Half a croissant, on a plate, with a sign in front of it saying '50c'
h a l f b a k e r y
"Look on my works, ye Mighty, and despair!"

idea: add, search, annotate, link, view, overview, recent, by name, random

meta: news, help, about, links, report a problem

account: browse anonymously, or get an account and write.



Lossy Text Compression

Compress your text files using the technology that makes JPEG so great for pictures!
  (+4, -7)
(+4, -7)
  [vote for,

Of the 256 possible characters that a byte can represent, only a few of these are used under most circumstances. Couldn't we somehow take advantage of this, and make our text files smaller by eliminating the need for the rarely used letters? Why yes. Yes we can.

Of the 8 bits in a byte, the most significant bit is rarely used (standard ASCII is only 7 bits long). So first we can eliminate the 8th bit, as is sometimes done anyway.

Next we can take a lesson from the 31337-5p34kers: many letters don't add any value in most situations, and can be replaced by more common letters. For example, a lower-case "L", capital "I", and the number "1" look nearly identical in most situations, so they can be consolidated.

There is little need for capital letters, so they can be dispensed with. The decompression/display program could even automatically capitalize the first letter of every sentence, common names, etc., if you like.

Repeating letters are hardly necessary. All multiple consecutive occurrences of a letter could be compressed into a single occurrence, and most of the meaning is preserved.

Certain letters and letter pairs look and/or sound the same as other letters/letter pairs. For example, "ck" offers nothing that a simple "k" does not.

Another technique that might be used is phoneme encoding. The text would be parsed and broken into its phonemes. The decoding program would then expand the phonemes out to something that could be read aloud.

Finally, the text is ready to be compressed in a more traditional way. Huffman compression would be used, so the most commonly used characters would require fewer bits to express. Each language would have its own general purpose Huffman key that would always be used, eliminating the need to recalculate and send the key along with each message.

Work has been done on lossy text compression on a word level, but to my knowledge it has not been done on the letter level as I described. With such a method, text files could meaningfully be compressed tremendous amounts. The character set could be reduced to a mere two or three dozen letters, I'm sure.

Applications could include email and instant messenging over slow connections, or the storage of large amounts of textual information.

Uberminky, Jun 07 2002

Word-level compression http://sequence.rutgers.edu/lossy/
(NOT an example of the described compression method) [Uberminky, Jun 08 2002, last modified Oct 17 2004]

A Plan for the Improvement of English Spelling http://www.neth.de/...piele/newspell.html
Seminal couple of paragraphs by Mark Twain. [jutta, Jun 08 2002]

Lossy PNG (ie zlib) compression http://membled.com/work/apps/lossy_png/
A modified zlib which allows lossy matching in the Lempel-Ziv stage. The intended application is image compression, but it would be easy to change the code for particular letter mismatches. However I don't think it would be that much better compression than ordinary gzip. [Ed Avis, Oct 11 2002, last modified Oct 17 2004]

anfractuosity compressor https://www.anfract...y-text-compression/
We simply pick the shortest alternative word from a thesaurus. In order to compress text in a lossy fashion. [mofosyne, Apr 17 2016]


       Baked! See link.   

       Welcome to the bakery, Uber-dave.
[ sctld ], Jun 07 2002

       Lossless compression can condense text files to 10% of their original size, or less. Efficient lossless encoding makes use of techniques such as efficiently encoding whitespace, using dictionaries of common words or sentence fragments, predicting what letter is most likely to follow, and avoiding the duplication of repeated strings. However, since text files tend to be only a very small part of the total data transmitted and stored, technology has perhaps lagged behind compared to image compression.   

       Nonetheless, I think most people would be very reluctant to use any lossy text compression method. Not only would it be unable to handle irregular strings such as car registration numbers, but it might change the meaning of sentences in subtle ways.
pottedstu, Jun 07 2002

       sctld: Forgive me, I'm new here -- what link did you want me to look at? The only link I see is the one I myself submitted, which is not an example of what I mentioned. I now see that I should have used the "link" link, thanks yamahito.   

       ravenswood: Actually the savings would be tremendous for things like instant messaging, etc. Worthwhile? No, probably not. I noticed that sarcasm doesn't go over very well around these parts. I'll keep this in mind.   

       It's good to feel welcome!
Uberminky, Jun 08 2002

       If its nothing to do with the idea, then why link it?
[ sctld ], Jun 08 2002

       He didn't say it wasn't to do with the idea, just that it wasn't an example of it. The link refers to compression at a word level, not a letter level.
yamahito, Jun 08 2002

       What ravenswood and pottedstu said. To clarify, the savings that would be neglible are those compared to straightforward Huffman compression, not compared to the existing text.
jutta, Jun 08 2002

       yamahito: Whats the diffrence?
[ sctld ], Jun 08 2002

       the scale on which it's done, I guess.
yamahito, Jun 08 2002

       Sorry, but i don't think you can compress letters any more than you can compress words. When you compress words, effectively you are removing letters deemd 'un-needed'. You can't compress letters, because they are the root, they are the fundemental particles or words, paragraphs, chapters, books, volumes. They are to language what quarks are to elements. So, actually, you can't compress letters.
[ sctld ], Jun 08 2002

       there's a difference between compressing letters and compressing words on the letter-level.
yamahito, Jun 08 2002

       Explain it to me.
[ sctld ], Jun 08 2002

       No, I don't feel like bickering over semantics. What Uberminky describes is what I would call compressing words on a letter by letter level. His link does it word by word, as I would imagine Huffman compression would.
yamahito, Jun 08 2002

       It's a left-brain/right-brain issue. You can use lossy compression for JPEG's because that kind of visual information is handled by our right-brains which are holistic and relatively unaffected by loss of detail information--capable, indeed, of discerning meaning even in the face of awesomely unfavorable signal-to-noise ratios. The left brain, OTOH, is responsible for linear, logic oriented thinking where attention to detail can be paramount. This is the part of the brain responsible for language and reading. Lossy compression is simply not appropriate for text compression because for the left brain to read the decompressed text, it must first work hard to "manufacture" the missing details (possibly calling on the right brain to step in and help?).   

       However, this would make a fun programming project for a rainy day, or a way to get your feet wet in a new programming environment. Certainly more useful than the fifteen-minute-hack pig-latin translator I wrote the other day.
madscientist, Oct 11 2002

       If English were as diverse as ASCII itself, there wouldn't be any words that look foreign to us. Dictionary compression is baked, I'm sure. Storing the text in all caps and uncompressing it to the rules of our language would be a great idea too. I guess you could intelligently remove and replace commas too.   

       Hell, you get a bun for this. I'll bet this idea isn't baked nearly like it could be. I want to see BURNT edges, dammit!
kevinthenerd, Jun 22 2010

       The anfractuosity compressor in the link is pretty interesting.   

       According to it, Alice in wonderland compresses from 164K to 157K (and still just about being readable)!   

       Basically uses a thesaurus to do the compression by picking the shortest word.
mofosyne, Apr 17 2016


back: main index

business  computer  culture  fashion  food  halfbakery  home  other  product  public  science  sport  vehicle