Halfbakery: Lookup Text Compression

Point: Ask Jeeves says the Oxford English Dictionary has approximately 615,000 unique words in it (this is including derivative forms, i.e. climbs, climber, climbed, etc.).

Point: One character in the ASCII character set is 8 bits (one byte)

Point: 20 bits can have 1,048,576 different values.

Point: Most english words have more than 3 letters.

Point: Rules of pronounciation and grammar aside, 3 letters have 17,576 possible configurations. 140,608 if you allow arbitrary capitalization, which is silly.

Round 20 bits up to a nicer number like 24 bits (3 bytes) and we get 16,777,216 different values. 3 bytes is the amount of space necessary to store a three-letter word in plain text--which, in English, there not more than 17,576 of. In fact, we've got in total more than 27 times as many possible states as there are words of ANY length in the entire English Language.

So, we can easily represent every word, of any length, in the same amount of digital space as a three-letter word (ignoring the space required for the lookup table, which I roughly estimate would be not more than 50 megabytes). We also have plenty of room left over for capital letters, punctuation, numbers, common word combinations ("of the" "as if" etc.) and anything else we feel like storing.

By storing words, phrases, punctuation, etc. as 3-byte integers, rather than storing the letters themselves, we can produce a massive reduction in filesize of any text document (remember, most words are more than three letters--and with all the common compound phrases we can store as well, we may average more like 2 bytes per uncompressed word). Of course, if we restrict ourselves to a lookup table based on the OED we'll miss out on all the proper nouns, some of the slang, many hardcore scientific terms, and anything misspelled, but these can be encoded "plain" on a case-by-case basis and on the whole will probably only represent a small percentage of the total text. Besides, some of this stuff may be covered in the extra 16,100,000 states we have to fill.

Naturally most of the 615,000 words in the OED are hardly ever used. A "common case" compressor that only stored a lookup table for the 200,000 words in common use today (or even the 20,000 words most people actually know) could be a reasonably small program and would only suffer minor ineffieciences in compression when it had to plaintext encode an unusual or nonsense word.

Naturally this processed text could be further compressed by running it through a conventional data compressor (i.e. winzip).