Halfbakery: Lookup Text Compression

Computer: Compression
Lookup Text Compression (+7, -2) [vote for, against]
3 bytes is all you need

Point: Ask Jeeves says the Oxford English Dictionary has approximately 615,000 unique words in it (this is including derivative forms, i.e. climbs, climber, climbed, etc.).

Point: One character in the ASCII character set is 8 bits (one byte)

Point: 20 bits can have 1,048,576 different values.

Point: Most english words have more than 3 letters.

Point: Rules of pronounciation and grammar aside, 3 letters have 17,576 possible configurations. 140,608 if you allow arbitrary capitalization, which is silly.

Round 20 bits up to a nicer number like 24 bits (3 bytes) and we get 16,777,216 different values. 3 bytes is the amount of space necessary to store a three-letter word in plain text--which, in English, there not more than 17,576 of. In fact, we've got in total more than 27 times as many possible states as there are words of ANY length in the entire English Language.

So, we can easily represent every word, of any length, in the same amount of digital space as a three-letter word (ignoring the space required for the lookup table, which I roughly estimate would be not more than 50 megabytes). We also have plenty of room left over for capital letters, punctuation, numbers, common word combinations ("of the" "as if" etc.) and anything else we feel like storing.

By storing words, phrases, punctuation, etc. as 3-byte integers, rather than storing the letters themselves, we can produce a massive reduction in filesize of any text document (remember, most words are more than three letters--and with all the common compound phrases we can store as well, we may average more like 2 bytes per uncompressed word). Of course, if we restrict ourselves to a lookup table based on the OED we'll miss out on all the proper nouns, some of the slang, many hardcore scientific terms, and anything misspelled, but these can be encoded "plain" on a case-by-case basis and on the whole will probably only represent a small percentage of the total text. Besides, some of this stuff may be covered in the extra 16,100,000 states we have to fill.

Naturally most of the 615,000 words in the OED are hardly ever used. A "common case" compressor that only stored a lookup table for the 200,000 words in common use today (or even the 20,000 words most people actually know) could be a reasonably small program and would only suffer minor ineffieciences in compression when it had to plaintext encode an unusual or nonsense word.

Naturally this processed text could be further compressed by running it through a conventional data compressor (i.e. winzip).
-- 5th Earth, Jan 30 2005

(?) Baked by WRT? http://www.cs.fit.e...ion/#WRT_benchmarks
It's interesting that they also follow up the lookup compression with winzip-type compression, as you suggest. [robinism, Jan 30 2005]

(?) WRT alternate dictionaries http://www.ii.uni.w...arch/WRT10large.zip
[robinism, Jan 30 2005]

Your system saves even more space if you compare it to the unicode character set, which uses two bytes per character. Unicode is becoming more popular because it accomodates many more languages than ascii does. Could your system expand to include other dictionaries in other languages?
-- robinism, Jan 30 2005

May be baked - see link.
-- robinism, Jan 30 2005

A text compression scheme that uses a flat 3 bytes per word will pale in comparison to better methods such as those used in PKZIP and similar programs. If you took the output of the dictionary lookup and used it to feed into something like a dynamic Huffman encoder, you might end up with something a bit better, but still not much better than PKZIP which requires no external dictionary.<p>

Otherwise, I would posit that using the OED is a massive overkill. A simpler approach would be to have two dictionaries, of the most common 96 words and the most common 8,192 words. All words in the first dictionary would get coded as 1 byte; words in the second would get coded as 2 bytes. Other words would be coded at their actual length. Further, all words would start with an implicit space if the preceding character was a letter, digit, comma, period, question mark, exclamation point, close paren, close bracket, or close brace. If one of those characters had to be immediately followed by a word, a null byte would be inserted.

Saving two bytes off each occurrence of the 96 most common words and one byte off the next 8,192 most common words would more than make up for the less common words which had to be represented at full length.
-- supercat, Jan 30 2005

[Supercat], unfortuantely your scheme won't work as described because the computer has no way of knowing whether a given sequence of binary is intended to be 1 or 2 bytes. Of course this can be remedied--since one byte gives us a surplus (256 states) anyway (and 2 bytes is 65536), we can just use the first bit of each byte (or byte-pair as the case my be) as a binary switch to declare the following stored word as a one-byte or two-byte word. We lose an order of magnitude of states, but we still have enough states to store the common words (as you've defined them) plus some extras.

I only used the OED as a comprehensive example anyway--as I mentioned, a much smaller dictionary would suffice for most applications. The difficulty is that, without the full 3-byte allotment for the small-dictionary versions, the different programs lose compatablity. The full OED version could decode a small-dictionary encoded file with some clever coding and planned compatability between the two versions, (which I've thought out but won't go into here) but I don't think the smaller dictionary programs could cope with an OED encoded file. A minor issue, but I like being able to use one program for everything.

At any rate, bear in mind that the flat-rate 3 byte file can still be compressed yet again, because as yet we haven't actually done any mathematical compression at all. I would posit that a 3-byte encoded file would still compress at a significant ratio--although I'd want to verify this expirimentally. Anyway, chances are zipping a smaller initial file will generate a smaller output, even if it can't compress it as much.

[robinism], that's an interesting link--it certainly seems to show what I'm describing, right up to the 3-byte standard. Some extra research shows that PAQ got something like 3% more effeciency by adding the WRT program, but other reasearch casts doubt on whether that version of PAQ was as good at normal compression as the original WRT-less version, so this number may be skewed. The WRT documentation itself shows another test that generates 10% greater effeciency. I also don't know what's in WRT or how it works really--its dictionary is only around 1 MB so it certainly doesn't have the full English language in there.
-- 5th Earth, Jan 30 2005

There is a 5 mb version of the dictionary available.

Supercat's idea of using separate dictionaries for more common words is feasible; in fact WRT does this.

"We use 3 dictionaries, which outputs from 1 to 3 characters. Most frequent words are in first dictionary (128 words). Average used words are in second dictionary (128*128 words). And rarely used words are in third dictionary (up to 128*128*128 words). You can use own dictionaries, but remember that files can be properly decoded only with dictionaries used while encoding. Dictionaries supplied in this package were created from files from: Calgary Corpus, Canterbury Corpus, Large Canterbury Corpus and ACT text files. There are also large dictionaries that are generated with added almost all text files from the Guttenberg Project. Large dictionaries give similar compression results on texts in English, but can also handle multi-language (German, French, Italian) text files. Large dictionaries can be downloaded (5MB) from:" (see link)
-- robinism, Jan 30 2005

I realise that you are trying to make the smallest file possible. but I was just wondering if it would be possible to use the unused space in the of the original idea to have separate entries for each example of a word, for words with more than one meaning. i.e. maroon 1, maroon 2, maroon 3, maroon 4 or orange 1, and orange 2.

from my grasp of what was wrote, there is more than enough.

I could use such a dictionary for my speech to text idea.
a unique entry for each idea.
this would avoid the need to covert the spoken English into Chinese, and then translate it back into English.

I believe that such a dictionary would also be of use to people involved with language based AI.
-- j paul, Jun 30 2011

random, halfbakery