An irrational number is a number with an infinite number of decimal places without a repeating pattern. Examples are "pi", "e", the square root of 2, square root of 3, etc. Since there is no repeating pattern in the decimals, any combination of numbers is possible somewhere along the line. Meaning,
any finite string of numbers is likely to be found somewhere in an irrational number.

The idea behind the irrational archiver is this: use predefined algorithms to calculate irrational numbers. Divide the string you have to compress into shorter strings (not strictly neccessairy), and compare the strings to the calculated value of the irrational number. When a match is found, the string is compressed as "the number Pi, 300 decimals, starting at the 28394059th." If the string can't be matched, the compressor tries another number. If no match is found in any of the irrational numbers defined by the compression standard, it is left uncompressed. Of course, if "most" of the string fits, the fitting part is marked, and the "deviant" numbers are matched separately.

The decompression software would calculate the irrational numbers to the farthest decimal place needed (i.e. until that iteration where the farthest decimal place needed is calculated correctly), and combine the strings into a single file.

A predefined calculation algorithm with standard variable types would ensure that the decompression software gets the same values for Pi as the compression algorithms.

The effectiveness of the compression would wary greatly, depending on the file. The ideal files would be, say, 35623730950488016887242097... (n decimal places of the square root of 2, starting at the 6th decimal place); 1415926535897932384626433832795... (the first n decimal places of Pi);

7182818284590452353602874713527... (the first n decimals of "e") and would be compressed down to a few bytes. But more demanding files might lead to the number of the decimal places being bigger than the strings themselves. The compressor should check every irrational number defined in the standard for a match closer to the decimal point.

This method would require a very powerful computer, both for compression and decompression, and both would take virtually forever, but theoretically, this could work.

I had this same idea several years ago, but soon discovered the fatal flaw in the scheme: the "pigeon hole principle". Basically it will take at least as much data to encode the data as it is possible to compress.

To use Pi as an example; lets say there is a five digit string you want to compress - this will, on average, occur about 10,000-100,000 decimal places into Pi. Thus taking the same amount of information to describe the location of the information as the information itself (apologies for that garbled sentence).

The best use for Pi 'compression' is as a "one time pad" - a system of perfect encryption.

No, bigsleep, the GIF compression algorithm is not "very similar to this". I don't know why people keep coming back to LZW (a similarly incorrect statement is next door, in the "sliced cube data compression") - is it the only compression algorithm anyone knows?

[Lt_frank], remember any number (including decimals) can be represented as a binary number (or hexidecimal or octal), and any alphabet can be choosen to use all possible combinations in a particular base.

Also note that irrational numbers are inumerably infinite, so some pretty random irrational numbers can be used as an OTP.

Short strings and very long strings have the same problem: their position in Pi can only be defined in a number so big, its digital representation is longer than the string itself (especially true for short strings). But the algorithm would try several irrational numbers for matches, not only Pi, and give up after a number of decimal places (i.e., if the decimal place is represented by a number longer (or inadequately shorter) than the string itself). The shortest number is selected. If no similar string can be found, the string is left as it was, uncompressed. Strings shorter than the defined minimum compression block (consisting of the code of the irrational number used, number of decimals used, and the starting decimal place) wouldn't be compressed. Very large strings would be divided into shorter ones (not a guarantee for a match closer to the decimal point; stohastic math doesn't really folow what we believe is "common sense"). This means the worst case scenario is having zero compression (end result is as big as the original), but everything else would be, strictly speaking, compressed (effectiveness notwithstanding).

[xav]: I should have known that, as many mathematical ideas, it has undoubtedly been thought of before, and if it was feasible, it would have been implemented somehow. I just couldn't find a flaw in my reasoning. The great thing about Halfbakery is that people tell me where I went wrong with my (half)baking (provided there really is something wrong with it).

It's more accurate to describe pi as transcendental, rather than irrational (although it is irrational too). I agree with others, that the data needed to encode where the sequence starts wil be longer than the sequence itself in the majority of cases.

For god's sake man, haven't you SEEN "A Beautiful Mind"? This research is a very slippery slope indeed, soon you'll be followed by agents and receive coded messages in the newspapers.

Edited the title for less clarity. For some reason, everyone is stuck on Pi as the only number that would be used for compression; there are many other (well known, researched, thoroughly pondered and meticulously mulled over) irrational and/or transcendental numbers to work with, and the algorithm would try several of them for conveniently placed matching strings, not just Pi.

I like the less clarity thing. And the whole problem would have gone away if the legislation to rationalize pi to 16/5 would have passed. (In Indiana, at least.)

[Lt_Frank], I agree that using any *known* irrational/transcendental number is not a good idea for an OTP. I'm instead suggesting that there are plenty of unpublished irrational/transcendental numbers to choose from. Of course it would be easier to simply give the OTP string than try to define some obscure irrational/transcendental number - I was using this as a round-about way of suggesting to [Veho] that his scheme resembled a viable encryption system rather than compression system.

[Veho], having a *set* of irrational numbers rather than a single irrational number will still suffer from the same problem - it'll take valuable data to describe which irrational number you're using.

So for example if you ennumerate your various numbers: 1.Pi, 2.e, 3.phi etc. When you encode the data you'd need to say the 5 digit string starting at 3,7654th decimal place in the 3rd stated number (phi).

[xaviergisz], the question here is whether the code for the irrational number used will surpass the size of the number representing the decimal place of the string, more than what can be gained by using more irrational numbers. The idea is that a different number would have the same string so much closer to the decimal point, it would (more than) compensate for adding a few bits to define more irrational numbers.

Like other compression algorithms, the biggest problem is the nature/layout of the string you are trying to compress, making for worst/best case scenarios. But while other compression methods rely on defined algorithms, this one depends on chance. Since a general string is more or less random, and irrational numbers are as well, the only way to check the effectiveness of this compression method would be to actually try it out, because the math behind a theoretical discussion of its effectiveness is hideously incomprehensible to most. There is no mathematical proof that any given (irrational) number would have a completely random string "too far" or "near enough" to the decimal point - mainly because it's random. For a theoretical proof, you would have to (statistically) calculate the distance for every set of strings, which is impossible.

Perhaps you could really make it tricky by saying
"Starting at the 123456th decimal, use x digits then subtract y" or
"Starting at the 123456th decimal, use x digits backwards then add y" or
"Starting at the 123456th decimal, use x digits and starting at the 1234567th decimal, use y digits, and then multiply together."

don't advanced math programs use placeholders for repeating decimals (7/3) and irrationals (pi,e) when appropriate? instead of converting to wonky decimal or binary representations.

I like this idea! So much so that I have broken my own rule of not giving more than five loafs for every three fish.[+]

I would have pitched it a bit differently starting with.

If you say to a biblical scholar Genesis 1, then he will know that you mean the book of genisis from the beginning. He would likewise know what was meant by Leviticus 10-16.

for computers the book would be an irrational number such as the squire root of two, calculated to 1 G byte of hard drive space. The software being set up to continue reading from the beginning when it gets to the end. Giving 8 000 000 000 * 1 000 000 = 8*10^15 possible strings of data 1 K byte or larger in size. This data could be either archived files or streamed video.

For archiving files exact matches would be needed. if a match can not be found, then the file would need to be split in some way so that the fewest matches could be found to cover the whole file. Braking a unmatched file into half’s, any unmatched halves into quarters etc. For video the files would need to be of a pre-set length. but the match only needs to be best fit, because image is only on screen for a fraction of a second.

Decompressing, if it can be called that, for the video would involve starting at the specified bit on the hard drive and reading the correct number of bytes for each broadcast address. Decompressing a 1 M byte file might involve reassembling the following in the correct order address 1 250 KB address 2 125 KB address 3 125 KB address 4 250 KB address 5 250 KB

I don't see why people where so opposed to the idea? I hope to be returning to this idea as the start of something epic late in the year, so if I have missed some thing fundamental let me know.

//so if I have missed some thing fundamental let me know.//

You didn't read your own anno.

//Giving 8 000 000 000 * 1 000 000 = 8*10^15 possible strings of data 1 K byte or larger in size.//

Work out the probability you could actually use your 1K data chunks versus the 8 byte key. You're actually better off designing a dictionary than using random numbers. What proportion of number sequences could be used to encrypt ascii words for example ?

You need a finite set of irrational numbers, with
the set having the property that, for any string
you
might want to encode, the string occurs, in at
least one member of the set, at a position which
requires fewer bits to encode than does the
original string. Moreover, the number of bits
saved must be less than the number required to
specify which member of the set you're using.

Is such a set even possible? That sounds like a
Serious Mathematical Question to which the
answer might be "no."

This is not an unheard-of idea in current science fiction,
which, if history proves, may give it some real-world
validity in the future. I'm not going to spend hours flipping
through my dozens of short-story anthologies to site
specific titles, but I've run across it a half-dozen times at
least.

[big] Are you saying that in order to have a chance of getting a match it would need a much, much, bigger chunk of hard drive space. Going back to the analogy 1 G byte is more like a page from the beano than the bible.

[mp] In my take on Veho's idea there is a very finite set of random numbers ( one ( I would suggest the square root of two)). It just needs to be a lot bigger than I had thought. my pc is quite old and decrepit, and to make up for its lack of ram it wrights blocks of active memory onto the hard drive, all the time. I believe that the process is called paging. If it takes more memory space to specify the address of these blocks of memory then they can store, why would my pc use paging.

[alter] I to like sf, and that may be why the idea appeals to me. After all once you have every possible combination of ones and zeros the only limit to what you can have a pc do with them is imagination!

Your hard drive metaphor is flawed (I believe).
The
correct metaphor would be a read-only hard drive
which came from the factory pre-written with
digits
of pi, e, sqrt(2), etc. "paging" or virtual memory
would not work as well if you weren't allowed to
write to the hard drive those strings you expect to
use again, to the exclusion of others you won't
need.
It's an open question (I believe) whether
paging/VM
would work *at all* in that case.

Edit: according to [bigsleep] and Jorge Luis
Borges, it's not an open question.

//Are you saying that in order to have a chance of getting a match it would need a much, much, bigger chunk of hard drive space//

No. I'm saying that the compression wouldn't work because to say which bit of the hard drive space to use would take more space than the uncompressed form. In essence this idea (like Jutta's anno hippo refers to) assumes that if you use an irrational number as a white noise source dictionary, then all you are doing in your compression is doing a white noise lookup which by definition cannot achieve any compression - it becomes a simple substitution cipher at best.

The dictionary needs to be designed for the content it will be used against - QED.

I must be really stupid, or something. I honestly can not see how it would take over a kilobyte of Information, to instruct a pc to look at this address on its (none metaphorical) and to read of the next thousand consecutive bytes.

Consider decompressing 1 byte sequences. You need 256 bytes in the table and the address to look up each byte is 2^8 i.e. one byte. One byte in, one byte out - no gain.

Consider sequences of 2 bytes. You need 256x256x2 in the table, thats 131072 bytes. To address any of the 256x256 combinations you need a representation that will hold up to 65536 i.e. 2 bytes. Two bytes in, two bytes out - no gain.

You really think it will be any different for sequences of greater than 2 bytes ?