Halfbakery: Sorting Algorithm

There are a number of sorting algorithms already out there. This one I devised independently, and I don't recall ever encountering it in my various studies of computer-related stuff. That doesn't mean it isn't known, however --it doesn't even mean it isn't widely-enough known that this Idea should be deleted. We shall see!

When I started working heavily with computers about 30 years ago, the quantity of computer memory, available to use for such tasks as sorting, was rather limited. So, after some memory was filled up with data that needed to be sorted, the sorting algorithms of the day were designed to work on the data WHILE that data occupied THAT memory. The fastest such algorithm was called "the quicksort" (see link).

I confess I haven't paid much attention to what was done, with respect to fast sorting algorithms, after computers began having a lot of spare memory due to lower RAM prices. But I certainly thought of one, myself! Today, while considering posting it here, I made a cursory look to see what was "out there" (see link).

I didn't see this Idea, so....

Start with a block of memory containing a lot of random data that you want to sort. This algorithm is a bit nongeneric in the sense that the programmer implementing it needs to know something about that block of data. (On the other hand, all sorting algorithms have a "comparison test" that must be suited to the data being compared!)

In this case, however, what the programmer needs to know is something about how DIFFERENT each piece of data is, from all the others. The simplest way to describe that "difference" is to compare some words like "strength" and "stretch" --the first 4 letters are the same, so this data is rather more similar than different, with respect to considering another pair of words, such as "satisfy" and "submerge".

The programmer also needs to know something about the total range encompassed by the random data. The 4 quoted words in the previous paragrah all begin with the letter "s", so obviously that range is narrower than if the whole alphabet had been encompassed.

The actual comparison test that programmer needs to write can be described as yielding NOT a simple "smaller than" or "greater than" result; instead it yields a result describable as a "percentage of the total range". Note that if the data is fairly different a simple percentage like 37% might be useful enough, while if the data has a lot of similarity (or if the data set is large enough), it might be better if the comparison test yielded a more precise percentage, say 37.586%

AFTER figuring out how accurate a range-comparison test is needed, the programmer is in a position to make a good estimate of how much memory to request from the Operating System, for running this algorithm. We ALWAYS want to reserve several times as much memory as the unsorted data already occupies. (The sorting test chosen is strongly related to the decision whether to request 3 times as much memory or 10 times as much memory --or some other quantity, of course.) I'll call it the Big Block.

After initializing the Big Block (emptying it, usually), we are ready to run the algorithm on the data. For each data item, we compare it to the overall RANGE of randomized data, and translate that result into a memory location in the Big Block. Then we copy that data item to that memory location. That is, if our data item is figured to be 37% of the way through the range of random data, then we place it 37% of the way through the Big Block.

The net effect is that the random data gets sorted while it is simultaneously getting placed into the Big Block. We will have "inflated" the original data across a large amount of memory.

There is an obvious complication that can happen if the comparison test isn't "fine" enough, or if the Big Block is too small --a just-tested data item might have a location computed which is occupied by a previously-copied item.

But that is the main reason why we have reserved several times as much memory as was originally occupied by the random data. We can simply test the destination to see if it is indeed already ocuupied, and then compare the data already there with the data we were about to put there. It is quite likely that on either side of that data there will be some empty memory locations, so the current about-to-be-copied data-item can be placed correctly, in terms of being sorted.

Only rarely (if the original study of the randomized data was good enough) should it happen that several consecutive occupied memory locations need be tested as just described (with, probably, a simple "bubble sort" being used to get the latest data item into the appropriate place in that portion of the Big Block).

So, after all the original random data has been inflated into the Big Block, now we simply walk through the Big Block and copy the data back to the original block (deflating the data), in sorted order, ignoring the initialized/empty memory of the Big Block. The Big Block can now be returned to the Operating System for some other use.

So, after (A) one pass of emptying the Big Block, and (B) one pass of Primary comparison tests, and (C) a test of each location about-to-be-used, in the Big Block, to be sure it is empty, and (D) some small percentage of additional Secondary comparisons, depending on the quality of that Primary comparison, and (E) two copyings of each data item, which includes (F) a second pass of the Big Block, we are....

Ta-Done!