h a l f b a k e r y
Futility is persistent.

meta:

account: browse anonymously, or get an account and write.

 user: pass:
register,

# Sorting Algorithm

Inflation followed by deflation
 (+1) [vote for, against]

There are a number of sorting algorithms already out there. This one I devised independently, and I don't recall ever encountering it in my various studies of computer-related stuff. That doesn't mean it isn't known, however --it doesn't even mean it isn't widely-enough known that this Idea should be deleted. We shall see!

When I started working heavily with computers about 30 years ago, the quantity of computer memory, available to use for such tasks as sorting, was rather limited. So, after some memory was filled up with data that needed to be sorted, the sorting algorithms of the day were designed to work on the data WHILE that data occupied THAT memory. The fastest such algorithm was called "the quicksort" (see link).

I confess I haven't paid much attention to what was done, with respect to fast sorting algorithms, after computers began having a lot of spare memory due to lower RAM prices. But I certainly thought of one, myself! Today, while considering posting it here, I made a cursory look to see what was "out there" (see link).

I didn't see this Idea, so....

Start with a block of memory containing a lot of random data that you want to sort. This algorithm is a bit nongeneric in the sense that the programmer implementing it needs to know something about that block of data. (On the other hand, all sorting algorithms have a "comparison test" that must be suited to the data being compared!)

In this case, however, what the programmer needs to know is something about how DIFFERENT each piece of data is, from all the others. The simplest way to describe that "difference" is to compare some words like "strength" and "stretch" --the first 4 letters are the same, so this data is rather more similar than different, with respect to considering another pair of words, such as "satisfy" and "submerge".

The programmer also needs to know something about the total range encompassed by the random data. The 4 quoted words in the previous paragrah all begin with the letter "s", so obviously that range is narrower than if the whole alphabet had been encompassed.

The actual comparison test that programmer needs to write can be described as yielding NOT a simple "smaller than" or "greater than" result; instead it yields a result describable as a "percentage of the total range". Note that if the data is fairly different a simple percentage like 37% might be useful enough, while if the data has a lot of similarity (or if the data set is large enough), it might be better if the comparison test yielded a more precise percentage, say 37.586%

AFTER figuring out how accurate a range-comparison test is needed, the programmer is in a position to make a good estimate of how much memory to request from the Operating System, for running this algorithm. We ALWAYS want to reserve several times as much memory as the unsorted data already occupies. (The sorting test chosen is strongly related to the decision whether to request 3 times as much memory or 10 times as much memory --or some other quantity, of course.) I'll call it the Big Block.

After initializing the Big Block (emptying it, usually), we are ready to run the algorithm on the data. For each data item, we compare it to the overall RANGE of randomized data, and translate that result into a memory location in the Big Block. Then we copy that data item to that memory location. That is, if our data item is figured to be 37% of the way through the range of random data, then we place it 37% of the way through the Big Block.

The net effect is that the random data gets sorted while it is simultaneously getting placed into the Big Block. We will have "inflated" the original data across a large amount of memory.

There is an obvious complication that can happen if the comparison test isn't "fine" enough, or if the Big Block is too small --a just-tested data item might have a location computed which is occupied by a previously-copied item.

But that is the main reason why we have reserved several times as much memory as was originally occupied by the random data. We can simply test the destination to see if it is indeed already ocuupied, and then compare the data already there with the data we were about to put there. It is quite likely that on either side of that data there will be some empty memory locations, so the current about-to-be-copied data-item can be placed correctly, in terms of being sorted.

Only rarely (if the original study of the randomized data was good enough) should it happen that several consecutive occupied memory locations need be tested as just described (with, probably, a simple "bubble sort" being used to get the latest data item into the appropriate place in that portion of the Big Block).

So, after all the original random data has been inflated into the Big Block, now we simply walk through the Big Block and copy the data back to the original block (deflating the data), in sorted order, ignoring the initialized/empty memory of the Big Block. The Big Block can now be returned to the Operating System for some other use.

So, after (A) one pass of emptying the Big Block, and (B) one pass of Primary comparison tests, and (C) a test of each location about-to-be-used, in the Big Block, to be sure it is empty, and (D) some small percentage of additional Secondary comparisons, depending on the quality of that Primary comparison, and (E) two copyings of each data item, which includes (F) a second pass of the Big Block, we are....

Ta-Done!

 — Vernon, Sep 08 2011

As mentioned in the main text --the also-mentioned "bubble sort" is presented here, also. [Vernon, Sep 08 2011]

A Google search, as mentioned in the main text. [Vernon, Sep 08 2011]

Merge Sort http://en.wikipedia.org/wiki/Merge_sort
[Klaatu, Sep 08 2011]

[Klaatu, Sep 08 2011]

Wikipedia: Bucket Sort http://en.wikipedia.org/wiki/Bucket_sort
I've thought about this before, and I think what you are describing would fall under the classification of a "Generic Bucket Sort" [zen_tom, Sep 08 2011]

So... a bell-curve sort of sorts ? compare the current element with the element behind and the one ahead (which I suppose means the very first element actually ends up getting tossed into the output last 'cuz you have to wait until the last element comes along to be the one "behind" it).
 — FlyingToaster, Sep 08 2011

There's a way of implementing a "Bucket Sort" that sounds like this - Say you have 26 words, each beginning with a different letter of the alphabet, you can trivially place them into 26 "buckets", each representing a letter of the alphabet. It works brilliantly for smoothly distributed data, and yes, falls down a bit when there are multiple instances of the same thing. What's great about it is that the data almost sorts itself. If you're short on memory, you can "double-up" the buckets, so that you only need 13 of them, the first holding a and b, the second c and d etc. As the buckets are filled, they can be expanded and/or split out.
 — zen_tom, Sep 08 2011

I was going to say something similar - you need to have some knowledge of how skewed the data distribution within the range is.
 — hippo, Sep 08 2011

Just right fill with blanks and translate from base 27 (or 53) to base 2... then subtract words from each other for a perfect "percentage difference".
 — FlyingToaster, Sep 08 2011

[hippo] and [Flying Toaster], the main text mentions using a more accurate percentage, such as 37.586% when it is known that some of the data items are very similar to each other. AND it states that this information should be used to determine how much memory to request, with respect to the planned "inflation" of the data across that memory. So, with enough accuracy and enough memory, you can always avoid the situation where the Primary Comparison Test indicates that two different data items should go into the same location.
 — Vernon, Sep 08 2011

 [annotate]

back: main index