I came up awhile ago with a sorting algorithm which uses slightly fewer than n lg n comparisons while having the advantage that the results are output in order (i.e. the algorithm finds the first item, then the second, etc.) The one (major!) weakness is that the amount of "bookkeeping" required is substantial.
Nonetheless, if one has a lot of data to sort in a situation where comparisons are expensive and where getting the items served up "in order" is useful (e.g. if one can start doing something with the data once it starts coming out) the algorithm may be of some use.

Each data item must have associated with it three pieces of information: a 'weight' value (discussed below), status (active or inactive), and a list of data items it has "beaten" (also discussed below). A heap, tree, or chain-bucket data structure should be used to allow data items to be read out in the approximate order of weight.

Initially all items are active and assigned a weight of one. The outside loop of the sort procedure is then:

- Examine active items pairwise, in order of "weight". In each pair, the item that should sort earlier should have the other item's weight added to its own; the item which should sort later should be marked inactive and added to the first item's list of "vanquished" data items, but have its weight unchanged. This procedure should be run until there is only one active item.
- The remaining active item is output and all items it had "vanqushed" are returned to active status. It is then permanently deleted from the list.

The first pass through the main loop, all data items will be active; since each comparison eliminates one, the first pass of the outer loop will require n-1 comparisons before yielding the first item output.

The second pass through the main loop will be much faster, since only those data items which had been involved in comparisons with what turned out to be the final item remaining in the first pass will be active. There will be roughly lg n of those, so the second pass will require roughly lg n comparisons.

The third pass will be even shorter, though exactly how much so will depend upon which data item from the second pass was the "winner".

I have run this algorithm with a 1,000,000 item data set and found that it can be implemented reasonably effectively. The "bookkeeping" is not totally unreasonable, though it is a bit complex. As to whether there are any situations where the cost of comparisons is sufficient to justify it, I have no idea. But perhaps someone here might.

Genetic Algorithms FAQhttp://www-2.cs.cmu...ai/genetic/top.html Because I didn't want to go into too much depth here [-alx, Jul 20 2002, last modified Oct 04 2004]

Please log in.
If you're not logged in,
you can see what this page
looks like, but you will
not be able to add anything.

Description (displayed with the short name and URL.)

<Dons stupid hat> Ok, apart from losing me halfway through here with lots of Gobble-a-Jook, how exactly is this useful? I mean I understand that it sorts data using a cleverly augmented three step processing system, but how is it superior/different to anything out there at the moment, and where does the Tournament aspect come in? or maybe why does it come in? </Dons stupid hat> oh yes and when replying please remember that you are addressing someone who's education was almost solely anchored in the arts.

I don't use (or understand) quite a few things mentioned in the halfbakery, but I have no objection to others wandering in those fields. So I hope that algorithms are fair game. Even one like this, where the decisions required to implement the housekeeping outweigh the illusory benefits. I don't like giving fishbones, so I'll award a copy of Knuth's "Fundamental Algorithms" vol 1. Don't skip the exercises!

Likely more useful than the current & quite controversial American College "Football" schema. Also good for handicapping the horses. Well, not *handicapping* them...

Although [pfperry] might have been referring to the state
of my apartment in saying "the decisions required to
implement the housekeeping outweigh the illusory
benefits," I must agree. Even if the benefits were real, I
think you've fudged the number of comparisons -- I don't
see the basis for your assumption of lg n comparisons in
the second pass. It looks to me like if the original list is
close to being already ordered, the algorithm becomes
identical to a plain old selection sort, but less efficient.

I would give it a large fishbone with many ancillary
fishbone data structures attached, but I'm slightly awed
by the fact that you actually coded and ran this beast.

In most sorting applications, comparisons are not particularly expensive, granted. And yes I already have a copy of Knuth (if not, doing some of the bookkeeping would have been impossible). I put forth this algorithm not because I thought it was a good general-purpose alternative to Hoare's quicksort, but because I was curious whether anyone else here had seen anything like it, or could think of a real-world application where such a thing could be useful.

BTW, I conceived this algorithm while watching the Olympics one year. In many of the games, it is assumed that in pairwise competition the stronger player or team will always defeat the weaker. I was then curious how many games would be required to implement a complete ranking of all the contestants. Obviously it would not make much sense to run such an algorithm with millions of "data items" (i.e. contestants), but in such an application the "comparisions" (contests) are much more expensive than any amount of associated bookkeeping.

BTW, I'm curious; for sorting many thousands of very long strings whose distribution favors neither radix-sorting nor early-exit comparisons, what sorting algorithm would one use? Also, can anyone think of any real-world applications (other than sports tournaments) where data comparisons can only be done by very computationally-expensive algorithms but where random-access comparisons are no worse than sequential-access?

This reminds me of tournament selection in genetic algorithms (GAs):

Roughly speaking, a GA generates possible solutions to a problem, initially at random. The algorithm then operates on a 'survival of the fittest' basis, where solutions which better solve the problem are more likely to survive and 'breed' with other solutions, in the hope of producing even fitter offspring.

Normally, each individual solution is placed in rank order and then chosen for reproduction probabilistically based on that rank. However, it's been shown that simply choosing two possible solutions and then taking the fitter of the two (tournament selection) produces comparable results, while being algorithmically simpler and sometimes faster.

I used a system similar to supercat's when I investigated the possibilities of a better way to handicap horses. Coming up with my own ranking order in the process, Several factors weighed in - not just the horse's record - which would be the least in quantity in terms of it's totality, but that of the jockey, trainer and owner. Factored in relating directly to the horse itself were results, whether on a time basis or finish order in track conditions, along with it's lineage. Implementation of this, whether by looking at results in newspaper or actual presence at the track had me slightly ahead of the break-even point the majority of the time, sometimes well ahead. Once in a while it was a bust, but those were in the early phases before I recognized who needed the most weighting of the factors involved. Bear in mind the owners and trainers are extremely selective in what horse they choose to own or train. Bottom line was that my results always were better than those of the various handicappers in the newspaper. Might this be applicable to stocks, medical diagnosis (narrowing symptoms to eliminate false positives > appropriate treatment) or DNA?

One use could be the sorting of pictures. It is *very* expensive to compare pictures, e. g. with methods of computer vision. This of course applies to any complex objects like texts, documents, music, ...

I'm very interested in your tournament sort algorithm. I need such algorithm with minimum comparisons for my software app. I'm very interested in getting the algorithm's results for my situation. If it works very well as you claim (minimum comparisons), I will pay you a lump sum to get the algorithm's code. So please contact me at guntherverstraeten@hotmail.com Thanx

[hob]: I missed your anno previously. The reason the second pass requires approximately lg(n) comparisons is that the first-place person will have played against approximately lg(n) people.

Perhaps I should illustrate with a small example. Suppose we have eight data items: S, O, R, T, A, B, L, and E, and we wish to put them in alphabetical order.

Initially we'll process pairwise comparisons S-O, R-T, A-B, L-E. As we process each of these comparisons, one data item will be marked inactive and the other will have its weight increased to 2. Thus, until all weight-1 data items are processed, none of the others will be; once all the weight-1 items are complete, there will be four weight-2 items: O, R, A, and E, which we pair as O-R, A-E. Then we end up with two weight-4 data items which we pair up O-A. Then we end up with a single weight-8 item which we output (A). Seven comparison this round.

A had beaten three data items: O (weight 4), E (weight 2), and B (weight 1), so those get paired off as B-E (since those have lowest weights), then B (weight 3)-O. B wins and is the only thing left (with weight 7) so it's output. Two comparisons this round--nine so far.

Since the only things B ever beat were E and O, those get added back (with weights 2 and 4). Those get paired up and E wins; its weight is now 6, it gets output, and everybody it had beaten is added back. One comparison this round--ten so far.

In addition to having beaton O (weight 4) in the last round, E had also beaten L (weight 1) earlier. So we pair those and L wins with weight 5. One comparison this round--eleven so far.

The only thing L ever beat was O, so we add him back (still weight 4). Since he's the only guy around, we output him. No comparisons this round--still eleven so far.

O had beaten S (weight 1) and R (weight 2). So we add those back. R beats S, and is the last guy left with weight 3. So we output him. One comparison this round--twelve so far.

R had beaten S (weight 1) and T (weight 1), so we add those back. S beats T, and is the last guy left with weight 2. So we output him. One comparison this round--thirteen so far.

S had beaten T (weight 1) so we add him back. Since he's the only guy around, we output him. No comparisons this round--thirteen total.

So thirteen pairwise competitions completely sort eight data items.

Note, btw, that while the data for this example was chosen arbitrarily, it does happen to work unusually well for Tournament sort. Normal tournament sort average is 15.7 compares for eight items--right in line with the 'optimal' average of 15.299.

Also note, btw, that the Tournament Sort does quite well at small data sets; although it is possible to sort five items with seven comparisons worst case and Tournament Sort doesn't always do that, its average comparison count with five items is about 7.15--quite respectable.

FYI, real-world numbers, sorting one million items.

Number of comparisons by a perfect sort (lg(n!)) : 18,488,865
Average number of comparisons for Tournament Sort: 18,646,425 (100.85% of perfect)
Average number of comparisons for Borland C's qsort(): 22,439,507 (121.37% of perfect)

So there's a noticeable improvement in the number of comparisons vs. a standard quicksort implementation (I'll admit I don't know what optimizations Borland's qsort() includes).

//but because I was curious whether anyone else here had seen anything like it//

You mention heaps in the implementation. The overall tree you use is also heap ordered. I believe binary heaps and heapsort are more wasteful of comparisons than your sort, but are otherwise similar.

There are also binomial heaps, which are quite similar to your idea, but only pairs of items of exactly the same weight are compared and vanquished (using your terms). This leaves O(log n) trees. The remaining necessary comparisons to find the winner don't affect the vanquished lists, which is a bit wasteful if you're just using it for sorting.

There may be better explanations of binomial heaps on the web, I learnt about them from a textbook so I can't link to that.