= Combine boolean search w. whitespace neutrality ==
By 'whitespace neutrality' I mean the style of writing preferred by the ancient Greeks and Roman civilizations, in which the words of a document are run together from beginning to end with no whitespace or punctuation. A text or hypertext file
is thus prepared for indexing by stripping it of all but alphanumeric characters
autonom and anarch
infallib and doctrin
(solar or photovoltaic) and (addtocart or addtobasket)
== Use text dissimilarity in 'ranking' pages ==
I don't know what is ideal or feasible in 'measures of text dissimilarity.' Perhaps 'number of substrings in common' would be appropriate.
I suggest the following method for selecting the first-ranked result: The search string 'autonom' is 7 characters long. A 7-character search string in a 7,000-character-long file is 0.1% of the whole file and has a numeric value of 0.001. Three occurrences of 'autonom' in a 7,000 character file would count for 0.003. The 'and' connective represents multiplication and the 'or' addition. Unary 'not' could be 1 for absent and 0 for present. The first-ranked result should be simply the page for which the above formula yields the largest result.
The tenth-ranked result should not necessarily be the tenth largest value for the formula. It should be the page that is least textually similar to the first 9 results (perhaps the first 9 results concatenated?)
The top position is of course gameable by posting files containing a single word, but the dissimilarity algorithm should impose a certain level of signal to noise ratio on the results as a whole.