h a l f b a k e r y
Renovating the wheel
add, search, annotate, link, view, overview, recent, by name, random
news, help, about, links, report a problem
or get an account
= Combine boolean search w. whitespace neutrality ==
By 'whitespace neutrality' I mean the style of writing preferred by the ancient Greeks and Roman civilizations, in which the words of a document are run together from beginning to end with no whitespace or punctuation. A text or hypertext file
is thus prepared for indexing by stripping it of all but alphanumeric characters
autonom and anarch
infallib and doctrin
(solar or photovoltaic) and (addtocart or addtobasket)
== Use text dissimilarity in 'ranking' pages ==
I don't know what is ideal or feasible in 'measures of text dissimilarity.' Perhaps 'number of substrings in common' would be appropriate.
I suggest the following method for selecting the first-ranked result: The search string 'autonom' is 7 characters long. A 7-character search string in a 7,000-character-long file is 0.1% of the whole file and has a numeric value of 0.001. Three occurrences of 'autonom' in a 7,000 character file would count for 0.003. The 'and' connective represents multiplication and the 'or' addition. Unary 'not' could be 1 for absent and 0 for present. The first-ranked result should be simply the page for which the above formula yields the largest result.
The tenth-ranked result should not necessarily be the tenth largest value for the formula. It should be the page that is least textually similar to the first 9 results (perhaps the first 9 results concatenated?)
The top position is of course gameable by posting files containing a single word, but the dissimilarity algorithm should impose a certain level of signal to noise ratio on the results as a whole.
Please log in.
If you're not logged in,
you can see what this page
looks like, but you will
not be able to add anything.
Description (displayed with the short name and URL.)
||I lost this part way through, but I think your proposed
similarity search has some features in common with the
sequence similarity searches used in genomics and
proteomics (DNA and protein).
||Either Google or AltaVista used to be able to do AND's and OR's as well as parentheses; your disambiguation idea sounds good, except you wouldn't want to go to that page, just refine your search to similar.
||//Greek and Roman// had the advantage of common suffixes so reading a flowing text wouldn't require much thought.
||If that tenth result is the first nine concatenated, won't it be more relevant than all of them? And won't that encourage plagiarism?
||No, the tenth result is whatever (among documents that match the search terms) is most dissimilar to the first nine concatenated.
||I think what I meant is: If somebody makes a new webpage
by combining the first nine results, shouldn't that new
combined page be considered relevant by this algorithm?
Wouldn't it take the #1 spot and cause those nine to
disappear from the results (or move down the list quite far)
because they're so similar to it?