Half a croissant, on a plate, with a sign in front of it saying '50c'
h a l f b a k e r y
Assume a hemispherical cow.

idea: add, search, annotate, link, view, overview, recent, by name, random

meta: news, help, about, links, report a problem

account: browse anonymously, or get an account and write.



Google Stylometrics Search

Don't be evil. Be nosy to the point of creepiness.
  (+11, -1)(+11, -1)
(+11, -1)
  [vote for,

Search the Internet for the writings of any given author, even when you don't know the author's name. Enter a sample of text written by a particular author, the longer the better. Google will use a patented algorithm to analyze the text and extract the author's unique "writing-style fingerprint." This stylometric fingerprint is then compared against indexed fingerprints for all of the text in Google's database, and the highest likelihood matches are returned to you as search results.

This search technology has obvious implications for the fight against cybcrime (see link), but there are an infinite number of conventional uses as well. Do you want to put a face to the person sending you unsigned love emails? Concerned that a member of your creative writing staff is wasting time on the Halfbakery? Or do you simply want to find more works by an anonymous poet? Google Stylometrics Search may be able to help you in each of these endeavors.

Once an author's unique fingerprint is known, Google can also provide you with a dossier of that individual, based on the combined content of all search matches. To which websites does this person frequently contribute? About which topics is this person passionate? Do he or she have the personality of a sociopathic CEO or a meek little lily?

The stylometric fingerprint is created by an algorithm including the following characteristics of an author's prose:
* word, grammar, and punctuation choice,
* common spelling errors and typos,
* average syllables per word, words per sentence, and words per paragraph,
* statistically improbable phrases (a concept poached from Amazon, see link),
* and other secret ingredients

<< Are you scared yet? >>

swimswim, Jun 26 2009

(?) From Fingerprint to Writeprint http://ai.eller.ari...to%20Writeprint.pdf
Example that the mechanisms are already there for Google's pickings. [swimswim, Jun 26 2009]

Statistically Improbable Phrases http://en.wikipedia..._Improbable_Phrases
[swimswim, Jun 26 2009]

Plagiarism Checker http://www.plagiarismchecker.com/
one of many [xenzag, Jun 27 2009]

Plagiarism checker on swimswim's text. http://www.google.c...nger+the+better.%22
Illustration that plagiarism checkers can locate original works based on subsets of it, but not all the other work by that same author. plagiarismchecker.com just pumps your text into a Google search, so this is the resulting link. [swimswim, Jun 27 2009]

Googlewhack http://en.wikipedia.org/wiki/Googlewhack
Note that "stylistic googlewhack" is using googlewhack as an anlogy for how this software works. [swimswim, Jun 27 2009]


       I think I like it more on a theoretical level than practical [+]
FlyingToaster, Jun 26 2009

       There was another anno here about the implications of Internet hijinks on a child's future career in politics; but it seems to have been deleted. I will try to re-create it: If we could see what types of random things Barack Obama had posted online as a kid (if the Internet had been as widespread then), how would that have affected his presidential campaign? This will undoubtedly be an issue facing the current generation of children; not just for entry into politics, but probably for a lot of job applications as well (not to mention on the dating scene!).
swimswim, Jun 27 2009

       I would imagine a child's //stylometric// would differ from the same individual's adult //stylometric//. So I don't see too many problems there.   

       This would be a pretty good marker for plagiarism. To ascertain, within a given document, that certain parts differ from other parts. (e.g. Paragraph 3 differs stylometrically from the rest of the essay). You would have to work some form of context sorter here, whereby it is possible to seperate main body from conclusion, as they should necessarily differ.   

       I am pretty sure this idea would work in *certain* cases ie comments and posts. However there are chameleons of prose that might escape your system.   

       Most of all you may end up with a statistically large subset of false positives. Due to qwerty *laypouts* and *there* associated common errors.
4whom, Jun 27 2009

       Software already exists like this and is used by universities to detect plagiarism - a common trend amongst the new dumbed-down generation of students. Will try to find link.
xenzag, Jun 27 2009

       That's a good point [xenzag] and I had intended to include it in a link, but I think that the plagiarism detection software is used to find the same works that have simply been "rephrased". This stylometric search looks for the same style in different original works; so not plagiarism.
swimswim, Jun 27 2009

       I think my mic must be off...
4whom, Jun 27 2009

       Sorry [4whom], somehow I missed your anno:

The way I envision it, the plagiarism detectors are a subset of this stylometric search. The could help to point out plagiarism just as you say, by demonstrating that different paragraphs in one essay, for example, are stylometrically different. But this goes beyond that by also producing a list of all the other works that appear to be unique to an individual, with some degree of statistical confidence.

       False positives due to common typos, etc:
In fact, I think this approach would avoid these types of false positives, because it would detect those elements of an individual's style that -differ- from those things that are common to the population at large.

       An individual's stylometric fingerprint changes over time:
You're exactly right, so here's where social science comes in, specifically developmental psychology of personality. This, of course, is a bit more of a stretch, but the algorithm could chart out changes in the individual's education and overall character over time, and use this chain of changes to link an adult to a child.
swimswim, Jun 27 2009

       //but the algorithm could chart out changes in the individual's education and overall character over time, and use this chain of changes to link an adult to a child.//   

       If you can do that (non-invasively) consider me a buyer of some of your equity. In my heart I know its possible, I have investigated several mechanisms for this very thing (well similar).   

       As to false positives, they are something you just have to get used to. They occur rather Bayesianly, because they do. I am suggesting that you *wilkl* have statistically significant reams of faults due to common errors *taht* we all make due either to the keyboard layout (common to most) or educational inputs common to a large group. These faults will be further skewed by error corection pre publication, such as Words autocorrect. This autocorrect featured will skew (arbitrate) your end results.
4whom, Jun 27 2009

       I phrased that poorly. I meant \\\chart out changes in educ. etc.\\ to mean that the algorithm will know the dates that things were first published online, and it will detect that, for example, certain vocabulary terms are introduced over time. I didn't mean to imply that the algorithm knows that you got your degree from Alabama State, and therefore includes this in its inferences.   

       And yes, statistically speaking, false positives will occur. But my point is that if we all make certain mistakes due to keyboard layout or automated grammar checkers, then the algorithm will note these as common to the population, and look for other things that are unique to the individual.   

       I think another way to put this is that the algorithm picks up what we might call "stylistic googlewhacks" (linked), and uses those to ID an individual. Results are ranked by likelihood of the match, so they incorporate statistical confidence and possible errors.
swimswim, Jun 27 2009

       Oh and also, yes, this is -very- invasive, which is why I find its implications so damn creepy.
swimswim, Jun 27 2009

       Good one. I do this by hand from time to time, but it's always helpful to have at least a clue.
ldischler, Jun 27 2009

       What really creeps me out is this has become almost possible. And that is probably why I am overdefending its weaknesses.   

       Of course it does provide a gap for " let me obfuscate that for you" dot com...
4whom, Jun 27 2009

       There was this presentation with a lab-on-chip thingie. The researchers promised it would be able to taste out 4 wines. Journalists smuggled a glass of water into the experiment, which (foreseeably) got idetified as one of the wines. Hilarity ensued...   

       How that is relevant to this idea: Although stylometrics exist, it has to be trained on a rather large body of text, and will offer rather vague identifications even then. For show, differentiating Nietzsche and Camus would be pretty easy, but smuggling the works of, say, Rowlings into the experiment would bust it, showing whom of the two the Potter-creator resembles most, but nothing else. To fish from the huge text-pond of the Internet all the authors with similar writing characteristics as someone particular will generate a wealth of false positives.   

       Consider this forum: how many ideas were posted the first anno to which read : 'i thought this was Vernon's' ?
loonquawl, Jun 29 2009

       [loonqwal]. touché, you make a good point. And I think the differences between physical fingerprinting (of fingers) and this stylometric version (or of enological fingerprinting) warrant investigation. Such an analogy is not so straightforward. For one, people can't copy the fingerprint style of another person's hand, but people often model their writing style off of others'; creating more similarities in writing styles, as you pointed out. This technology will work best to identify the more eccentric of writers. Unlike the enological case, however, the writing style fingerprint can be calibrated off of the jazillions of documents in Google's brain.
swimswim, Jun 29 2009

       //However there are chameleons of prose that might escape your system.// I may have said all that already, <does this mic work?>
4whom, Jun 29 2009


back: main index

business  computer  culture  fashion  food  halfbakery  home  other  product  public  science  sport  vehicle