Computer: Browsing: Text
Vocabulary Highlighter   (+1)  [vote for, against]
Make browser automatically PREDICT and highlight the unknown and/or forgotten words

The idea is to make a browser automatically highlight the likely-unknown or not-very-well-understood words of its text to the user, so the user could easily identify the words s/he doesn't know to make reading more understandable.

How it would work?

There are many ways to make a computer highlight rarer words. There is a large set of words that are rarely used in English. However, it wouldn't be good to highlight all of these words, because a person could be familiar with them anyway.

I think it would be better to create something like word-highlight-service. Every time a person reads a text and wants the highlighting service on, s/he can turn it on. After the service is turned on, it highlights the words in different colors denoting the probability of the event that the word is unknown to the user.

The probability could be calculated according the information about the account, the law of forgetting (Ebbighausen function of forgetting information), empirically estimated user's forgetting patterns (as every person has a slightly different faculty of memory), the general frequency of each word in English, person's age and other maybe possible variables like education, interests, spoken languages. However, excluding all other variables, the most important would be only two:

1) How frequently and when the word is seen(read) from the time it has been marked as defined. (In the Vocabulary Highlighter's database, it should register the actual moments each word was seen/read)
2) How frequently and when the word is marked as defined/understood.

And the most important part of the Vocabulary Highlighter program would be applying the patterns of forgetting of the user to predict the likely-unknown or possibly-forgotten words.

The data about the user's vocabulary should be continuously accumulated and stored both in user's computer (which makes the highlighting fast) and in a remote server, which is time to time synchronized (so information is not lost in time and while the user changes computers..).

How would the browser know if you are reading that word or not? Simply if you are reading a paragraph and mark a word "as defined" (this means you have just looked it up in a dictionary), then the program assumes that all the other words of the same paragraph before that word have been read and seen. If the user in the same paragraph marks another word "as defined", the program assumes that the words between the "defined" words are seen. However, not always the whole paragraph is being read, so I think the program should ask if you have read the previous words in the paragraph.

I think the idea described so far would work as expected if there were no homonyms (that is if every word had only one meaning per spelling) in human languages. If one would want to eliminate them, one would have to use a new kind of vocabulary made of words with only one word per concept, for all texts, and by all people. It is not very doable. However it's good to know that many of the rare words does have only one meaning, so I think the Vocabulary Highlighter described so far would still be useful to great extent.

Background:

I thought of this idea because recently I have been reading some medical textual information on the web. I'm not a doctor, so I had some problems. I had to get defined many of the new words and drug names in order to understand the texts. OneLook.com and its dictionaries were very helpful indeed. However, after reading the text I realized having had missed out several already-heard, but not completely well-understood words (as I was hurrying to understand the entire text, not its words). Also, I realized having not completely understood the mechanisms I was reading about, so I had to reread it and get the words defined, what was time consuming. So I thought, couldn't the computer predict me the new words I don't know?
-- Inyuki, Apr 11 2003

Highlight Word Bookmarklet http://www.christopher.org/highlightword/
[phoenix, Oct 04 2004, last modified Oct 05 2004]

Dictophane http://www.halfbakery.com/idea/Dictophane
Oops! Already HalfBaked? [phoenix, Oct 04 2004, last modified Oct 05 2004]

Global Web Statistics http://www.cybermul...com/statistics.html
Just as a source of statistics I took from. [Inyuki, Oct 04 2004, last modified Oct 05 2004]

Nice idea and interesting but, at the core, quite similar to "Dictophane", the idea [phoenix] links to, below. True, Dictophane doesn't have an elaborate "do you know the word or not?" function but it seems like it'd be easier to use.
-- bristolz, Apr 11 2003


At the beginning, the reading would be a little time-consuming as all the words were highlighted, however, quickly you would mark more and more words "as defined". It is easy to say that "I know this word's meaning" or "I have just defined this new word" only by marking+rightcliking the word and choosing the proper answer. Also, consider lifetime account.
-- Inyuki, Apr 11 2003


Maybe I missed this in your explanation, but another source of "known" words to ignore would be the user's own typing. Capture what the user sends in emails, what they type into documents, what they key into web forms, and so on. Any words used there can be assumed to be known and eliminated from the highlighting.

I like the idea though, it turns the whole web into a gigantic "word of the day calendar". It would be a great tool for those learning a language.
-- krelnik, Apr 11 2003


99% of the text I read is on the Net. Even if it was not so, the number of words is limited. And once machine finds the pattern you remember that word, it won't rehighlight it in that color for extended period of time or maybe never. The number of known words is ever-increasing. If you learned 10 new words a day, after 5 years you'd know 7305 words more, but you can define hundreds of words each day! The assuming that a word is being read and understood is not-so-very-important. The more important would be the process of denoting the words you know and the words you get defined.

Yes, and the words you type could be automatically assumed to be well-known, really :-) [krelnik], a good alternative source.
-- Inyuki, Apr 11 2003


However, the service should be strictly anonymous and secure and using SSL connection as I think not everyone would like somone could know so much about what word and when you saw or wrote or defined for yourself.

Overall statistics should be public-available, I think.
-- Inyuki, Apr 12 2003


I've given this one a few days to toss around, and I keep coming back to the same problem. How often (outside of unusual circumstances such as you describe in your last paragraph) would anyone use this? How often does one come across a word one does not recognize? Unless English is a second language, I don't see the usefulness of this.

As mentioned on the Dictophane idea, Opera does right-click definitions. Surely that's all you need for as often as this comes up.
-- waugsqueke, Apr 12 2003


Whenever I happen across a word which I don't understand (which is increasingly rare), I promptly look it up. I see this as feasible, as much of my reading is done online. It seems quite doable that one's computer could have an interactive database which would help in this type of situation. (+)
-- X2Entendre, Apr 13 2003


As waugs indicated, the Opera Browser does this. It's *the* light, fast, smart, configurable and highly recommended Browser
-- thumbwax, Apr 13 2003


// How does it know what words you don't know? //

Excluding the words assumed to be known because you type them, in the extreme cases (// Pigs don't know that pigs stink //) knowledge and understanding of words can be easily tested by creating special high-credibility tests for words. By tests I mean simply sets of true or false statements about words. However I don't think that it is really necessary.

// How often would anyone use this? //

It depends on convenience and attractiveness of the service. Implementation in general. Also, on who the users are. Maybe there is more up-to-date information, but:
• By 2003, 65% of Web users will be international, and the United States will account for less than half of worldwide Internet commerce.-- International Data Corporation
• 92% of the world's population are not native-English users.-- World Almanac
• Non-native English speakers make up the fastest growing group of Internet users.-- New York Times

// How often does one come across a word one does not recognize? //

Also, depends on actual text and user.

It would be useful not only for learning a new language, but also learning a new subject as every subject has its terminology and thus the use of the service also depends on how frequently you are about to explore new subject in depth, on how many there are new subjects to you... Very useful for students, pupils..
-- Inyuki, Apr 13 2003


I expect this feature to be built in to Microsoft IE 8. You won't be able to turn it off.

Don't forget those banner ads that have a nonexistent word in them -- highlighted -- in order to trick you into clicking them.
-- phundug, Apr 14 2003


// You won't be able to turn it off. //

I don't think so. You should be able to browse without any login to the service.

// ..banner ads.. //

Well, they should work. I bet as well as the pop-up ads. So that users could avoid them, need a set of mouse pointers to be prohibited to use in ads.
-- Inyuki, Apr 14 2003



random, halfbakery