Computer: Speech: Recognition
low-end speech recognition   (+9, -4)  [vote for, against]
ju:s hju m@nz fo: wot THe.. gud @t [thanks, hippo]

Speech recognition is hard. It kind of works for picking out one of a limited range of options - a name you want to call, "weather" from "stocks", "no" from "yes" - but free form speech recognition with anonymous speakers, the thing that humans do without thinking about it, is still elusive.

While people are chipping away on that one, I'd love to see a low-end service that sends you your voice mail as text, and that is based on leaving all the hard problems to humans and just transcribing phonemes in some arbitrary language.

The service would not know anything about grammar, context, applications, word boundaries; it would just spew out this incomprehensible stream of syllables.

A human reader would then have to pronounce those syllables to him- or herself in order to reconstitute their meaning. (Eventually, one would get good at it and just "read" the stream.)

I have no idea whether this is possible, or whether phoneme recognition with anonymous speakers is just as hard as the other problems.
-- jutta, Mar 26 2002

Phonetic alphabet http://www.antimoon...onunc-soundsipa.htm
[hippo, Mar 26 2002]

ASCII Phonetic alphabet http://www.antimoon...w/pronunc-ascii.htm
[hippo, Mar 26 2002]

How Telephones Work http://www.howstuff...s.com/telephone.htm
An illuminating article on the simple telephone. [DrBob, Mar 28 2002]

Tones and Accents http://www2.arts.gla.ac.uk/IPA/tones.html
This is how Mr. Bubba [mcscotland, Mar 30 2002]

Hmmm. This is nice. Pure "pronunciation" text. Maybe it wouldn't really need to absolutely recognize phonemes but just get kinda' close; representing "what it heard" as text. Might still be very readable. (No recognition at all would be the useless equivalent of sending a waveform representation, I guess).

The variations in the representation could (possibly) enable people who were used to reading it to be able to distinguish who it was who left the message.
-- bristolz, Mar 26 2002


My cousin (a speech therapist) used to write me letters in phonetic transcription (see 'Phonetic Alphabet' link). I got very good at reading it after a bit. For a low-cost speech recognition system, the installation of a phonetic font on the phone and voicemail -> phonetic transcription conversion should be possible. For a super-low cost system, use the "ASCII Phonetic Alphabet" (see link) which munges the real phonetic alphabet to standard ASCII characters.

du ju si: wot ai mi:n?

(The subtitle of this idea becomes "ju:s hju m@nz fo: wot THe.. gud @t")
-- hippo, Mar 26 2002


We'd need to pick one accent only, if it were to use the alphabet as it is currently used. It could churn things out in standard IPA, but that can take some reading (and certain phonetic sounds are very difficult for one person to understand in the context of their own accent). It might be quite amusing watching people trying to translate. In fact, this is probably doable, if there were a good way of establishing context - the software would have to be aware of a lingua franca accent for voicemail sender and receiver (perhaps and additional option on voicemail to register your accent before speaking?). Without something like this things could get very confusing when you long lost Uncle Juan from Barcelona calls your voice mail, which is set up to understand Standard American English accents.
-- mcscotland, Mar 26 2002


Since we're taking about a 'dumb' computer interpreting what it hears there would be no way to "pick one accent only" and, thus, no way to arrive at a lingua franca.

"Yoos hoo menz fahr whud thayr goot ahd."
"ju:s hju m@nz fo: wot THe.. gud @t"
Yoos - sounds Brooklyn/New Jersey
fahr/goot - sounds like the person has a German accent
ahd - sounds like the speaker has a stuffed-up nose
-
ju:s - sounds of a heavy Hispanic accent
fo: - inner city

I anticipate there would be no easy way to standardize the display (without better voice recognition - which is what we're trying to sidestep). Everyone sounds differently so everyone would 'read' differently. But once you learned someone's 'voice', I bet you would be able to tell who left the message just by reading it.
-- phoenix, Mar 26 2002


Rods... i'vebeeinusingthattechniqueonmyprofilepage, todisplaymyemailaddress.

I like this idea, on some level. I like the concept of breaking spoken language down into phonetics, and then just spelling the phonetics.
-- waugsqueke, Mar 26 2002


//Since we're taking about a 'dumb' computer interpreting what it hears there would be no way to "pick one accent only" and, thus, no way to arrive at a lingua franca.

Its not the interpretation that is the problem (well it is, but that's not the problem I was attempting to highlight), its the translation. Try reading Irvine Welsh's "Trainspotting" if you are an American, and see how long it takes if you want an illustration of the point. That's why there would have to be some sort of executive decision taken on the part of the voicemailer to set the accent. There are, in linguistics, attempts to categorize accents dependent on common vocal similarities, in particular vowel sounds. So there is a (sort of) source of selectable, understandable alternatives. The voicemail would have to tag its output as (say) Scottish Standard English, so if the recipient were an Americanised Korean they would at least have a point of reference to begin. There is perhaps another step, which would be to use the standards to convert a stream of one accent output into another.

The biggest problem here is that of course linguistic standards are only *interpreted* rules. So, like everything else in English, the rules hold out most of the time, but there are always exceptions and the exceptions are not rule bound.
-- mcscotland, Mar 27 2002


This would drive me mad... it reminds me of the only one of Iain M Banks' books I stopped reading part way through... it had something like a third of the book spelt phonetically...
-- RobertKidney, Mar 27 2002


If someone mentioned this already please forgive me I only skipped through the annos. Virgin trains have an enquiry line which must have speech recognition. It asks your departure and destination, times of travel and date, it must also be able to recognise accents. When you have spoken the details a real life person then answers and asks all the details again!!!
-- arora, Mar 27 2002


RobertKidney: The book you're thinking of is "Feersum Enjinn". It took me hours to get used to the phonetic spelling. But the book was good.
-- herilane, Mar 28 2002


Yeah, Feersum Endjinn was hard going but a decent book.

Telephones operate by modulating electric current (I know, I just read about them. See link). So why not use this instead of phonetics. Such a system should be relatively simple. You don't even have to have a receiver at your end (as you aren't going to be there to hear anything), just measure the strength of the current at regular intervals (say every hundredth of a second, just for the sake of argument) and turn the number pattern into text. Then, all you have to do is learn to recognise the number patterns (perhaps they could be printed like a bar chart rather than just a series of numbers in text format) - the same way that some people can identify records (some younger 'bakers may not remember them) by the pattern in the groove.
-- DrBob, Mar 28 2002


American Airlines has some amazingly good voice recognition, too. It works about the same way arora mentions...
-- StarChaser, Mar 29 2002


But written texts might be much shorter-lived if they represented phonemes; consider Chinese ideograms, which have now abstracted away most or all of the sound. It's harder to learn literary Chinese than some other languages - maybe - but the texts are durable.

If written English (e.g.) represented only the sound, dialect or slang groups would become mutually incomprehensible faster. Fine for unstored IM'ing, worrisome for novels.
-- hello_c, Mar 29 2002


DrBob: Birdsongs are represented in some books by sonograms, which seem a bit like what you're suggesting.

I once received an email from a gentleman who had suffered a severe stroke and relied on voice recognition software to handle his correspondence (being unable to type). It was a bit spooky--I could 'hear' the difficulty he had pronouncing some words, and his tendency to stammer a bit, all transcribed phonetically. The topic was technical--I had emailed him a physics question, he being a well-respected research physicist--and his response was clearly well-reasoned and succinct, which made the imprecision of his speech all the more poignant.

Um, what was the topic again? Oh, yes, low-end speech-recognition. Well, there are parts of his email I was not able to decipher. I 'spect there would be similar translational problems with any verbal idiosyncracies, be they personal, regional, or national. BUT! a truly phonetic language--Esperanto?--might minimize such difficulties. I dunno.
-- Dog Ed, Mar 29 2002


One potential offshoot. If people became adept at reading phonemes, if could gradually change English, so that words are pronounced as they are spelt.
-- QuadAlpha, Mar 29 2002


That's a given - almost everything we say has mutated from it's root already, and continues to be bstrdzd. One must admit, if some noob came in here using what appears to be chat-speak or AOLingua - they'd be criticized as being semi-literate charlatans.
-- thumbwax, Mar 29 2002


I'd do without that. It's low-end. One can imagine software that puts :-) into the text when it detects a jokular inflection, but by the time you're that good, you can probably do away with the whole phonetic stuff to begin with, and it's no longer my problem...

Re Dr.Bob's idea to just render the sound wave directly:

The parts of speech that people are interested in happen at frequencies around 250-2500 Hz. To encode waves of those frequencies, you need to measure the amplitude of your signal about 5000 times per second. (Search for "nyquist frequency" for the math on that.) If you assign each sample a single letter, you're now reading about three terminal screens full of text per second.

The closest to "readable" rendering of sound that I've seen are frequency histograms; it's easy to distinguish vowels and (noisy) consonants, but telling A from E takes some training, and telling individual consonants apart is next to impossible for me.
-- jutta, Mar 30 2002


I like this idea and hippo's phonetic suggestion. I will be borrowing it as a front end for something I an going to post. All hail to the both of you.
-- j paul, Jun 25 2011


I like this idea and hippo's phonetic suggestion. I will be borrowing it as a front end for something I an going to post. All hail to the both of you.
-- j paul, Jun 25 2011



random, halfbakery