text<->speech as audio compression

Use speech recognition and speech synthesis as a voice codec
  (+12, -3)(+12, -3)
(+12, -3)
  [vote for,

Current voice compression codecs are good, but for the ultimate low-bandwidth voice codec, perform speech recognition on what each user says, send the resulting text string across the Internet, and when the other user's computer receives the text string, it uses a speech synthesizer to convert the text back into speech.

A naive implementation of this would probably suck -- nobody likes to listen to a robotic- sounding synthesized voice for very long. But a slightly better version would detect and send intonation metadata along with the text (e.g. "at this point in the sentence, lower the pitch by 50Hz and slow the speech rate by 10%"); this would make the recreated speech sound more natural and lose less connotation.

In addition to the above, an advanced version could pre-download voiceprint 'skins' for each person you talk to, so that your computer could more accurately simulate Uncle Fred's voice and mannerisms using canned data.

Jeremi, Mar 26 2002


       And this is better than the phone because _________ ?
mcscotland, Mar 26 2002

       ...because you can listen to it over again?
sappho, Mar 26 2002

       The big advantage is high compression. When compressing data further and further you generally hit a limit where you start losing signficant content. Most people will have seen over-compressed JPEGs where the artifacts (distortions caused by compression) so distort the image that you can't tell what it is any more.   

       The only way to compress further is by using a 'common data dictionary' where the sender and receiver are both in possession of a large identical data dictionary which contains fragments commonly used in sound files, images or whatever you are compressing.   

       Jeremi's idea is a development on this where you share speech data in a common compact format (text) then enhance this using a 'data dictionary' of what the voice sounds like. This gives you massive compression with high quality. One application would be clearer, higher quality mobile phones that use less bandwidth and are therefore cheaper to run.   

       My croissant already given, I would suggest one change, though. Instead of rather than trying to render the speech into text, render it only as far as phonetics. That would allow the data to be communicated without the codec needing to understand the sense of the sentence. Render it back using a Phonetic Markup Language (almost posted that a while back).
st3f, Mar 26 2002

       I'm picturing a sort of verbal MIDI.
waugsqueke, Mar 26 2002

       i thought of this myself one day. Very interesting idea, I must say. Probably quite doable. Actually, I think its partially baked already. There is software to voice-reg and type into chat programs and read the messages aloud to you. Like that T2 head.
ironfroggy, Mar 27 2002

       set it up so you could use it as one of those phones on the internet, the compression would still have speed on a 56k modem. Like Me! :)
kurtynlsn, Feb 21 2004

       A 56K modem is fast enough for traditional VoIP phones.
Acme, Feb 21 2004

       very intriguing idea. I like it. Maybe used for podcasts if it is too slow for done in real time.
vmaldia, Aug 01 2006

       Would the meta data not be best sent as inflection marks in the text itself... Much like the timing information supplied in sheet music. I would prefer symbolising all that is required in the message rather than breaking the message down into phonetics --- noone reads phonetically.   

       Would give a [+] for setting out the symbols required to encode the meta data in the input speech!
madness, Aug 02 2006


