Halfbakery: TXT Player

Product: Book: E-Book
TXT Player (+17, -1) [vote for, against]
Portable text-to-speech

Audiobooks generally take about a dozen CDs worth of storage. Even if you MP3 them they take about a CD and a half. This is quite bulky compared to the few hundred kb of data required when stored in text form. If there were an MP3 player with a built in text-to-speech chip, you could store an entire library worth of audio books on the same amount of media as one in the traditional format.

Disadvantages:
Mechanical-sounding voice (certainly not a huge problem)
Copy protection issues of having raw text available of major books. They've solved this for e-books, so I'm sure this isn't a deal breaker either.
-- Worldgineer, May 18 2005

Audrey http://hegel.resear...3d6adafe5a840fa.wav
ATT Lab's female UK accent [waugsqueke, May 21 2005]

The Weather Channel "Local on the 8's" http://www.twcclass...b05/Jay%20Hogan.wmv
For central Connecticut, Sunday evening, Feb 20, 2005. (8 MB .wma file). I remember this storm. [waugsqueke, May 21 2005]

Kurzweil Educational Systems http://www.kurzweiledu.com/products.asp
Products and Services. Many accessory products to the basic $1000 model 1000. [reensure, May 22 2005]

Victor Reader Stream http://www.humanwar...rreader_stream.html
[JesusHChrist, Apr 01 2009]

Different coding schemes to listen to http://www.data-com...ion.com/speech.html
Scroll way down [loonquawl, Apr 01 2009]

Nice. It reminds me of eighteen rpm records. If the text was stored as characters but using an ASCII version of the phonetic alphabet, it could be even simpler. In fact, it could probably have been done twenty years ago or more with floppies and a BBC Micro style voice synthesiser. You might even be able to make one at home.
-- nineteenthly, May 18 2005

An audiobook encoded with a 2400bps CELP vocoder could be 740 hours long and still fit on one CD. The audio quality would be a little robotic, but probably still better than from a text-to-speech system.
-- supercat, May 19 2005

Sounds like a good idea to me. [+]
-- contracts, May 19 2005

Nah, not worth it. Storage is plentiful and cheap, audio compression is, as supercat points out, good enough to fit many books on a single data CD, and a good reader makes a *huge* difference.
-- jutta, May 19 2005

[jutta], i understand that speech compression is one of your areas of expertise. Exactly how good is it compared to text?
-- nineteenthly, May 19 2005

Hm! It would be nice to have a storage format that has audio accompanied by searchable text on a frame-by-frame level.
-- jutta, May 19 2005

[super], [jutta] Good points, though I'm not sure that's a better practical solution. I've run audiobooks through maximum-bearable compression in an MP3 format, and they take more than a CD. It also takes hours of effort (ok, a few minutes of effort every 20 minutes or so) to convert. I guess you could market pre-compressed audio files, and a special player that understands a better encoding than MP3.

However, having this read text-based data allows you to listen to much more than the books available in this format. Find an interesting article online? Copy it to your TXT player. Haven't had time to read your e-mail? TXT player. More or less the whole Internet that you just don't have time for will be accessable.
-- Worldgineer, May 19 2005

If text was used, it would be possible to scan data in from text on paper, which could then be read, like a highlighter pen being run along lines. This would make the algorithm for reading it out more complex because it wouldn't be stored phonetically and there would clearly have to be OCR. It would also be possible to convert it to audio internally but i'm inclined to believe it would be bigger than text.
-- nineteenthly, May 19 2005

This might sound funny; let's expand this idea further. If in situations where a specific person's voice is not mandatory, we could use a really fast writer (like IBM Viavoice) to convert the voice into text and send packets of this text and use a fast reader (like Windows narrator) on the other side. If this can be done seamlessly, we can save a lot of data transfer.
-- concept, May 21 2005

the more natural you want the text player to sound, the greater are the chip and linguistic dedicated harware you want the player to have, therefore the greater the speed for the player's processor and the greater the energy consumption, hence PC characteristics; don't know about portability being a main asset to that.. hope technology miniaturization will make it possible in the future. nice idea though [+]
-- sweet, May 21 2005

[concept] and add tonality info on the way, for each word. imagine the moment you talk to your employee and try to tell him he's fired without hurting his feelings. for an experiment try pasting this to the word reader "... therefore, your future activity is no longer needed in our company"
-- sweet, May 21 2005

There are already inexpensive small digital voice recorders. All this needs to do, in its simplest form, is to associate a character with a sample of a related phoneme, assuming the text is stored phonetically rather than using conventional spelling in the appropriate language. In some languages, such as Finnish, Czech or Swahili, the text can be more or less as spelt and it would still work, but in English, French, Portuguese and other languages there would need to be some alteration of the text. There is then the rather trivial problem of storing one phoneme per character, and an overestimate of the number of bytes used to store a single phoneme at an unnecessarily high sampling rate in an unsuitable format, i.e. MP three, is around eight hundred bytes. With the fewer than fifty phonemes in the dialect of English i speak, excluding allophones, that adds up to less than forty kilobytes. This is not difficult to achieve and doesn't need much processing power by today's standards, and that assumes that an algorithm optimised for speech is not used. It would not be at all demanding to do this.
-- nineteenthly, May 21 2005

OK, let's do the math:

Storage on a CD-R: 700 MB.
One second of telephony-compressed audio, technology publicly available as of ca '94: 1664 bytes.
Seconds per hour: 3600
Storage for one hour of audiobook: 6 MB.
Hours of speech that fit on a single CD-R: 116+.
Hours in the longest unabridged audiobook I've listened to so far: 24.

I agree that some of the software solution to this may be inconvenient - you have to understand a bit about codecs and be able to use or perhaps even write converter software to make this happen - but I think that's very much worth it to, where a speaker is available, save you from the monotony of text-to-speech. (Unless, perhaps, the speaker is Brian Emerson, in which case I'd have preferred the text-to-speech.) There are such things as melody, flow, arc, irony, beats, timing, rhythm, and, three hours into the intricacies of US foreign policy in the middle east, you'll miss them.

If you enjoy text to speech for the novelty value, good for you; if you don't have audio at all, good for you, too; but the problem of not being able to store all that stuff has been solved for a while. (Unsuprisingly - extreme lossy speech compression and speech generation meet somewhere where speech is compressed to elements of a "voice font", not all that different from elements of the phoneme set of a language, roughly the order of magnitude of an alphabet. Nineteenthly, above, is heading there, too.)

All that set, I still think this is an extremely cool idea for the existing mixed-media players. E.g., you can store notes on your iPod, and Apple has long had text-to-speech; why not have a mode where my notes can be read to me? It would cost them next to nothing, all the technology is already there.
-- jutta, May 21 2005

Thanks for the info, [jutta], that's really helpful and interesting. I wish i knew more about speech compression.
Storage is not the only issue though. If it's memory on something like an EPROM or a card (sorry, i don't know the right terminology and i know it's probably out of date) rather than on an optical disc, it can be read directly into the device, it can be smaller and have fewer moving parts, so it would be more durable and reliable, and in those circumstances, less storage would be available for the same price. However, i agree that storage is not really an issue. What could make a difference to this, though, is that if it is stored in text form rather than as compressed audio of some kind, the necessary data can be transferred more quickly, such as on a dial-up internet connection, and also it becomes possible to convert data that were previously only text into audio, such as webpages, viewdata, text files and actual printed text on paper if some kind of scanner is available. I think it would be more difficult to convert text in English than in many other languages written in Latin script because of the way it's spelt, but English spelling is unusual in its poor correspondance between characters and phonemes, and there would have to be much more complex conversion software which would slow things down, take up more storage space (though this may not be an issue) and be potentially buggy. However, if a language with saner spelling is to be converted, the problem is much simpler, although to read actual text printed on paper, OCR would be needed, which for all i now (and i am most ignorant on this matter) is just as difficult to do as converting English text to speech. Even so, there would be advantages to a device which could read in ASCII stuff even if it couldn't do OCR.
-- nineteenthly, May 21 2005

I like the concept, but...

// Mechanical-sounding voice (certainly not a huge problem) //

... here, I disagree. I think you're vastly underestimating the impact of this problem. The voice is the entire interface. If it sucks, the whole experience sucks. And to date, I have not listened to any text-speech synth that I could stand to listen to for more than a couple of minutes*. Imaging listening to Audrey (link) narrate an entire novel.

So a lot of work would be needed in this area to make this idea viable, I think.

(*With the possible exception of the US Weather Channel WeatherStar forecast voice-guy (link again). It's pretty damn good, but it's not true synth. It pulls from a pre-recorded file of phrases.)
-- waugsqueke, May 21 2005

The problem of the mechanical-sounding voice could be solved. The phonetic content of the player could be determined by asking the user to recite a series of sounds or a phrase which could then be chopped up and turned into the voice of the player itself. This would help people who are fond of the sound of their own voices.
-- nineteenthly, May 21 2005

The problem with text-to-speech is not with the sound of the voice per se, so much as it is with the prosody (pacing, word stresses, and intonation). "Assisted" text-to-speech systems, where a human edits the characteristics of the generated speech, can produce results which would be fully acceptable. The amount of effort required to generate such speech, however, exceeds that required to simply say what needs to be said. There are sometimes cases such a thing can be worthwhile (e.g. many staffers working on a project can generate dictation files that match).

One idea I do find intriguing, though, would be that of using a speech-to-text algorithm to 'line up' the recorded speech with the printed text, allowing someone to simultaneously listen and follow along with the written word.
-- supercat, May 22 2005

I definitely think the problem here is English spelling. For a lot of other languages there would be no problem at all, since there are regular rules for stress, and raising the voice for a question, pausing because of commas and so forth are already built into the punctuation of the text.
-- nineteenthly, May 22 2005

Listening to a bad synthesized voice is a lot like reading text - minimal. Since the computerized voice intonates everything the same, once you get used to the voice - learn it like a language -- you can let your imagination fill in the tone of voice, the same way your imagination fills in the pictures with text. To me this is much different than being spoon fed meaning by human actors.
-- JesusHChrist, May 22 2005

I tend to agree, but i also think the font in which something is printed influences one's impression of a text. If something is in Courier, for example, it would seem to be plain, official and no nonsense, but in Black Letter it would seem pretentious, falsely anachronistic or perhaps Biblical in tone. Some authors insist on the use of a particular font to print their novels.
-- nineteenthly, May 22 2005

JHC: In some ways, I prefer older text-to-speech systems to newer ones. The older ones would generally try to speak mostly in a monotone, except for the last syllable or two of a sentence or question. The newer systems try to add a more natural 'arc' to the speech. Unfortunately, the efforts at adding word stress and intonation patterns sometimes produce results that are just plain wrong. A monotonous voice is much less bad than a voice which uses improper word stresses.
-- supercat, May 22 2005

My head hurts...

I like this idea just for the fact that you can cut & paste text into it. Bonus croissant if you enable it to assign different accents to different portions of speech.

[yawns]
-- energy guy, May 22 2005

Once they solve the problems of synthesized speech, and the problems are big, why stop at a TXT file reader? I'd like a machine into which I could stick a book or magazine into, so it could optically read the material and read it outloud to me.

In fact, for short articles, I might be able to tolerate the synthesized voice in its current state of the art.
-- noglider, May 22 2005

Although this product would not be for [waugs] and others with low tolerances for such things, I've personally listened to hours of generated speech without being annoyed. Back in college, when I was out of interesting mp3s to download I did something like this. I'd find e-texts that I wanted to read, generate sound files with text-to-speech software, convert them to MP3, and upload them to my little MP3 player. The only problem I had with this was that I was limited to 30 minutes or so with my 16mb card, and it took far too much effort to be useful.
-- Worldgineer, May 22 2005

Are you sure about the 1664Bytes per second, [jutta]? - In the link i posted, the 2400bps is very bad already.

High compression of audio makes the voice more unintellegible to me than any text-to-speech synth, which is quite intuitive.
-- loonquawl, Apr 01 2009

//Are you sure about the 1664Bytes per second, [jutta]? - In the link i posted, the 2400bps is very bad already.//

2400bps is 300 bytes/sec. Even if that level of compression is dreadful, that wouldn't mean 13.3kbits/sec (1664 bytes/sec) would be.
-- supercat, Apr 02 2009

[21] Kindles are huge, compared to an MP3 player.
-- Worldgineer, Apr 04 2009

Has anybody done an idea for a voice-avatar yet?
-- lurch, Dec 03 2010

random, halfbakery