h a l f b a k e r y
Getting blown into traffic is never fun.
add, search, annotate, link, view, overview, recent, by name, random
news, help, about, links, report a problem
or get an account
I just tried to find a short segment in a
hour-long podcast. (If you're curious, it
was a podcast of the a show called The
Space Programme, and there was a short
segment on the N-Prize in it). Even
skipping forward a few seconds at a
it took quite a while to locate what I was
Why not have a piece of software which
can perform speech-to-text analysis of
podcast (or any file containing spoken
word), and can then allow the user to
search for a given word or phrase? Yes, I
know that speech recognition software is
not great, but it's not bad (especially if it
can run slowly to "subtitle" the podcast
realtime or slower). I would then have
searched for "N-prize" or (if that failed,
being an unusual and hard-to-recognize
phrase) "prize", or a few other relevant
If the software were good, it would allow
for fuzziness. For example, in searching
for "prize", it would also search for
and "price" (easily confused by speech
recognition software). It might then
me the points in the podcast where the
word or phrase was found, and allow me
to click to hear that part of the
I suspect there are plenty of speech-to-
text programmes out there, but this
combine the speech recognition with a
search and playback function.
Looks dead now, had some buzz in 2006. Does anyone know whether it actually worked? [jutta, May 26 2008]
Service formerly known as podzinger
I vaguely remember actually using this to find things in podcasts. Now it's a corporation targeting enterprise markets. [jutta, May 26 2008]
Halfbakery: Speech To Text Processing
I knew we'd been over this before. [jutta, May 26 2008]
||baked by intelligence-gathering agencies the world over
||OK - so I just need to download the
relevant transcripts from the CFBA?
||Jutta - from the looks of it, that
podscope site would be excellent - I
wonder why it ne'er took off?
||What I had in mind was more a tool for
use on your own machine (it would
'index' spoken-word mp3 or wav files,
either on the fly or to create an archive
on your machine). However, a web-
wide search tool would have many
additional applications, especially if it
could drop you into the selected
podcast just before the searched phrase
||Word spotting as technology is still very fragile. This is much harder than text to speech, and the results are, well, spotty. (Sometimes doing something very badly is worse than not doing it at all.)
||There are some things in this space I'd love to try - like first translating the user query into phoneme salad, then spotting the phoneme salad; then trying to parse the surrounding sentence with some sort of meaningful grammar. We're on the verge of being able to do this usably well, but it's still not easy.
||I know it's a difficult field. How does
real-time subtitling on TV work? I'd
always assumed it was by speech
recognition, because it makes the sort
of mistakes you'd expect it to make.
||Also, this algorithm doesn't have to be
perfect. Suppose it consistently mis-
hears "prize" as "price"; then I search
for "prize" - the engine will search for
"prize", "price" and anything similar. It
will not be perfect. But, suppose I
search a podcast for the "prize" and it
comes up with the following poorly-
||"would hefty pay the prize for not"
"came as a big sir prize to him"
"offered a price of fifth team hundred
||It would then be very easy for me to see
that the third one was probably what I
was looking for, click on it, and be
taken to part of the podcast that said
"offered a prize of fifteen hundred
||Yeah, your examples are quite close to what podzinger's results felt like.
||This is an existing area of research, but AFAIK, commercial real-time subtitling is done by humans (often stenographers with a little bit of software to translate steno back into normal written language), and the errors you see are human errors.
||(You can tell from the fact that subtitlers quite often summarize, leave out uhms/ahs, rephrase expletives.)
||Hmm. I'm pretty sure I've seen some
errors which I wouldn't expect a human
to make (things like "big sir prize"), but
maybe they are human slips or glitches
in the downstream software.
||According to Mr. Wiki, "Voice
recognition technology has advanced so
quickly in the United Kingdom that
about 50% of all live captioning is
through voice recognition as of 2005.",
but it's also possible that this is done
through a speaker re-voicing in
realtime for clarity - it's not clear
exactly what's done in practice.
||There are certainly different types of
subtitle; some appear in whole phrases,
often colour-coded to the speaker and
often condensed; others appear word by
word, with no obvious condensation,
and look much more computer-
generated to my eye.
||Yeah, respeaking would fit both in using both STT and a human. (And since there's a significant problem with speaker-agnostic STT, this does make a difference.)
Frustratingly, I can't find BBC statistics on any of this - they just say how much they're captioning, not what the quality of the captioning is, or how it's done.
||Same here - it's all precisely vague and
exactly imprecise. But maybe the software
isn't as good as I'd thought. The only hope
for rescuing this might be the fact that the
software needn't operate in realtime.
However, I suspect that speed is not main