March 13, 2007 — A lip-reading computer that could help solve crimes and assist consumers is the goal of a new project at the University of East Anglia in Norwich, England.
When coupled with a
speech recognition system, the technology could work to not only decipher the words of criminals captured on video but could also improve voice-activated
computers in cars or mobile phones.
"There is interest in using lip-reading for all sorts of human
computer interaction, particularly in noisy environments," said
Richard Harvey, a senior lecturer in the University's School of
Computing Sciences.
"Noisy" can mean that an audio signal is muddled by other sounds, for
example from a car radio or a crowd. But it can also mean that a
visual signal is fuzzy or unclear.
People overcome such communication obstacles by pulling
information from various places — lip movement, facial gestures, body
language — to piece together what's being said. But computers
designed for speech recognition typically focus on speech alone.
In previous experiments, Harvey and his team found that
accuracy was significantly improved when a noisy audio signal was
augmented with visual information.
For example, some speech sounds that
are easily confused in the audio domain — "b" and "v," or "m" and
"n" — are distinct in the visual domain. Conversely, some spoken words
look identical in the visual domain, for example, "bat" and "pat."
The researchers will be working over the next three years to find the best way
to combine audio with video.