Skip to main content

Can deep learning help solve lip reading?

Can deep learning help solve lip reading?


New research paper shows AI easily beating humans, but there's still lots of work to be done

Share this story

If you buy something from a Verge link, Vox Media may earn a commission. See our ethics statement.

Fulvio De Filippi/Getty Images

Lip reading is a tricky business. Test results vary, but on average, most people recognize just one in 10 words when watching someone’s lips, and the accuracy of self-proclaimed experts tends to vary — there are certainly no lip-reading savants. Now, though, some researchers claim that AI techniques like deep learning could help solve this problem. After all, AI methods that focus on crunching large amounts of data to find common patterns have helped improve audio speech recognition to near-human levels of accuracy, so why can’t the same be done for lip reading?

Much more accurate than humans, but working with very limited data

The researchers from the University of Oxford’s AI lab have made a promising — if crucially limited — contribution to the field, creating a new lip-reading program using deep learning. Their software, dubbed LipNet, was able to outperform experienced lip readers to a significant degree, achieving 93.4 percent accuracy in certain tests, compared to 52.3 percent accuracy from human lip readers. And even in its current, early stages, the software is extremely fast — processing silent video into text transcripts in nearly real time.

However, before we get lost in bad dreams of AI-powered surveillance states and HAL reading lips in 2001: A Space Odyssey, the research from Oxford has some serious limitations. For a start, the system was trained and tested on a research dataset known as GRID. This is a collection of tens of thousands of short videos of 34 volunteers reading nonsense sentences, as well as captions. Each clip is just three seconds long, and each sentence follows the pattern: command, color, preposition, letter, digit, adverb. Sentences include, for example, "set blue by A four please," and "place red at C zero again." Even the words within these patterns are limited — there are just four different commands and colors used. This has led some researchers in the field to suggest that the paper's findings have been overblown, especially after one viral tweet linking to the researchers’ video (below) made the sensationalist claim that the work meant there would be "no more secrets."

This is certainly not the case. Speaking to The Verge, two of the researchers behind the paper, Yannis Assael and Brendan Shillingford, readily admitted they were working with "restricted vocabulary and grammar," but said this was due to limitations in available data. "The dataset is small but it’s a good indication we could perform just as well with a much bigger dataset," says Assael.

Really, this won't help with surveillance at all

Both Assael and Shillingford are also keen to stress that their work has no application in the world of surveillance, simply because lip reading requires you to see the subject’s tongue — meaning that the video has to be straight on and well-lit to get a good result. "It’s technically impossible or at least very, very difficult," to use any lip-reading software for surveillance says Assael, adding that frame rate is also a factor; and something that is usually neglected with CCTV. He says: "And if you do have frontal video of someone taken with a very good camera, then you probably have a directional microphone [pointed at them] as well!" (On the subject of surveillance, Assael notes that although one of the paper's supervisors also works with Google's AI division DeepMind, Google itself had no involvement with LipNet's development.)

Instead, the two researchers think that lip-reading AI could help people with hearing disabilities, especially in noisy environments where it’s difficult for computers to isolate speech. For example, someone wearing a camera built into a pair of glasses could get clear, frontal footage of someone they're talking to at a party, and an ancestor of LipNet could then transcribe the conversation in real time, feeding it into their ear. "Anywhere you have speech recognition and a camera, we can improve that," says Assael. He also mentions silent dictation to Siri or Google Assistant as a possible use-case. In the future, then, perhaps those of us who don’t like speaking to our computers, can’t just have them read our lips instead.