Skip to main content

Baidu’s new text-to-speech system can master hundreds of accents

Baidu’s new text-to-speech system can master hundreds of accents

/

And it can do it with just half an hour of audio training

Share this story

A phone with a recording app installed and running on screen
Photo by Amar Toor / The Verge

There is a renaissance happening in the world of artificial intelligence. Using deep learning, researchers are producing systems that can recognize objects, understand spoken language, and even simulate the human voice. The quality of these systems is advancing at a blistering pace.

Just three months months ago, Chinese search giant Baidu showed off Deep Voice, a system for turning text into speech. It could produce speech which was nearly indistinguishable from an actual human voice on the first listen, and do it in near real time. But that system could only learn one voice at a time, and required many hours of audio or more from which to build a sample. Today the company is rolling out Deep Voice 2. It can learn the nuances of a person’s voice with just half an hour of audio, and a single system can learn to imitate hundreds of different speakers.

Regional accents are now easier to achieve

Remember how long it took for Siri to roll out regional accents? That’s because each new voice required an actual human being to record thousands and thousands of hours of speech. After that, engineers spent a long time hand tuning the software, teaching it how to speak. Deep Voice 2 takes a different approach: it learns commonalities shared across hundreds of different speakers to build a model of the human voice, then tweaks that model slightly to craft different characters. And the system doesn’t require any manual adjustment from its human creators. “Give it the right data, and it can learn on their own what sort of features are important,” says Andrew Gibiansky, a research scientist at Baidu's Silicon Valley AI Lab who works on the Deep Voice project.

Baidu imagines that this technology would be useful on digital assistants that are controlled through voice commands that respond by speaking to their users. It also sees potential in text-to-speech applications like ebooks. “The ability to quickly synthesize multiple human voices will have a huge effect on products such as personal assistants and eBook readers in the future. For example, each character of your Ebook could have a unique voice when you list to the Ebook,” the company wrote in a blog post.

It’s a crowded field

Baidu isn’t the only tech giant exploring this space. In September of last year Google’s DeepMind division published research on WaveNet, a vocoder built using deep learning techniques that made huge gains in audio quality over more traditional speech synthesis systems. Startups are also playing in this market. Last month Lyrebird, a Canadian startup, showed off a system that could imitate the vocal nuances of famous figures based on just one minute of audio data.

As we move into a world where our gadgets are increasingly controlled by our voices, with expectations to reply, this technology will be used to create all kinds of custom characters for our digital assistants. Would you prefer Siri to sound like Humphrey Bogart, Hulk Hogan, or Lil Kim? We’re taking suggestions in the comments below.