clock menu more-arrow no yes

Filed under:

Google launches more realistic text-to-speech service powered by DeepMind’s AI

New, 9 comments

OK Google, sing supercalifragilisticexpialidocious

Google is already using DeepMind’s tech to power the voice of Google Assistant.
Photo by Dieter Bohn / The Verge

Google is launching a new AI voice synthesizer as part of its suite of machine learning cloud tools. The service, named Cloud Text-to-Speech, will be available for any developer or business that needs voice synthesis on tap, whether that’s for an app, website, or virtual assistant. But what’s particularly interesting about this news is that Cloud Text-to-Speech is powered by WaveNet, software created by Google’s UK-based AI subsidiary DeepMind.

This is significant for two reasons. First, ever since Google bought DeepMind in 2014, it’s been exploring ways to turn the company’s AI talent into tangible products. So far, this has meant using DeepMind’s algorithms to reduce electricity costs for cooling in Google’s data centers by 40 percent and DeepMind’s forays into health care. But, directly integrating WaveNet into its cloud service is arguably more significant, especially as Google tries to win cloud business away from Amazon and Microsoft, presenting its AI skills as its differentiating factor.

Second, DeepMind’s AI voice synthesis tech is some of the most advanced and realistic in the business. Most voice synthesizers (including Apple’s Siri) use what’s called concatenative synthesis, in which a program stores individual syllables — sounds such as “ba,” “sht,” and “oo” — and pieces them together on the fly to form words and sentences. This method has gotten pretty good over the years, but it still sounds stilted.

A GIF showing how DeepMind’s WaveNet model has improved over the years.
Image: DeepMind

WaveNet, by comparison, uses machine learning to generate audio from scratch. It actually analyzes the waveforms from a huge database of human speech and re-creates them at a rate of 24,000 samples per second. The end result includes voices with subtleties like lip smacks and accents. When Google first unveiled WaveNet in 2016, it was far too computationally intensive to work outside of research environments, but it’s since been slimmed down significantly, showing a clear pipeline from research to product.

WaveNet was first integrated into Google Assistant last October (although only in Japanese and English) and is now available for select voices in Cloud Text-To-Speech. Google says the new service offers 32 different voices capable of speaking 12 languages, and users are able to customize factors like pitch and speed. So, be prepared for a wave of new, realistic computer voices to argue with and boss around. You can check out how WaveNet sounds for yourself below.

Here’s an industry-leading synthesized voice:

And here’s the same sentence from WaveNet:

Here’s another rival’s voice synthesizer, this time speaking Japanese:

And again, here’s the same sentence from WaveNet: