At the start of this year, Chinese search giant Baidu introduced a new system called DeepVoice. It uses deep learning, a popular artificial intelligence technique, to build a system that can convert text-to-speech. The first version was able to produce short sentences that, at least on a cursory listen, were nearly indistinguishable from a real person. That system could learn one voice at a time, and required hours of data to master each one.
DeepVoice 2, which debuted in May, could imitate a voice with just half an hour of data, and a single system could learn hundreds of different accents. Today, Baidu is introducing the third and final version of DeepVoice; the company says this version can learn 2,500 voices with just a half an hour of data each. Baidu says that “having a system that is able to effectively generate a wide variety of voices opens the door to many use cases that would otherwise not be feasible. For example, each character in an audio book or a video game would have his or her own unique voice for a more enhanced user experience.”
The examples of synthesized voices Baidu showed off from DeepVoice 3 don’t sound human in the same way its initial examples did. They are clearly synthetic. The company argues that’s not what it’s aiming for, saying, “If we are generating only one or two voices, as our single speaker sample has shown, our system is already proven to be able to synthesize very natural, human-like voices that can be readily used as a digital assistant.”
What Baidu’s trying to do is craft a system that can master the nuances of a multiplicity of accents or characters. While 2,500 is the current limit, the team says that it believes future version, using a bigger data set, could master 10,000 or more. “This is the initial work showing the possible of scalability. Our system succeeded in scaling the training to a size and magnitude that's never been done in previously published text-to-speech model. We believe the the quality can be increased substantially in the near future by using large high-quality datasets to train with additional machine learning engineering.”
Baidu isn’t the only search giant working on computer speech synthesis. Google’s DeepMind division has been pursuing a similar project with WaveNet. Its latest version has gotten much better at mastering accents and even produces “lip smacks” that make the voices sound more human. It’s now being used live in the field to generate voices for the Google Assistant in English and Japanese.