Text-To-Speech, why do we still record and splice instead of algorithmically creating voices
I found that my comment was probably too lengthy to expect a decent reply in comments, but I am curious of what people might think are the limitations here.
I am a bit curious on why it needs
to sound like anyone at all. Is human speech so complicated that we couldn't replicate it with an algorithm that creates each sound for it’s specific instance based on an algorithm that knows how to produce most human sounds?
I understand that there are a lot of variables on the mechanisms that create human sound, but I can’t imagine that it wouldn't be more efficient than recording every sound a person makes individually, splicing it up and reconnecting it to fit our needs.
The current method is obviously necessary for at least some part of the process, but I’d imagine eventually we would be able to not only generate natural sounding voices, but we would be able to change the sound of the voice on a whim.
And imagine the use cases of that kind of technology. Imagine that game developers could use algorithmic language generators to create non-main character’s voices to save money and expand capabilities of the range of what could be produced.
Obviously, it would be nearly impossible to have the full range of emotions expressed properly, but you could eventually generate alternative algorithms that can adjust for a range of emotions. Imagine a voice modulator with knobs for each emotion it has pre-configured. You could bump up the distressed / panic in a voice to make it sound more winded and high pitched, or adjust the depressed knob to taper out the voice though each sentence and have it crack a bit more on it’s high notes.
I imagine this is where we would eventually want to be at and I could see a huge amount of use cases for it, particularly in indie gaming, indie animation, etc.