For decades, computer scientists have tried to mimic these abilities and make computers sound more “natural.” Yet, despite incredible progress, artificial speech has struggled to match the qualities of the human voice.
When we first started working on WaveNet, most text-to-speech systems relied on “concatenative synthesis” — a pain-staking process of cutting voice recordings into phonetic sounds and recombining them to form new words and sentences.
The resulting voices often sound mechanical, and making changes requires entirely new recordings — an expensive and time-consuming process.
WaveNet addresses these limitations, offering a technology that finally allows people to interact more naturally with the products they use.
They can be trained on images, videos, or sounds and, once trained, should be able to create new, realistic examples based on what they have learned.
For example, if we train a generative model on a dataset of landscape drawings, it should learn to create entirely novel images of landscapes not seen in the dataset.
The more accurate the results, the more it suggests the model has learned the underlying structure of the dataset, rather than just simply memorising examples.
It creates the waveforms of speech patterns by predicting which sounds likely follow each other. Each waveform is built one sample at a time, with up to 24,000 samples per second of sound.
And because the model learns from human speech, WaveNet automatically incorporates natural-sounding elements left out of earlier text-to-speech systems, such as lip-smacking and breathing patterns.
By including intonation, accents, emotion, and other vital layers of communication overlooked by earlier systems, WaveNet delivers a richness and depth to computer-generated voices.
For example, when we first introduced WaveNet, we created American English and Mandarin Chinese voices that narrowed the gap between human and computer-generated voices by 50%.
Using a technique called distillation — transferring knowledge from a large model to a smaller model — we reengineered WaveNet to run 1,000 times faster than our research prototype, creating one second of speech in just 50 milliseconds.
In parallel, we also developed WaveRNN — a simpler, faster, and more computationally efficient model that could run on mobile phones rather than in a data centre.
The following year, we partnered with the Google speech team to launch WaveNet as the voice of Google Assistant.
After improving the experience for users in American English and Japanese, WaveNet was rolled out to create dozens of voices in different languages for millions of people using the Assistant through their smart-home and mobile devices.
In another demonstration, WaveNet was used to recreate the voices of two celebrities who featured as cameos on the Assistant. Using only a few hours of speech samples from each celebrity, we integrated the voices of singer John Legend and actress Issa Rae.
On the latest Android devices, WaveRNN also now powers the Assistant voice.
Diagnosed with ALS in 2014, former NFL linebacker Tim Shaw watched as his strength, and his voice, deteriorated. To help, Google AI, the ALS Therapy Institute, and Project Euphonia (a Google program applying AI to help people with atypical speech) developed a service to better understand Shaw’s impaired speech.
WaveRNN was combined with other speech technologies and a dataset of media interviews previously recorded to create a natural-sounding version of Shaw’s voice, empowering him to read aloud a letter written to his younger self.
And, through Google Cloud, businesses can now choose from hundreds of lifelike voices in over 30 languages or use a WaveRNN service to make a custom voice from only 30 minutes of speech, to improve customer service and device interactions.
Extensions of WaveNet are also helping create entirely new product experiences.
For example, WaveNetEQ and Lyra help fill in lost information and improve the quality of calls on weak connections for Google’s video-calling app Duo.
The same technology that made it possible for Tim Shaw to regain a voice lost to a degenerative disease also helps answer some of the one billion queries asked of the Google Assistant every day.
It also has the potential to help millions more people to communicate successfully, translate instantly across multiple languages, expand small businesses with custom audio content, and much more.
WaveNet is helping unlock barriers in communication, culture, and commerce for people around the world everyday. And its journey is just beginning.