WaveNet

Finding a voice

WaveNet creates more natural-sounding speech for products used by millions of people around the world.

The challenge

The human voice is a finely-tuned instrument. Its tone, tempo, and inflection help us share ideas, communicate needs, and express emotions.

For decades, computer scientists have tried to mimic these abilities and make computers sound more “natural.” Yet, despite incredible progress, artificial speech has struggled to match the qualities of the human voice.

When we first started working on WaveNet, most text-to-speech systems relied on “concatenative synthesis” — a pain-staking process of cutting voice recordings into phonetic sounds and recombining them to form new words and sentences.

The resulting voices often sound mechanical, and making changes requires entirely new recordings — an expensive and time-consuming process.

WaveNet addresses these limitations, offering a technology that finally allows people to interact more naturally with the products they use.

WaveNet rapidly went from a research prototype within our wider effort to understand intelligence to an advanced product used by millions around the world. It offers a glimpse of the enormous range of applications and benefits we believe AI can bring to the world.”

Koray Kavukcuoglu

Vice President of Research

Learning by doing

WaveNet emerged from our team's research in generative models, a type of AI system that can be trained to create new instances of a dataset of interest.

They can be trained on images, videos, or sounds and, once trained, should be able to create new, realistic examples based on what they have learned.

For example, if we train a generative model on a dataset of landscape drawings, it should learn to create entirely novel images of landscapes not seen in the dataset.

The more accurate the results, the more it suggests the model has learned the underlying structure of the dataset, rather than just simply memorising examples.

Introducing WaveNet

WaveNet is a generative model that is trained on speech samples.

It creates the waveforms of speech patterns by predicting which sounds likely follow each other. Each waveform is built one sample at a time, with up to 24,000 samples per second of sound.

And because the model learns from human speech, WaveNet automatically incorporates natural-sounding elements left out of earlier text-to-speech systems, such as lip-smacking and breathing patterns.

By including intonation, accents, emotion, and other vital layers of communication overlooked by earlier systems, WaveNet delivers a richness and depth to computer-generated voices.

For example, when we first introduced WaveNet, we created American English and Mandarin Chinese voices that narrowed the gap between human and computer-generated voices by 50%.

WaveNet is a general purpose technology that has allowed us - and teams at Google - to unlock a range of new applications, from improving video calls on even the weakest connections to helping people who've lost their ability to speak, regain their original voice.”

Zachary Gleicher

Product Manager

Rapid advances

Early versions of WaveNet were time-consuming, taking hours to generate just one second of audio. To be useful for consumer products, we knew WaveNet needed to run much faster.

Using a technique called distillation — transferring knowledge from a large model to a smaller model — we reengineered WaveNet to run 1,000 times faster than our research prototype, creating one second of speech in just 50 milliseconds.

In parallel, we also developed WaveRNN — a simpler, faster, and more computationally efficient model that could run on mobile phones rather than in a data centre.

The power of voice

At its 2016 I/O developer conference, Google introduced an AI-powered virtual assistant designed to answer questions and perform tasks in real-time.

The following year, we partnered with the Google speech team to launch WaveNet as the voice of Google Assistant.

After improving the experience for users in American English and Japanese, WaveNet was rolled out to create dozens of voices in different languages for millions of people using the Assistant through their smart-home and mobile devices.

In another demonstration, WaveNet was used to recreate the voices of two celebrities who featured as cameos on the Assistant. Using only a few hours of speech samples from each celebrity, we integrated the voices of singer John Legend and actress Issa Rae.

On the latest Android devices, WaveRNN also now powers the Assistant voice.

Regaining speech

People living with progressive neurological diseases like ALS (amyotrophic lateral sclerosis), Parkinson’s and multiple sclerosis often lose control of their muscles, and ultimately their ability to speak.

Diagnosed with ALS in 2014, former NFL linebacker Tim Shaw watched as his strength, and his voice, deteriorated. To help, Google AI, the ALS Therapy Institute, and Project Euphonia (a Google program applying AI to help people with atypical speech) developed a service to better understand Shaw’s impaired speech.

WaveRNN was combined with other speech technologies and a dataset of media interviews previously recorded to create a natural-sounding version of Shaw’s voice, empowering him to read aloud a letter written to his younger self.

The Age of A.I.

40 mins

Building blocks

WaveNet and WaveRNN are now crucial components of many of Google’s best known services such as the Google Assistant, Maps, and Search.

And, through Google Cloud, businesses can now choose from hundreds of lifelike voices in over 30 languages or use a WaveRNN service to make a custom voice from only 30 minutes of speech, to improve customer service and device interactions.

Extensions of WaveNet are also helping create entirely new product experiences.

For example, WaveNetEQ and Lyra help fill in lost information and improve the quality of calls on weak connections for Google’s video-calling app Duo.

Looking ahead

Since publishing our research in 2016, WaveNet has gone from a research concept to an advanced real-world system used by millions of people around the world.

The same technology that made it possible for Tim Shaw to regain a voice lost to a degenerative disease also helps answer some of the one billion queries asked of the Google Assistant every day.

It also has the potential to help millions more people to communicate successfully, translate instantly across multiple languages, expand small businesses with custom audio content, and much more.

WaveNet is helping unlock barriers in communication, culture, and commerce for people around the world everyday. And its journey is just beginning.