fbpx
Wikipedia

WaveNet

WaveNet is a deep neural network for generating raw audio. It was created by researchers at London-based AI firm DeepMind. The technique, outlined in a paper in September 2016,[1] is able to generate relatively realistic-sounding human-like voices by directly modelling waveforms using a neural network method trained with recordings of real speech. Tests with US English and Mandarin reportedly showed that the system outperforms Google's best existing text-to-speech (TTS) systems, although as of 2016 its text-to-speech synthesis still was less convincing than actual human speech.[2] WaveNet's ability to generate raw waveforms means that it can model any kind of audio, including music.[3]

History edit

Generating speech from text is an increasingly common task thanks to the popularity of software such as Apple's Siri, Microsoft's Cortana, Amazon Alexa and the Google Assistant.[4]

Most such systems use a variation of a technique that involves concatenated sound fragments together to form recognisable sounds and words.[5] The most common of these is called concatenative TTS.[6] It consists of large library of speech fragments, recorded from a single speaker that are then concatenated to produce complete words and sounds. The result sounds unnatural, with an odd cadence and tone.[7] The reliance on a recorded library also makes it difficult to modify or change the voice.[8]

Another technique, known as parametric TTS,[9] uses mathematical models to recreate sounds that are then assembled into words and sentences. The information required to generate the sounds is stored in the parameters of the model. The characteristics of the output speech are controlled via the inputs to the model, while the speech is typically created using a voice synthesiser known as a vocoder. This can also result in unnatural sounding audio.

Design and ongoing research edit

Background edit

 
A stack of dilated casual convolutional layers[10]

WaveNet is a type of feedforward neural network known as a deep convolutional neural network (CNN). In WaveNet, the CNN takes a raw signal as an input and synthesises an output one sample at a time. It does so by sampling from a softmax (i.e. categorical) distribution of a signal value that is encoded using μ-law companding transformation and quantized to 256 possible values.[11]

Initial concept and results edit

According to the original September 2016 DeepMind research paper WaveNet: A Generative Model for Raw Audio,[12] the network was fed real waveforms of speech in English and Mandarin. As these pass through the network, it learns a set of rules to describe how the audio waveform evolves over time. The trained network can then be used to create new speech-like waveforms at 16,000 samples per second. These waveforms include realistic breaths and lip smacks – but do not conform to any language.[13]

WaveNet is able to accurately model different voices, with the accent and tone of the input correlating with the output. For example, if it is trained with German, it produces German speech.[14] The capability also means that if the WaveNet is fed other inputs – such as music – its output will be musical. At the time of its release, DeepMind showed that WaveNet could produce waveforms that sound like classical music.[15]

Content (voice) swapping edit

According to the June 2018 paper Disentangled Sequential Autoencoder,[16] DeepMind has successfully used WaveNet for audio and voice "content swapping": the network can swap the voice on an audio recording for another, pre-existing voice while maintaining the text and other features from the original recording. "We also experiment on audio sequence data. Our disentangled representation allows us to convert speaker identities into each other while conditioning on the content of the speech." (p. 5) "For audio, this allows us to convert a male speaker into a female speaker and vice versa [...]." (p. 1) According to the paper, a two-digit minimum amount of hours (c. 50 hours) of pre-existing speech recordings of both source and target voice are required to be fed into WaveNet for the program to learn their individual features before it is able to perform the conversion from one voice to another at a satisfying quality. The authors stress that "[a]n advantage of the model is that it separates dynamical from static features [...]." (p. 8), i. e. WaveNet is capable of distinguishing between the spoken text and modes of delivery (modulation, speed, pitch, mood, etc.) to maintain during the conversion from one voice to another on the one hand, and the basic features of both source and target voices that it is required to swap on the other.

The January 2019 follow-up paper Unsupervised speech representation learning using WaveNet autoencoders[17] details a method to successfully enhance the proper automatic recognition and discrimination between dynamical and static features for "content swapping", notably including swapping voices on existing audio recordings, in order to make it more reliable. Another follow-up paper, Sample Efficient Adaptive Text-to-Speech,[18] dated September 2018 (latest revision January 2019), states that DeepMind has successfully reduced the minimum amount of real-life recordings required to sample an existing voice via WaveNet to "merely a few minutes of audio data" while maintaining high-quality results.

Its ability to clone voices has raised ethical concerns about WaveNet's ability to mimic the voices of living and dead persons. According to a 2016 BBC article, companies working on similar voice-cloning technologies (such as Adobe Voco) intend to insert watermarking inaudible to humans to prevent counterfeiting, while maintaining that voice cloning satisfying, for instance, the needs of entertainment-industry purposes would be of a far lower complexity and use different methods than required to fool forensic evidencing methods and electronic ID devices, so that natural voices and voices cloned for entertainment-industry purposes could still be easily told apart by technological analysis.[19]

Applications edit

At the time of its release, DeepMind said that WaveNet required too much computational processing power to be used in real world applications.[20] As of October 2017, Google announced a 1,000-fold performance improvement along with better voice quality. WaveNet was then used to generate Google Assistant voices for US English and Japanese across all Google platforms.[21] In November 2017, DeepMind researchers released a research paper detailing a proposed method of "generating high-fidelity speech samples at more than 20 times faster than real-time", called "Probability Density Distillation".[22] At the annual I/O developer conference in May 2018, it was announced that new Google Assistant voices were available and made possible by WaveNet; WaveNet greatly reduced the number of audio recordings that were required to create a voice model by modeling the raw audio of the voice actor samples.[23]

See also edit

References edit

  1. ^ van den Oord, Aaron; Dieleman, Sander; Zen, Heiga; Simonyan, Karen; Vinyals, Oriol; Graves, Alex; Kalchbrenner, Nal; Senior, Andrew; Kavukcuoglu, Koray (2016-09-12). "WaveNet: A Generative Model for Raw Audio". arXiv:1609.03499 [cs.SD].
  2. ^ Kahn, Jeremy (2016-09-09). "Google's DeepMind Achieves Speech-Generation Breakthrough". Bloomberg.com. Retrieved 2017-07-06.
  3. ^ Meyer, David (2016-09-09). "Google's DeepMind Claims Massive Progress in Synthesized Speech". Fortune. Retrieved 2017-07-06.
  4. ^ Kahn, Jeremy (2016-09-09). "Google's DeepMind Achieves Speech-Generation Breakthrough". Bloomberg.com. Retrieved 2017-07-06.
  5. ^ Condliffe, Jamie (2016-09-09). "When this computer talks, you may actually want to listen". MIT Technology Review. Retrieved 2017-07-06.
  6. ^ Hunt, A. J.; Black, A. W. (May 1996). "Unit selection in a concatenative speech synthesis system using a large speech database". 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings (PDF). Vol. 1. pp. 373–376. CiteSeerX 10.1.1.218.1335. doi:10.1109/ICASSP.1996.541110. ISBN 978-0-7803-3192-1. S2CID 14621185.
  7. ^ Coldewey, Devin (2016-09-09). "Google's WaveNet uses neural nets to generate eerily convincing speech and music". TechCrunch. Retrieved 2017-07-06.
  8. ^ van den Oord, Aäron; Dieleman, Sander; Zen, Heiga (2016-09-08). "WaveNet: A Generative Model for Raw Audio". DeepMind. Retrieved 2017-07-06.
  9. ^ Zen, Heiga; Tokuda, Keiichi; Black, Alan W. (2009). "Statistical parametric speech synthesis". Speech Communication. 51 (11): 1039–1064. CiteSeerX 10.1.1.154.9874. doi:10.1016/j.specom.2009.04.004. S2CID 3232238.
  10. ^ van den Oord, Aäron (2017-11-12). "High-fidelity speech synthesis with WaveNet". DeepMind. Retrieved 2022-06-05.
  11. ^ Oord, Aaron van den; Dieleman, Sander; Zen, Heiga; Simonyan, Karen; Vinyals, Oriol; Graves, Alex; Kalchbrenner, Nal; Senior, Andrew; Kavukcuoglu, Koray (2016-09-12). "WaveNet: A Generative Model for Raw Audio". arXiv:1609.03499 [cs.SD].
  12. ^ Aaron van den Oord; Dieleman, Sander; Zen, Heiga; Simonyan, Karen; Vinyals, Oriol; Graves, Alex; Kalchbrenner, Nal; Senior, Andrew; Kavukcuoglu, Koray (2016). "WaveNet: A Generative Model for Raw Audio". arXiv:1609.03499 [cs.SD].
  13. ^ Gershgorn, Dave (2016-09-09). "Are you sure you're talking to a human? Robots are starting to sounding eerily lifelike". Quartz. Retrieved 2017-07-06.
  14. ^ Coldewey, Devin (2016-09-09). "Google's WaveNet uses neural nets to generate eerily convincing speech and music". TechCrunch. Retrieved 2017-07-06.
  15. ^ van den Oord, Aäron; Dieleman, Sander; Zen, Heiga (2016-09-08). "WaveNet: A Generative Model for Raw Audio". DeepMind. Retrieved 2017-07-06.
  16. ^ Li, Yingzhen; Mandt, Stephan (2018). "Disentangled Sequential Autoencoder". arXiv:1803.02991 [cs.LG].
  17. ^ Chorowski, Jan; Weiss, Ron J.; Bengio, Samy; Van Den Oord, Aaron (2019). "Unsupervised Speech Representation Learning Using WaveNet Autoencoders". IEEE/ACM Transactions on Audio, Speech, and Language Processing. 27 (12): 2041–2053. arXiv:1901.08810. doi:10.1109/TASLP.2019.2938863.
  18. ^ Chen, Yutian; Assael, Yannis; Shillingford, Brendan; Budden, David; Reed, Scott; Zen, Heiga; Wang, Quan; Cobo, Luis C.; Trask, Andrew; Laurie, Ben; Gulcehre, Caglar; Aäron van den Oord; Vinyals, Oriol; Nando de Freitas (2018). "Sample Efficient Adaptive Text-to-Speech". arXiv:1809.10460 [cs.LG].
  19. ^ Adobe Voco 'Photoshop-for-voice' causes concern, 7 November 2016, BBC
  20. ^ "Adobe Voco 'Photoshop-for-voice' causes concern". BBC News. 2016-11-07. Retrieved 2017-07-06.
  21. ^ WaveNet launches in the Google Assistant
  22. ^ Aaron van den Oord; Li, Yazhe; Babuschkin, Igor; Simonyan, Karen; Vinyals, Oriol; Kavukcuoglu, Koray; George van den Driessche; Lockhart, Edward; Cobo, Luis C.; Stimberg, Florian; Casagrande, Norman; Grewe, Dominik; Noury, Seb; Dieleman, Sander; Elsen, Erich; Kalchbrenner, Nal; Zen, Heiga; Graves, Alex; King, Helen; Walters, Tom; Belov, Dan; Hassabis, Demis (2017). "Parallel WaveNet: Fast High-Fidelity Speech Synthesis". arXiv:1711.10433 [cs.LG].
  23. ^ Martin, Taylor (May 9, 2018). "Try the all-new Google Assistant voices right now". CNET. Retrieved May 10, 2018.

External links edit

  • WaveNet: A Generative Model for Raw Audio

wavenet, deep, neural, network, generating, audio, created, researchers, london, based, firm, deepmind, technique, outlined, paper, september, 2016, able, generate, relatively, realistic, sounding, human, like, voices, directly, modelling, waveforms, using, ne. WaveNet is a deep neural network for generating raw audio It was created by researchers at London based AI firm DeepMind The technique outlined in a paper in September 2016 1 is able to generate relatively realistic sounding human like voices by directly modelling waveforms using a neural network method trained with recordings of real speech Tests with US English and Mandarin reportedly showed that the system outperforms Google s best existing text to speech TTS systems although as of 2016 its text to speech synthesis still was less convincing than actual human speech 2 WaveNet s ability to generate raw waveforms means that it can model any kind of audio including music 3 Contents 1 History 2 Design and ongoing research 2 1 Background 2 2 Initial concept and results 2 3 Content voice swapping 3 Applications 4 See also 5 References 6 External linksHistory editGenerating speech from text is an increasingly common task thanks to the popularity of software such as Apple s Siri Microsoft s Cortana Amazon Alexa and the Google Assistant 4 Most such systems use a variation of a technique that involves concatenated sound fragments together to form recognisable sounds and words 5 The most common of these is called concatenative TTS 6 It consists of large library of speech fragments recorded from a single speaker that are then concatenated to produce complete words and sounds The result sounds unnatural with an odd cadence and tone 7 The reliance on a recorded library also makes it difficult to modify or change the voice 8 Another technique known as parametric TTS 9 uses mathematical models to recreate sounds that are then assembled into words and sentences The information required to generate the sounds is stored in the parameters of the model The characteristics of the output speech are controlled via the inputs to the model while the speech is typically created using a voice synthesiser known as a vocoder This can also result in unnatural sounding audio Design and ongoing research editBackground edit nbsp A stack of dilated casual convolutional layers 10 WaveNet is a type of feedforward neural network known as a deep convolutional neural network CNN In WaveNet the CNN takes a raw signal as an input and synthesises an output one sample at a time It does so by sampling from a softmax i e categorical distribution of a signal value that is encoded using m law companding transformation and quantized to 256 possible values 11 Initial concept and results edit According to the original September 2016 DeepMind research paper WaveNet A Generative Model for Raw Audio 12 the network was fed real waveforms of speech in English and Mandarin As these pass through the network it learns a set of rules to describe how the audio waveform evolves over time The trained network can then be used to create new speech like waveforms at 16 000 samples per second These waveforms include realistic breaths and lip smacks but do not conform to any language 13 WaveNet is able to accurately model different voices with the accent and tone of the input correlating with the output For example if it is trained with German it produces German speech 14 The capability also means that if the WaveNet is fed other inputs such as music its output will be musical At the time of its release DeepMind showed that WaveNet could produce waveforms that sound like classical music 15 Content voice swapping edit According to the June 2018 paper Disentangled Sequential Autoencoder 16 DeepMind has successfully used WaveNet for audio and voice content swapping the network can swap the voice on an audio recording for another pre existing voice while maintaining the text and other features from the original recording We also experiment on audio sequence data Our disentangled representation allows us to convert speaker identities into each other while conditioning on the content of the speech p 5 For audio this allows us to convert a male speaker into a female speaker and vice versa p 1 According to the paper a two digit minimum amount of hours c 50 hours of pre existing speech recordings of both source and target voice are required to be fed into WaveNet for the program to learn their individual features before it is able to perform the conversion from one voice to another at a satisfying quality The authors stress that a n advantage of the model is that it separates dynamical from static features p 8 i e WaveNet is capable of distinguishing between the spoken text and modes of delivery modulation speed pitch mood etc to maintain during the conversion from one voice to another on the one hand and the basic features of both source and target voices that it is required to swap on the other The January 2019 follow up paper Unsupervised speech representation learning using WaveNet autoencoders 17 details a method to successfully enhance the proper automatic recognition and discrimination between dynamical and static features for content swapping notably including swapping voices on existing audio recordings in order to make it more reliable Another follow up paper Sample Efficient Adaptive Text to Speech 18 dated September 2018 latest revision January 2019 states that DeepMind has successfully reduced the minimum amount of real life recordings required to sample an existing voice via WaveNet to merely a few minutes of audio data while maintaining high quality results Its ability to clone voices has raised ethical concerns about WaveNet s ability to mimic the voices of living and dead persons According to a 2016 BBC article companies working on similar voice cloning technologies such as Adobe Voco intend to insert watermarking inaudible to humans to prevent counterfeiting while maintaining that voice cloning satisfying for instance the needs of entertainment industry purposes would be of a far lower complexity and use different methods than required to fool forensic evidencing methods and electronic ID devices so that natural voices and voices cloned for entertainment industry purposes could still be easily told apart by technological analysis 19 Applications editAt the time of its release DeepMind said that WaveNet required too much computational processing power to be used in real world applications 20 As of October 2017 Google announced a 1 000 fold performance improvement along with better voice quality WaveNet was then used to generate Google Assistant voices for US English and Japanese across all Google platforms 21 In November 2017 DeepMind researchers released a research paper detailing a proposed method of generating high fidelity speech samples at more than 20 times faster than real time called Probability Density Distillation 22 At the annual I O developer conference in May 2018 it was announced that new Google Assistant voices were available and made possible by WaveNet WaveNet greatly reduced the number of audio recordings that were required to create a voice model by modeling the raw audio of the voice actor samples 23 See also edit15 ai Deep learning speech synthesisReferences edit van den Oord Aaron Dieleman Sander Zen Heiga Simonyan Karen Vinyals Oriol Graves Alex Kalchbrenner Nal Senior Andrew Kavukcuoglu Koray 2016 09 12 WaveNet A Generative Model for Raw Audio arXiv 1609 03499 cs SD Kahn Jeremy 2016 09 09 Google s DeepMind Achieves Speech Generation Breakthrough Bloomberg com Retrieved 2017 07 06 Meyer David 2016 09 09 Google s DeepMind Claims Massive Progress in Synthesized Speech Fortune Retrieved 2017 07 06 Kahn Jeremy 2016 09 09 Google s DeepMind Achieves Speech Generation Breakthrough Bloomberg com Retrieved 2017 07 06 Condliffe Jamie 2016 09 09 When this computer talks you may actually want to listen MIT Technology Review Retrieved 2017 07 06 Hunt A J Black A W May 1996 Unit selection in a concatenative speech synthesis system using a large speech database 1996 IEEE International Conference on Acoustics Speech and Signal Processing Conference Proceedings PDF Vol 1 pp 373 376 CiteSeerX 10 1 1 218 1335 doi 10 1109 ICASSP 1996 541110 ISBN 978 0 7803 3192 1 S2CID 14621185 Coldewey Devin 2016 09 09 Google s WaveNet uses neural nets to generate eerily convincing speech and music TechCrunch Retrieved 2017 07 06 van den Oord Aaron Dieleman Sander Zen Heiga 2016 09 08 WaveNet A Generative Model for Raw Audio DeepMind Retrieved 2017 07 06 Zen Heiga Tokuda Keiichi Black Alan W 2009 Statistical parametric speech synthesis Speech Communication 51 11 1039 1064 CiteSeerX 10 1 1 154 9874 doi 10 1016 j specom 2009 04 004 S2CID 3232238 van den Oord Aaron 2017 11 12 High fidelity speech synthesis with WaveNet DeepMind Retrieved 2022 06 05 Oord Aaron van den Dieleman Sander Zen Heiga Simonyan Karen Vinyals Oriol Graves Alex Kalchbrenner Nal Senior Andrew Kavukcuoglu Koray 2016 09 12 WaveNet A Generative Model for Raw Audio arXiv 1609 03499 cs SD Aaron van den Oord Dieleman Sander Zen Heiga Simonyan Karen Vinyals Oriol Graves Alex Kalchbrenner Nal Senior Andrew Kavukcuoglu Koray 2016 WaveNet A Generative Model for Raw Audio arXiv 1609 03499 cs SD Gershgorn Dave 2016 09 09 Are you sure you re talking to a human Robots are starting to sounding eerily lifelike Quartz Retrieved 2017 07 06 Coldewey Devin 2016 09 09 Google s WaveNet uses neural nets to generate eerily convincing speech and music TechCrunch Retrieved 2017 07 06 van den Oord Aaron Dieleman Sander Zen Heiga 2016 09 08 WaveNet A Generative Model for Raw Audio DeepMind Retrieved 2017 07 06 Li Yingzhen Mandt Stephan 2018 Disentangled Sequential Autoencoder arXiv 1803 02991 cs LG Chorowski Jan Weiss Ron J Bengio Samy Van Den Oord Aaron 2019 Unsupervised Speech Representation Learning Using WaveNet Autoencoders IEEE ACM Transactions on Audio Speech and Language Processing 27 12 2041 2053 arXiv 1901 08810 doi 10 1109 TASLP 2019 2938863 Chen Yutian Assael Yannis Shillingford Brendan Budden David Reed Scott Zen Heiga Wang Quan Cobo Luis C Trask Andrew Laurie Ben Gulcehre Caglar Aaron van den Oord Vinyals Oriol Nando de Freitas 2018 Sample Efficient Adaptive Text to Speech arXiv 1809 10460 cs LG Adobe Voco Photoshop for voice causes concern 7 November 2016 BBC Adobe Voco Photoshop for voice causes concern BBC News 2016 11 07 Retrieved 2017 07 06 WaveNet launches in the Google Assistant Aaron van den Oord Li Yazhe Babuschkin Igor Simonyan Karen Vinyals Oriol Kavukcuoglu Koray George van den Driessche Lockhart Edward Cobo Luis C Stimberg Florian Casagrande Norman Grewe Dominik Noury Seb Dieleman Sander Elsen Erich Kalchbrenner Nal Zen Heiga Graves Alex King Helen Walters Tom Belov Dan Hassabis Demis 2017 Parallel WaveNet Fast High Fidelity Speech Synthesis arXiv 1711 10433 cs LG Martin Taylor May 9 2018 Try the all new Google Assistant voices right now CNET Retrieved May 10 2018 External links editWaveNet A Generative Model for Raw Audio Retrieved from https en wikipedia org w index php title WaveNet amp oldid 1212031137, wikipedia, wiki, book, books, library,

article

, read, download, free, free download, mp3, video, mp4, 3gp, jpg, jpeg, gif, png, picture, music, song, movie, book, game, games.