Artificial intelligence tools made a splash last year. In 2023, these tools are expected to become even smarter and more accurate. Microsoft's new AI is testimony to the potential that is hiding in machine learning.
Microsoft's new AI is able to simulate anyone's voice with 3 seconds of audio. On Thursday, Microsoft researchers announced a new text-to-speech AI model called VALL-E that simulates a person's voice after being provided a sample.
Once VALL-E learns anyone's voice, it can synthesise audio of that person saying virtually anything. Yikes! While we can't help but think of ways in which such tech could be exploited in a dog-eat-dog world, it's still a fascinating step forward in building a smarter world.
VALL-E's creators say that it could be used to create high quality text-to-speech applications. It could also enable speech editing, allowing users to change what was originally said.
Microsoft says that VALL-E is a "neural codec language model" built on a technology called EnCodec that was announced by Meta in October 2022. Most text-to-speech models synthesise speech by tweaking waveforms, but VALL-E generates audio through audio codec codes from text and acoustic prompts, ArsTechnica reported.
Also read:?This AI Tool Upscales Blurry Old Videos While You Watch Them On Your Browser
The first step for VALL-E is to identify how a personal sounds, breaking that information into "tokens" and then using training data to deliver results.
"To synthesise personalised speech (e.g., zero-shot TTS), VALL-E generates the corresponding acoustic tokens conditioned on the acoustic tokens of the 3-second enrolled recording and the phoneme prompt, which constrain the speaker and content information respectively. Finally, the generated acoustic tokens are used to synthesise the final waveform with the corresponding neural codec decoder," Microsoft wrote in the VALL-E paper?currently up on the pre-print forum ArXiv.
VALL-E's speech synthesis capabilities are based on an audio library called LibriLight that was created by Meta. This library contains 60,000 hours of speech in English from over 7,000 speakers that were mostly sourced from LibriVox public domain audiobooks.
Besides being able to copy a speaker's voice, VALL-E is also able to imitate the "acoustic environment" of the sample audio. For instance, if the sample audio sounds like a telephone call conversation, VALL-E will create audios that sound like that.
Also read:?Industry Pioneers Explain How AI Improved Their Efficiency And Accuracy In 2022
We're not the only ones kind of freaked out by how this tool may be misused in virtually any situation. "Since VALL-E could synthesise speech that maintains speaker identity, it may carry potential risks in misuse of the model, such as spoofing voice identification or impersonating a specific speaker. To mitigate such risks, it is possible to build a detection model to discriminate whether an audio clip was synthesized by VALL-E. We will also put Microsoft AI Principles into practice when further developing the models," Microsoft wrote in the VALL-E paper.
You can check out Microsoft's demo page for VALL-E here.
What do you think about Microsoft's latest technology in the domain of AI? Let us know in the comments below.?For more in the world of?technology?and?science, keep reading?Indiatimes.com.?
References
Edwards, B. (2023, January 9). Microsoft¡¯s new AI can simulate anyone¡¯s voice with 3 seconds of audio. Ars Technica. https://arstechnica.com/information-technology/2023/01/microsofts-new-ai-can-simulate-anyones-voice-with-3-seconds-of-audio/