Accept the updated privacy & cookie policy

The Indiatimes.com Privacy Policy has been updated to align with the new data regulations in European Union. Please review and accept these changes below to continue using the website. We use cookies to ensure the best experience for you on our website.

Microsoft's New AI Simulates Anyone's Voice With 3 Seconds Of Sample Audio

On Thursday, Microsoft researchers announced a new text-to-speech AI model called VALL-E that simulates a person's voice after being provided a sample

Bharat Sharma Updated on Jan 10, 2023, 17:26 IST

Artificial intelligence tools made a splash last year. In 2023, these tools are expected to become even smarter and more accurate. Microsoft's new AI is testimony to the potential that is hiding in machine learning.

Microsoft's new AI is able to simulate anyone's voice with 3 seconds of audio. On Thursday, Microsoft researchers announced a new text-to-speech AI model called VALL-E that simulates a person's voice after being provided a sample.

Unsplash

How does VALL-E work?

Once VALL-E learns anyone's voice, it can synthesise audio of that person saying virtually anything. Yikes! While we can't help but think of ways in which such tech could be exploited in a dog-eat-dog world, it's still a fascinating step forward in building a smarter world.

VALL-E's creators say that it could be used to create high quality text-to-speech applications. It could also enable speech editing, allowing users to change what was originally said.

Microsoft

Microsoft says that VALL-E is a "neural codec language model" built on a technology called EnCodec that was announced by Meta in October 2022. Most text-to-speech models synthesise speech by tweaking waveforms, but VALL-E generates audio through audio codec codes from text and acoustic prompts, ArsTechnica reported.

Also read:?This AI Tool Upscales Blurry Old Videos While You Watch Them On Your Browser

The first step for VALL-E is to identify how a personal sounds, breaking that information into "tokens" and then using training data to deliver results.

"To synthesise personalised speech (e.g., zero-shot TTS), VALL-E generates the corresponding acoustic tokens conditioned on the acoustic tokens of the 3-second enrolled recording and the phoneme prompt, which constrain the speaker and content information respectively. Finally, the generated acoustic tokens are used to synthesise the final waveform with the corresponding neural codec decoder," Microsoft wrote in the VALL-E paper?currently up on the pre-print forum ArXiv.

Unsplash

VALL-E's speech synthesis capabilities are based on an audio library called LibriLight that was created by Meta. This library contains 60,000 hours of speech in English from over 7,000 speakers that were mostly sourced from LibriVox public domain audiobooks.

Besides being able to copy a speaker's voice, VALL-E is also able to imitate the "acoustic environment" of the sample audio. For instance, if the sample audio sounds like a telephone call conversation, VALL-E will create audios that sound like that.

Also read:?Industry Pioneers Explain How AI Improved Their Efficiency And Accuracy In 2022

Unsplash

We're not the only ones kind of freaked out by how this tool may be misused in virtually any situation. "Since VALL-E could synthesise speech that maintains speaker identity, it may carry potential risks in misuse of the model, such as spoofing voice identification or impersonating a specific speaker. To mitigate such risks, it is possible to build a detection model to discriminate whether an audio clip was synthesized by VALL-E. We will also put Microsoft AI Principles into practice when further developing the models," Microsoft wrote in the VALL-E paper.

You can check out Microsoft's demo page for VALL-E here.

What do you think about Microsoft's latest technology in the domain of AI? Let us know in the comments below.?For more in the world of?technology?and?science, keep reading?Indiatimes.com.?

References

Edwards, B. (2023, January 9). Microsoft’s new AI can simulate anyone’s voice with 3 seconds of audio. Ars Technica. https://arstechnica.com/information-technology/2023/01/microsofts-new-ai-can-simulate-anyones-voice-with-3-seconds-of-audio/

Bharat Sharma

I live for all-things-technology - gadgets, novel climate solutions, and startups that are changing the game. In my leisurely hours, you can find me binge watching science fiction films, writing poetry, or dancing to pop anthems.

Visual Stories

9 tips to grow juicy watermelons in your terrace garden this summer

6 reasons to change your surgical mask every 4 hours during COVID-19

Alia Bhatt slays bridesmaid goals with boho-chic to black-tie glam in Spain

Ciri Is the new Witcher Everything you should know

6 easy ways to stay hydrated during COVID-19 isolation support your recovery at home

Virat Kohli Rajat Patidar more RCB players share what winning IPL title meant

Hina Khan’s wedding outfit details decoded there was a secret detail

Baba Vanga’s prediction Top 5 zodiac signs set to succeed in 2025

10 unique records broken in IPL 2025

RCB wins 9 adorable Virat Kohli–Anushka Sharma moments we cant stop loving