Nvidia’s voice AI is exactly like human voice
image: Nvidia

Nvidia’s voice AI is exactly like human voice

When an AI mimics human behavior, it’s described as the “uncanny valley.” But Nvidia’s new voice AI is something far more realistic and levels above anything we’ve seen before. Merging AI and a human reference recording, the fake voice sounds almost identical to a human’s.   

The company’s in-house creative team describes the process of achieving accurate voice synthesis. The team equates speech to music that has complex rhythms, pitches, and timbres that aren’t easy to replicate. Nvidia is creating tools that are capable of producing these intricacies with AI.

A voice just like yours

The company showcased its latest advancements at Interspeech, a technical conference that focuses on the research of speech processing technologies. Nvidia’s voice tools can be used via the open-source NeMo toolkit, and they work well with Nvidia’s GPUs.

The AI voice isn’t a demo, as the company has transitioned to an AI narrator for its I Am AI video series, which shows the advancements in machine learning across various industries. Now, the firm is using an AI as a narrator, free of the audio artifacts that have synthesized voices.

Nvidia tackles AI voices by using two methods. The first is to train a text-to-speech model on a speech given by a human. Once the training is complete, the model can take any text input and convert it to speech. Another method is voice conversion, where the program uses an audio file recorded by a human and converts the voice to an AI while matching the intonation.

Nvidia taking AI to another level

Nvidia notes that it can be used for customer services, as well as replace the ones currently being used in smart speakers. The company says the new tech can go further. “Text-to-speech can be used in gaming, to aid individuals with vocal disabilities or to help users translate between languages in their own voice,” reads the company’s blog post.

Previously, Nvidia showcased an AI model capable of converting a single 2D image of a person into a “talking head” video. Called Vid2Vid Cameo, the deep learning model aims to make the video conferencing experience better.

Disclaimer: The above article has been aggregated by a computer program and summarised by an Steamdaily specialist. You can read the original article at nvidia
Close Menu