Nvidia projects are helping AI find its human-like voice

Speech recognition may seem like a technology relic of another era, something useful in customer service, telemarketing and transcription software--and with the quality always somewhat lacking. It’s easy to forget that speech recognition technology not only opened the door to the artificial intelligence era, but continues to be a key interface to AI and continues to improve in quality.

“It could be argued that the modern artificial intelligence revolution started with speech recognition, along with image classification,” said Bryan Catanzaro, vice president of applied deep learning research at Nvidia. “Around the industry there is a lot of work going on at the intersection of AI and speech technologies.”

He added that speech technologies “are driving demand for a lot of artificial intelligence, and are the interface to a lot of AI.” And that is why Nvidia is heavily invested in the continued improvement of speech technologies, including efforts to make speech recognition more accurate and AI speech closer to a real human voice.

Nvidia this week is highlighting its ongoing work in the realm of speech technologies at the Interspeech 2021 conference, presenting multiple papers on the topic and updating the world on several projects the company has been working on to advance speech technologies, particularly the automatic speech recognition and text-to-speech aspects that are important to AI functions.

“We’re interested in having speech technologies become much more widespread and continue to improve in quality,” Catanzaro said.

Nvidia’s speech technology projects include NeMo, which is an open-source toolkit for GPU-accelerated conversational AI aimed at developers starting to work with speech models to include in their own applications. 

Another project on parade is RAD-TTS, a speech synthesis model to help normalize AI speech by teaching it to mimic the emotion, tone and rhythms of a recorded human voice. Catanzaro said Nvidia used this tool itself when it developed an AI voice to be used in marketing presentations.

Among other Nvidia developments being presented at the event are another speech synthesis model called TalkNets2, and SPGISpeech, which includes 5,000 hours of transcribed financial audio for fully formatted end-to-and speech.

Transcription services are just one application for speech technology. New ones, such as interactive chatbots, virtual retail shopping assistants and live captioning for video conferencing, continue to evolve. These apps all have something in common besides core technology--all of them have become increasingly valuable during a pandemic that has limited the ability for people to conduct business and daily routines the way they once did. 

Changing usage practices are bound to drive development of many more new speech technology applications, and Catanzaro said Nvidia is just trying to help enable those new developments.

“These projects come with code and data and pre-trained models, so they are not just ideas, but tangible things that people can use,” he said.

RELATED: Advances in AI, MEMS usher in the Internet of Voice era