Voice user interfaces: Getting more popular, but still technically challenging

When it comes to voice user interfaces, companies like Google, Apple and Amazon dominate with their smart speakers and home assistants. As they and other companies look to expand voice into other areas, Rob Oshana, VP of Software Engineering R&D for NXP, says that it can still be technically challenging to deploy voice user interface products with expected consumer quality levels.

FierceElectronics sat down (virtually speaking) with Oshana to discuss where voice user interface technology is today and some of the bugaboos that challenge design engineers. 

FE: What are some of the other applications for voice that product developers are looking at today and what types of user interactions done other ways do you think it could replace?

Oshana: From a business perspective, there is a false perception today about low cost because Google and Amazon are smashing prices with their home assistant. So, people think the cost should be lower than it is. But our customers are looking for products that have not yet been built. As such, our customers are developing something different, including higher performance, better playback quality, and additional features.

Technically, some providers want to develop local voice user interface (entire local edge processing) to maintain service without internet access or to avoid subscription.

FE: What applications is voice not good for?

Oshana: Voice authentication is still a challenge especially for text independent approach.

FE: With GUI, inputs are unambiguous. Since voice is not, what is needed to reduce inaccuracies and what is the load on the system?

Oshana: Graphical user interface is addressing the user feedback. But for user input, voice is still the most convenient, comfortable, easiest, and fastest. To improve performance, multi modal is one approach which combines sensors working in orthogonal domains. For example, through combining audio and video, video could recover the performances in instances where audio is degraded by noise.

FE: How ‘hard’ is it to add a voice user interface feature to an embedded system today? What is the most challenging aspect?

Oshana: While it may not seem hard, from a practical perspective adding a voice user interface is challenging for multiple reasons, in priority order below:

  1. Multiple domains are involved: Acoustics, mechanics, signal processing. This suddenly increases complexity of your product at all design stages: from concept, casing design, production, and validation to aging.
  2. Because there is no “one size fits all” solution, each component needs to be carefully chosen. For example the choices in a microphone could be analog, digital, SNR, AOP, etc. Later on, changing one component can break entire chain. For example, a low-quality speaker can break the AEC [acoustic echo chamber] which will impact the voice performance.
  3. Voice interface includes multiple blocks, mainly coming from different third parties and still they need to work well together. Full system validation is usually done at the latest stage and may lead to severe interaction issues and delay in product launch. For example, does the Audio Front End remove the noise that the ASR [automatic speech recognition] engine is expecting?

FE: What’s the difference (from a systems perspective) between cloud-based and on-device voice activation?

Oshana: In the cloud, you can easily deploy multiple models that can improve performance and reduce false positives by confirming whether or not the detection is from the local solution. These different cloud models can easily benefit from retraining or a bigger database.

The downside of cloud is that you need to send your voice data (unless it is a complex system,) and this could be a security issue. Latency is higher and an internet connection is of course a hard requirement. A cloud solution may not directly get the microphone signal. They may receive only one single clean channel. This assumes there is a good interaction between the local Audio Front End and cloud ASR. 

A local solution is easier to deploy and easier to control performances. All the system performance is under your control. Your voice remains processed locally so there is no security issue. Locally, the latency is optimized. Your system can benefit from other local information, such as a microphone array in the room. The drawback is that all decisions should be taken locally, with limited resources. And updating models can be a challenge.

FE: What are the most important considerations for embedded developers?

Oshana: The critical part of the voice user interface design is to understand the full chain and to assess the performance of each block to ensure the entire chain is working effectively. For example, in the case of non-detection, you have to understand whether it is because of the audio front-end which did not clean enough of the signal or because of the ASR system, which does not work well in the noise condition?

FE: How much of this capability is done in software versus hardware?

Oshana: Most of the solution is software based. We still use hardware for predictive processing like PDM decimator to down sample the digital microphone signal.

FE: Are today’s processors capable enough to handle the memory and CPU power requirements for voice recognition?

Oshana: “Processor” is a very broad term, but regarding high-processing cores like Cortex-A or HiFi4, this is good enough for voice recognition implementation. The trend is to deploy voice with a low-end processor like Cortex-M family or smaller DSP.

FE: What about the power draw of microphones, is that a concern?

Oshana: This is not a concern on a sound bar while the device is playing music. In that case, microphone current is negligible compared to the rest of the system. However, customers are starting to monitor microphone current consumption in idle mode, when the device is waiting for a wake word. Especially with multiple microphones (microphone array), the microphone’s consumption could become an issue. However digital microphone providers are specifying power mode for such use cases (at a cost of audio quality).

FE: What about embedded voice recognition engines like Truly HandsFree, Snips Commands, or Alexa connect Kit? Are those needed for specific processors?

Oshana: The goal of these solutions is to be deployed on all possible targets. So, these providers developed solutions with different footprint requirements to address all possible cores. From small cortex-M to high-end processor cores.

FE: Finally, are there development boards or kits like the OM13090: LPC54114 Audio and Voice Recognition Kit that can help engineers get started?

Oshana: NXP’s LPC54114 was one of the first platforms to propose an embedded Digital microphone interface, with low power mode, so that is the perfect platform to implement a low-power voice solution.