Integrate Voice Assistants into Portable Speakers and Smart Headsets

By Majeed Ahmad

Contributed By DigiKey's North American Editors

2019-09-26

Virtual assistants like Amazon’s Alexa, Apple’s Siri, Microsoft’s Cortana, and Google Assistant are driving the creation of smart, voice-enabled devices ranging from Bluetooth headsets paired with smartphones and other mobile devices to smart speakers for home and office automation environments, as well as consumer electronics such as TVs. While voice-enabled services are increasingly being used to control functions such as listening to music, making calls, and running biometric sensors, designers find it a challenge to identify, capture and wirelessly transmit voice in environments that are often both acoustically and electrically noisy.

What’s required are robust noise cancelation techniques and an equally robust wireless interface, all in a packaged solution that developers can experiment with and apply quickly to save both time and cost.

This article introduces several voice capture solutions from Cirrus Logic, XMOS, and Qualcomm that help designers get a fast start on next generation voice-enabled mobile devices and headsets.

Voice capture solution

While companies like Apple and Microsoft started implementing their solutions with smartphones and computers, Amazon kicked off its Alexa with the Echo smart speaker and then began expanding its use into more devices.

However, the Echo has seven microphones in it—too many for a small handheld device where space, cost, and power are at a premium. That said, chipmakers like Cirrus Logic are jumping in with simpler design solutions to let designers bring Alexa to a variety of smart devices and other audio system form factors.

Take, for instance, smart home applications employing Alexa Voice Service (AVS) in voice-controlled lighting and appliances, hands-free portable speakers, and networked speakers. Here, voice capture solutions are required to enhance the user experience by suppressing noise and other real-world interference for more accurate and reliable voice interactions.

The implementation of a voice assistant mandates high-accuracy wake-word triggering and command interpretation in noisy environments and during music playback. Echo cancelation is also critical to achieving a superior user experience; it allows the user to interrupt loud music playback and Alexa responses so that new requests can be responded to accurately.

A good place to start experimenting with AVS designs is to work with Cirrus Logic’s 598-2471-KIT voice capture development kit for AVS. This is aimed at integrating Alexa capability into compact audio devices with acoustically-tuned audio processing hardware and software components (Figure 1). It is based on a Raspberry Pi 3 platform and includes a reference board that features Cirrus Logic’s CS47L24-CWZRsmart codec, digital MEMS microphones, and SoundClear^®algorithms for voice control, noise suppression, and echo cancelation.

Image of Cirrus Logic’s 598-2471-KIT voice capture development kit Figure 1: Cirrus Logic’s 598-2471-KIT voice capture development kit for AVS-enabled devices allows a voice capture board (upper right) to be attached to a Raspberry Pi 3 (upper left) either via cable or placed as a HAT on top of the Raspberry Pi 3. (Image source: Cirrus Logic)

Voice capture building blocks

The voice capture process starts with the CS47L24 voice processor which combines a dual-core 300 MMAC DSP with an audio hub codec to serve a variety of power efficient, fixed function audio processing blocks (Figure 2). The programmable DSP cores support a range of advanced audio processing features such as multi-mic noise suppression, acoustic echo cancelation (AEC), and voice recognition.

Diagram of Cirrus Logic’s CS47L24 voice processor Figure 2: Voice capture on the kit starts with the CS47L24 voice processor which combines a dual-core 300 MMAC DSP with an audio hub codec to serve a variety of power efficient, fixed function audio processing blocks. (Image source: Cirrus Logic)

The CS47L24 smart codec uses an on-chip digital-to-analog converter (DAC) with a 2 watt mono speaker driver to enable high fidelity audio playback. It supports automatic sample rate detection which helps with wideband and narrowband voice-call handover. There are three digital audio interfaces provided on the CS47L24 processor, each supporting a range of standard audio sample rates and serial interface formats.

The CS47L24 is powered from 1.8 volt and 1.2 volt external supplies; its power, clocking, and output driver architectures are all designed for low power in voice, music, and standby modes. The CS47L24 also provides a separate MICVDD input for microphone operation above 1.8 volts.

The digital MEMS microphones IC and associated SoundClear algorithms for voice control, noise suppression, and echo cancelation provide high-quality audio on the input, while lowering microphone power consumption. The IC supports two operational modes: Low Power Mode which is suitable for always-on voice activity detection, and High Performance Mode which is optimized for high fidelity recording. The mode is determined by the applied clock frequency.

The microphone incorporates an analog-to-digital converter (ADC) to output a single bit data stream using pulse density modulation (PDM) encoding, and to efficiently connect multiple microphones in stereo and array configurations. For designers, it’s important to look for multi-microphone ICs as these can be optimized to provide aggressive noise reduction and echo cancelation using beam forming techniques to achieve the clearest full-duplex communication and audio capture.

The MEMS microphone should also facilitate a wide dynamic range (100 decibels (dB) is a good starting point) between the noise floor and acoustic overload point. This enables high fidelity audio recording in both quiet and loud environments. For example, it allows low level audio content such as classical music or voice to be recorded without background hiss. At the same time, it ensures that very loud sounds such as rock concerts and wind noise don’t cause distortion in the microphone.

To get the most out of the hardware, SoundClear algorithms eliminate noise through processing features such as noise suppression, automatic speech recognition (ASR) Enhance™, and echo cancelation.

Far-field voice capture

Another voice capture solution is XMOS’s XK-VF3500-L33-AVS VocalFusion™ stereo development kit for Amazon AVS. This focuses on far-field use cases such as smart TVs, soundbars, set-top boxes, and digital media adapters. These applications mandate stereo AEC support for "across the room" voice interface solutions and allow users to switch on the TV and adjust table lamps via voice commands.

The far-field voice capture applications demand that AEC reference signals are accurately calibrated, and that the latency is carefully adjusted. By doing so, designers can be sure that the far-field voice accessories they design can hear and accurately capture the user’s voice commands regardless of the volume of content and surrounding environment.

The VocalFusion kit is a linear microphone array solution that has been qualified by Amazon for far-field performance. It lets designers put Alexa into edge-of-room devices like smart TVs, lighting, and home appliances. The kit is built around theXVF3500-FB167-C voice processor that delivers two-channel, full-duplex AEC to support voice capture in complex acoustic environments (Figure 3). The DSP-enabled AEC capability facilitates dereverberation, automatic gain control, and noise suppression to ensure clear voice interaction even in noisy environments.

Diagram of XMOS XVF3500 voice processor Figure 3: The XVF3500 voice processor employs adaptive beamforming to locate the desired speech source and effectively isolate voice commands from the stereo audio while suppressing background noise and room echoes. (Image source: XMOS)

Next, the four microphone VocalFusion kit uses Infineon’s XENSIV™ IM69D130V01XTSA1 MEMS microphones that provide raw audio data for running audio signal processing algorithms on the XVF3500 voice processor. The IM69D130 microphones are designed to enable far-field and whispered voice pickup performance and total harmonic distortion (THD) of less than 1% at sound pressure levels (SPLs) up to 128 dB.

The “barge in” capability provided by the voice capture design allows users to interrupt or pause a device that's playing music, opening new opportunities for Alexa-based designs in stereo home entertainment and wall-mounted AV equipment (Figure 4).

Figure 4: A voice capture processor and microphone work together to create a voice interface for far-field Alexa applications. (Image source: Infineon Technologies)

An example of a real-world implementation is Skyworth’s artificial intelligence (AI)-enabled smart TV that is based on the XVF3500 voice processor. The always-on smart TV wakes up and responds to voice commands with 180° all-dimensional sound-source identification from up to 5 meters (m).

Smart headset design

On the other end of the design spectrum are earbuds and headsets. While paired with smartphones and tablets, these are increasingly requiring voice assistant integration for calendar management, smart home control, streaming music, and weather updates. Like smart speakers, Bluetooth headsets need continuous improvement to transmit quality audio in noisy environments.

The smart headset reference design and development kits for AVS and Google Assistant platforms from Qualcomm are major building blocks that enable developers to get started on voice-activated headsets and hearable designs. Reference boards help developers evaluate the voice assistants, while design kits allow design engineers to move to the full development environment.

Take Qualcomm’s DK-QCC5124-GAHS-A-0 smart headset development kit for the Google Assistant. This supports pushbutton activation for Google’s voice assistant on Android phones that have the Google Assistant app installed. It’s built around a Bluetooth audio chipset from Qualcomm that uses the Qualcomm Clear Voice Capture (cVc™) noise reduction technology to enhance a caller’s voice by reducing the ambient sounds via noise suppression and other audio enhancements.

The cVc 6.0 technology provides packet loss and bit error concealment through a set of noise reduction algorithms for clear phone conversations. Another notable technology is Qualcomm aptX™ HD that facilitates low latencies for robust audio streaming. It’s a high -definition Bluetooth audio codec that has been engineered to improve the signal-to-noise ratio and lower background noise.

Qualcomm’s DK-QCC5124-AVSHS-A-0 smart headset reference design for Amazon AVS also supports both cVc 6.0 noise reduction and aptX HD wireless audio technologies. It supports pushbutton activation for Alexa on mobile phones with the Alexa app installed.

The platform, built around Qualcomm’s QCC5124 Bluetooth transceiver chipset, also supports the Alexa Mobile Accessory (AMA) kit which allows users to conveniently connect Bluetooth with the Alexa Mobile App on Android and iOS devices (Figure 5). The AMA kit facilitates the communication of voice commands from the headset to Alexa via the phone, while Amazon AVS does the heavy lifting for natural language processing.

Diagram of Qualcomm DK-QCC5124-AVSHS-A-0 development board for Amazon AVS Figure 5: The DK-QCC5124-AVSHS-A-0 development board for Amazon AVS has the key building blocks of a smart headset design. (Image source: Qualcomm)

That means two things: first, developers don’t need to oversee the bulk of coding for their Alexa integration; and second, developers don’t have to add any communication hardware beyond Bluetooth connectivity.

At a higher level, the AMA kit enables Amazon AVS to facilitate communication between voice accessories like a smart headset and the Alexa service through a control mechanism operating between the voice accessory and the Alexa Mobile app.

Developers can use an open board development kit after the evaluation. However, programming the open board development kit requires a Transaction Bridge (DK-TRBI200-CE684-1) which is not included in the kit, but can be purchased separately.

Conclusion

For designers looking to integrate voice assistants into their next design, silicon suppliers have already done much of the heavy lifting in terms of wake-word recognition, noise cancelation, and low-power, always-on capability. Using their reference designs and development kits, designers can develop voice capture solutions for a range of intelligent voice control services from smart headsets and smart speakers to full home voice control.

Disclaimer: The opinions, beliefs, and viewpoints expressed by the various authors and/or forum participants on this website do not necessarily reflect the opinions, beliefs, and viewpoints of DigiKey or official policies of DigiKey.