In order to improve functionality in next-generation devices, smart audio devices need to do more than just listen. Users are coming to expect an experience where smart hubs and voice assistants understand far more than just simple voice commands. Devices are expected to pair these commands with other environmental and user data for context, whether it’s background noises or the user’s favorite album on Spotify, and in a split second accurately interpret and execute the command with respect to the user’s context.
This data can only be gathered if smart devices have contextual awareness.
What is Contextual Awareness?
When a device is context-aware, that means that it incorporates user-specific information, such as location, preferences and data gathered from various sensors, to better understand what the user is asking for. Based on that understanding, and with all of that situational and user preference context in hand, the device can then execute its functions in response to a certain command or prompt. Without context awareness, it’s difficult for smart audio devices such as hearables and voice assistants to accurately deliver the ideal user experience.
Always-listening devices use signal processing technology coupled with machine learning to make sense of natural sounds, everyday noises, a user’s voice and so much more. To do this, devices need to be able to detect what are referred to as “acoustic scenes” and “acoustic events”.
A scene may be a busy cafe at brunch time, a quiet library or a city bus terminal during the morning commute. Events are the specific sounds that are heard in any scene: a car honk, the clattering of dishes or a child laughing. Machine listening should also be able to classify music and voices, such as by the language someone is speaking, their gender or approximate age, the genre of music they’re playing or even the specific artist singing a track.
One study found that adding context-aware sound event recognition to home service robots improved the robots’ effectiveness and accuracy while monitoring elderly people living alone. Contextual awareness opens the door for devices to classify certain acoustic events as “alerts”, meaning that there are real, life-saving benefits that this technology can offer for the health and wellness field. If a device identifies a glass breaking sound or a smoke alarm sound, it can automatically call the relevant emergency services to respond.
Contextual Awareness in Voice Assistants
Voice assistants are one specific application that can benefit greatly from context awareness. Currently, this feature is not integrated in most voice assistants, although Amazon Alexa now uses context awareness in its new Guard feature to improve home security.
For most voice assistants, though, a lack of contextual awareness leads to limits in functionality, user experience and even safety. For example, without context awareness, it’s challenging for a voice assistant to recognize sounds, such as alarms, and react accordingly.
On the flip side, a feature like Alexa Guard works because it gathers context from multiple sensors and sources to figure out when users are home or away, and when it needs to send out a Smart Alert (such as when it picks up the sound of glass breaking or a smoke alarm going off). When a user says, “I’m leaving,” Alexa uses this context to activate the alarm-listening feature, and then uses built-in audio analytics to actually identify the sounds of a smoke alarm, a carbon monoxide alarm and breaking glass.
Similar audio analytics can be used in a host of other applications to filter out false alarms, shorten response times for security officers and first responders and improve safety and security overall. This technology can complement on-site staff and video surveillance systems to provide even more protection in smart cities and educational facilities.
Sensor Fusion as a Contextual Awareness Solution
The answer to the context awareness problem in smart audio and voice assistants is through sensor fusion, or sensor processing.
Data from multiple sensors such as microphones and accelerometers can be blended together to give the device the appropriate context needed for machine listening, precise audio classification and sound event recognition.
This process happens very quickly, in real time, and helps account for errors or biases in each individual sensor, without requiring constant (and costly) calibration. It can combine a user’s personal data with the audio data that is gathered from a voice command to give better context, and elicit a more accurate response from the voice assistant.
There are obvious security concerns with the use of personal data. However, sensor processing can also help improve security with local processing of the data, so that it never has to be transmitted or stored off the device, where it’s more vulnerable.
While contextual awareness will undoubtedly help improve device function, accuracy and user experience, we’re working hard to unlock other opportunities with this technology as well. It has big implications for monetization: voice assistants could serve an ad based on a voice search and the user’s prior shopping behaviors, for example. When used in a business sense, it may also improve team productivity and workplace efficiency with personalized interfaces for users.
(Source: CEVA)
CEVA’s own SenslinQ integrated hardware and software platform aggregates motion sensors, sound and connectivity technologies to create contextual awareness for IoT devices. The platform processes data from multiple sensors inside a device, such as microphones, radars, inertial measurement units (IMUs), environmental sensors and time of flight (ToF) sensors. Then, it filters this data and performs front-end signal processing, and applies advanced algorithms to create “context enablers”. These include activity classification, voice and sound detection and presence and proximity detection.
A platform like SenslinQ is an important piece in the development of many contextually aware devices, including smartphones, laptops, AR/VR headsets, hearables and wearables, along with voice assistants. It centralizes the workload for sensor processing and then fuses context enablers, either on-device or wirelessly, to help devices understand and adapt to the surrounding environment.