When you’re watching the Super Bowl between the Rams and Patriots on Sunday, your Echo speakers won’t inadvertently respond to the wake word “Alexa” when someone shouts it out on television — even during Amazon’s splashy new Alexa commercial starring Harrison Ford and Forest Whitaker. It might seem like a small thing, but the science behind getting Amazon’s intelligent assistant to ignore its name is a bit more involved than you might think. (It wasn’t that long ago, after all, that a Burger King ad prompted smart speakers around the country to search for the ingredients in a Whopper sandwich.)
In a blog post this morning, Amazon details acoustic fingerprinting, a technique Alexa AI research scientists at the Seattle company use to “teach” Alexa what individual instances of her name sound like, so that she’ll ignore them. It’s applied on the fly to detect when multiple Alexa-enabled devices around the globe hear the same command at roughly the same time — which Mike Rodehorst, Machine Learning Scientist at Amazon’s Alexa Speech division, says is critical to preventing Alexa from responding to pranks, references to people named Alexa, and other TV mentions Amazon isn’t made away of in advance.
“Our approach to matching audio recordings is based on classic acoustic-fingerprinting algorithms like that of Haitsma and Kalker in their 2002 paper ‘A Highly Robust Audio Fingerprinting System,’” he says. “Such algorithms are designed to be robust to audio distortion and interference, such as those introduced by TV speakers, the home environment, and our microphones.”
So how’s it work? Fingerprinting involves deriving log filter-bank energies, or LFBEs, for an acoustic signal — cells within a grid that represent the amount of energy in overlapping frequency bands in a series of overlapping time windows. An algorithm steps through the grid in two-by-two blocks, computing 2D gradients for cells as it goes along. The positive (or negative) sign of the results summarizes the values of each individual block in a single bit, the sum total of which constitute the acoustic fingerprint.
When the fraction of the bits that make up a fingerprint differ enough, they’re deemed to match, and Alexa ignores the wake word.
Amazon fingerprints entire audio samples whenever possible, in cases where they’re provided in advance. The results are stored in the cloud; Amazon “builds up” acoustic fingerprints piecemeal with audio that’s streaming to the cloud from Alexa-enabled devices, repeatedly comparing them to other fingerprints as they grow. (Clean audio is easier to process than audio with lots of background noise, Amazon says; the latter can yield a match, but requires more data.)
Every incoming audio request to Alexa that starts with a wake word is checked in two ways. It’s first compared to a database of known fingerprinted instances of “Alexa,” which also make use of the audio that follows the wake word. Then it’s checked against a fraction of other requests coming into Alexa devices around the same time — audio matching that of requests from at least two other customers is identified as a “media event” and given increased scrutiny (and potentially declared a match). This dynamic matchine contributes to a small cache of fingerprints, allowing Alexa to continue to ignore wake word requests even when they’re not happening simultaneously.
These fingerprinting methods together — for which Amazon has patents — will prevent as many as 80 to 90 percent of devices from responding to TV-originated Alexa requests, the company says. And they’re not the only precautionary measure in place.
On most Echo devices, every time the wake word “Alexa” is detected, the audio’s compared to a small set of known instances where Alexa is mentioned in commercials. Rodehorst says that the set is generally restricted to ads the Alexa team expects to be currently airing on TV, due to the limits of the smart speakers’ CPUs.
“Ideally, a device will identify media audio using locally stored fingerprints, so it does not wake up at all,” he says. “If it does wake up, and we match the media event in the cloud, the device will quickly and quietly turn back off.”
Separately, Amazon scientists continue to refine machine learning techniques that help Alexa distinguish sounds produced by TVs and other devices from those produced by people in the flesh. In research published last year, they describe an AI system that learns the frequency characteristics of different types of sounds and analyzes the sounds’ arrival times at multiple microphones within an Echo speaker. This enables it to tell the difference between moving sound sources and stationary ones, for example.
In tests, a system trained using 311 hours of recordings from volunteers improved Alexa’s media audio recognition by 8 percent to 37 percent depending on the audio type, Amazon says. Apparently, it performed best on any combination involving singing.