DolphinAttack- simulated text to speech commands at ultra high frequencies can take over voice powered AI assistants
Remember those books and movies we used to devour in our childhood where there would be a dog, who could respond to a sonic whistle (which no one else could hear) and save the day?
Well, that was fiction - but how often has life imitated fiction? Just open any Jules Verne book. Basically, whatever he wrote - Man went ahead and did it! <tangent/ go MAN, you rock>
The race for AI is on, and while exciting innovations are changing our lives fundamentally, the spectre of security has never been more omnipresent than now.
It will be more critical in the days to come.
Reams of PR headlines have been given to voice powered AI assistants in recent years. Alexa, Siri anyone? While it is a dream to be able to get these diminutive AI tools act on verbal commands, seems like they might have a very simple but serious back door flaw.
A team of six Chinese researchers from Zhejiang University have discovered a vulnerability in voice assistants from Apple, Google, Amazon, Microsoft, Samsung, and Huawei. "It affects every iPhone and Macbook running Siri, any Galaxy phone, any PC running Windows 10, and even Amazon’s Alexa assistant."
In a detailed research paper published here, the team has shown how with a $3 apparatus, hackers can simulate text to speech commands at ultra high frequencies inaudible to human ear to take over voice powered AI assistants, make unauthorised calls, open malicious websites, change a cars navigation route and so on.
Imagine Siri and Apple Pay?
Read this paper in detail. Titled " Dolphin Attack - Inaudible Voice Commands".
A truly fascinating theory put into practice to prove a hypothesis.
But before we go forward let's try and understand how sound frequencies work with respect to human hearing.
In summary, sound frequencies beyond 20KHZ to 10MHz are inaudible to the normal human ear.(but audible to phone microphones - more on that later)
Now let's look at how our voice commands transmit inside a phone and get an AI assistant to execute a command.
The challenge, however, lies in mimicking the voice pattern recorded in the AI memory at a higher frequency during wake-up mode.
That's where DolphinAttack comes in. (To clarify, DolphinAttack is an attack sequence built by the team to prove their hypothesis)
So what exactly is DolphinAttack?
Quoting from the paper: DolphinAttack - Inaudible voice commands
The architecture of DolphinAttack
The researchers used the phenomenal advancement in Text to Speech technology (TTS) to convert text based commands into speech which could then be modulated at frequencies inaudible to the human ear. (by using a hardware module consisting of a phone, amplifier and ultrasonic transducer to activate Siri and take commands)
Quoting from the research paper again,
Applications like Alexa and Siri work more on vocal tones than on exact voice pattern match. Using Google text to Speech, the researcher's trained Siri to respond to machine simulated commands during wake up mode.
This is the most difficult part, since speech recognition systems in smartphones only authenticate voice patterns during the wake-up message: eg " Hey Siri". They do not authenticate voice commands to launch applications once the system is activated.
This is aided by the fact that the microphones and software that power voice assistants like Siri, Alexa, and Google Home can pick up inaudible frequencies–specifically, frequencies above the 20KhZ limits of human ears.
So in the scenario of a brute force attack, a hacker can activate any control command (eg: open google/ show me restaurants on 5th street/ call 911) using easily available TTS API and tools on Google, Baidu and Selvy - once the wake-up sequence has been activated successfully.
This is a classic case of user friendliness coming to odds with personal security.
Chromecast, Amazon Dash - all use frequencies higher than 18kHZ to pair with phones. The experience is seamless and magical but based on this research, maybe at times a bit vulnerable - something we need to be cognizant of.
Automation is a boon but like most machine learning systems, it has its achilles heel. Today, browsers collect cookies, easily and invisibly. These cookies identify our web behaviour in relentless detail. We can identify individual mobile device IDs. Digital ads can follow us across the web. Maps have a scary way of knowing our routes home. Our phones back up our photos and contacts to the cloud. Our private lives are at the mercy of hackers with a penchant for voyeurism.
Convenience is necessary. Even important, but like all benefits, it comes with a price. A hidden cost. Our personal security and vulnerability for starters.
Unchecked, this can go on to create a world, where someone sitting in a room with one laptop and a cloud server can control the transatlantic flight you are in at 40,000 feet or pay for a purchase you never made, using your phone and your stored credit card details within it.
Let's take a moment to think about that.
Oh and for the above scenario, just don't let AI powered voice assistants be always-on, in your phone. Use them when you need to. Close the applications down, once your need is complete. Unknowingly, we often let about 10-15 programs run always-on in the background (in our phones). These programs are constantly communicating with the network. Sending push notifications, updating databases in real time, mapping our coordinates - storing information about us and our behaviours. SILENTLY. RELENTLESSLY.
Let's go out. Listen to some music. Read a book.
Have a lollipop.
1: This blog is an abstract based on the original research paper " DolphinAttack: Inaudible Voice Commands ", conducted and written by, Guoming Zhan, Chen Yan, Xiaoyu Ji, Tianchen Zhang, Taimin Zhang, Wenyuan Xu of Zhejiang University, China.
Copyright of this work, the study and its findings belong to them.
The research paper is available publicly here