15 July 2019
The following are some thoughts about the AMBIENT PRIVACY project. I originally undertook this project as a way to examine my part in the ‘surveillance capitalism’ - the growing trade in personal information given in exchange for digital services.
In this post I won’t analyse the content of the queries themselves - what was asked, when, and how it reflects on the technology itself.
First, a little bit of background on how I actually indexed and documented the full archive.
I downloaded all of my Voice Search & Assistant metadata from Google Takeout under the “My Activity” section and processed the resulting files (audio mp3 files and a JSON metadata file) using two programs I wrote1.
The first was written in R, a data-processing language, which I used to flatten and clean up the metadata file, as well as generate the specific waveforms for each audio recording.
The second was a program written in Processing, a creative coding language, which created the individual card designs to be printed as well as providing me an interface to audit individual audio files when annotating the cards. I printed these designs directly onto standard index cards.
Having printed all of the cards, I then spent 5 long days listening to every audio file and annotating the individual cards while assuming a few roles - an engineer (red pen), a designer (blue pen), and a marketeer (green pen) - in an attempt to understand how each would take on the same work. Each has a taxonomy of sorts, with corresponding stamps of various meanings.
Voice assistant tools such as the Google Home and Alexa (and phones with a voice-assistant enabled) are ‘always on’, meaning that they have an always-active microphone which is constantly listening and attempting to parse out the relevant hotword for the device (e.g. “okay Google”). This listening happens with a buffer - a ‘short-term memory’ that does not save any audio more than a few seconds old that the device knows didn’t contain the keyword.2
Once the hotword has been detected, however, the device ‘rewinds’ through those last few seconds of short term memory and creates an audio file which it uploads to Google’s servers for query parsing and response formation. As best as I can determine, it ends the recording once either a time-limit for the total recording is hit or the microphone input is quiet for a certain amount of time, making it safe to assume the query is finished.
Structurally, the audio file can therefor be segmented as:
After staring at enough of these waveforms it becomes quite easy to spot some patterns. The keyword is usually easy to spot: 3 stacatto waves indicating the stop sounds of the keyword3. There can be an “acknowledgement tone” that’s made by the device itself - a beep that indicates the keyword’s been heard. The query can obviously vary in length and waveform and so is a little more difficult to parse out, especially against ambient noise. The window before the keyword and after the keyword are the interesting segments.
Depending on the device, the first buffer window (before the query) is either 0.75 seconds or 1.75 seconds. For a phone, this seems to be 0.75s. For Google Home devices, this seems to be 1.75s.
In other words, once the keyword has been detected, a Google Home will go back in its audio buffer 1.75s and initialise the audio file that will be uploaded from there.
The short window of audio saved after a query has been deemed ‘finished’ is a little more difficult to determine consistently, but seems to be around 0.75s for all devices.4
This means that, for a voice assistant search that took 4.8 seconds5 in total, up to 2.5s can be audio that isn’t of the keyword or relevant query. For my voice history, of the 42 minutes of recordings of this kind of query, somewhere between 13 and 22 minutes are audio that isn’t of the keyword or search query but of the time before or after.
It is entirely reasonable to ask what could really be gleaned from these short periods of additional surveillance, and whether it is possible for these snatches of recording to contain anything one might consider too private to have uploaded to Google servers and examined by Google staff.
Within my history I found snatches of other conversations, coughs and laughter, and ambient noise and conversations from my friends, family, TV, and work. I was surprised at how much ‘extraneous’ audio was stored, especially in recordings where high levels of ambient noise meant the ‘query’ wasn’t considered finished for many seconds after I’d stopped speaking. In my opinion it is entirely possible for a recording to either ‘backtrack’ to a brief snatch of private conversation, or to hold just slightly too long because of ambient noise and record some private moment.
My more-recent phones have been Google’s own Pixel phones. These have the same always-listening approach as my Google Home.
This always-listening approach has resulted in a number of misfirings - incidents in which some ambient discussion was similar enough to the keywords to trigger the recording. Given that I work in technology, the likelihood of somehow muttering some combination of “okay” or “hey” and “Google”, and triggering the recording, is quite high when compared to the average person.
As best as I can tell, and more concerningly, some of these instances are also without me having been actively notified that a recording has been made. These seem to have occurred when the voice assistant was triggered but then was unable to parse out a voice in the ambient noise. Since the keyword has been (incorrectly) detected, the audio file has been recorded and uploaded to Google’s servers, but once the server has been unable to parse out a query, the audio remains on the server.
The result is a kind of ambient field recording of business meetings, pub discussions, and chats with friends, silently triggered and recorded, with nothing alerting me that this is happening. Of the 689 recordings there seem to be 5 of these incidents of quiet surveillance.
As a result of this finding, one lingering thought has kept occurring to me. Given the prevalence of smartphones within my work and social groups, it’s reasonable to assume that at least one always-on microphone is listening when I’m in a group of 5 or more people. Within larger groups, it’s safe to assume there are multiple microphones. How many have been triggered accidentally, and recorded some snatch of my private or public life, is ultimately unknowable to me. Without constantly interrogating any group of people, it’s impossible to know how many microphones are listening at any point. It is easy to state that one should “just not use this technology if you’re not willing to sacrifice some of your privacy” but the reality is that we are now being surveilled constantly with or without our consent.
Without downloading this archive of my queries, I would never have known how much audio was actually recorded and stored. I would also not have known how many recordings there were of situations in which I didn’t intentionally trigger a query. I do not know who else has listened to these recordings from Google, if anyone, nor for what purposes. I do not know what algorithms have pored over these recordings for the purposes of marketing, profiling, or simply training better algorithms. I do not have any conclusive evidence of exactly how much audio will be stored before and after my queries. These remain just guesses.
Prior to taking on this project, I believed I had a better-than-average understanding of what degree of privacy I was exchanging in order to quickly start a timer or find out the capital of Ulan Bator. Now I am very much unsure, and I feel anyone taking on the same degree of self-surveillance would come to the same conclusion: surveillance capitalism is reliant on a misunderstanding of the exchange it asks of us: privacy in exchange for convenience. Indeed, this is noted in the opening chapter of Shoshana Zuboff’s excellent book on the subject.
Surveillance capitalism operates through unprecedented asymmetries in knowledge and the power that accrues to knowledge. Surveillance capitalists know everything about us, whereas their operations are designed to be unknowable to us.
Ultimately, while tools like Google’s Takeout service go some way toward providing an understanding of what privacy we are giving up in exchange for convenience, it falls well short of what’s required for anyone to truly understand the nature, value, and prevalance behind this surveillance. And this is intentional - it is never in the interests of a company such as Google to provide this information. Surveillance capitalism is, I believe, built on this tension: that it is only sustainable if we don’t truly understand what we are giving up, who is listening, or how it truly works, and so it remains in the interests of corporations to ensure this remains the status quo.
While Google’s Takeout system provides an voluntary access system to my data, I would be very interested to know whether I would receive a more complete set of data if I instead filed a GDPR request. ↩
I’ve been unable to find any statement of exactly how long this buffer is for Google device - it doesn’t seem to be explicitly stated anywhere. Similarly, I’m unable to say whether it has been changed or altered over time - presumably an option with a device update. ↩
I’d be very interested to hear what phoneticists would decide would be the best keyword from a recognition perspective, as I’d guess it wouldn’t be something people would want to say aloud regularly. ↩
With some sound analysis tools it’d be easier to reverse engineer what the parameters for the query cut-off might be but I didn’t have the time to dive into this. ↩
The average duration of my voice queries that were triggered by a keyword. ↩