Voice-enabled technologies have become incredibly popular in the last few years. They are present in a wide variety of devices such as smartphones, house appliances, or smart speakers, and commonly in the form of voice assistants.
Voice-enabled technologies can truly facilitate everyday tasks such as checking up the weather or your bank account balance. However, it’s important to note that they can also be used to perform more delicate functions such as emotion analysis that aims to detect and recognize types of feelings through speech signals (e.g., anger, disgust, fear, happiness, etc.) or health condition identification (e.g., identify possible COVID-19 patients by analyzing the sound of their coughs or how long they can make certain vowel sounds).
To complete any of the tasks mentioned above and some others of course, voice-enabled systems require hundreds of hours of training data in the form of audio recordings, which, most of the time, contain personal data or information, i.e., information relating to an identified or identifiable person. But, where does this information come from? Let’s see!
Most voice-based devices collect data from three major sources:
- The speech/message content
- The user’s voice
- Background sounds
Let’s check each of these sources, separately.
The speech/message content
Every time a user gives his/her voice assistant a command using his/her voice (e.g., “Alexa, find me the nearest train station”), the system converts it into a readable form or text.
To properly execute commands, voice-enabled devices collect and process the users’ spoken message which often carries personal or sensitive information such as:
- Words or utterances that mention the users’ identity or traits that reveal information about his/her background (e.g., gender, age, ethnic origin, etc.), information concerning his/her health status or his/her medical condition (e.g., “Siri, I want to buy prenatal vitamins”), or some other critical information (e.g., “My credit card number is 8908089876. Send my the order to Street Magallane 3, 2A”).
- User preferences not revealed by the final dialogue outcome (e.g., a user asks about several hair products from various brands she likes, although she eventually buys only one of them).
- Any voice interaction revealing non-personal, but still sensitive information to the voice-enabled system provider (e.g., an employee that accidentally activates his/her voice assistant while discussing the company’s upcoming marketing strategy or annual sales).
The user’s voice alone (i.e., voice signal) can reveal considerable amounts of personal information. This information can be used by vendors and service providers to profile users and adapt their offers of products and services accordingly, but it can also be used by cybercriminals to pass identification controls or commit fraud.
Common types of information revealed by the user’s voice include, but are not limited to:
- General traits of the speaker such as gender, age, ethnic origin, etc., through voice features like accent, voice pitch, etc.
- Physical traits such as the weight, the height or even the strength.
- Mental state or illnesses such as stress, relaxation, depression, etc.
- Physical health state such as bronchitis, smoking habits, intoxication may be extracted by detecting constant coughs, shortness of breath or other kind of symptoms while the user is speaking.
- Emotional states such as monitoring customer dissatisfaction when interacting with call centres.
Some voice-enabled devices might be capable of recording background sound. Background sounds can reveal personal information about the user interacting with the device, and also provide additional context to his/her message. Below are listed few examples:
- Background conversations taking place when the device is recording (e.g., conversations between members of a family, work colleagues, etc.)
- Content such as music, shows, etc., played from computers, speakers, televisions, radios, or similar devices (e.g., the music played can reveal information about preferences that could be used by vendors and service providers for profiling purposes).
- Sounds such as shouting or slamming doors can provide information about the user’s family situation (e.g., domestic violence).
- Sounds like a crying baby or kids playing can reveal information about the user’s personal life (e.g., he/she has children).
- Sounds produced by trains, aeroplanes or another type of transport can reveal the user’s location (e.g., the user lives near the city airport or train station).
Personal information, as we have seen throughout this article, can be extracted from multiples sources (i.e., speech content, voice signal, background sounds). Sometimes, a single source does not provide enough information to identify the user or, at least, make him/her identifiable. When combined, they might complement each other enough to reveal the user personal information, however.
Legal specialist and Project Manager at Rooter
Legal Consultant at Rooter
Developer survey: Since you are here and interested in our project, could you please spare a moment to share your concerns and answer 12 questions related to developing voice-enabled apps.