Is There Anything Else I Should Know About You?

“Tell me what you read and I tell you who you are.” – Pierre de La Gorce

Voice assistants, such as Alexa, Siri, or Google Assistant, are becoming increasingly popular. Some users are, however, worried about the fact that their vocal interactions with these devices often get stored in the cloud, together with a textual transcript of every spoken word. But is that just a diffuse feeling, or is there an actual threat associated with collecting these data? And if so, could such threats be prevented? Let’s have a look!

When companies have access to lots of their user’s dialog data, they can actually use that to make their voice assistants work better. So, that’s great news for the users, isn’t it?– where’s the problem then? Well, for one, the recordings of a user’s voice could be abused to impersonate them using artificial neural networks. While reputable companies are usually not in the identity-theft business, this might become a more realistic threat in cases where the stored user recordings are leaked to the wrong people, for instance, after a successful hack.

But it’s not just the sound of your voice that is potentially at risk here. It’s also what you say that reveals a lot about you. While saying “turn on the living room lights” in a smart home today may be considered rather harmless, future versions of voice assistants will be increasingly capable. They will engage in more involved conversations with the user and assist with complex tasks such as, e.g., booking your next holiday trip from start to finish, including your choice of hotel, booking the flights, and planning leisure time activities. From the user’s perspective, such interactions will likely contain quite a lot of private information.

To illustrate, let’s focus on five simple types of entities that we might find in a conversation with a voice assistant: people, organizations, locations, and dates and times. Here’s an example where all of these appear:

USER: New calendar entry: meeting with Mrs. Norton from Mycom next Tuesday.
SYSTEM: What time?
USER: The whole day. And please look up train connections to Berlin for that day.

This short exchange already contains a lot of points that draw a specific picture of the person speaking. For instance, we can guess with some confidence that the conversation is about a business meeting, allowing us to conclude that the speaker works in a position where attending such meetings is part of the job. But more specifically, we now know where the speaker will be on a specific day: next Tuesday, he or she will be in Berlin. The speaker’s preferred mode of transportation for getting to Berlin is revealed as well. In addition, we can also infer a couple of – let’s call them – negative pieces of information, namely that the speaker is not Mrs. Norton, does not work for Mycom, and does not reside in Berlin. And finally, although we may not be absolutely sure about it, this small snippet of a voice interaction also allows us to hypothesize some facts with a certain level of probability: for instance, if the current location of the speaker cannot be too far away from Berlin or else a train ride there would probably be too inconvenient to consider.

If such a short vocal exchange already provides so much information for profiling a speaker, it becomes clear that more, longer, and more elaborate recordings and transcriptions of voice interactions bear even greater potential for abuse. In a slight variation of the quote at the beginning of this blog post, we really have reason to be worried: Tell me what you said and I tell you who you are.

Oh, and by the way: the above example of course contains information not just about the speaker, but also about the other person mentioned: we know now that a Mrs. Norton works for a company called Mycom and will also be in Berlin on Tuesday. Thus privacy threats in recorded conversations do not only affect the people who actually participated in the conversation, but potentially third parties as well.

Identifying privacy threats in dialogues

The EU-funded research project COMPRISE seeks to find novel ways to protect the privacy of the users of voice assistants, while at the same time keeping in mind the valid interest to continuously improve the performance of such systems through continuous data collections. We employ state-of-the-art artificial intelligence and natural language processing methods to find ways to protect the users’ privacy by transforming the dialogue transcripts before they are uploaded to the cloud. The goal is to “disarm” the conversations from any privacy threats, so that even in cases where they accidentally get into the wrong hands, little to no harm can be incurred.

So how can this be done?

To stay within our example, the names of people and organizations, the mentions of locations and dates or times, are instances of what is referred to as “Named Entities” in the computational linguistics literature. One approach to improved privacy consists in detecting these Named Entities, or NE’s, automatically and either deleting them altogether or replacing them by something else, so that the result is less of a privacy concern to all parties involved.

Techniques for detecting NE’s have long been studied because they are of great use for a lot of different applications, not just privacy protection. Therefore, without the need to reinvent the wheel, we can draw on existing approaches which today are as good as people are in detecting NE’s. State-of-the-art systems for this task employ contextual word embedding, such as, BERT, and modern neural network architectures, such as the Transformer, to reach top performance.

Here’s the same voice interaction as shown above but with the different named entities identified, highlighted and labeled:

USER: New calendar entry: meeting with Mrs. Norton [PER] from Mycom [ORG] next Tuesday [DATE].
SYSTEM: What time?
USER: The whole day [TIME]. And please look up train connections to Berlin [LOC] for that day [DATE].

Detecting private information in dialogues is only the first step. The question is what to do with it once it has been identified. In COMPRISE, we’ve so far compared three different approaches and studied their pros and cons. Our next blog post will tell you more about these! Stay tuned!

Developer survey: Since you are here and interested in our project, could you please spare a moment to share your concerns and answer 12 questions related to developing voice-enabled apps.