Implementation of GDPR principles in machine learning

Today, voice-enabled technologies rely on a machine learning paradigm called deep learning. Deep learning has led to major improvements in speech-to-text, spoken language understanding, and dialogue management. Before moving into the GDPR principles, first, let’s first briefly recall the concepts of machine learning and deep learning introduced and detailed in our previous blog posts.

Machine learning can be understood as a technique that aids Artificial Intelligence (AI) by training algorithms so that they can learn how to make decisions and predictions based on large amounts of inputs (data).

On the other hand, deep learning is a specific implementation of machine learning and is based on the use of  artificial neural networks. Like with humans, algorithms used in deep learning intend to compare new information to known items before making sense of it [Ref 1].

Machine learning and voice-enabled technologies

Speech recognition, also known as speech-to-text, recognises speech and converts it into a readable form. The term “speech recognition” is sometimes used in a broad sense, encompassing natural language understanding (i.e., the ability of a machine/program to receive and interpret dictation or to understand and carry out spoken commands [Ref 1]), which is often based on deep learning.

Voice recognition or speaker recognition, by contrast, aims to identify the person speaking. It works by scanning the aspects of speech that differ from one individual to another (voice physiology), including accent, speaking, style and voice pitch [Ref 2].

Both speech and voice recognition require large volumes of data to train algorithms. Such amounts of data can be considered as big data, which has important implications in relation to the enforcement of the GDPR, among other EU laws.

In this sense, voice can be considered as an identifier if the signal itself, its content, or any information that can be derived from it makes it possible to identify the speaker. But, how is this related to the principles which are established in the GDPR?

GDPR principles

The principles which are tied to the processing of personal data and which are covered in Article 5 of the GDPR apply to the processing of any information concerning an identified or an identifiable natural person. Machine learning techniques may hardly fulfil the provisions established in these principles, as we shall see below.

Lawfully, fairly and in a transparent manner  

According to Article 5.1 a) of the GDPR, “personal data shall be processed lawfully, fairly and in a transparent manner in relation to the data subject”.

Machine learning plays a leading role in speech recognition technologies. When an algorithm is trained with personal data, the resulting model could be incorrect or discriminatory if biased or irrelevant data are used [Ref 3], which would be contrary to the fairness principle stated in the GDPR.

Thus, models must be trained with correct and relevant data and must learn not to emphasize information related to gender, ethnic origin, beliefs, sexual orientation or other characteristics that could lead to discriminatory treatment.

Moreover, deep learning may raise challenges on how to satisfy the transparency principle and to properly inform data subjects about the processing of their personal data. This is due to the fact that deep learning works as a “black box”, which means that sometimes it is extremely difficult or impossible to explain how information is correlated and weighted in a particular process.

Collected for specified, explicit and legitimate purposes

According to Article 5.1 b) of the GDPR, “personal data shall be collected for specified, explicit and legitimate purposes and not further processed in a manner that is incompatible with those purposes”.

In this respect, when developing AI algorithms, it might be challenging to define the purpose of the processing of personal data, as it may not be possible to predict what the algorithm will learn. Besides, when personal data is used to train an algorithm, it may be difficult to explain the purpose as in many occasions, humans cannot understand the trained model [Ref 3].


Article 5.1 c) of the GDPR states that “personal data shall be adequate, relevant and limited to what is necessary in relation to the purposes for which they are processed”.

Machine learning involves the collection of large amounts of data that, in many cases, remain unused [Ref 4]. As previously explained, black boxes not only make it impossible to predict how algorithms will learn but also complicate the task of defining the purpose of the treatment, which may also change as the machine learns. Consequently, it remains unclear which data will be necessary or not for training the algorithm [Ref 3].

Furthermore, the data minimisation principle also seeks to restrict the extent of the intervention in the privacy of the data subject, avoiding disproportionate interferences. In this sense, the data controller should examine the intended area of application of the model and consider how to achieve the objective in a way that is the least invasive for the data subject.3


Article 5.1 d) of the GDPR states that “personal data shall be accurate and, where necessary, kept up to date”.

When data are processed massively, a certain amount of inaccurate personal data may be tolerated when the model is trying to represent general trends. However, this proved to be a problem when personal data are processed for profiling purposes, as the use of inaccurate data may lead to wrong predictions that, in some contexts, may bring adverse effects to data subjects [Ref 4].

Storage limitation

Lastly, according to Article 5.1 e) “personal data shall be kept in a form which permits identification of data subjects for no longer than is necessary for the purposes for which personal data are processed”.

In this regard, machine learning has the ability to process large volumes of data, which may encourage data controllers to keep records beyond the period required for the original purpose, which contravenes the storage limitation principles established in the GDPR.


Comments are closed.