Joren Verspeurt


Making your AI projects GDPR-compliant: how to get started

Disclaimer: the author is not a lawyer. This article does not constitute legal advice.

Why should you think about GDPR?

As an AI agency, we do projects for clients in many different sectors. In these projects, we handle all kinds of data, including personal data. As a result, that means that we have to comply with the European General Data Protection Regulation, commonly known as GDPR. Is it possible to cut through the fog of uncertainty and doubt around GDPR and inspire trust in modern AI solutions at the same time? We think it is!

There are many ways in which compliance is integrated into our ways of working and company culture, including regular education and training, procedures to be followed before, during, and after a project, etcetera. While developing these processes and interacting with clients who come to us for personal data AI projects, we’ve noticed that some aspects of the GDPR and specifically the way it interacts with modern machine learning practices aren’t always as clear as they could be.

What you’ll learn here

There are many resources available about the GDPR in general, but not that many about how it applies to AI projects specifically. This is why we’re contributing to this by releasing a series of blog articles about the GDPR and AI.

In this article, we’re going to introduce you to the core principles of the GDPR and how they apply to AI projects. This way, you’ll be better equipped to think about personal data issues when you encounter them during a personal data AI project.

In future articles, we’re going to continue by telling you about purposes and bases, special categories of data and how to treat them, and how to integrate ways of letting data subjects exercise their rights into an Agile AI project.

Your GDPR basics: 7 principles

At an abstract level there are 7 core principles behind the legislation, as laid out in Article 5 of the GDPR:

1) Lawfulness, fairness, and transparency

You must have a lawful basis for processing (more about this in a future article, or read up here). Your processing shouldn’t have any unjustified adverse effects on data subjects, and you should be clear, open, and honest with people about the kind of processing you do.

How does it apply to an AI project? 

For example, if you make a product that includes some kind of facial recognition model as part of its functionality you should include a high-level description of how facial images are processed and on what kind of data your recognition algorithm was trained, to inform consumers about possible bias. There are multiple ways of doing this being developed, including Model Cards by Google, but there are no real standards yet.

2) Purpose limitation

You can only use data for the purposes you have a legal basis for and this needs to be documented as part of your development process.

How does it apply to an AI project? 

For example, if you received a user’s age and approximate location as part of their sign-up for an application because you need that information to provide the services included in that application, you can’t just use it to analyse which regions have the right user demographic to offer some other services to. Even if you aggregate these statistics to the point where they can’t be considered personal data any more, that processing in itself is not covered by the data’s original purpose. So selling aggregated user data to other companies without users’ knowledge is also out.

3) Data minimization

You should identify the minimum amount of personal data you need for your purpose, as part of your design, and not use more than that.

How does it apply to an AI project? 

Often as Machine Learning engineers, we can feel the urge to collect more data than we need “just in case”, but this is the wrong approach. Not only does simply throwing more data at a problem not solve it, but it can even be harmful in some cases. Using less data can help improve your model by making it more robust if you choose wisely. That is exactly what this principle motivates us to do: to think about which data we need and why, and not to collect or keep any more.

4) Accuracy

You should have checks in place to ensure that the personal data you process is not incorrect or misleading. How often you should trigger these checks depends on the type of data and the kind of processing you do.

How does it apply to an AI project? 

If you use personal data to train some kind of model, having accurate data in the usual sense is important, for obvious reasons. The reason accuracy in this sense is important, however, is a bit more subtle. For example, if you collected data about specific data subjects from a particular year and trained a model that is only supposed to apply to data from that year, it’s perfectly fine that it contains data that is no longer current. As long as it was correct then, there’s no problem.

An interesting question arises when you find out you’ve trained a model that was based on inaccurate data. Are you then forced to retrain that model because of the inaccuracy of that single data point, even if there’s no indication that it would cause the model to make biased predictions or cause any other harm?

Because this is a principle that needs some more explanation and has many interesting consequences for machine learning we’ll go deeper into it in a future article.

5) Storage limitation

You can’t keep personal data for as long as you need it for a legitimate purpose and no longer. The retention time for every piece of data should be finite and set beforehand if possible.

How does it apply to an AI project? 

For example, if you know you’re not going to need training data for a model any more after the final model is trained or you can abstract away some of the details, you have a responsibility to remove what is no longer needed. How long data can be kept is something you should think about before you start training, and it may influence your design and how you plan your projects.

6) Integrity and confidentiality

You must ensure that you have security measures in place to protect the personal data you hold, where the measures are proportionate to the risk the personal data would pose if abused.

How does it apply to an AI project?

Usually, this just means securing the storage for your personal data, but there are numerous other ways in which your models or the APIs you serve them with can be compromised. A whole field of study has sprung up around “attacks” on Machine Learning Models. Attacks in this sense are ways of abusing them to:

  • return an unexpected result, like a facial recognition system not registering a face at all;
  • reveal information about their training data, like for example, whether an individual was or wasn’t included in the dataset;

Because this is also a topic that deserves a bit more attention, we will also come back to this in a future article.

7) Accountability

You need to take responsibility for how you handle personal data and have appropriate measures and records in place to demonstrate your compliance.

How does it apply to an AI project? 

For example, this means that if you make important decisions about how to treat personal data this should be recorded somewhere, so you can show your client or an auditor later if requested.

What exactly does “processing” mean?

Let’s clear up some misconceptions. We’ve mentioned “processing” a couple of times already, a term that’s used quite often in the GDPR context. In principle, it should be well-understood by everyone reading this, but some misconceptions about it still exist.

A common misconception is that you need to be doing some kind of computation on the data for it to count as processing. This is incorrect. Anything you do to the data, including storing it, sending it over a network, converting it between formats, … is considered a form of processing. There are some cases where simple storage is treated differently from other forms of processing, but in general, the same rules apply.

So “we’re just storing this data, we’re not actually doing anything with it yet” isn’t a reason to slack on data protection.

We need to go deeper

Now that you’re up to speed on the basics about the 7 core principles of the GDPR, let’s look at some of them in more detail, applied to AI projects, in the next blog post. We’ll give you a better understanding of Fairness, Accuracy, and Security/Integrity with concrete cases and more detailed explanations. Stay tuned!