Making your AI projects GDPR-compliant: principles explained
Disclaimer: the author is not a lawyer. This article does not constitute legal advice.
This article is the second in a series of articles. You can read this article by itself, but we recommend reading the previous article first.
Where we left off
In the previous article, we introduced the seven basic principles behind the GDPR. We mentioned that some principles would require a bit more explanation to help you understand what they’re all about. In this article, we’ll do just that, giving you a broader perspective and some practical insights with concrete examples and case studies.
Come for compliance, stay for the model improvements
This is an essential principle for AI and Machine Learning. It’s also more compatible with modern ML than you may think. While there has been a tendency in some subfields like Deep Learning to go towards ever-larger datasets, there is also a countermovement that focuses on using as little data as possible to achieve the same goal. There are good reasons to try to minimize the amount of data you use, like increased robustness, as long as you use the correct techniques. Most models can even be confused by redundant data.
Decision-tree based methods like GBMs, for example, tend not to benefit from extensive datasets in most cases. This is why it’s important to choose which data you include and which data you never even collect, based on domain expert knowledge. There are even techniques to find out which data contributes to the final result the most.
For most machine learning models, there is a way to get a variable importance score for every feature that was added. In this way, you can even remove unnecessary data after the fact.
First, do no harm
As we mentioned earlier to comply with this principle, you need to make sure that your processing doesn’t unjustly cause harm to a data subject, but also that you don’t unjustly treat data subjects differently. A system that discriminates against minorities is unfair even if it doesn’t cause any direct harm.
That is why this principle is also related to machine learning bias, which is a topic that is also very important for us and about which we’ve already published some other content. You can check that out here.
The tension between fairness and minimisation
There’s an interesting challenge that arises from the need to check for model bias, however. As mentioned in those articles, there are algorithmic ways of checking that your model isn’t biased for some sensitive attribute and even preventing this bias actively. Still, these methods require you to keep that sensitive attribute about your training and/or test data. What’s more, these sensitive attributes tend to belong to the GDPR Special Categories, as mentioned in Article 9! Ethnic origin, religious beliefs, sexual orientation, and medical conditions are typical sources of bias, especially in domains like human resources and hiring, and are all included in the special categories. This means that in most cases, you need consent to process them.
If we think about the principle of data minimisation we might assume that it’s best to take the passive approach and just not include this data, but this has been proven to be the wrong approach. Much has been written about this subject in the research literature, but in summary, there may be so-called proxies for the sensitive attribute in the data, other information that can be used by the model to infer the value of the sensitive attribute, so the result is still biased. So it can be necessary to use active measures to prevent bias, like Model Fairness Metrics. For example, suppose we have a system that classifies individuals as financially responsible. In that case, we could check that the distribution of genders in people who were classified as either more responsible or less responsible doesn’t differ too much from the gender distribution in the general population. You may have noticed that if there is a genuine difference in financial responsibility between men and women that this way of reducing bias can introduce bias if not used carefully.
An opportunity, not a burden
This principle is the best example of how the GDPR can be an opportunity instead of a burden. Being transparent doesn’t mean revealing your important trade secrets, to the contrary, there is no requirement to do so, only to give some high-level explanation about your process and the way you treat personal data. This, in turn, is an excellent opportunity for some positive PR. Communicating about data protection is yet another way to show your social responsibility, to connect with consumers and show them you’re an ethical and trustworthy organisation.
Be mindful of using clear and easy to understand language when explaining things. All data subjects have the right to be informed about what is done with their personal data, not only those with college degrees. Being inclusive can only help your company or brand develop a positive image.
Well, it depends
This is a tricky one. It’s also one of the principles where the name might be misleading because it means something subtly different in the GDPR context compared to how it’s generally used. You might, for example, assume that the need for data to be accurate requires it to be kept up-to-date in all cases. This is not true. It depends on the use case: if you require data that was true about a specific person at a certain point of time, then there’s no requirement to change it over time. The same goes for accuracy in the numerical sense; there’s no obligation to measure something as accurately as possible if a less accurate measurement is sufficient and not misleading or bias-inducing.
If there is a good reason why you should keep training data for a model up-to-date, however, this can cause other interesting problems. An obvious question to ask is what needs to happen with a model that was trained based on data that has since been updated. Could the model now also be regarded as “inaccurate”? Do you have an obligation to update it as well? This question becomes especially important when the model in question is used to aid in decision making about people, specifically if decisions are made about the data subjects whose data has been updated. Perhaps whatever the model learned can still be seen as valid “in aggregate”, but a negative impact on the data subjects can not be excluded with any certainty.
As with most GDPR-related questions, these need to be looked at on a case by case basis and if an important decision is made this needs to be recorded.
Integrity and Confidentiality
Keep it secret, keep it safe
Most of the work to be done for this principle tends to fall under information security, which we won’t go into here, but there are some other concerns related to this principle that are specific to AI.
There has been a recent explosion in the amount of literature published about attacks on machine learning models. This can be to deceive them, for example in computer vision a famous example is to add well-chosen noise to an image to make an object recognition system classify something as the wrong kind of object. But it’s also possible to make a model reveal something about its training data, and this is where it gets dangerous from a GDPR perspective.
An attack on a model where variables from the training set about specific subjects are recovered using model outputs and some other information about the subjects is called a model inversion attack. For example, Matt Fredrikson and others were able to build a recognizable picture of a victim’s face using a publicly available facial recognition API, knowing only the victim’s name and the fact that their face was included in the training dataset for the facial recognition model.
The inclusion determination
Sometimes just being able to determine whether someone was included in the training dataset at all is a serious problem. For example, if you know that everyone in a specific model’s training dataset has a specific medical condition and some other information about a person allows you to determine whether they were in the training set you can also determine whether they have the medical condition, which is a breach of special category personal data!
There are countermeasures to this type of attack. However, as it is still a very active field of academic inquiry, many of these measures are still very new and require specific expertise to implement.
The future of GDPR in AI projects
Meeting in the middle
Applying GDPR to AI is an evolving field, with many exciting discussions still to be had. Nevertheless, we believe that GDPR is an opportunity to do better and more ethical AI, not just another regulatory burden. It may require extra work and increase costs in the beginning, but through the focus on transparency, fairness, and efficiency, the pay-off will be well worth it, both for consumers and companies.
As something to think about, we’d like to leave you with this piece of wisdom from a conversation one of us had with a member of the European Parliament: lawyers and engineers have opposing motivations when it comes to specifying things. For an engineer, a specification needs to be just precise enough, so only unimportant details are left for those implementing them. For a lawyer, a specification needs to be just vague enough, so only the essentials are set in stone, and there is enough room for interpretation by those implementing them. Hopefully, GDPR can serve as an example of something where lawyers and engineers can meet in the middle.
In the next article, we’ll continue covering the fundamentals of GDPR and how they apply to AI projects. We’ll look at how to clearly define for which purposes you’re using personal data, as well as legal bases that cover these purposes. We’ll also learn more about special categories of data and how to treat them.