AI and Data Protection: Training Data – The Invisible Foundation

28. April 2026
Gernot Fritz, Tanja Pfleger

The legal risks of AI systems do not arise at the point of use – they arise much earlier. Not at deployment, not at the prompt or output stage, but at a phase that still receives surprisingly little attention in many projects: training.

This is where the foundations are laid. Which data is used, how it is obtained, and under which assumptions it is processed – all of this shapes not only a model’s performance, but also its regulatory risk. If training data is misunderstood or underestimated, the system is built on a foundation that can hardly be corrected later.

Training as a legal starting point

From a technical perspective, training is the process by which a model identifies patterns in large volumes of data. From a legal perspective, it is something different: a large-scale processing activity that often involves personal data.

The key question is not whether data is “public”, but whether it relates to identifiable individuals. This is where a fundamental tension arises: data that is freely available on the internet does not cease to be personal data. Its use in training is therefore subject to the requirements of the General Data Protection Regulation.

This gap between technical availability and legal classification is not incidental – it is structural. AI systems scale through data. Data protection law limits exactly that scaling.

Inherent risks of training

The legal tensions do not end with the question of whether data is personal. They are embedded in the training process itself – in the way AI systems absorb, weigh, and translate data into models.

Training is not merely about processing. It involves selection and structuring. Which data is included, in what form it exists, and which patterns are derived from it all shape how the system will behave later on. It is at this stage that risks arise which can hardly be corrected in the system’s later lifecycle.

A central example is bias. Training data rarely reflects a neutral reality. It mirrors existing distributions, preferences, and inequalities. Models trained on such data inherit these structures – and may even amplify them. Discrimination is therefore not usually the result of deliberate choices, but of statistical relationships.

There is also a structural lack of transparency. Individuals are generally unaware whether and to what extent their data has been used in training processes. In practice, they have no meaningful way to influence this use. The processing remains abstract; its effects only become visible later.

From a technical perspective, it also becomes clear that scale alone is not a quality indicator. Models are meant to identify patterns that go beyond the specific dataset. Where this fails, what has been learned becomes overly tied to the original data. The system may perform consistently within familiar scenarios but lose reliability when faced with new ones.

An additional phenomenon reinforces this concern: information from training data can persist within the model and, under certain conditions, reappear in outputs. The assumption that data simply dissolves during training and loses its individuality is therefore misleading. A connection remains – one that cannot be fully controlled.

All of this shows that training data is more than a technical input. It defines the structure of the system – and with it, the limits of its legal controllability.

The structural conflict: scaling vs. purpose limitation

Training thrives on volume and diversity. Data protection law is built on limitation and purpose specification.

This is not a minor issue, but a fundamental tension: the broader and more heterogeneous the training data, the more powerful the model. At the same time, it becomes increasingly difficult to define a clear purpose and a robust legal basis.

This tension is particularly evident in large-scale web scraping. The assumption that “public data” can be freely used falls short. The real question is whether individuals could reasonably expect their data to be used for AI training. In many cases, the answer will be no.

Legal bases in the training context – and their limits

The traditional range of legal bases is well known. Their practical viability in the training context is far less clear.

Consent typically fails at scale. It requires informed, granular, and revocable agreement – conditions that are difficult to reconcile with large training datasets.

Contractual bases are only helpful where data is used in a targeted way within clearly defined service relationships. For general training purposes, they are usually insufficient.

Legitimate interest therefore often remains the only realistic option. Yet even here, the analysis becomes more demanding: the broader the dataset, the more difficult the balancing exercise. This is especially true where data is processed without a direct link to the individual, but may nonetheless have far-reaching effects through the model.

The result is a paradox: the most powerful training approaches are often the most legally fragile.

The situation becomes even more complex when special categories of personal data are involved. In such cases, the usual balancing mechanisms are no longer available. Processing requires a separate legal basis subject to significantly stricter conditions. Approaches relying on research-related grounds typically require a benefit that goes beyond purely commercial objectives – a threshold that is often difficult to meet in practice.

Anonymisation – a fragile stability

A seemingly straightforward solution lies in anonymisation. If training data is no longer personal data, the GDPR no longer applies.

In practice, however, this path is narrower than it appears. Anonymisation is not a fixed state, but a contextual assessment. Data may appear anonymous to one party, while remaining identifiable to another. Particularly with large, linkable datasets, the risk of re-identification cannot be fully excluded.

For training purposes, this means that “anonymised data” is often only stable under certain assumptions. These assumptions must be documented, tested, and, if necessary, defended.

Synthetic data – solution or new layer of complexity?

Against this backdrop, synthetic data is gaining importance. Instead of using real datasets, artificial data is generated to replicate the statistical properties of real data without directly relating to identifiable individuals.

The approach is appealing. It promises scalability without an immediate link to individuals. In practice, however, it often shifts the problem rather than solving it.

Synthetic data is only as “synthetic” as its source. Where it is generated based on personal data, the question remains to what extent those underlying data points continue to be legally relevant. A second issue arises from the fact that synthetic data may still allow inferences about real individuals, depending on the model and generation method.

Synthetic data is therefore not a free pass, but a tool. Used properly, it can reduce risk. Used incorrectly, it introduces a new layer of opacity.

Training data as a governance issue

The core challenge is therefore less about individual legal questions and more about governance. Training data must be traceable, documented, and controllable.

This includes its origin, its composition, and the assumptions underlying its use. In many organisations, this transparency is lacking. Training data is collected, combined, and reused without a systematic assessment of its legal quality.

This is where the decisive point is reached: decisions made during training can rarely be corrected later. Models carry their data history within them – often invisibly, but with legal consequences.

Conclusion and outlook

Training is not a technical preliminary step, but the central legal starting point of AI systems. It determines whether a system rests on solid foundations or already contains future compliance risks.

Recent developments in case law on the relative concept of personal data point to a more nuanced approach: effective pseudonymisation may result in training data falling outside the scope of the GDPR for the recipient – provided that the separation between datasets and identifying information is robust and permanent. This is not a free pass, but a demanding design task.

While training data forms the foundation, risk shifts during operation. The next part will focus on input data – the data users enter into systems. This is where new, often underestimated issues arise, particularly at the intersection of control, purpose limitation, real-time processing, and risk.

Those who want to make data usable for AI must also make it legally manageable – we are happy to support you in doing so.