{"id":52448,"date":"2026-04-28T16:42:22","date_gmt":"2026-04-28T14:42:22","guid":{"rendered":"https:\/\/www.eh.at\/?p=52448"},"modified":"2026-04-28T16:59:21","modified_gmt":"2026-04-28T14:59:21","slug":"ai-and-data-protection-training-data-the-invisible-foundation","status":"publish","type":"post","link":"https:\/\/www.eh.at\/en\/ai-and-data-protection-training-data-the-invisible-foundation\/","title":{"rendered":"AI and Data Protection: Training Data \u2013 The Invisible Foundation"},"content":{"rendered":"\t\t<div data-elementor-type=\"wp-post\" data-elementor-id=\"52448\" class=\"elementor elementor-52448 elementor-52447\" data-elementor-post-type=\"post\">\n\t\t\t\t\t\t<section class=\"elementor-section elementor-top-section elementor-element elementor-element-e6c0927 elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"e6c0927\" data-element_type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-909a2d1\" data-id=\"909a2d1\" data-element_type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-e1b6776 elementor-widget elementor-widget-text-editor\" data-id=\"e1b6776\" data-element_type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t<p>28. April 2026<br \/><em><a href=\"https:\/\/www.eh.at\/en\/team\/gernot-fritz\/\">Gernot Fritz<\/a>, <a href=\"https:\/\/www.eh.at\/en\/team\/tanja-pfleger\/\">Tanja Pfleger<\/a><\/em><\/p>\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-e5d8f59 elementor-widget elementor-widget-text-editor\" data-id=\"e5d8f59\" data-element_type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t<p>The legal risks of AI systems do not arise at the point of use \u2013 they arise much earlier. Not at deployment, not at the prompt or output stage, but at a phase that still receives surprisingly little attention in many projects: training.<\/p><p>This is where the foundations are laid. Which data is used, how it is obtained, and under which assumptions it is processed \u2013 all of this shapes not only a model\u2019s performance, but also its regulatory risk. If training data is misunderstood or underestimated, the system is built on a foundation that can hardly be corrected later.<\/p>\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-28daf2d elementor-widget elementor-widget-heading\" data-id=\"28daf2d\" data-element_type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t<h2 class=\"elementor-heading-title elementor-size-default\">Training as a legal starting point<\/h2>\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-31ee048 elementor-widget elementor-widget-text-editor\" data-id=\"31ee048\" data-element_type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t<p>From a technical perspective, training is the process by which a model identifies patterns in large volumes of data. From a legal perspective, it is something different: a large-scale processing activity that often involves personal data.<\/p><p>The key question is not whether data is \u201cpublic\u201d, but whether it relates to identifiable individuals. This is where a fundamental tension arises: data that is freely available on the internet does not cease to be personal data. Its use in training is therefore subject to the requirements of the General Data Protection Regulation.<\/p><p>This gap between technical availability and legal classification is not incidental \u2013 it is structural. AI systems scale through data. Data protection law limits exactly that scaling.<\/p>\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-d3def13 elementor-widget elementor-widget-heading\" data-id=\"d3def13\" data-element_type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t<h2 class=\"elementor-heading-title elementor-size-default\">Inherent risks of training<\/h2>\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-61eb72a elementor-widget elementor-widget-text-editor\" data-id=\"61eb72a\" data-element_type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t<p>The legal tensions do not end with the question of whether data is personal. They are embedded in the training process itself \u2013 in the way AI systems absorb, weigh, and translate data into models.<\/p><p>Training is not merely about processing. It involves selection and structuring. Which data is included, in what form it exists, and which patterns are derived from it all shape how the system will behave later on. It is at this stage that risks arise which can hardly be corrected in the system\u2019s later lifecycle.<\/p><p>A central example is bias. Training data rarely reflects a neutral reality. It mirrors existing distributions, preferences, and inequalities. Models trained on such data inherit these structures \u2013 and may even amplify them. Discrimination is therefore not usually the result of deliberate choices, but of statistical relationships.<\/p><p>There is also a structural lack of transparency. Individuals are generally unaware whether and to what extent their data has been used in training processes. In practice, they have no meaningful way to influence this use. The processing remains abstract; its effects only become visible later.<\/p><p>From a technical perspective, it also becomes clear that scale alone is not a quality indicator. Models are meant to identify patterns that go beyond the specific dataset. Where this fails, what has been learned becomes overly tied to the original data. The system may perform consistently within familiar scenarios but lose reliability when faced with new ones.<\/p><p>An additional phenomenon reinforces this concern: information from training data can persist within the model and, under certain conditions, reappear in outputs. The assumption that data simply dissolves during training and loses its individuality is therefore misleading. A connection remains \u2013 one that cannot be fully controlled.<\/p><p>All of this shows that training data is more than a technical input. It defines the structure of the system \u2013 and with it, the limits of its legal controllability.<\/p>\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-f27abb8 elementor-widget elementor-widget-heading\" data-id=\"f27abb8\" data-element_type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t<h2 class=\"elementor-heading-title elementor-size-default\">The structural conflict: scaling vs. purpose limitation<\/h2>\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-94a19d6 elementor-widget elementor-widget-text-editor\" data-id=\"94a19d6\" data-element_type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t<p>Training thrives on volume and diversity. Data protection law is built on limitation and purpose specification.<\/p><p>This is not a minor issue, but a fundamental tension: the broader and more heterogeneous the training data, the more powerful the model. At the same time, it becomes increasingly difficult to define a clear purpose and a robust legal basis.<\/p><p>This tension is particularly evident in large-scale web scraping. The assumption that \u201cpublic data\u201d can be freely used falls short. The real question is whether individuals could reasonably expect their data to be used for AI training. In many cases, the answer will be no.<\/p>\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-eb69070 elementor-widget elementor-widget-heading\" data-id=\"eb69070\" data-element_type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t<h2 class=\"elementor-heading-title elementor-size-default\">Legal bases in the training context \u2013 and their limits<\/h2>\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-1cdcdc5 elementor-widget elementor-widget-text-editor\" data-id=\"1cdcdc5\" data-element_type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t<p>The traditional range of legal bases is well known. Their practical viability in the training context is far less clear.<\/p><p>Consent typically fails at scale. It requires informed, granular, and revocable agreement \u2013 conditions that are difficult to reconcile with large training datasets.<\/p><p>Contractual bases are only helpful where data is used in a targeted way within clearly defined service relationships. For general training purposes, they are usually insufficient.<\/p><p>Legitimate interest therefore often remains the only realistic option. Yet even here, the analysis becomes more demanding: the broader the dataset, the more difficult the balancing exercise. This is especially true where data is processed without a direct link to the individual, but may nonetheless have far-reaching effects through the model.<\/p><p>The result is a paradox: the most powerful training approaches are often the most legally fragile.<\/p><p>The situation becomes even more complex when special categories of personal data are involved. In such cases, the usual balancing mechanisms are no longer available. Processing requires a separate legal basis subject to significantly stricter conditions. Approaches relying on research-related grounds typically require a benefit that goes beyond purely commercial objectives \u2013 a threshold that is often difficult to meet in practice.<\/p>\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-49d49ad elementor-widget elementor-widget-heading\" data-id=\"49d49ad\" data-element_type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t<h2 class=\"elementor-heading-title elementor-size-default\">Anonymisation \u2013 a fragile stability<\/h2>\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-5d60ef8 elementor-widget elementor-widget-text-editor\" data-id=\"5d60ef8\" data-element_type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t<p>A seemingly straightforward solution lies in anonymisation. If training data is no longer personal data, the GDPR no longer applies.<\/p><p>In practice, however, this path is narrower than it appears. Anonymisation is not a fixed state, but a contextual assessment. Data may appear anonymous to one party, while remaining identifiable to another. Particularly with large, linkable datasets, the risk of re-identification cannot be fully excluded.<\/p><p>For training purposes, this means that \u201canonymised data\u201d is often only stable under certain assumptions. These assumptions must be documented, tested, and, if necessary, defended.<\/p>\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-b3e80b5 elementor-widget elementor-widget-heading\" data-id=\"b3e80b5\" data-element_type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t<h2 class=\"elementor-heading-title elementor-size-default\">Synthetic data \u2013 solution or new layer of complexity?<\/h2>\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-225eabd elementor-widget elementor-widget-text-editor\" data-id=\"225eabd\" data-element_type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t<p>Against this backdrop, synthetic data is gaining importance. Instead of using real datasets, artificial data is generated to replicate the statistical properties of real data without directly relating to identifiable individuals.<\/p><p>The approach is appealing. It promises scalability without an immediate link to individuals. In practice, however, it often shifts the problem rather than solving it.<\/p><p>Synthetic data is only as \u201csynthetic\u201d as its source. Where it is generated based on personal data, the question remains to what extent those underlying data points continue to be legally relevant. A second issue arises from the fact that synthetic data may still allow inferences about real individuals, depending on the model and generation method.<\/p><p>Synthetic data is therefore not a free pass, but a tool. Used properly, it can reduce risk. Used incorrectly, it introduces a new layer of opacity.<\/p>\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-82ae571 elementor-widget elementor-widget-heading\" data-id=\"82ae571\" data-element_type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t<h2 class=\"elementor-heading-title elementor-size-default\">Training data as a governance issue<\/h2>\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-7c87401 elementor-widget elementor-widget-text-editor\" data-id=\"7c87401\" data-element_type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t<p>The core challenge is therefore less about individual legal questions and more about governance. Training data must be traceable, documented, and controllable.<\/p><p>This includes its origin, its composition, and the assumptions underlying its use. In many organisations, this transparency is lacking. Training data is collected, combined, and reused without a systematic assessment of its legal quality.<\/p><p>This is where the decisive point is reached: decisions made during training can rarely be corrected later. Models carry their data history within them \u2013 often invisibly, but with legal consequences.<\/p>\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-9204ce1 elementor-widget elementor-widget-heading\" data-id=\"9204ce1\" data-element_type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t<h2 class=\"elementor-heading-title elementor-size-default\">Conclusion and outlook<\/h2>\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-f72206a elementor-widget elementor-widget-text-editor\" data-id=\"f72206a\" data-element_type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t<p>Training is not a technical preliminary step, but the central legal starting point of AI systems. It determines whether a system rests on solid foundations or already contains future compliance risks.<\/p><p>Recent developments in case law on the relative concept of personal data point to a more nuanced approach: effective pseudonymisation may result in training data falling outside the scope of the GDPR for the recipient \u2013 provided that the separation between datasets and identifying information is robust and permanent. This is not a free pass, but a demanding design task.<\/p><p>While training data forms the foundation, risk shifts during operation. The next part will focus on input data \u2013 the data users enter into systems. This is where new, often underestimated issues arise, particularly at the intersection of control, purpose limitation, real-time processing, and risk.<\/p>\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-30d20f4 elementor-widget elementor-widget-text-editor\" data-id=\"30d20f4\" data-element_type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t<p><em>Those who want to make data usable for AI must also make it legally manageable \u2013 we are happy to support you in doing so.<\/em><\/p>\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<\/div>\n\t\t","protected":false},"excerpt":{"rendered":"<p>28. April 2026Gernot Fritz, Tanja Pfleger The legal risks of AI systems do not arise at the point of use \u2013 they arise much earlier. Not at deployment, not at the prompt or output stage, but at a phase that still receives surprisingly little attention in many projects: training. This is where the foundations are [&hellip;]<\/p>\n","protected":false},"author":20,"featured_media":51841,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"rank_math_lock_modified_date":false,"inline_featured_image":false,"footnotes":""},"categories":[235],"tags":[805,898,285,968,915],"group":[],"area":[],"location":[],"systype":[],"class_list":["post-52448","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-legal-update-en","tag-ai-2","tag-data-protection","tag-ki","tag-training-data","tag-urheberrecht"],"acf":[],"_links":{"self":[{"href":"https:\/\/www.eh.at\/en\/wp-json\/wp\/v2\/posts\/52448"}],"collection":[{"href":"https:\/\/www.eh.at\/en\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.eh.at\/en\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.eh.at\/en\/wp-json\/wp\/v2\/users\/20"}],"replies":[{"embeddable":true,"href":"https:\/\/www.eh.at\/en\/wp-json\/wp\/v2\/comments?post=52448"}],"version-history":[{"count":4,"href":"https:\/\/www.eh.at\/en\/wp-json\/wp\/v2\/posts\/52448\/revisions"}],"predecessor-version":[{"id":52456,"href":"https:\/\/www.eh.at\/en\/wp-json\/wp\/v2\/posts\/52448\/revisions\/52456"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.eh.at\/en\/wp-json\/wp\/v2\/media\/51841"}],"wp:attachment":[{"href":"https:\/\/www.eh.at\/en\/wp-json\/wp\/v2\/media?parent=52448"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.eh.at\/en\/wp-json\/wp\/v2\/categories?post=52448"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.eh.at\/en\/wp-json\/wp\/v2\/tags?post=52448"},{"taxonomy":"group","embeddable":true,"href":"https:\/\/www.eh.at\/en\/wp-json\/wp\/v2\/group?post=52448"},{"taxonomy":"area","embeddable":true,"href":"https:\/\/www.eh.at\/en\/wp-json\/wp\/v2\/area?post=52448"},{"taxonomy":"location","embeddable":true,"href":"https:\/\/www.eh.at\/en\/wp-json\/wp\/v2\/location?post=52448"},{"taxonomy":"systype","embeddable":true,"href":"https:\/\/www.eh.at\/en\/wp-json\/wp\/v2\/systype?post=52448"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}