AI and Data Protection: Input Data – The Moment of Truth

Alongside training data, which we discussed in our previous article, one of the key data protection risks of AI systems lies in what users enter into them. Input data is the blind spot of practice.

A prompt is written in seconds. A file is uploaded just as quickly. A use case is tested almost casually. But what may look like a trivial technical step is, from a legal perspective, a separate act of data processing – with its own requirements, risks and consequences.

While training data shapes the past of a model, input data defines its present.

Input as a separate act of data processing

Input data is the data that users enter into an AI system. This may be simple text, but also complex documents, personal data, confidential materials or entire datasets. What these data have in common is not their format, but their context: they arise situationally, are often current and are frequently much more sensitive than training data.

It is precisely this context-specific nature that makes them legally challenging. Unlike training, which often takes place in an abstract and upstream setting, the input of data is directly connected to specific individuals, specific business processes and specific expectations. What is entered is rarely neutral. It is usually embedded in existing legal relationships – with customers, employees or business partners.

This requires a shift in perspective: input is not merely a technical intermediate step, but a separate data protection-relevant processing operation that must be justified in its own right.

Who is actually processing what?

When data is entered into an AI system, the traditional allocation of roles starts to shift. The user enters the data, the provider supplies the infrastructure, the model processes the content and generates an output. But that description is too simplistic.

In practice, it is often assumed that the AI provider acts as a processor. That may be correct – but it is not automatic. As soon as the service provider uses prompts, uploads or telemetry data – that is, automatically collected and transmitted usage data – for its own purposes, such as product improvement, security analysis or model training, the classification as mere processing on behalf of another party often becomes difficult to maintain.

A good example is AI-based voice transcription tools. Many of these systems do not limit themselves to pure transcription, for instance for meeting minutes, but also use voice data to improve the underlying models, for example through fine-tuning or training. From a data protection perspective, such use cannot simply be classified as processing on behalf of the customer. A provider that uses voice data for its own development or training purposes may itself become a controller or joint controller.

The decisive questions remain the same: Who determines the purposes? Who determines the essential means? And who uses the data for its own interests? This is where the line is drawn between processor status and separate or joint controllership. For companies, this means that the classification of the provider is not a formality, but a central part of the risk assessment.

Purpose limitation under real-time conditions

The data protection principle of purpose limitation faces a particular challenge when it comes to input data. Data is typically collected for a specific purpose – for example, contract performance or internal analysis. If an AI system is then used, the question is whether that use is still covered by the original purpose.

The answer is rarely straightforward. The integration of an AI system can quickly change the context of processing. Data is no longer merely stored or transmitted. It is analysed, transformed and placed into new contexts. At the same time, there is often limited transparency as to what actually happens inside the system.

The result is a gradual functional shift. What began as a supporting use may develop into an independent form of processing. The relevant boundary is not the user interface, but the actual use of the data.

Legal bases in the context of use

Compared with training, the legal basis for processing input data is often more closely connected to existing business relationships. Contractual necessity may play a role, for example where an AI system is used to perform a specific service. But here too, the same principle applies: this legal basis only fits if the processing is objectively necessary to enter into or perform a contract. Merely being practical, useful or convenient is not enough.

Legitimate interests remain an important legal basis, particularly for internal applications. However, the balancing test becomes more demanding. Input data is often directly personal and relates to specific real-life situations. This increases the requirements regarding transparency, reasonable expectations and protective measures.

Consent may be relevant in certain constellations, but it also quickly reaches practical limits. In dynamic usage scenarios, it is often difficult to obtain informed, freely given and valid consent.

The overall picture therefore remains ambivalent: the legal instruments exist, but their application in the specific context of use is complex and highly case-specific.

Shadow AI – the real risk

A significant share of the risks surrounding input data does not arise from deliberately controlled processes, but from informal use. Employees turn to freely available tools to work faster, achieve better results or automate routine tasks. In doing so, they enter data that was never intended for those systems.

Customer data, draft contracts, internal analyses or strategic considerations – in practice, all of this can end up in prompts. What is intended as an efficiency gain can quickly become a loss of control. Once entered, this information often leaves the company’s immediate sphere of influence and may be used as training data.

The challenge is less about the individual violation than about the structure behind it. Shadow AI is not an exceptional case, but a systemic phenomenon. It emerges where governance is missing or not actually lived in practice. And it shows that the real weak point of many AI systems is not technical, but organisational.

Confidentiality and commercial sensitivity

In addition to data protection, another issue comes to the fore with input data: confidentiality. Much of the information entered into AI systems is not only personal data, but also commercially sensitive. It may be subject to contractual confidentiality obligations or qualify as trade secrets.

The use of external AI systems may therefore lead to unintended disclosure. Even where there is no active onward transfer, the question arises whether processing by the provider already amounts to disclosure. This is not merely a theoretical issue. It determines whether existing confidentiality obligations are being complied with.

Technical reality and legal assumptions

A central problem in dealing with input data lies in the opacity of the systems. Users often do not know whether their inputs are stored, how long they are retained and whether they are used for other purposes. At the same time, legal assessments frequently rely on assumptions about precisely these processes.

This discrepancy creates a structural risk. Decisions are made on the basis of incomplete information. Contracts are concluded without a full understanding of the actual data processing. Compliance is assumed without having been verified.

That is why a sober look at the technical reality is essential. Anyone seeking to assess input data from a legal perspective must understand what actually happens to it – and must secure that understanding through robust contractual arrangements.

Conclusion and outlook

Input data is the point at which the abstract questions of AI regulation turn into concrete risks. This is where data protection, confidentiality and business practice intersect most directly. And this is where it becomes clear whether the use of AI is genuinely controlled – or merely appears to be.

The greatest vulnerability of many AI systems does not lie in their training, but in what is entered into them every day.

While input data shapes the use of AI systems, output raises legal questions of its own. The next part will therefore focus on output data – and on the legal consequences that arise from the results generated by AI systems, as well as the question of who is responsible for them.

AI governance, training, policies and contracts are essential to minimise the legal risks associated with the use of artificial intelligence. We support you in this process.