Introduction
As artificial intelligence (AI) development accelerates, businesses increasingly rely on vast amounts of data to train their AI models. Various laws need to be taken into consideration in the process – in particular when personal data is involved. One of the most relevant legal frameworks applicable to the training of AI is the European Union’s General Data Protection Regulation (GDPR). Even companies outside the EU may find themselves subject to these rules, which impose strict obligations on how personal data can be collected and used.
This article provides an overview of the key requirements businesses should be aware of when (web-)scraping and using personal data for AI training purposes under European data protection laws.
When does the GDPR apply?
The first step in determining whether the GDPR applies to a business is assessing the nature of the data being collected and used. The GDPR only regulates the handling of “personal data”, which refers to any information that relates to an identified or identifiable natural person. Data is fully anonymized if the individual cannot be re-identified, taking into account ‘all the means reasonably likely to be used’. Anonymized data only falls outside the GDPR’s scope under this condition.
If the data could potentially be re-identified with reasonable effort, it remains subject to the GDPR. This means that even if a dataset appears non-personal, businesses should carefully evaluate the risk of re-identification before assuming they are free from GDPR obligations. The AI to be trained must also be considered as a means ‘reasonably likely to be used’. Some AI systems may pose a particular risk of re-identification if they have access to large amounts of data. For example, the training of an AI may corrupt prior randomization, resulting in accidental re-identification.
The GDPR also applies extraterritorially. Even if a company is not established in the EU, it may need to comply with the GDPR if it processes personal data of individuals in the EU.
Legal grounds for web scraping under the GDPR
If a business determines that its data collection activities fall under the GDPR, it must establish a legal basis for processing personal data. The GDPR provides several legal bases, but not all are suitable for AI training. Three legal bases are most likely to be used for scraping personal data for the training of AI:
Consent
The GDPR allows the use of personal data if individuals provide informed, specific, and freely given consent. This may be a feasible option, e.g., if a company only wants to process its own users’ data. However, obtaining valid consent at scale for web scraping is impractical in most cases. Furthermore, consent may be withdrawn at any time with effect on future processing.
Legitimate interests
Businesses can process personal data if they have a legitimate interest in data processing that is not overridden by individuals’ rights and freedoms. This requires companies to conduct a balancing test, weighing their interests in training AI against individuals’ interests in not having their data used for AI training purposes.
Public interest or legal obligations
Some activities may be justified under public interest grounds or legal obligations, though these are rarely applicable to commercial AI development.
Consequently, “legitimate interests” is the most common legal basis for web scraping. Several European Data Protection Authorities have issued guidance emphasizing that businesses can implement safeguards to strengthen their case for legitimate interest by mitigating risks to individuals’ rights and freedoms. These safeguards can include:
- Defining precise data collection criteria.
- Filtering out sensitive categories of data (e.g., health information, political beliefs).
- Pseudonymizing personal data, i.e., processing only parts of personal data that do not allow re-identification without additional information.
- Respecting website policies that prohibit scraping (e.g., via robots.txt or ai.txt files).
- Facilitating the exercise of data subjects’ rights under the GDPR, e.g., by implementing easily accessible mechanisms to access their personal data or to opt out from the collection or processing.
The trained AI model should also be tested and evaluated to mitigate the risk of unintended data memorization.
Key data protection principles for AI training
Companies must continuously adhere to the GDPR’s fundamental principles at all times during data collection and AI training:
Lawfulness, fairness, and transparency
Personal data must not be processed without a legal basis. Individuals must be informed about how their personal data is used, including the purpose of the processing, the categories of personal data processed, and the sources from which personal data has been collected. Businesses should provide clear privacy notices and allow individuals to exercise their rights, such as accessing, rectifying, or deleting their personal data, or objecting to the collection and processing of their personal data.
Purpose limitation
Personal data collected for one purpose (e.g., social media interaction) cannot automatically be used for another (e.g., AI training) unless there is a valid legal basis for further processing.
Data minimization
Companies should be able to explain why the personal data is necessary for training their AI models. Excessive or irrelevant data should not be included.
Accuracy
Data used for AI training must be accurate and up-to-date to avoid biases and errors.
Storage limitation
Personal data must only be stored as long as it is necessary for the AI training. Once the training is completed and the personal data is no longer needed, it must be erased.
Integrity and confidentiality
Companies must implement appropriate technical and organizational security measures to prevent data leaks and unauthorized access to personal information.
Handling data subject rights
Individuals’ rights under the GDPR also apply when personal data is used in AI training. These include:
- The right to access their personal data and receive a copy.
- The right to request corrections or deletions.
- The right to object to data processing.
- The right to data portability, allowing individuals to transfer their data elsewhere.
To comply, businesses should implement processes such as “erasure request” forms or maintain “blacklists” to ensure that erased data is not reintroduced into training datasets. Additionally, when AI models output personal data, businesses may need to use “machine unlearning” techniques to remove such data from the system.
The AI Act: Additional obligations for AI training
Beyond the GDPR, businesses developing AI models should also consider the upcoming EU AI Act. The AI Act introduces additional obligations for AI developers, particularly those creating “high-risk” AI systems.
Under the AI Act, businesses must implement strict data governance policies for all dataset training, validation, and testing, including:
- Ensuring datasets are relevant, representative, and free of errors and biases.
- Conducting quality assessments and bias mitigation strategies.
- Documenting data sources and processing activities.
Most companies covered by the AI Act must comply with these new requirements by August 2026. While the GDPR focuses on privacy and data protection, the AI Act adds broader accountability measures for AI development. Thus, the rules on data governance are but a small part of the EU AI Act’s obligations applicable to training AI.
Conclusion: Practical steps for compliance
For businesses looking to train AI models using online data, compliance with European data protection laws may be crucial. To minimize legal risks, companies should:
- Conduct a thorough legal assessment to determine if the GDPR applies to their data collection activities.
- Establish a legal basis for scraping and processing personal data.
- Implement safeguards to respect data protection principles and minimize privacy risks.
- Facilitate data subject rights by setting up clear procedures for access, correction, and erasure requests.
- Stay informed about new AI legislation, such as the EU AI Act, to ensure compliance with evolving legal standards.
By taking these steps, businesses can responsibly train AI models while respecting individuals’ privacy rights and avoiding regulatory pitfalls. Ensuring compliance not only mitigates legal risks but also helps build trust with users and regulators in an increasingly data-driven world.
