v2.4.0 | Report Errata
docs development docs development

The training data used to develop AI models, particularly large language models and generative systems, may include copyrighted material. The legal landscape is evolving rapidly, with active litigation in multiple jurisdictions. For high-risk AI systems, the AISDP must document the copyright status of the training data.

The assessment identifies whether the training data includes copyrighted text, images, audio, or other works and documents the legal basis relied upon: licence, consent, the text and data mining exception under Directive (EU) 2019/790, or another basis. It records the measures taken to identify and exclude material where rights holders have exercised an opt-out. Procedures for responding to copyright claims from rights holders are also documented.

For systems incorporating pre-trained models from third parties, the organisation should obtain contractual representations regarding the copyright status of the model’s training data. Where such representations are unavailable or qualified, the AI System Assessor records the risk in the risk register and assesses potential regulatory and reputational impact. The Code of Practice for GPAI providers under Article 56 includes copyright compliance commitments; however, the Code’s content and signatory coverage continue to evolve. The downstream provider should cross-reference any Code of Practice commitments against the information actually received, and should not treat Code of Practice participation as a substitute for direct contractual representations where those can be obtained.

Copyright risk is distinct from data protection risk. A dataset may be GDPR-compliant (no personal data, or personal data processed with lawful basis) yet still infringe copyright. The IP and Licensing Analysis artefact should address both dimensions.

Key outputs

  • Training data copyright assessment
  • Legal basis documentation per data source
  • Opt-out compliance records
  • Rights holder response procedures

Where training, validation, or testing data includes personal data, the AISDP must document the lawful basis for processing under GDPR Article 6. Consent under Article 6(1)(a) is one possible basis, though legitimate interests under Article 6(1)(f) or public interest under Article 6(1)(e) may be more appropriate depending on the context. The appropriate lawful basis for AI model training is an area of active regulatory debate across EU member states; data protection authorities have taken divergent positions on whether legitimate interests can support large-scale model training, and enforcement practice continues to evolve. The Legal and Regulatory Advisor and DPO Liaison should monitor developments and revisit the lawful basis determination if regulatory guidance shifts.

The verification should confirm that the lawful basis is appropriate for the specific processing activity (model training may require a different basis from production inference), that data subjects were informed of the processing in accordance with GDPR Articles 13 and 14, that any consent obtained meets the GDPR’s requirements for being freely given, specific, informed, and unambiguous, and that the organisation can demonstrate compliance (accountability under GDPR Article 5(2)).

For special category data (racial or ethnic origin, political opinions, religious beliefs, trade union membership, genetic data, biometric data, health data, sex life or sexual orientation), the stricter requirements of GDPR Article 9 apply. The AISDP must document the specific exemption relied upon under Article 9(2), which may include explicit consent or the substantial public interest exemption, alongside the safeguards applied.

Where models are trained on data obtained from third parties, the organisation must verify the third party’s data governance, including the lawful basis, the consent mechanisms, and the data processing agreements in place. address third-party data validation in detail.

Key outputs

  • Lawful basis documentation per dataset and processing activity
  • Data subject notification records
  • Third-party data governance verification records
  • Special category data exemption documentation (where applicable)

Residual IP Risk Documentation

After completing copyright assessment, personal data consent verification, and licence compatibility review, residual intellectual property risks may remain. These risks aggregate across the system’s model components, training data, and third-party dependencies.

The IP and Licensing Analysis artefact consolidates these residual risks. For each risk, the document records the source (copyright uncertainty in training data, ambiguous licence terms, provider refusal to disclose training data composition), the potential impact (regulatory sanctions, injunctive relief, reputational damage, deployment restrictions), the mitigations applied (contractual representations, copyright filtering, alternative data sourcing), and the residual risk rating.

Residual IP risk is communicated to the AI Governance Lead for formal acceptance and may also need to be communicated to deployers through the Instructions for Use if it affects the deployer’s own compliance position. The risk register entries for IP risk are subject to periodic review, particularly as the legal landscape around AI training data copyright evolves.

Key outputs

  • IP and Licensing Analysis (consolidated artefact)
  • Residual IP risk register entries
  • AI Governance Lead risk acceptance (where applicable)
On This Page