Every dataset used in the system’s lifecycle, whether for training, validation, testing, calibration, or fine-tuning, requires provenance documentation that specifies how and from where the data was acquired. Provenance must be specific: “data collected from deployer ATS systems between January 2021 and December 2023 under data processing agreements” is acceptable; “data from various sources” is not.
For each dataset, the Technical SME records the original collection methodology, identifying whether the data was collected through direct observation, user interaction, sensor capture, manual entry, or automated scraping. The legal basis under GDPR Article 6 for the collection must be documented, along with any consent mechanisms used. Where the data was licensed from a third party, the licensing terms and their compatibility with the intended use are recorded.
The Datasheets for Datasets framework (Gebru et al., 2021) provides a structured template. Its seven sections cover motivation, composition, collection process, preprocessing, uses, distribution, and maintenance. For EU AI Act compliance, the collection process section requires additional depth beyond the standard template, specifically the GDPR lawful basis and the data processing agreements governing cross-organisational transfers.
Dataset documentation is treated as a living artefact. A version bump, whether from new records, modified features, or changed quality rules, triggers a corresponding documentation update. Tools such as OpenMetadata and DataHub support attaching structured documentation to dataset versions with change tracking. For lighter approaches, a Markdown file co-located with the dataset in the versioning system (DVC, Delta Lake) provides version-controlled documentation that evolves alongside the data.
Key outputs
- Source and acquisition record per dataset
- GDPR lawful basis documentation
- Data processing agreement references (where applicable)