Data Lineage & Version Control

v2.4.0 | Report Errata

docs development docs development

Transformation Documentation (Pre-Step / Post-Step)

Data lineage requires documenting every data engineering step with a pre-step record (captured before execution) and a post-step record (captured after execution). This methodology creates an audit trail demonstrating that data engineering was deliberate and considered.

The pre-step record includes the input datasets referenced by version identifier, the intended transformation (what the step will do), the rationale for the transformation (what data quality, completeness, or fairness problem it addresses), the expected output characteristics (schema, record count, distribution properties), and the validation criteria to be applied to the output.

The post-step record includes the actual output dataset referenced by version identifier, the actual output characteristics, a comparison against pre-step expectations noting any deviations and their explanation, the impact on data quality metrics, the impact on fairness-relevant distributions, and the identity and date of the person who executed the step.

Data lineage operates at three levels of granularity. Pipeline-level lineage captures the macro view using DAG-based orchestration tools (Airflow, Prefect, Dagster). Transformation-level lineage captures the logic within each step, requiring version-controlled code (dbt for SQL, tracked Python scripts for other transforms). Column-level lineage tracks how each column in the output relates to columns in source datasets, which is essential for proxy variable analysis; OpenLineage and Marquez provide an open standard for emitting and collecting lineage events at all three levels.

Great Expectations integrates naturally with the pre-step/post-step methodology: an expectation suite defines expected output characteristics, and the validation result (pass/fail with specifics) serves as the post-step record.

Key outputs

Pre-step and post-step records per data engineering step
Lineage event records (pipeline, transformation, and column level)
Deviation analysis documentation

Data Versioning Tooling (DVC, Delta Lake, LakeFS, Manual Snapshots)

Every dataset used in the system’s lifecycle must be versioned with an immutable identifier that allows the exact dataset to be retrieved at any future point. Dataset versions are linked to model versions so that the AISDP can state precisely which data was used to train each model version.

DVC (Data Version Control) is the most widely adopted open-source tool. It extends Git to track large files and datasets, storing the data in a configured backend (S3, GCS, Azure Blob) while Git tracks the metadata and version pointers. DVC enables branch-based data experimentation and ensures that the data used for any model version is reproducible from the Git commit history.

Delta Lake provides ACID transactions on top of data lakes, supporting time-travel queries that retrieve the exact dataset as it existed at any point in time. It is well-suited to Spark-based pipelines and large-scale data environments. LakeFS provides Git-like branching and versioning for data lakes, supporting isolated data experimentation without affecting production datasets. Cloud-native versioning (S3 object versioning, for example) provides a simpler alternative for smaller datasets.

For organisations with manual or semi-automated data workflows, manual snapshots (timestamped copies of the dataset stored in a defined location with a version identifier) provide minimum viable versioning. The AISDP must reference the versioning mechanism, the storage location, the retention policy, and the access controls governing the versioned datasets.

The choice of tooling should align with the system’s data infrastructure and the team’s capabilities. The AISDP documents the selected tool, the versioning scheme, and the integration with the model registry (so that model-to-data traceability is maintained).

Key outputs

Data versioning tool selection and configuration
Versioning scheme documentation
Model-to-data version linkage specification

Ten-Year Retention Planning — Storage & Lifecycle Policies

Article 18 of the AI Act requires that technical documentation, including information about training data, be retained for ten years after the system is placed on the market or put into service. The GDPR’s storage limitation principle requires that personal data be kept no longer than necessary. Reconciling these obligations is a core data governance challenge.

The resolution lies in retaining the documentation about the data, not necessarily the underlying personal data. In practice, this requires retaining metadata (provenance records, quality metrics, distributional statistics, versioning records, schema documentation, bias assessment results) after the personal data itself has been deleted or anonymised. The data architecture must be designed so that compliance-relevant information about training data can survive deletion of the individual records it describes.

The retention plan specifies, for each data category (training data, validation data, test data, inference inputs, inference outputs, operator interaction logs): the retention period, the justification for the period (regulatory requirement, reproducibility need, audit trail obligation, retraining schedule), the storage tier and cost implications (hot storage for active use, warm for periodic access, cold for archival), and the deletion or anonymisation process at the end of the retention period.

The DPO Liaison reviews the retention plan against GDPR requirements, confirming that personal data retention periods are justified and that deletion/anonymisation procedures are technically verified. At system end-of-life, the retention framework faces its most demanding test; the DPO Liaison should review the retention schedule against the decommission circumstances.

Key outputs

Data retention plan per data category
Storage tier and lifecycle policy
DPO Liaison review and confirmation

Third-Party Data Validation — Contracts, Ingestion Checks & Quarantine

Many high-risk AI systems rely on data from external sources: commercial data brokers, GPAI provider training corpora, feature enrichment services. The organisation bears full Article 10 compliance responsibility for this data regardless of its origin. The third-party data governance framework operates on three layers: contractual, technical, and ongoing monitoring.

The contractual layer establishes baseline expectations. Data supplier agreements should address provenance disclosure (collection methodology, lawful basis, populations represented, known biases), data quality specifications (completeness thresholds, accuracy guarantees, timeliness requirements, consistency standards), bias and representativeness warranties (demographic composition statistics to the extent disclosable), change notification (30 to 90 days before material changes), and audit rights (direct inspection or third-party auditor access at a risk-proportionate frequency).

The technical layer validates every delivery regardless of contractual promises. The intake validation pipeline verifies schema compliance, completeness, range and distribution checks against the historical baseline, and anomaly detection. Great Expectations or Soda Core can define a dedicated expectation suite per supplier. Deliveries that fail validation are quarantined: the data sits in a holding area, the supplier is notified, and the data does not enter the training pipeline until the failure is resolved. The quarantine log is retained as Module 4 evidence.

The ongoing monitoring layer detects silent changes. Statistical monitoring of incoming deliveries compares each delivery’s distributional profile against the historical baseline, flagging sudden shifts that may indicate undisclosed methodology changes. Periodic re-assessment (at least annually) evaluates whether the data remains suitable for the system’s intended purpose given evolving deployment populations and available alternatives.

Key outputs

Third-party data governance framework documentation
Supplier contract provisions summary
Intake validation pipeline configuration
Quarantine log and resolution records