Data Version Control

v2.4.0 | Report Errata

docs development docs development

Data Versioning Tooling (DVC, Delta Lake, LakeFS)

Data versioning ensures that, for any model version, the organisation can retrieve the exact dataset used to train it. Without deliberate versioning, the data that trained last quarter’s model may have been silently overwritten by this quarter’s data pipeline. Three tools address this requirement with different architectural trade-offs.

DVC (Data Version Control) works alongside Git. The dataset is stored in remote storage (S3, GCS, Azure Blob), and DVC creates a small metadata file in the Git repository recording the storage location, content hash, and version. Checking out a Git commit allows DVC to retrieve the corresponding dataset. DVC fits into existing Git workflows and is the most widely adopted option, though it tracks whole files, meaning even small changes require storing a complete new copy.

Delta Lake, built on Apache Spark, provides ACID transactions on data lakes with time-travel capability for querying historical versions. It handles incremental changes efficiently but depends on the Spark ecosystem. LakeFS provides Git-like semantics (branches, commits, merges) directly on object storage, working with any S3-compatible tool. All three tools must support the ten-year retention requirement under Article 18, which has infrastructure implications: durable storage, surviving access credentials, and budgeted storage costs for a decade.

Key outputs

Selected data versioning tool with deployment configuration
Integration with the code repository (cross-referencing data versions to Git commits)
Ten-year retention infrastructure (long-term storage, lifecycle policies)
Module 4 and Module 10 AISDP documentation

Dataset Manifest per Version (Count, Schema, Hash, Creator, Transformations)

Each dataset version must be accompanied by a manifest that records the essential metadata needed for traceability and reproducibility. The manifest serves as the dataset’s identity card, enabling the organisation to verify that the correct dataset was used for a given training run and to detect any unauthorised modifications.

The manifest records the version identifier, the creation date, the creator’s identity, the record count, the column schema (field names, types, and formats), the SHA-256 content hash of each data file, the source description, and any transformations applied since the previous version. The content hash is particularly important: it provides a tamper-evident fingerprint that can be verified at any future point to confirm the dataset has not been altered.

For organisations using DVC, the manifest information is partially captured in the .dvc metadata files. For Delta Lake and LakeFS, the transaction log provides equivalent information. Regardless of tooling, the manifest should be stored alongside the dataset version in a human-readable format (YAML or JSON) and cross-referenced from the model registry entry that used the dataset for training. This cross-reference completes the data-to-model traceability chain.

Key outputs

Dataset manifest template (YAML or JSON)
Content hash computation per data file
Cross-reference to model registry entries
Module 4 and Module 10 evidence

Manual Alternative (Filename Versioning, YAML Manifests, Restricted Access)

For organisations that cannot deploy DVC, Delta Lake, or LakeFS, data versioning reverts to manual snapshot management. Each dataset version becomes a complete copy stored with a naming convention (for example, training_data_v2.4_2026-02-15/) and accompanied by a manifest file in YAML or JSON format as described above.

The manual approach requires strict operational discipline. Storage must have access controls and no-delete policies in place. The model registry entry (or tracking spreadsheet) must cross-reference the dataset version identifier. Each version must be a complete, unmodified snapshot; partial updates or in-place modifications undermine the versioning guarantee.

This approach sacrifices several capabilities that automated tooling provides: incremental storage efficiency (every version is a full copy), automated hash verification on retrieval, and integration with Git for code-data cross-referencing. For datasets above approximately 10GB, the storage cost and manual management burden become substantial, and adopting tooling becomes strongly advisable. DVC is open-source and free; the real cost is the engineering time to integrate it.

Key outputs

Naming convention and directory structure for dataset snapshots
YAML or JSON manifest per version
Access controls and no-delete storage policies
Module 4 and Module 10 documentation

Ten-Year Retention (Long-Term Storage, Lifecycle Policies, Credential Survivability)

Article 18 of the EU AI Act requires that technical documentation be retained for ten years from the date the system is placed on the market. For data versioning, this means that older dataset versions must remain retrievable for the entire period. Many organisations underestimate the infrastructure implications.

The versioning backend’s storage must be durable, with replication and backup to protect against data loss. The access credentials must survive personnel changes; a dataset versioning system that runs on a team’s cloud account and is forgotten when the team reorganises fails the retention test. The storage cost must be budgeted for a decade. Versioned datasets should be stored in the organisation’s long-term compliance storage (S3 Glacier, Azure Archive, or equivalent), with lifecycle policies preventing accidental deletion.

The AI Governance Lead is responsible for ensuring that the ten-year retention obligation is reflected in the organisation’s infrastructure planning and budgeting. Credential survivability means that the credentials needed to access archived data are managed through the organisation’s central secrets management, not held by individuals. Lifecycle policies should prevent both accidental deletion and premature archival to storage tiers that make retrieval impractically slow for incident response purposes.

Key outputs

Long-term storage configuration (S3 Glacier, Azure Archive, or equivalent)
Lifecycle policies preventing accidental deletion
Credential management through central secrets management
Module 4 and Module 10 AISDP documentation