Open-Source Models — Training Data Provenance

v2.4.0 | Report Errata

docs development docs development

Open-source models from repositories such as Hugging Face, GitHub, or academic publications offer accessibility and community validation. They also introduce training data provenance risks that the AISDP must address. The training data may be unknown or poorly documented. It may include copyrighted material, personal data processed without consent, or data unrepresentative of the intended deployment population. The development process may not have included the bias testing, adversarial evaluation, or governance records that the AI Act requires.

Any organisation using an open-source model as a component of a high-risk system inherits these documentation gaps. Module 3 records which open-source components are incorporated, the due diligence performed on each, the licensing terms and their regulatory compatibility, and the residual risks arising from provenance gaps.

Where provenance documentation is unavailable, the organisation must compensate through its own testing and evaluation. Sentinel datasets exercising the risk dimensions that the training data disclosures do not address provide one mechanism. Output filtering and validation layers constraining the model’s outputs to the acceptable range for the intended purpose provide another. Continuous monitoring of the model’s behaviour for drift, with automated alerting when output distributions shift beyond defined tolerances, closes the loop.

The SLSA (Supply-chain Levels for Software Artifacts) framework, originally designed for software supply chain security, adapts well to ML artefacts. Level 2 (automated build process with verifiable provenance metadata) is the minimum practical target. For models downloaded from public repositories, best practice is to download once, compute a cryptographic hash, and store in the internal model registry, ensuring all subsequent references use the internal copy.

Key outputs

Open-source model provenance assessment
Due diligence documentation per component
Compensating control specifications for provenance gaps