The composition section of the dataset documentation captures the structural characteristics that enable a reviewer to understand the dataset’s scale, shape, and technical format. For each dataset, the Technical SME records the total record count, the number of features (columns), the storage size, the schema (field names, data types, value formats), and the immutable version identifier.
The version identifier is critical for traceability. It links the dataset to the model versions trained on it, enabling the AISDP to state precisely which data was used to train each model version. Tools such as DVC, Delta Lake, and LakeFS assign immutable version identifiers to dataset snapshots. Cloud-native versioning (S3 object versioning, for example) provides an alternative for simpler architectures.
Schema documentation should be sufficiently detailed for a technical reviewer to reconstruct the data’s structure without accessing the data itself. Each field entry records the field name, data type, permitted values or value range, a brief description of the field’s meaning, and whether it contains personal data or special category data. Automated schema validation (using Pandera for Pandas DataFrames or dbt’s built-in tests for SQL pipelines) should enforce consistency between the documented schema and the actual data, catching structural drift before it affects model training.
Documentation depth should be proportionate to the dataset’s role. Training datasets for high-risk systems warrant comprehensive treatment; static reference datasets warrant a lighter approach. The AI System Assessor documents the standard applied to each dataset category and the rationale for the proportionality decision.
Key outputs
- Composition record per dataset (count, schema, version)
- Schema validation configuration
- Proportionality rationale for documentation depth