Population Representativeness
Population representativeness asks whether the training, validation, and testing datasets adequately represent the full range of individuals on whom the system will operate. Article 10(3) requires that datasets be “sufficiently representative,” and this requirement demands a structured assessment rather than a general assertion.
The Technical SME defines the deployment population: the specific demographic, geographic, and contextual characteristics of the people the system will serve or affect. The definition is derived from the system’s intended purpose (AISDP Module 1) and the conditions of use. A recruitment screening system intended for use by employers across the EU/EEA has a deployment population spanning all EU/EEA member states and the demographic diversity within them. A medical diagnostic system intended for use in a specific hospital has a narrower deployment population defined by the hospital’s patient demographics.
The representativeness assessment compares the dataset’s composition against the deployment population across all measured dimensions: demographic subgroups, geographic regions, temporal periods, and any domain-specific stratification relevant to the system’s purpose. Statistical tests (chi-squared for categorical distributions, Kolmogorov-Smirnov for continuous distributions) quantify the alignment between data composition and deployment population.
Where the representativeness assessment reveals gaps, the compensating controls documented in Article 65 (synthetic data augmentation, transfer learning, stratified sampling, deployment restrictions) apply. The assessment results, including the statistical test outputs and the compensating control specifications, are retained as Module 4 evidence.
Key outputs
- Population representativeness assessment
- Statistical comparison of dataset composition against deployment population
- Compensating control specifications for identified gaps
Underrepresented Subgroup Identification
The population representativeness assessment identifies gaps at the aggregate level. Underrepresented subgroup identification drills into specific groups that are most at risk of inadequate model performance, extending the analysis to intersectional combinations.
The Technical SME examines each protected characteristic subgroup’s representation in the training data, computing the ratio of the subgroup’s dataset proportion to its deployment population proportion. Subgroups with ratios substantially below 1.0 are flagged. The analysis then extends to intersectional combinations: female applicants over 55, disabled applicants from ethnic minority backgrounds, and so on. These intersectional subgroups frequently have critically small cell sizes even when each individual characteristic is adequately represented.
Cell size thresholds determine when reliable analysis is possible. A common threshold is 30 instances for basic performance metrics, with 100 or more needed for reliable fairness metrics with meaningful confidence intervals. Subgroups below these thresholds are documented as data-insufficient, and the AISDP states this limitation rather than reporting unreliable metrics.
For each underrepresented subgroup, the documentation records the group definition, the current representation level, the deployment population proportion (where available), the impact on model performance (if measurable with the available data), and the mitigation applied (oversampling, synthetic augmentation, model selection favouring architectures with lower data requirements, or deployment restrictions limiting use for the affected population).
Key outputs
- Underrepresented subgroup register
- Cell size analysis for intersectional subgroups
- Mitigation strategy per underrepresented subgroup