v2.4.0 | Report Errata
docs development docs development

The demographic composition of a dataset determines whether the model will perform equitably across the populations on whom it operates. Article 10(2)(f) requires that training, validation, and test data be examined, in light of the intended purpose of the AI system, in view of possible biases that are likely to affect the health and safety of persons, have a negative impact on fundamental rights, or lead to discrimination prohibited under Union law. The Technical SME presents composition statistics both in aggregate and disaggregated by relevant subgroups.

The documentation records the distribution of protected characteristics within the dataset: age bands, gender, ethnicity, disability status, and any other characteristics relevant to the system’s deployment context. These distributions are compared against the deployment population to identify over-representation and under-representation. Where protected characteristic data is not directly available in the dataset, the documentation records this as a limitation and describes any proxy-based or inferred demographic analysis conducted (with appropriate caveats about the reliability of proxy-based inference).

Population completeness is assessed: does the dataset represent the full range of persons and groups on whom the system will operate? Underrepresentation of specific subgroups degrades the model’s performance for those groups and creates fairness risk. Where the deployment population spans multiple EU member states with different demographic profiles, the completeness assessment should address each target member state individually.

The demographic composition feeds directly into the pre-training bias assessment. Distributional imbalances identified at this stage inform the Technical SME’s bias mitigation strategy and may influence the model selection decision (a model requiring less training data may be preferable when certain subgroups are underrepresented).

Key outputs

  • Demographic composition report with protected characteristic distributions
  • Deployment population comparison
  • Under-representation identification and flagging
On This Page