Calibration Within Groups — Reliability Diagrams

v2.4.0 | Report Errata

docs development docs development

Calibration within groups tests whether the model’s confidence scores carry consistent meaning across protected subgroups. If the model assigns a 70% probability to applicants from one group and those applicants are indeed successful 70% of the time, the model is well-calibrated for that group. If the same 70% probability corresponds to only 55% actual success in another group, the model is poorly calibrated, and operators relying on the confidence score will be systematically misled for that subgroup.

Reliability diagrams are the standard visualisation tool. They plot predicted probability against observed frequency, with a separate curve for each subgroup. A perfectly calibrated model produces a diagonal line. Deviations from the diagonal indicate miscalibration: overconfident predictions (where predicted probability exceeds actual frequency) or underconfident predictions (the reverse).

Fairlearn and AI Fairness 360 both support per-subgroup calibration analysis. The AISDP includes the reliability diagrams as visual evidence, alongside the numerical calibration metrics: Brier score decomposition (reliability, resolution, uncertainty) per subgroup, and the maximum calibration error across all subgroups.

Calibration is particularly important for systems where the confidence score is presented to operators as part of the oversight interface. If operators use the confidence score to decide how much scrutiny to apply to a recommendation, miscalibration for specific subgroups means those groups receive inappropriate levels of oversight, undermining the Article 14 human oversight framework.

Key outputs

Reliability diagrams per protected subgroup
Calibration metrics (Brier score decomposition, maximum calibration error)
Calibration impact assessment for operator oversight