Regression Tests — Golden Dataset with Per-Subgroup Cases

v2.4.0 | Report Errata

docs development docs development

A golden dataset of historical inputs with known correct outputs serves as the regression baseline. Every candidate release is evaluated against this dataset to detect behavioural regression. The golden dataset is distinct from the training or evaluation datasets; it is a curated collection of cases selected specifically for regression detection.

The golden dataset must include cases drawn from each protected characteristic subgroup. This ensures that regressions do not disproportionately affect vulnerable populations. A candidate model that maintains overall accuracy but degrades accuracy for a specific demographic group would pass a naive regression test but fail a subgroup-aware regression test. The per-subgroup structure of the golden dataset makes this visible.

The golden dataset is version-controlled and expanded over time as new edge cases are discovered through production operation, incident investigation, or user feedback. Cases that previously caused errors, near-misses, or fairness concerns should be added to the golden dataset to prevent recurrence. The regression test results, including per-subgroup breakdowns, are retained as Module 5 evidence and feed into the model validation gates.

Key outputs

Golden dataset with per-subgroup case coverage
Version-controlled dataset expanded over time with discovered edge cases
Per-subgroup regression analysis for each candidate release
Module 5 AISDP evidence