Bias detectability asks whether fairness metrics can be computed at the subgroup level, whether the model can be interrogated for proxy variable effects, and whether the architecture supports fairness-aware training or post-hoc calibration.
The assessment determines whether the candidate architecture supports feature attribution methods (SHAP, LIME, integrated gradients) that can identify proxy variable effects. Models with calibrated probability score outputs are more amenable to fairness analysis than models producing only ranked outputs or categorical labels, since probability scores enable threshold-based fairness metrics such as equalised odds and calibration within groups.
For ensemble methods, SHAP values provide strong proxy variable detection at the individual prediction level. For deep neural networks, feature attribution is possible through KernelSHAP or DeepSHAP, though with lower precision. For LLMs, bias detection typically relies on benchmarking across demographic categories in the test set rather than per-prediction feature attribution; the assessment specifies which bias detection methodologies are applicable and their limitations.
The score reflects the combined strength of proxy variable detection, disaggregated fairness evaluation, and the availability of fairness-aware training or post-hoc calibration methods for the candidate architecture.
Key outputs
- Bias detectability score per candidate model
- Applicable fairness evaluation methodology