Counterfactual fairness is the most direct test of whether a model uses protected characteristics in its decisions. For individual predictions, the protected characteristic is changed (while all other features are held constant) and the model’s output is observed. If flipping gender from male to female changes the prediction, the model is using gender or its proxies in the decision.
The Technical SME applies counterfactual testing to a representative sample of the evaluation dataset. The proportion of predictions that change under counterfactual manipulation is reported, along with the direction and magnitude of the changes. Alibi Explain’s counterfactual explanations are particularly useful for this analysis, as they answer the question “what would need to be different for the outcome to change?” in concrete, per-instance terms.
Counterfactual testing is computationally tractable for tabular models where the protected characteristic is a discrete input feature. For models where the protected characteristic is entangled with other features (in text or image data where gender may be expressed through language patterns or visual features, not as a discrete input), the testing methodology becomes more complex. The AISDP documents the testing methodology, its applicability to the model architecture, and any limitations.
The results are interpreted in context. A small proportion of changed predictions may be acceptable if the changes are small in magnitude and the model’s overall fairness profile (as measured by the other metrics) is satisfactory. A large proportion of changed predictions, or changes concentrated in specific subgroups, indicates that the model is relying on protected characteristics and requires mitigation.
Key outputs
- Counterfactual test methodology documentation
- Proportion and direction of changed predictions
- Subgroup-level analysis of counterfactual sensitivity