Pre-Processing Techniques (Oversampling, Undersampling, Reweighting, Synthetic Data)

v2.4.0 | Report Errata

docs development docs development

Pre-processing mitigations modify the training data before the model encounters it. They are the most accessible class of bias mitigation techniques and are documented in the AISDP with their trade-offs.

Oversampling creates additional copies or synthetic examples of underrepresented subgroups. SMOTE (Synthetic Minority Over-sampling Technique) generates synthetic examples by interpolating between existing minority records, reducing overfitting risk compared to simple duplication. ADASYN (Adaptive Synthetic Sampling) focuses synthetic generation on boundary regions where the classifier struggles. The risk with oversampling is overfitting to the minority class; the risk with simple duplication is even greater.

Undersampling removes records from overrepresented subgroups to balance the dataset. The risk is discarding potentially useful data, reducing the model’s overall performance. Both oversampling and undersampling are validated by comparing the model’s performance on an unaltered holdout set to ensure the technique has not degraded generalisation.

Reweighting assigns higher training weights to underrepresented subgroups, ensuring each subgroup contributes equally to the loss function. AI Fairness 360’s reweighting preprocessor computes optimal weights automatically. Reweighting preserves all data while adjusting the model’s attention, making it generally preferable to undersampling for high-risk systems.

Synthetic data augmentation uses generative techniques (SDV, Gretel.ai, MOSTLY AI) to create additional training examples for underrepresented subgroups. The AISDP documents the generation algorithm, the validation of synthetic data against real distributions, the proportion of synthetic data in the final training set, and the risks of over-reliance on synthetic data (which may not capture real-world complexity).

Key outputs

Pre-processing technique selection and rationale
Trade-off analysis (fairness improvement vs accuracy impact)
Validation results on unaltered holdout set