v2.4.0 | Report Errata
docs security docs security

Adversarial Examples — Attack Vectors & Controls (Adversarial Training, Input Validation, Ensemble Methods)

Adversarial examples are inputs crafted with imperceptible perturbations that cause the model to produce incorrect outputs with high confidence. This threat affects image classification, speech recognition, and other perceptual AI systems, as well as tabular data models through feature manipulation. A loan applicant who slightly modifies their reported income to cross a decision boundary is executing a real-world adversarial attack.

Controls include adversarial training, which incorporates adversarial examples in the training data to improve the model’s robustness to perturbations. Input validation detects out-of-distribution inputs that may indicate adversarial manipulation, flagging inputs whose feature distributions fall outside the training data’s range. Ensemble methods, which aggregate predictions from multiple models, are more robust to adversarial perturbations than single models because an adversarial example crafted for one model is unlikely to fool all models in the ensemble.

Regular adversarial testing as part of the CI pipeline robustness gate provides ongoing verification. The adversarial robustness evaluation methodology, the attack types tested against (FGSM, PGD, C&W; for neural networks; feature perturbation for tabular models), the model’s measured robustness, and the residual risk for attack types where full robustness cannot be achieved are documented in Module 9. Module 5 should include adversarial robustness metrics alongside standard accuracy metrics.

Key outputs

  • Adversarial training integration (where applicable)
  • Input validation for out-of-distribution detection
  • CI pipeline adversarial testing (robustness gate integration)
  • Module 5 and Module 9 AISDP evidence

Model Inversion — Controls (Output Granularity Restriction, Differential Privacy, Probing Monitoring)

Model inversion attacks use the model’s outputs, including confidence scores and probability distributions, to reconstruct information about the training data. In a classification model, inversion can recover representative examples of each class. For models trained on personal data, this can expose sensitive information about individuals in the training set.

Restricting the granularity of output information is the most effective countermeasure. Returning only the top prediction or a coarsened confidence band, rather than full probability distributions, reduces the information available to an attacker. For classification systems, returning a binary decision (approve/reject) with a broad confidence category (high/medium/low) rather than a precise probability score significantly limits the inversion attack surface.

Differential privacy during training provides a formal guarantee that the model’s outputs do not reveal disproportionate information about any individual training record. Monitoring output patterns for signs of systematic probing, where a consumer submits inputs designed to explore the model’s decision boundary, supports early detection. Module 9 captures the model inversion threat, the output granularity restrictions in place, and any differential privacy parameters applied during training.

Key outputs

  • Output granularity restriction policy (coarsened confidence bands)
  • Differential privacy parameters (if applied)
  • Probing pattern monitoring and alerting
  • Module 9 AISDP documentation

Federated/Distributed Training Risks (Poisoned Gradients, Data Inference, Aggregation Manipulation)

Organisations using federated learning or distributed training across multiple data holders face threats specific to these architectures. Malicious participants can submit poisoned gradient updates that corrupt the global model, infer information about other participants’ data from gradient exchanges, or exploit the aggregation protocol to manipulate the training outcome.

Controls include secure aggregation protocols that prevent the central coordinator from seeing individual gradient updates, differential privacy applied to gradient updates to limit information leakage from any single participant, Byzantine-robust aggregation methods that detect and exclude anomalous gradient updates, and participant authentication and access controls ensuring that only authorised parties contribute to the training process.

Audit trails must record every gradient exchange and aggregation step to support forensic investigation. Organisations using federated learning should document the architecture, the security controls, the trust model (which participants are trusted, what verification mechanisms are in place), and the residual risks in both Module 5 (Architecture) and Module 9 (Cybersecurity). If the system does not use federated or distributed training, this threat category is documented as not applicable in the threat model.

Key outputs

  • Federated/distributed training architecture documentation (if applicable)
  • Secure aggregation, differential privacy, and Byzantine-robust aggregation controls
  • Participant authentication and audit trail implementation
  • Module 5 and Module 9 AISDP evidence
On This Page