OWASP LLM04: Data and Model Poisoning

v2.4.0 | Report Errata

docs security docs security

Data Poisoning — Attack Vectors (Targeted, Untargeted, Label Flipping, Backdoor)

An attacker who can manipulate the training data can introduce biases, backdoors, or performance degradation into the trained model. Data poisoning manifests in several forms. Targeted poisoning affects the model’s behaviour for specific inputs (for example, causing a specific individual’s application to always be approved) whilst leaving general performance intact. Untargeted poisoning degrades overall model performance, making the system unreliable.

Label flipping changes the correct labels on training examples, teaching the model incorrect associations. Backdoor insertion embeds a hidden trigger in the training data; the model performs normally on clean inputs but produces attacker-controlled outputs when the trigger is present. For RAG-based systems, adversarial document injection into the knowledge base is a form of poisoning that can influence the model’s outputs without modifying the model itself.

The threat assessment should evaluate which poisoning vectors are relevant to the specific system. Systems that retrain on production data (where outputs are labelled by deployers or affected persons) are more vulnerable to label flipping than systems trained on curated, internally labelled datasets. Systems using RAG are vulnerable to document injection. The assessment informs the controls described above.

Key outputs

Assessment of relevant poisoning vectors (targeted, untargeted, label flipping, backdoor, document injection)
Likelihood scoring based on the system’s data ingestion architecture
Integration with the overall threat model
Module 4 and Module 9 AISDP documentation

Data Poisoning — Controls (Provenance Tracking, Anomaly Detection, Sentinel Testing, Access Controls)

Four control layers address the data poisoning vectors described above. Data provenance tracking, using the data lineage infrastructure described in Section 4 and the version control described in Section 6, enables detection of unauthorised data modifications. Every modification to training data is logged with the modifier’s identity, timestamp, description, and rationale. DVC and Delta Lake provide version-controlled data storage where every change is attributed.

Statistical anomaly detection on the data pipeline (Great Expectations, Evidently AI) identifies suspicious records or distributional shifts before training. Isolation forests and distributional tests flag unusual data points. The challenge is that sophisticated poisoning attacks may modify only a small fraction of records, keeping the overall distribution within normal bounds. For high-risk systems, periodic manual review of random training record samples provides a complementary human verification layer.

Sentinel input testing after each retraining cycle checks for unexpected changes in outputs for known inputs, detecting whether the model’s behaviour has been altered by poisoned data. Access controls on training data repositories restrict modification to named individuals with documented business needs, with every access event logged in an immutable audit trail. The training data integrity controls belong jointly in Module 4 and Module 9.

Key outputs

Data provenance tracking with immutable modification logs
Statistical anomaly detection on the data pipeline
Sentinel input testing after each retraining cycle
Access controls with audit logging on training data repositories