OWASP LLM02: Sensitive Info Disclosure

v2.4.0 | Report Errata

docs security docs security

Information Disclosure — Attack Vectors (Memorisation, Membership Inference, Property Inference)

The model may leak sensitive information from its training data through its outputs. This risk manifests in three forms. Memorisation occurs when the model has memorised specific training examples and can be prompted to reproduce them; this is most acute for large language models, which can reproduce verbatim passages including personal information.

Membership inference allows an attacker to determine whether a specific individual’s data was included in the training set, violating that individual’s privacy even if the model does not reveal their specific data. Property inference enables an attacker to deduce aggregate properties of the training data (such as the proportion of a specific demographic group) that the organisation intended to keep confidential.

For high-risk AI systems processing personal data, information disclosure has direct GDPR implications. A model that leaks personal data from its training set may constitute an unauthorised disclosure under GDPR Article 5(1)(f). The threat assessment should evaluate which disclosure vectors are relevant to the specific system, considering the volume and sensitivity of personal data in the training set, the model architecture’s propensity for memorisation, and the access patterns of the system’s consumers.

Key outputs

Assessment of memorisation, membership inference, and property inference risks
Sensitivity analysis based on training data contents
GDPR impact assessment for identified disclosure risks
Module 4 and Module 9 AISDP documentation

Information Disclosure — Controls (Differential Privacy, Output Filtering, Membership Inference Testing)

Differential privacy techniques during training (OpenDP, TensorFlow Privacy, Opacus) limit the model’s memorisation of individual training records. Differentially private stochastic gradient descent (DP-SGD) clips per-example gradients and adds Gaussian noise during training, providing a mathematical guarantee parameterised by epsilon (ε). Lower epsilon provides stronger privacy at the cost of model utility; the chosen epsilon, the resulting accuracy trade-off, and the rationale are documented in AISDP Module 6.

Output filtering (Microsoft Presidio, spaCy NER) provides a runtime defence for generative models. Before generated text is returned, a PII detection pipeline scans for personal names, addresses, phone numbers, email addresses, and national ID numbers. Detected PII is redacted or replaced with placeholder tokens. The Technical SME monitors the false positive rate to balance privacy protection against system utility.

Membership inference testing (ML Privacy Meter) evaluates the model’s susceptibility by training an attack model on a shadow dataset and evaluating its ability to distinguish training members from non-members. If the attack achieves significantly better than random accuracy (the recommended starting threshold is an attack AUC-ROC below 0.55 as a starting threshold), the model is leaking membership information and further controls are required. Data minimisation in training reduces the volume of sensitive data available for the model to memorise.

Key outputs

Differential privacy implementation with documented epsilon and accuracy trade-off
Output filtering pipeline (Presidio or equivalent) with false positive monitoring
Membership inference testing results against defined thresholds
Module 4 and Module 9 AISDP evidence