Prompt Injection — Attack Vectors (Direct, Indirect, Multi-Turn)
Prompt injection is the most widely discussed threat for systems incorporating large language models. Attackers craft inputs that cause the model to deviate from its intended behaviour, ignore its instructions, or execute actions beyond its authorised scope. The attack manifests in three primary vectors.
Direct prompt injection occurs when the attacker provides malicious input directly through the system’s user interface or API. The input is designed to override the system prompt’s instructions, causing the model to produce outputs outside its intended scope. Indirect prompt injection is more insidious: the attacker plants malicious content in data sources that the model consults during retrieval-augmented generation. When the model retrieves the poisoned content, it follows the embedded instructions rather than the system prompt. Multi-turn injection exploits conversational context to gradually erode the system prompt’s constraints over successive interactions.
For high-risk AI systems, prompt injection is a compliance risk: an injected prompt that causes the system to produce outputs inconsistent with its declared intended purpose effectively changes the system’s behaviour without any authorised modification. The threat model must document the injection vectors relevant to the specific system, assess their likelihood and impact, and map them to the controls described above.
Key outputs
- Assessment of direct, indirect, and multi-turn injection vectors
- Likelihood and impact scoring per vector
- Documentation of system-specific injection risk factors
- Module 9 AISDP evidence
Prompt Injection — Controls (Sanitisation, Validation, Privilege Separation, Anchoring, Monitoring)
Five control categories address the prompt injection vectors described above. Input sanitisation filters or escapes known injection patterns before they reach the model. This includes stripping or encoding control characters, detecting known jailbreak patterns, and validating input structure against expected formats. Sanitisation is a necessary but insufficient defence; novel injection techniques will bypass pattern-based filters.
Output validation verifies that the model’s response falls within the expected output space. If the system is designed to produce structured classification outputs, any response that deviates from the expected format is flagged and blocked. Privilege separation ensures that the LLM component cannot access resources or execute actions beyond its documented scope; even if injection succeeds, the damage is limited. Instruction anchoring uses system prompt design techniques (clear delimiters, repeated instructions, explicit refusal patterns) to make the prompt more resistant to override.
Input-output monitoring provides detection rather than prevention: it flags anomalous patterns that may indicate injection attempts, enabling investigation and response even when other controls do not prevent the injection. The combination of these five layers provides defence in depth. Module 9 captures the specific controls deployed, the testing performed (including adversarial prompt injection testing), and the residual risk.
Key outputs
- Five-layer control implementation (sanitisation, validation, privilege separation, anchoring, monitoring)
- Adversarial prompt injection test results
- Residual risk documentation
- Module 9 AISDP evidence