OWASP LLM10: Unbounded Consumption

v2.4.0 | Report Errata

docs security docs security

Model DoS — Attack Vectors & Controls (Rate Limiting, Timeouts, Cost Caps)

An attacker submits inputs designed to consume excessive computational resources, degrading or denying service to legitimate users. AI systems are particularly vulnerable because individual inference requests can be computationally expensive; a large transformer model may require seconds of GPU time per request. A volumetric attack that would be trivially absorbed by a web server can exhaust an inference service.

Three control layers address this threat. Rate limiting on inference endpoints enforces a maximum request rate per client, identified by API key, IP address, or authenticated identity. Kong, NGINX, and cloud API gateways all support configurable rate limiting. The rate limit should accommodate legitimate peak usage with a margin; excess requests receive an HTTP 429 response. For neural networks, input complexity analysis can detect and reject inputs designed to trigger pathological computation paths.

Inference timeout enforcement sets a maximum execution time per request, terminating any request that exceeds it. The timeout should be set above the p99 latency for legitimate requests and below the threshold where a single request materially impacts other users. Autoscaling with cost caps provides the third layer: the system scales up to handle increased load but will not exceed a defined cost ceiling, preventing sustained attacks from generating unbounded cloud bills. Module 9 records the rate limiting configuration, timeout thresholds, and autoscaling boundaries. Module 5 states the system’s expected throughput under normal and adversarial load conditions.

Key outputs

Rate limiting configuration per client identity type
Inference timeout enforcement (above p99, below impact threshold)
Autoscaling with cost caps
Module 9 and Module 5 AISDP documentation