Rate Limiting

v2.4.0 | Report Errata

docs security docs security

Rate limiting on inference endpoints prevents denial of service and model extraction attacks. A reference nginx configuration is provided with two layers: per-consumer rate limiting (keyed to the API key header) and global rate limiting across all consumers. Both layers allow short bursts whilst enforcing sustained rate ceilings.

The rate limits should be calibrated to the system’s legitimate usage patterns. The per-consumer limit should accommodate the consumer’s expected peak request rate with a reasonable margin; the global limit should accommodate the expected total request rate across all consumers. Anti-extraction measures extend beyond simple rate limiting: application-level middleware can track unique input patterns per consumer per hour, flagging consumers who submit systematically varied inputs that suggest extraction behaviour.

Rate limit configuration is version-controlled as infrastructure-as-code and documented in Module 9. The limits are reviewed periodically and adjusted as usage patterns evolve. Rate limit enforcement is tested as part of the denial-of-service testing described above.

Key outputs

Per-consumer and global rate limiting on inference endpoints
Anti-extraction monitoring for systematic input variation
Version-controlled rate limit configuration
Module 9 AISDP documentation