Resource Utilisation & Capacity The engineering team tracks CPU, GPU, memory, and storage utilisation against capacity limits. AI inference workloads can exhibit sudden load spikes driven by batch processing cycles, deployer activity patterns, or seasonal demand. When utilisation approaches capacity limits, the system may queue requests, increase latency, or drop requests entirely. Capacity monitoring provides sufficient lead time (weeks, not hours) for infrastructure scaling decisions. The PMM plan defines warning thresholds at 70 to 80% of capacity and critical thresholds at 90% or above, with documented response procedures for each. Response procedures range from automated scaling (for cloud-native deployments) to manual capacity planning requests (for on-premises infrastructure). Storage utilisation monitoring is particularly relevant for systems that accumulate monitoring data, inference logs, and evidence artefacts. The ten-year retention requirement (Article 18) means storage requirements grow continuously and must be planned. Key outputs
- CPU, GPU, memory, and storage utilisation monitoring
- Warning (70–80%) and critical (90%+) thresholds
- Capacity scaling lead time of weeks
- Storage growth planning for ten-year retention