Inference Latency

v2.4.0 | Report Errata

docs operations docs operations

Inference Latency Latency monitoring tracks mean response times and tail latencies (95th and 99th percentiles). High-percentile latency spikes can cause timeouts in downstream systems that depend on the AI system’s output. Where latency exceeds the timeout threshold, the downstream system may receive no output or a default fallback value. If the deployer’s process assumes a valid AI output is always available, timeout-driven fallbacks introduce silent failures. The Technical SME defines latency thresholds in the PMM plan, calibrated against the deployer’s integration architecture, and monitors them separately for each inference endpoint. Latency degradation may indicate resource contention, model complexity growth (after a retrained model is deployed), or infrastructure issues. Latency monitoring should correlate with throughput monitoring: latency degradation under increasing load is expected, but latency degradation at constant load suggests an infrastructure or model problem. Key outputs

Mean, P95, and P99 latency monitoring per endpoint
Thresholds calibrated to deployer integration architecture
Timeout-driven silent failure detection
Latency-throughput correlation analysis