Performance Monitoring

v2.4.0 | Report Errata

docs operations docs operations

Accuracy Metrics The Technical SME computes the system’s core accuracy metrics continuously on production data. The metric set includes AUC-ROC (discrimination ability across all thresholds), F1 score (harmonic mean of precision and recall), precision (proportion of positive predictions that are correct), recall (proportion of actual positives correctly identified), Brier score (calibration accuracy for probabilistic predictions), and calibration error (agreement between predicted probabilities and observed frequencies). These metrics are computed against ground truth labels where available. The specific metrics reported depend on the system’s task: classification systems report the full set; ranking systems may substitute NDCG or MAP; regression systems report RMSE, MAE, and R-squared. The AISDP declares the primary metrics and the minimum acceptable thresholds; PMM monitors compliance with these declarations. All accuracy metrics are computed on production data using the same methodology as the validation gate, ensuring comparability between pre-deployment and post-deployment performance. Key outputs

Core accuracy metric set computed continuously on production data
Metric selection aligned with system task type
Comparison against AISDP-declared thresholds
Methodology consistent with validation gate

Ground Truth Handling In many deployment contexts, ground truth labels are not immediately available. A credit scoring system’s true outcome is not known until the borrower repays or defaults, potentially years later. A recruitment screening system’s true outcome (the quality of the hired candidate’s performance) may not be known for months. Where ground truth is available, accuracy metrics are computed directly. Where ground truth is delayed, the Technical SME defines proxy metrics and leading indicators that provide early warning without waiting for labels to arrive. NannyML’s CBPE method estimates accuracy from the confidence score distribution; Evidently AI computes drift metrics that correlate with performance degradation. The PMM plan documents the expected ground truth delay for each metric, the proxy metrics used during the delay period, and the process for recomputing metrics once ground truth arrives. Where ground truth labels arrive with a delay, the computation pipeline handles late-arriving data and recomputes affected metrics, ensuring the historical record is updated. Key outputs

Ground truth delay documented per metric
Proxy metrics and leading indicators for delayed-truth systems
CBPE estimation for accuracy without ground truth
Recomputation on late-arriving labels

Disaggregated Performance — Subgroup-Specific Degradation Aggregate performance metrics can mask subgroup-specific degradation. A system whose aggregate accuracy remains stable but whose accuracy for a specific subgroup has degraded is experiencing a compliance-relevant change that aggregate monitoring would miss. The Technical SME computes all performance metrics across protected characteristic subgroups, where data is available and lawful to process under Article 10(5). The same disaggregation structure used during pre-deployment fairness testing is applied in production. Where cell sizes for certain subgroups are too small for statistically meaningful computation, the metric is flagged as inconclusive rather than omitted. Subgroup-specific degradation may indicate that the data distribution for that subgroup has shifted, that the model’s decision boundary is poorly calibrated for that population, or that an upstream data quality issue disproportionately affects certain groups. Each of these root causes requires a different remediation approach. Key outputs

Per-subgroup performance metric computation
Consistent disaggregation structure with pre-deployment testing
Inconclusive flagging for insufficient cell sizes
Root cause differentiation for subgroup-specific degradation

Temporal Stability & Trend Analysis The Technical SME tracks performance metrics over time with trend analysis. A slow, consistent decline that does not breach the threshold on any single measurement may still represent a significant cumulative degradation over months. A system that loses 0.5% accuracy per month would take ten months to breach a 5% degradation threshold, but after ten months the degradation is substantial. Trend analysis uses rolling averages, linear regression on metric time series, and change-point detection algorithms. A statistically significant downward trend, even where no individual measurement breaches a threshold, should generate a warning alert for investigation. Seasonal patterns (predictable fluctuations linked to business cycles, academic calendars, or other periodic factors) are documented in the PMM plan and excluded from trend analysis. Temporal stability monitoring also detects sudden performance shifts that might indicate a data pipeline failure, a deployment error, or an adversarial event. A sudden drop in accuracy followed by a return to normal warrants investigation even if the recovery was spontaneous. Key outputs

Rolling averages, trend regression, and change-point detection
Slow degradation detection below individual threshold breach
Seasonal pattern documentation and exclusion
Sudden shift detection and investigation