Level 1: Technical Monitoring

v2.4.0 | Report Errata

docs operations docs operations

Level 1: Technical Monitoring — Personnel & Function Level 1 of the oversight pyramid comprises ML engineers, platform engineers, and site reliability engineers responsible for continuous automated monitoring of the system’s technical health. This level provides real-time visibility into inference latency, error rates, throughput, and infrastructure utilisation through dashboards and automated alerting. The engineering team must have the ability to diagnose and remediate technical failures, access to the system’s logging infrastructure for root cause analysis, and the capacity to execute emergency actions when metric threshold breaches indicate service degradation. Level 1 is the first line of detection for technical failures that may have compliance implications. Level 1 monitoring operates continuously, including outside business hours, through on-call rotation with defined response SLAs. The monitoring infrastructure is independent of the AI system it monitors, ensuring that a failure in the AI system does not disable the monitoring capability. Key outputs

Continuous automated monitoring by engineering team
Real-time dashboards and automated alerting
On-call rotation with defined SLAs
Module 7 and Module 12 AISDP documentation

Level 1: Authority — Emergency Rollback Without Prior Approval Level 1 personnel have authority to execute emergency rollbacks without prior approval from the AI Governance Lead, with immediate post-hoc notification. This authority exists because requiring senior approval before addressing an active technical failure introduces delay that may increase harm or extend downtime. The scope of this authority is limited to technical remediation actions: rolling back to a previous model version, reverting a configuration change, scaling infrastructure, or activating a fallback service. Actions that alter the system’s intended purpose, change its decision logic, or modify compliance-relevant parameters require AI Governance Lead approval. Every emergency action is logged with the identity of the person who acted, the timestamp, the action taken, the rationale, and the post-hoc notification to the AI Governance Lead. This log is retained as Module 7 evidence and reviewed at the quarterly oversight meeting. Key outputs

Emergency rollback authority without prior approval
Scope limited to technical remediation
Post-hoc notification to AI Governance Lead
Action logging retained as Module 7 evidence

Level 1: Escalation Triggers Level 1 escalation triggers include infrastructure failures affecting system availability, performance degradation beyond defined SLA thresholds, security alerts from runtime monitoring, and anomalous patterns in system logs. These triggers are defined in the PMM plan and implemented as automated alert rules. When a trigger fires, the Level 1 team follows the triage process to determine whether the issue is operational or model-related. Operational issues within Level 1’s remediation authority are addressed directly. Issues that may have compliance implications, indicate model degradation, or suggest a potential serious incident are escalated to Level 2 (operators) and Level 4 (compliance) simultaneously. The escalation triggers are reviewed quarterly at the oversight meeting. Triggers that generate excessive false positives are recalibrated; triggers that fail to detect genuine issues are tightened. Key outputs

Four categories of escalation trigger (infrastructure, performance, security, anomalies)
Automated alert implementation
Dual escalation to Level 2 and Level 4 for compliance-relevant issues
Quarterly trigger review and recalibration