v2.4.0 | Report Errata
docs operations docs operations

Hallucination Detection For generative AI systems that produce factual claims, hallucination monitoring compares generated claims against source documents. Three approaches are common. Entailment scoring: an NLI model checks whether the source actually supports the generated claim. Citation verification: the system checks whether generated citations exist and contain the claimed information. Consistency checking: the system flags cases where the same query produces contradictory answers on different occasions. For RAG systems, RAGAS provides automated evaluation of faithfulness (whether the answer follows from retrieved documents), answer relevance, and context relevance. Trulens offers a similar framework with customisable feedback functions. For non-RAG systems, NLI-based detection relies on comparing outputs against a reference corpus. These detectors are imperfect; they miss subtle hallucinations and occasionally flag correct statements. The monitoring combines automated detection with periodic human evaluation: a random sample of outputs is reviewed by domain experts rating factual accuracy, relevance, and safety. The Technical SME tracks hallucination rates as a PMM metric with defined thresholds. Key outputs

  • Three automated detection approaches (entailment, citation, consistency)
  • RAGAS and Trulens for RAG system evaluation
  • Combined automated and human evaluation
  • Hallucination rate tracked as PMM metric with thresholds

Safety Monitoring For systems with safety constraints (content policies, behavioural boundaries, use-case restrictions), monitoring tracks policy violation rates over time. A rising violation rate may indicate the model’s safety alignment has degraded, users have discovered bypass techniques, or the system is encountering input patterns it was not designed to handle. Lakera Guard scans model inputs and outputs for prompt injection attempts, PII leakage, toxic content, and other safety violations. NVIDIA NeMo Guardrails enforces conversational guardrails (topic boundaries, response format constraints, safety filters) at the application layer. Llama Guard provides a safety classifier applicable to model outputs. Safety monitoring runs on every output in production, with violations logged, counted, and reported in the PMM report. Prompt injection detection is particularly relevant for high-risk systems: an adversary who can manipulate the system’s behaviour through crafted inputs can cause the system to produce outputs that violate its intended purpose or harm affected persons. Prompt injection rates should be tracked and reported alongside content safety metrics. Key outputs

  • Policy violation rate tracking over time
  • Lakera Guard, NeMo Guardrails, Llama Guard integration
  • Every-output safety monitoring in production
  • Prompt injection detection and rate tracking

Prompt/Response Distribution Monitoring The Technical SME monitors the distribution of incoming prompts for shifts indicating the system is being used outside its intended purpose. Topic classification of incoming prompts, using BERTopic or custom embedding-based clustering, detects usage drift. A sudden shift in the topic distribution (a large increase in prompts about a topic the system was not designed for) may indicate misuse, a user population change, or an adversarial probing campaign. Output characteristics (length, sentiment, topic, confidence indicators) are similarly monitored for shifts that might indicate model degradation or adversarial manipulation. The baseline topic and output distributions are established during deployment and tracked over time. The monitoring alerts when the topic distribution diverges significantly from the baseline. The investigation determines whether the shift represents legitimate evolution of the user population (requiring an intended purpose review) or problematic usage (requiring corrective action). Key outputs

  • Prompt topic classification via BERTopic or embedding clustering
  • Output characteristic distribution monitoring
  • Baseline establishment at deployment with ongoing tracking
  • Intended purpose review trigger for significant shifts

Annotation Platforms Automated monitoring cannot capture all dimensions of LLM output quality. A regular human evaluation programme provides qualitative assessment that complements automated metrics. Argilla, Label Studio, and Prodigy provide annotation platforms for structuring this evaluation. The AI Governance Lead defines the human evaluation cadence in the PMM plan. A common approach evaluates a random sample of 100–500 outputs weekly, rated on a structured rubric covering accuracy, relevance, safety, and explanation quality. The evaluation results feed into the PMM report and provide the ground truth against which automated quality metrics are calibrated. The annotation platform should support structured rubrics (ensuring consistency across evaluators), inter-annotator agreement measurement (ensuring evaluation quality), and integration with the monitoring pipeline (feeding results back into the metric computation layer). Key outputs

  • Weekly human evaluation of 100–500 output samples
  • Structured rubric (accuracy, relevance, safety, explanation quality)
  • Inter-annotator agreement measurement
  • Results integrated into PMM metrics and reporting
On This Page