v2.4.0 | Report Errata
docs security docs security

Model outputs pass through a filtering layer before reaching the consumer. For classification models, confidence scores below a minimum threshold trigger a “low confidence” flag rather than a definitive classification, routing the decision to human review. For generative models, output filters detect and redact personally identifiable information, detect content that falls outside the system’s intended purpose, and enforce output length limits.

The output filtering layer implements the “untrusted output” principle described in : all model outputs are treated as potentially containing content that could cause harm if consumed without validation. The filtering logic is implemented as a dedicated middleware or service on the inference output path, architecturally separate from the model itself. This separation ensures that filtering cannot be bypassed and that changes to the filtering logic are visible as discrete, reviewable events.

The output filtering configuration is version-controlled and subject to the same governance as other configuration changes. Changes that alter which content is filtered or how filtering decisions are made should be assessed against the substantial modification thresholds. The filtering logic, its configuration, and the filtering rates are documented in Module 9.

Key outputs

  • Confidence-based routing for low-confidence outputs
  • PII redaction and content scope filtering for generative models
  • Dedicated filtering middleware on the inference output path
  • Module 9 AISDP documentation
On This Page