Output Filtering

v2.4.0 | Report Errata

docs security docs security

Model outputs pass through a filtering layer before reaching the consumer. For classification models, confidence scores below a minimum threshold trigger a “low confidence” flag rather than a definitive classification, routing the decision to human review. For generative models, output filters detect and redact personally identifiable information, detect content that falls outside the system’s intended purpose, and enforce output length limits.

The output filtering layer implements the “untrusted output” principle described in : all model outputs are treated as potentially containing content that could cause harm if consumed without validation. The filtering logic is implemented as a dedicated middleware or service on the inference output path, architecturally separate from the model itself. This separation ensures that filtering cannot be bypassed and that changes to the filtering logic are visible as discrete, reviewable events.

The output filtering configuration is version-controlled and subject to the same governance as other configuration changes. Changes that alter which content is filtered or how filtering decisions are made should be assessed against the substantial modification thresholds. The filtering logic, its configuration, and the filtering rates are documented in Module 9.

Key outputs

Confidence-based routing for low-confidence outputs
PII redaction and content scope filtering for generative models
Dedicated filtering middleware on the inference output path
Module 9 AISDP documentation