Schema Contracts & Quality Specification
The data ingestion layer is the system’s first contact with external data, and schema contracts are the primary mechanism for ensuring that only conforming data enters the pipeline. Every data source must have a defined contract specifying a schema (field names, types, and formats), a quality specification (acceptable missing-value rates, value ranges, and distributional properties), and a freshness requirement.
The ingestion pipeline enforces these contracts before data enters the system. For batch ingestion, tools such as Great Expectations or Soda Core run expectation suites against each incoming dataset. For streaming ingestion, Apache Kafka’s Schema Registry enforces schema validation on every message. Records that fail validation are rejected with a logged error rather than being silently coerced, ensuring that malformed or out-of-distribution data does not corrupt the training or inference pipeline.
This control addresses the risk of intent drift at the data source. Upstream systems change: a CRM vendor may modify a field’s enumeration values, a data provider may alter a date encoding, or a partner organisation may change how it computes a derived field. Without boundary validation, these changes enter the pipeline undetected. With it, they are caught, quarantined, and investigated. The investigation log becomes Module 4 evidence in the AISDP.
Key outputs
- Schema contract per data source (field names, types, formats, value ranges)
- Quality specification per data source (missing-value thresholds, distributional properties)
- Validation pipeline configuration (Great Expectations, Soda Core, or Kafka Schema Registry)
- Investigation and resolution logs for rejected records
Freshness Requirements
Each data source feeding the AI system must have a documented freshness requirement: the maximum acceptable age of records at the point of ingestion. This requirement forms part of the schema contract described in Article 110 and ensures that the system does not make decisions based on stale information.
Freshness requirements vary by data source and use case. A real-time fraud detection system may require transaction data no older than seconds; a quarterly workforce planning model may accept data that is weeks old. The Technical SME defines the freshness threshold based on the system’s intended purpose and the rate at which the underlying data changes. Records that exceed the freshness threshold are flagged or rejected at the ingestion layer.
The freshness requirement also has implications for data distribution monitoring. If the temporal profile of incoming data shifts (for example, because a batch feed is delayed), the ingestion layer should detect this and raise an alert. Stale data that passes schema validation but falls outside the expected temporal window may introduce subtle biases, particularly if the staleness affects some subgroups more than others.
Key outputs
- Freshness threshold per data source, documented in the schema contract
- Alerting configuration for freshness violations
- Evidence of Technical SME rationale for each threshold
Boundary Validation Tooling (Great Expectations, Soda Core, Kafka Schema Registry)
Boundary validation tooling implements the schema contracts and quality specifications described in Article 110 as automated, repeatable checks. Three tools are identified suited to different ingestion patterns.
Great Expectations is an open-source framework for defining data quality expectations as code. Expectation suites declare what properties a dataset must satisfy (column types, value ranges, uniqueness constraints, distributional bounds) and are executed against each incoming batch. Results are logged, and failures trigger quarantine workflows. Soda Core offers similar capabilities with a SQL-based syntax that integrates well with warehouse-centric architectures.
For streaming architectures, Apache Kafka’s Schema Registry validates every message against a registered Avro, Protobuf, or JSON schema before it is committed to a topic. Messages that fail validation are rejected to a dead-letter queue. The dead-letter queue is not a discard mechanism; it is an investigation queue. Records routed there must be examined, the root cause identified, and the resolution documented.
The choice of tooling depends on the system’s ingestion pattern (batch, streaming, or hybrid) and the organisation’s existing data infrastructure. Regardless of which tool is selected, the validation layer must be automated, version-controlled alongside the pipeline code, and produce audit-grade logs of every validation run.
Key outputs
- Configured validation tooling (Great Expectations, Soda Core, and/or Kafka Schema Registry)
- Expectation suites or schema definitions version-controlled in the repository
- Dead-letter queue configuration and investigation procedures
- Validation run logs as Module 4 evidence
Dead-Letter Queue for Non-Conforming Records
When the ingestion layer rejects a data record for failing schema validation, range checks, or freshness requirements, the record must not be silently discarded. It is routed to a dead-letter queue: a holding area for non-conforming records that preserves them for investigation.
The dead-letter queue serves two compliance functions. First, it provides evidence that the validation controls are operating correctly; a queue that is never populated may indicate that validation rules are too permissive. Second, it creates an investigation trail. For each record in the queue, the data engineering team documents what failed, why, and how the issue was resolved (correction and re-ingestion, permanent rejection with rationale, or escalation to the data source owner).
Investigation records from the dead-letter queue feed into AISDP Module 4 as evidence of data governance diligence. They also support post-market monitoring by revealing patterns in data quality issues. A sustained increase in dead-letter queue volume from a particular source, for instance, may indicate an upstream change that requires contractual or technical remediation.
Key outputs
- Dead-letter queue configuration in the ingestion pipeline
- Investigation and resolution procedures for queued records
- Periodic summary reports on queue volume and root causes
- Module 4 evidence records
Intent Drift Control — Source Change Detection & Quarantine
Intent drift at the data source is one of the most insidious risks to a high-risk AI system. Upstream systems change without notice: a CRM vendor modifies enumeration values, a data provider alters a field format, or a partner organisation changes how it computes a derived metric. Each of these changes alters the data the model receives, potentially degrading performance or shifting fairness profiles without triggering any visible error.
The ingestion layer’s boundary validation catches structural changes such as schema violations and out-of-range values. Source change detection goes further by monitoring for subtler distributional shifts that pass schema validation. The ingestion layer computes real-time summary statistics (mean, variance, quantile distributions) for incoming data and compares them against the training data baseline. Statistically significant shifts are reported to the monitoring layer.
When a source change is detected, the affected data is quarantined pending investigation. The quarantine prevents potentially compromised data from entering the training or inference pipeline. The investigation determines whether the change is benign (a natural evolution in the underlying population), material (requiring model revalidation or retraining), or indicative of a data quality failure at the source. The resolution is documented in the AISDP and feeds into the post-market monitoring record.
Key outputs
- Source change detection configuration (statistical tests, thresholds, baseline definitions)
- Quarantine procedures for flagged data
- Investigation and resolution log template
- Integration with post-market monitoring alerting