Phase 4: Development & Testing — Owner & Outputs
Phase 4 runs during Weeks 6 to 18. The Engineering Team , led by the Technical Owner, owns this phase.
The objective is to build the system in accordance with the approved architecture, generating compliance evidence as a natural byproduct of the engineering workflow. Development follows version-controlled code, model, and data artefacts. The CI/CD pipeline enforces quality gates at every commit: static analysis (including AI-specific rules), unit testing for every component type, contract testing between services, dependency and licence scanning, and secret detection.
Data engineering follows the pre-step/post-step capture methodology, documenting each transformation before execution and verifying it afterwards. Dataset documentation is maintained continuously as datasets are assembled, cleaned, and transformed. Model training, validation, and testing follow the documented methodology, with performance, fairness, robustness, and calibration metrics computed and recorded. The model validation gate blocks promotion of any model that fails AISDP-declared thresholds.
The human oversight interface is developed with automation bias countermeasures, mandatory review workflows, and override capability. The explainability layer is implemented with fidelity validation. Cybersecurity testing runs throughout: SAST and DAST in the CI pipeline, dependency scanning, container image scanning, and infrastructure-as-code scanning. Adversarial ML testing covers adversarial examples, data poisoning simulations, and prompt injection testing where applicable.
Key outputs
- Version-controlled code, model artefacts, and dataset versions
- Automated test reports (unit, integration, regression, fairness, robustness)
- Auto-generated model cards
- Data quality reports and training pipeline logs
- Cybersecurity scan results and remediation records
- Feature registry with proxy variable assessments
Phase 4: Governance Gate (Sprint-Level Compliance Review)
Phase 4 employs multiple governance gates operating at different cadences rather than a single end-of-phase gate. An automated model validation gate blocks any model that fails performance, fairness, or robustness thresholds. A manual security review gate applies for the first deployment. The integration test suite must pass before any promotion.
The recommended approach is embedding compliance activities in the sprint cadence itself. Each sprint should include updating the relevant AISDP modules for any design decisions made during the sprint, running the full test suite (including fairness and robustness gates) as part of the sprint’s definition of done, reviewing any new risks identified during development and adding them to the risk register, and updating the evidence pack with artefacts produced during the sprint. The sprint retrospective should include a compliance dimension: what evidence was generated, what gaps remain, and what risks were introduced.
This approach ensures that the AISDP is assembled incrementally throughout development. Module 1 (System Identity) is completed during Phase 1. Module 6 (Risk Management) is drafted during Phase 2 and updated continuously. Module 3 (Architecture) is populated during Phase 3 and refined as the architecture evolves. Module 4 (Data Governance) grows as the data engineering work progresses. By the time Phase 5 arrives, the AISDP should be substantially complete, requiring only final review and consistency checking.
Feature flags that enable new model versions, data sources, or decision pathways are themselves system changes that the AI System Assessor assesses against the substantial modification thresholds. Feature flag configurations are version-controlled, and activation events are logged in the deployment ledger.
Key outputs
- Sprint-level compliance review records
- Incrementally assembled AISDP modules
- Feature flag configuration and activation logs
Phase 4: CI/CD Gates & Incremental AISDP Population
The CI/CD pipeline is the mechanism through which compliance evidence is produced as a byproduct of development. For AI systems, CI/CD extends beyond traditional software pipelines to operate on multiple artefact types (code, data, models, configurations) with multiple interconnected build processes.
A compliance-grade pipeline defines discrete, auditable stages. Data preparation ingests from documented sources, applies quality checks, and produces a versioned dataset. Feature engineering transforms the dataset with lineage captured at each step. Model training records all metadata: duration, resource consumption, convergence metrics, random seed. Model evaluation computes all performance, fairness, robustness, and calibration metrics declared in the AISDP; the stage fails if any metric breaches its declared threshold.
Four model validation gates enforce compliance boundaries. Gate 1 (Performance) verifies that accuracy, precision, recall, and other metrics meet AISDP-declared thresholds. Gate 2 (Fairness) evaluates selection rate ratios, equalised odds, and calibration across protected characteristic subgroups. Gate 3 (Robustness) tests resilience to adversarial examples and input perturbation. Gate 4 (Drift) compares the candidate model’s behaviour against the production and baseline models. Any gate failure halts the pipeline.
The pipeline definition itself is a compliance artefact, version-controlled alongside code and configuration. Changes to the pipeline definition constitute changes to the development process documented in AISDP Module 2. Each pipeline stage should be idempotent (same inputs produce same outputs), observable (emitting structured logs and metrics), and recoverable (resumable from the failed stage without re-executing completed stages). Pipeline orchestration tools such as Apache Airflow, Kubeflow Pipelines, Dagster, or Prefect manage dependencies and sequencing.
Key outputs
- Pipeline execution records and metadata
- Gate pass/fail evidence
- Auto-generated model cards and evaluation reports
- Versioned pipeline definition