Risk Scoring & Calibration

v2.4.0 | Report Errata

docs governance docs governance

Four Scoring Dimensions

Risks are scored using a likelihood-impact matrix. Impact is assessed against four dimensions: health and safety, fundamental rights, operational integrity, and reputational exposure. Each dimension has a calibrated five-point rubric.

Health and safety scores range from negligible (1, no measurable consequence) to catastrophic (5, irreversible harm to life or safety affecting a large and vulnerable population). Fundamental rights scores range from negligible (1, no discernible Charter right effect) to catastrophic (5, large-scale or irreversible infringement, or infringement affecting a right of particular sensitivity such as human dignity or non-discrimination). Operational integrity scores range from negligible (1, no effect on availability or accuracy) to catastrophic (5, total system failure or integrity compromise). Reputational exposure scores range from negligible (1, internal awareness only) to catastrophic (5, sustained public attention, political scrutiny, and regulatory enforcement).

Likelihood is scored separately on a five-point scale: rare (1), unlikely (2), possible (3), likely (4), and almost certain (5). Each score must be accompanied by a written rationale citing specific evidence.

Key outputs

Four-dimension impact assessment (health/safety, rights, operational, reputational)
Five-point calibrated rubrics per dimension
Separate likelihood scoring with evidence-based rationale
Module 6 AISDP documentation

Composite Scoring & Documented Weighting Rationale

The composite risk score is the product of the likelihood rating and the highest impact rating across the four dimensions. This “worst-case dimension” approach ensures that a risk with low operational impact but catastrophic fundamental rights impact is not diluted by averaging. The AI System Assessor records all four impact ratings; the composite score drives treatment priority, but the individual dimension scores inform the type of mitigation required.

Risks scoring above the organisation’s defined threshold (typically 12 or above on a 25-point scale) require specific, documented mitigation measures. Those scoring below the threshold may be accepted, with the acceptance recorded and signed by the AI Governance Lead. The threshold itself should be documented with its rationale and reviewed periodically; a threshold set too high leaves material risks unmitigated, while one set too low creates an unmanageable mitigation burden.

The weighting rationale (why the worst-case dimension approach was chosen over averaging or other composite methods) is documented in Module 6. This rationale enables an assessor to understand the scoring methodology and evaluate its appropriateness for the system’s risk profile.

Key outputs

Composite score = Likelihood × Highest Impact Dimension
Treatment threshold documented with rationale
All four dimension scores retained alongside the composite
Module 6 AISDP documentation

Calibration Workshops — Reference Scenarios & Anchors

Scoring is inherently subjective. Calibration workshops present assessors with five to ten reference scenarios drawn from published enforcement actions, the AI Incident Database, or internal near-miss events. Assessors score the scenarios independently, then compare results. Divergences are discussed and the group agrees on reference scores for each scenario; these become calibration anchors.

When scoring a new risk, assessors compare it to the anchored scenarios, grounding their scoring in concrete reference points rather than abstract rubric definitions. Systematic divergences (one assessor consistently scoring likelihood higher than another) are identified by the AI Governance Lead and addressed through shared reference cases and discussion.

Calibration workshops should precede each assessment cycle. New assessors complete a calibration exercise before conducting their first live assessment. Where the organisation has multiple high-risk systems, cross-system calibration ensures that a “Significant” rating carries the same meaning across the portfolio, enabling meaningful portfolio-level risk reporting. The calibration results are retained as Module 6 compliance evidence.

Key outputs

Annual calibration workshops with 5–10 reference scenarios
Calibration anchors agreed by the assessor group
Cross-system calibration for portfolio consistency
Module 6 AISDP evidence

Semi-Quantitative Bayesian Scoring

For high-uncertainty risks where the team cannot confidently distinguish between likelihood levels, semi-quantitative Bayesian scoring offers a more defensible approach than forcing a single point estimate. Each assessor provides a probability distribution across the five likelihood levels, for example: 10% rare, 30% unlikely, 40% possible, 15% likely, 5% almost certain.

The distributions are aggregated across assessors, and the resulting expected value and confidence interval are reported alongside the risk. This makes uncertainty visible rather than concealing it behind a point estimate. A risk with a narrow confidence interval around “Possible” represents a different level of confidence than a risk with a wide distribution spanning “Unlikely” to “Almost Certain,” even if both have the same expected value.

Most GRC platforms do not natively support distributional scoring. Implementation may require a custom tool: a Python script or a simple web form that collects distributions and computes aggregates. Semi-quantitative Bayesian scoring is recommended for the system’s top-ten risks and for any risk where assessors disagree by more than one point on the standard scale. The distributions and aggregation methodology are documented in Module 6.

Key outputs

Probability distributions across likelihood levels per assessor
Aggregated expected values and confidence intervals
Uncertainty made visible for high-uncertainty risks
Module 6 AISDP evidence

Written Rationale per Score

Every risk score must be accompanied by a written rationale citing specific evidence. “Medium likelihood” is insufficient; the assessor must explain why medium rather than high, citing the frequency of a particular failure mode observed during testing, the exposure of the affected population, comparable incidents in similar systems, or the maturity of the mitigations in place.

The written rationale serves two functions. During the assessment, it forces the assessor to ground their judgement in evidence rather than intuition. During conformity assessment or regulatory inspection, it enables a reviewer to evaluate whether the score is defensible and to challenge it if the evidence does not support the conclusion.

Scoring patterns across the register are reviewed by the AI Governance Lead to identify systematic inconsistencies before the assessment is finalised. A register where all risks cluster at the same score suggests that the rubric is not being applied with sufficient granularity.

Key outputs

Written rationale per risk score citing specific evidence
Evidence grounding for both likelihood and impact dimensions
AI Governance Lead review of scoring patterns for consistency
Module 6 AISDP evidence