checklist

AI QE Maturity Assessment Checklist

Score your AI quality engineering practice across 40 dimensions and find your three highest-ROI moves.

12 min read·Free with email

What you’ll take away

Map your AI QE practice against five maturity stages using 40 concrete, checkable dimensions drawn from EU AI Act, ISO/IEC 42001, and NIST AI RMF.
Identify which stage you are at today — and avoid the common trap of optimising stage-two practices when your risk profile demands stage four.
Pinpoint the three highest-ROI moves for your current maturity stage rather than spreading effort across every gap at once.
Understand how agentic AI systems, LLM red-teaming, and synthetic test data each shift your QE obligations and when to introduce them.
Use the scoring guide and prioritisation matrix to build a credible, time-bound AI QE improvement roadmap.

Why Maturity Scoring Matters Before You Fix Anything

Most AI quality engineering programmes fail not because teams lack effort but because they apply the wrong fixes at the wrong stage. A team spending engineering cycles on automated drift detection before it has documented its model acceptance criteria is optimising the roof before the foundation is set. Conversely, a team still doing manual spot-checks on a model that processes millions of regulated decisions per day is carrying silent, compounding risk.

This checklist gives you a structured way to score where you actually are across 40 dimensions, calibrate that score against observable benchmarks from comparable enterprise teams, and surface the three moves that will return the most assurance value for your current stage.

The five stages used here align with the capability progression implied by ISO/IEC 42001 (AI management systems), the risk-tiered obligations of the EU AI Act, and the governance functions of the NIST AI Risk Management Framework (Govern, Map, Measure, Manage). Each stage is not a grade — it is a description of what your organisation can reliably do today.

---

The Five Maturity Stages at a Glance

Stage 1 — Ad Hoc: Testing happens, but it is uncoordinated, undocumented, and person-dependent.
Stage 2 — Defined: Core test processes exist on paper; some are followed consistently.
Stage 3 — Managed: Metrics are collected, gates are enforced, and QE is integrated into the ML lifecycle.
Stage 4 — Optimised: Risk-based test strategies adapt dynamically; assurance evidence feeds compliance workflows.
Stage 5 — Continuous Assurance: Real-time quality signals, automated governance, and post-deployment monitoring operate as a single loop.

In practice, most enterprise teams assessed across BFSI, healthcare, and insurance sit at Stage 2 to early Stage 3. A meaningful minority carry Stage 4 or Stage 5 aspirations without the foundational Stage 2 documentation to support them.

---

Dimension Group A — Governance and Risk Classification (8 items)

This group maps to the NIST AI RMF "Govern" function and the EU AI Act's risk-tier obligations. Score each item: 0 = not in place, 1 = partially in place, 2 = fully in place.

A1. Every AI system in production has a documented risk classification (high-risk, limited-risk, or minimal-risk) using explicit, agreed criteria.
A2. There is a named accountable owner (not just a team) for each AI system's quality and compliance posture.
A3. AI use cases are re-classified when the deployment context changes — not only at initial launch.
A4. A conformance checklist tied to applicable regulation (EU AI Act Annex III categories, sector-specific rules) exists and is version-controlled.
A5. QE involvement is mandated at project intake, not added after model training is complete.
A6. Third-party or vendor model components are subject to the same risk classification process as internally built models.
A7. There is a documented escalation path when a model fails a quality gate — with defined decision rights.
A8. Board or executive-level reporting includes at least one AI quality or risk metric on a regular cycle.

Maximum for Group A: 16 points.

---

Dimension Group B — Test Strategy and Coverage (8 items)

This group measures whether your test approach is designed for AI's specific failure modes — distributional shift, label noise, adversarial inputs, emergent behaviour — rather than adapted from traditional software testing alone.

B1. Functional correctness tests, fairness/bias tests, and robustness tests are treated as separate, tracked test types — not bundled into a single pass/fail.
B2. Test coverage explicitly addresses edge cases derived from the model's deployment population, not only the training distribution.
B3. Negative testing (inputs the model should refuse, reject, or flag) is defined and executed before each release.
B4. There is a documented strategy for testing LLM outputs across dimensions of factual accuracy, instruction-following, and safety/harm avoidance.
B5. For agentic AI systems (multi-step, tool-calling), the test plan covers action sequencing errors, tool misuse, and goal misalignment — not just individual output quality.
B6. Regression tests are maintained for previously identified failure modes and run automatically on every model version change.
B7. Data quality checks (completeness, distribution drift, label validity) are part of the test pipeline — not a separate, optional audit.
B8. Synthetic test data is used in at least one domain where live PII data cannot be used without privacy risk, and the synthetic data's statistical fidelity is validated.

Maximum for Group B: 16 points.

---

Dimension Group C — LLM and Generative AI Assurance (6 items)

If your organisation has no generative AI or LLM in production or pre-production, score this group as N/A and exclude it from your total. If LLMs are present, this group becomes disproportionately important for risk.

C1. Red-teaming (structured adversarial prompting to surface jailbreaks, prompt injection, and harmful output) is conducted by a dedicated function or external party — not informally by developers.
C2. There is a defined taxonomy of LLM failure modes relevant to your domain (e.g., hallucination in medical advice, biased credit commentary, PII leakage in generated documents).
C3. Output evaluation uses a combination of automated metrics (e.g., ROUGE, BERTScore, custom classifiers) and human expert review — neither alone is treated as sufficient.
C4. System prompts, retrieval configurations (for RAG architectures), and tool definitions are version-controlled and change-managed as first-class test artefacts.
C5. There is an explicit test protocol for model updates from the foundation model provider — including behaviour regression when the underlying model is updated without your initiation.
C6. LLM assurance evidence (red-team reports, evaluation logs) is stored in a retrievable format suitable for regulatory inspection.

Maximum for Group C: 12 points.

---

Dimension Group D — Compliance and Audit Readiness (10 items)

This group maps to ISO/IEC 42001 Clause 9 (performance evaluation) and the EU AI Act's technical documentation obligations, as well as India's DPDP Act for teams processing personal data.

D1. A technical documentation package (model card or equivalent) exists for every high-risk AI system and is kept current with each model version.
D2. Logging is sufficient to reconstruct model inputs, outputs, and decision context for any inference event within the regulatory retention window.
D3. Human oversight mechanisms are tested — not assumed. There is evidence that override and review workflows function correctly under realistic load.
D4. Bias and fairness evaluation results are documented, with explicit statements about which demographic dimensions were tested and which were out of scope.
D5. There is a defined and tested incident response process for AI-specific failures (harmful output, unexpected behaviour, data leakage) distinct from the generic IT incident process.
D6. Post-market monitoring (production performance tracking, user complaint analysis) feeds back into the QE test backlog with a defined cadence.
D7. The organisation can demonstrate, with evidence, that test data used for evaluation was not part of the training corpus (preventing leakage-inflated metrics).
D8. For DPDP-relevant systems: data minimisation and purpose limitation are verified as part of the AI test process, not only during privacy impact assessments.
D9. Change management records link each model version change to a corresponding QE sign-off — creating a traceable audit chain.
D10. At least one AI system has been through a simulated or actual third-party conformity assessment, and findings were acted upon.

Maximum for Group D: 20 points.

---

Dimension Group E — Tooling, Automation, and Culture (8 items)

E1. There is a dedicated AI QE toolchain (evaluation frameworks, test orchestration, monitoring dashboards) — not a repurposed software testing stack alone.
E2. Model performance metrics and quality gate thresholds are defined in code or configuration — not only in documents.
E3. CI/CD pipelines for AI systems include automated quality gates that can block a deployment without a human manually checking a report.
E4. QE engineers assigned to AI projects have received structured training on ML concepts, fairness metrics, and AI-specific test design — not only traditional software testing.
E5. There is a documented process for retiring or sunsetting AI models, including QE sign-off criteria for decommission.
E6. Synthetic data generation is automated and reproducible — a new synthetic dataset can be generated on demand for a new test scenario without a multi-week effort.
E7. Post-deployment monitoring alerts are triaged by QE (or a joint QE-MLOps function) — not only by platform operations.
E8. A blameless retrospective or equivalent process exists to analyse AI quality failures and feed learnings back into the QE process — not only into the model.

Maximum for Group E: 16 points.

---

Scoring Your Maturity Stage

Add your group scores. If Group C was N/A, calculate your total from the remaining 68 maximum points and convert to a percentage.

With Group C included (maximum 80 points):

0–24 points: Stage 1 — Ad Hoc
25–40 points: Stage 2 — Defined
41–56 points: Stage 3 — Managed
57–68 points: Stage 4 — Optimised
69–80 points: Stage 5 — Continuous Assurance

Note on benchmarking: Based on patterns observed across enterprise AI teams in regulated industries, the median score tends to fall in the upper Stage 2 to lower Stage 3 range. Teams that self-report as "mature" frequently cluster at Stage 3 on Groups A and E but remain at Stage 1 on Group C and Group D. This divergence — strong tooling, weak compliance evidence — is the single most common AI QE risk pattern in regulated enterprises.

---

The Three Highest-ROI Moves by Stage

Stage 1 — Ad Hoc Your highest ROI comes from structure, not sophistication. - Move 1: Implement A1 and A2 immediately. You cannot prioritise risk you have not classified. A simple, two-hour workshop to classify every AI system in production by EU AI Act risk tier, with a named owner, yields disproportionate governance return. - Move 2: Create a minimum viable test plan template (covering B1's three test types) and make it mandatory at project initiation. Even an imperfect standard is transformative at this stage. - Move 3: Begin logging model inputs and outputs in production (D2). Without this, every future compliance effort is built on an evidence gap you cannot retrospectively fill.

Stage 2 — Defined Your highest ROI comes from closing the gap between documented and enforced. - Move 1: Convert your test thresholds from documents into automated gate configurations (E2, E3). The single most common Stage 2 failure is a team that knows the right thresholds but bypasses them under release pressure. - Move 2: Execute one structured red-team exercise (C1) on your highest-risk LLM or AI system. This surfaces more real failure modes per hour than any other QE activity at this stage. - Move 3: Establish post-market monitoring with a QE feedback loop (D6, E7). Closing the training-production gap is the fastest way to improve your real-world model quality.

Stage 3 — Managed Your highest ROI comes from evidence quality and compliance alignment. - Move 1: Build a complete technical documentation package for each high-risk system (D1, D9). At Stage 3, most teams have the underlying test data — they simply have not assembled it into a form that would satisfy an audit. - Move 2: Implement synthetic test data at scale for privacy-sensitive domains (B8, E6). This directly removes the constraint that prevents deeper, more adversarial testing of production-representative data. - Move 3: Run a simulated conformity assessment against EU AI Act or ISO/IEC 42001 requirements (D10). Third-party eyes at this stage reliably surface the specific compliance gaps that internal teams normalise over time.

Stage 4 — Optimised Your highest ROI comes from agentic and emergent-behaviour coverage. - Move 1: Extend your test strategy explicitly to agentic system failure modes (B5). As enterprises deploy multi-step AI agents, traditional output-level testing creates a significant assurance blind spot. - Move 2: Introduce human oversight testing under realistic production load (D3). Oversight that works in a demo environment but fails under peak throughput is not functioning oversight under the EU AI Act's meaning. - Move 3: Automate the evidence chain from test execution to compliance artefact (D9, C6). The goal at this stage is that a regulatory inspection produces structured, machine-readable evidence without a manual assembly effort.

Stage 5 — Continuous Assurance Your highest ROI is in cross-system and supply-chain assurance. - Move 1: Extend classification and assurance coverage to all third-party and foundation model components (A6, C5). Most Stage 5 gaps reside in vendor dependencies, not internal systems. - Move 2: Publish internal AI quality benchmarks and use them to drive supplier and procurement quality standards. - Move 3: Contribute to and align with emerging sector-specific AI assurance standards — your maturity positions you to shape norms rather than react to them.

---

Building Your Roadmap

Once you have your score and your three priority moves, convert them to a time-bound plan using this structure:

Immediate (0–30 days): Items scoring 0 in Groups A and D that apply to a system already in production.
Short-term (30–90 days): Automation and tooling changes from Group E that enforce existing documented standards.
Medium-term (90–180 days): Evidence assembly, red-teaming, and synthetic data investments from Groups B, C, and D.
Ongoing: Post-deployment monitoring, feedback loop operation, and regulatory alignment reviews.

Review your maturity score every six months, or immediately after a significant model change, a regulatory update, or an AI quality incident.

---

A Note on Why Assurance Is the Work

AI quality engineering is sometimes treated as a compliance cost — a function that produces reports for auditors. The teams that advance fastest treat it differently: as the mechanism that makes it possible to deploy AI at scale without accumulating hidden risk. Every gap in this checklist represents a class of failure that has happened in a production system somewhere in a regulated industry. The purpose of scoring yourself honestly against these 40 dimensions is not to achieve a number — it is to see clearly where your organisation is exposed and to act on that clarity with the precision that serious AI deployment demands.

Free · read in full with your details

Read “AI QE Maturity Assessment Checklist”

Enter your details to unlock the full resource.