Classical QA Cannot Validate What an AI Agent Will Do Next
Traditional test automation validates deterministic outputs; agentic AI systems produce probabilistic, multi-step actions that demand an entirely different assurance approach — especially under EU AI Act Article 9.

Key takeaways
- Classical test automation assumes deterministic, input-output mappings; agentic AI systems operate probabilistically across multi-step decision chains, making conventional test coverage metrics meaningless.
- Agentic QE is not more automation — it is a discipline built around three distinct capabilities: behavioral simulation, emergent-path coverage, and continuous runtime audit.
- EU AI Act Article 9 requires ongoing risk management for high-risk AI systems; organizations that test only outputs, not intermediate agent decisions, are non-compliant by architecture.
- NIST AI RMF Measure 2.5 demands that AI measurement plans account for context-dependent behavior — a requirement that conventional test scripts structurally cannot satisfy.
- Readiness for the shift to agentic QE depends on governance maturity, not just tooling; organizations need a defined risk owner, a living test ontology, and an audit-grade evidence store before they scale.
Why Classical QA Breaks When AI Agents Act Autonomously
Every quality engineering practice in regulated financial services was built on a shared assumption: given the same input, the system will produce the same output. That assumption underlies test case design, regression suites, pass/fail thresholds, and the audit trails that satisfy model-risk reviewers. It is a reasonable assumption for a rules engine, a batch settlement process, or a deterministic underwriting scorecard. It is the wrong assumption for an AI agent.
An AI agent does not execute a fixed instruction set. It reasons across a sequence of steps, selects tools, calls external APIs, forms intermediate conclusions, and then acts — sometimes all within a single user interaction. The path it takes between receiving a prompt and producing an outcome is not predetermined. It varies with context, with the state of connected systems, with the phrasing of prior turns, and occasionally with factors that neither the developer nor the tester anticipated. This is not a bug in the agent design. It is the architectural point of AI agents. And it is precisely why a test suite that validates outputs at the boundary of the system tells you almost nothing about what the agent actually did.
For a CTO or VP Engineering in BFSI or insurance, this creates an immediate accountability problem. Your delivery team may be reporting green builds and passing acceptance tests. Your model-risk function may be satisfied with output-layer sampling. But if the agent is making consequential intermediate decisions — retrieving customer records, interpreting policy clauses, escalating or suppressing a claim flag — none of those intermediate decisions are being tested in any meaningful sense. You have coverage on the last sentence the agent wrote. You have no coverage on the reasoning chain that produced it.
The failure mode is not dramatic. It is quiet. A claims automation agent routes a borderline case incorrectly not because it returned a wrong word, but because it misread the sequence of prior context turns and weighted a policy exclusion incorrectly. Your output test says pass. Your regulator, reviewing the decision record under EU AI Act Article 9, asks for evidence of how intermediate risk controls were exercised. You have none.
What Agentic QE Actually Means — Three Defining Capabilities
Agentic quality engineering — an agentic AI testing framework for regulated industries — is not test automation with a larger prompt. It is a reconstituted discipline organized around three capabilities that classical QA structurally cannot provide.
The first is behavioral simulation across multi-turn, multi-tool scenarios. Rather than testing the agent against a fixed input, behavioral simulation constructs realistic interaction trajectories — sequences of turns, tool invocations, and state changes that approximate production conditions. This is closer to scenario-based penetration modeling than to conventional test scripting. The goal is not to enumerate every possible path, which is computationally intractable, but to exercise the decision boundaries that carry the highest risk consequence: edge cases in eligibility logic, ambiguous instructions where the agent must interpret intent, and adversarial inputs designed to reveal misalignment between agent behavior and policy.
The second capability is emergent-path coverage. Because an agent's execution graph is not fixed, your quality framework must be able to observe and classify paths that were not in any test plan. This requires runtime instrumentation — the ability to trace which tools were called, in what order, with what parameters, and how each intermediate output shaped the next step. Without this layer, you are not testing an agentic system. You are testing a black box and hoping the outside looks right.
The third capability is continuous runtime audit with evidence generation. In a regulated environment, a test result is not evidence. An audit-grade record of what the agent decided, why, and under what conditions is evidence. Agentic QE must produce structured decision logs that are queryable, tamper-evident, and mappable to the risk controls your governance team has registered. This is not a reporting feature bolted onto a test tool. It is a design requirement for the quality architecture itself.
The Governance Layer: Mapping to EU AI Act Article 9 and NIST AI RMF Measure 2.5
EU AI Act Article 9 requires that providers and deployers of high-risk AI systems implement a risk-management system that operates continuously — not just at deployment. The obligation includes identifying and analyzing known and reasonably foreseeable risks, implementing risk controls, and maintaining documentation sufficient to demonstrate compliance. Critically, Article 9 does not specify that risk management applies only to final outputs. It applies to the behavior of the system across its operational lifecycle, which for an AI agent includes every intermediate decision in every reasoning chain.
Organizations that audit only the output layer are therefore not satisfying Article 9 by architecture. They have a compliance surface that is narrower than the risk surface. The agent is making decisions that are not under observation, and those unobserved decisions are precisely what a supervisory authority will ask about when something goes wrong.
📊 Related research
The State of AI Assurance in Healthcare 2026
A data-driven briefing for regulated healthcare enterprises on where AI governance, regulatory compliance, and assurance infrastructure stand today — and what budget-holders must do before the next enforcement cycle closes.
NIST AI RMF Measure 2.5 adds a complementary dimension. It requires that AI measurement and monitoring plans be designed to account for context-dependent and emergent behavior — that is, behavior that was not present or visible during pre-deployment testing but that emerges under production conditions. A test suite of static scripts cannot satisfy Measure 2.5 because it was, by definition, written before the emergent behavior existed. What Measure 2.5 demands is a measurement architecture that can detect and characterize novel behavioral patterns in production, feed them back into the risk register, and trigger re-evaluation when patterns exceed defined thresholds. This is agentic QE in its governance mode: not a gate before deployment, but a continuous assurance loop after it.
A Before/After Walkthrough — Claims Automation at a Mid-Size Insurer
Consider a mid-size property and casualty insurer that has deployed an AI agent to handle first-notice-of-loss intake and initial coverage assessment. Before adopting an agentic QE approach, the quality process looked like this: a team wrote acceptance test cases against the API boundary, validated that the agent returned structured JSON within schema, sampled a percentage of real outputs monthly for human review, and logged pass rates. The model-risk team signed off on output accuracy metrics from a held-out evaluation set. Governance artifacts consisted of a model card and a performance summary.
After adopting an agentic AI testing framework aligned with Article 9 obligations, the picture changes substantially. Behavioral simulation scenarios are built from real claims trajectories, including adversarial variants where claimants provide ambiguous or conflicting information. The agent's tool-call sequence — which external data sources it consulted, in what order, how it weighted conflicting signals — is traced and stored in a structured audit log for every test run. Emergent paths that appear in production but were not in any simulation scenario trigger automatic review cycles. The risk-management system registers not just output accuracy but decision-path consistency: whether the agent applied the same reasoning pattern to materially similar cases.
The governance artifacts produced are no longer a model card. They are a living decision audit trail, a behavioral deviation register, and a risk-control mapping document that links each intermediate agent action to a named Article 9 control. When regulators ask how coverage decisions were made, the answer is not a statistical summary. It is a traceable record.
How to Know If Your Organisation Is Ready to Make the Shift
Readiness for the transition to agentic QE is a governance question before it is a tooling question. Three indicators separate organizations that can make the shift successfully from those that will acquire tooling they cannot operationalize.
The first indicator is whether you have a named risk owner for agent behavior — not for model performance in the aggregate, but for the behavioral envelope of each deployed agent specifically. If that accountability sits ambiguously between engineering, model risk, and compliance, the assurance function will have no authority to mandate remediation when a behavioral deviation is detected.
The second indicator is whether your team can define, in writing, what behavioral boundaries the agent must not cross — not in terms of output format or accuracy thresholds, but in terms of decision logic. If you cannot articulate the policy, you cannot write the simulation scenarios. If you cannot write the simulation scenarios, you are not doing agentic QE — you are doing sampling.
The third indicator is whether your organization has an audit-grade evidence store — a system that can receive structured decision logs from an agent at production scale, retain them for the periods your regulatory obligations require, and make them queryable against specific risk-control identifiers. Many organizations have monitoring dashboards. Very few have evidence stores. They are not the same thing.
The discipline of agentic QE exists precisely because regulated enterprises cannot afford the gap between what classical testing certifies and what AI agents actually do in production. Closing that gap is not optional under the regulatory frameworks that now apply to high-risk AI in financial services — it is the minimum condition for operating with a defensible governance posture.
“If your test suite only checks what the agent returned, you have not tested the agent — you have tested its last sentence.”
Go deeper — gated research
The State of AI Assurance in Healthcare 2026
A data-driven briefing for regulated healthcare enterprises on where AI governance, regulatory compliance, and assurance infrastructure stand today — and what budget-holders must do before the next enforcement cycle closes.


