Qapitol
← All insights
Agentic QEJune 21, 2026·8 min read

Five QE Assumptions Your AI Agent Just Broke — and What Replaces Them

Knowing how to test agentic AI systems in regulated industries means accepting that five foundational QE assumptions no longer hold — and building a new assurance stack from scratch.

📥 Featured researchThe Agentic QE Maturity Model
Get the report →

Key takeaways

  • Agentic AI systems invalidate at least five classical QE assumptions: deterministic outputs, stateless execution, single-model scope, human-in-the-loop checkpoints, and bounded failure domains.
  • Non-determinism, tool-chaining, and memory persistence are not edge cases to be handled by existing test frameworks — they are structural properties that demand purpose-built assurance layers.
  • NIST's 2025 agentic AI guidance and the OWASP LLM Top 10 together provide a compliance-compatible vocabulary for characterising agentic risk, but neither gives you a test plan.
  • In regulated industries, the failure modes that matter most are not functional bugs but compounding behavioral drifts that cross compliance boundaries across multi-step agent chains.
  • A maturity checklist built around observable evidence — not test pass rates — is the most defensible starting point for QE teams inheriting agentic AI ownership with no established playbook.

The Handoff Nobody Prepared You For

You have been given ownership of AI agent testing. There is no established playbook. Your CTO is fielding questions from risk and compliance about autonomous system deployments, and the pressure is landing on your desk. If that description fits, this article is written for you — specifically for QE leads and AI engineering managers in BFSI and insurance GCCs who are 8 to 15 years into their careers and are now being asked to assure systems that behave in ways classical quality engineering was never designed to handle.

The core problem is not a skills gap. It is an assumptions gap. The frameworks, tooling, and mental models that have served you well across functional testing, regression coverage, and UAT were built for a different class of system. Agentic AI — systems that plan, reason across steps, invoke external tools, maintain memory across sessions, and act with varying degrees of autonomy — breaks those frameworks at the foundation, not at the edges.

What Classical QE Assumed — and What Agentic AI Changed

NIST's 2025 guidance on agentic AI and the OWASP LLM Top 10 together make a hard truth explicit: agentic systems invalidate at least five foundational QE assumptions. Understanding each one is not academic. It determines what you test, what you instrument, and what you can defend to an auditor.

The first assumption is deterministic outputs. Classical testing is built on the premise that a given input produces a predictable output. Agentic systems, by design, do not work this way. The same prompt under the same conditions can produce structurally different responses across runs. Coverage defined as input-output pairs becomes meaningless. What replaces it is behavioral envelope testing — characterising the space of acceptable outputs and asserting that the system stays within it, rather than asserting it produces a specific result.

The second assumption is stateless execution. Traditional test cases assume each execution starts clean. Agentic systems carry memory — short-term context windows, long-term memory stores, session histories — that actively shape subsequent behavior. A test that passes in isolation may fail after three prior interactions have primed the agent's context. Test isolation, as you have practiced it, no longer holds.

The third assumption is single-model scope. Most QE teams were scoped to a single model or a single service boundary. Agentic architectures chain models, tools, APIs, and retrieval systems together. A failure in one component propagates through the chain in ways that no individual component test will surface. The unit of assurance must shift from the model to the agent system as a whole, and eventually to multi-agent compositions.

The fourth assumption is human-in-the-loop checkpoints. Classical regulated-industry QA relies on human review gates at defined points. Agentic systems are specifically designed to reduce or eliminate those gates. When the agent decides which tool to invoke, which data to retrieve, and which action to take, the human checkpoint that your compliance team assumed was present may no longer exist in any meaningful sense.

The fifth assumption is bounded failure domains. In deterministic systems, the blast radius of a failure is generally knowable in advance. In agentic systems, failures compound. A hallucinated intermediate output becomes an input to a downstream tool call, which updates a record, which triggers a downstream process. By the time a compliance-relevant error surfaces, it may have propagated across multiple systems and sessions. Failure isolation is a design requirement, not a test strategy.

Three Assurance Layers You Have to Build

Given that the classical stack does not transfer, the practical question is what to build instead. Three distinct assurance layers address the structural properties that make agentic systems different: non-determinism, tool-chaining, and memory persistence.

The non-determinism layer is not about eliminating variance. It is about characterising and bounding it. This means defining behavioral contracts — what properties must hold across all valid outputs regardless of specific content — and running high-volume, varied prompt executions to probe the distribution of outputs rather than a single path. For regulated industries, this layer must also assert compliance-relevant properties: the agent must never produce a output that violates a disclosure requirement, a fair lending constraint, or a data minimisation obligation, regardless of which output path it takes. OWASP LLM Top 10 entries covering prompt injection, insecure output handling, and model denial of service all live at this layer.

The tool-chaining layer asserts correctness and safety across multi-step sequences. Each tool invocation — a database read, an API call, a file write, a downstream model call — must be tested not just for its own output but for its effect on the subsequent chain. This is where NIST's agentic AI guidance is most practically useful: it introduces concepts of task decomposition verification, tool call validation, and sandboxed execution monitoring that map directly to what a QE team needs to instrument. Concretely, this layer requires execution trace logging at the step level, not just at the session level, so that any intermediate failure can be localised and reproduced.

The memory persistence layer is the one most QE teams discover late and regret. Agent memory — whether it is in-context, in a vector store, or in an external database — creates test state that persists across sessions and users. In a regulated industry, this raises immediate data governance questions: is personal data persisting in a memory store beyond its intended retention window? Is the agent's prior interaction with one customer contaminating its behavior toward another? Testing this layer means building explicit memory state probes, asserting that memory is scoped correctly, and verifying that purge and isolation mechanisms actually work under load.

The Classical vs. Agentic QE Comparison

To make the shift concrete, it is worth placing the two paradigms side by side across the dimensions that matter most in practice.

📊 Related research

The Agentic QE Maturity Model

A five-level framework governing AI quality engineering from ad-hoc testing to production-grade governance—defining the technical controls, organizational structures, and staged investments regulated enterprises need to deploy autonomous agents safely.

Get the report →

On test oracle definition: classical QE uses expected outputs as the oracle. Agentic QE uses behavioral contracts and property assertions as the oracle, because expected outputs do not exist in a non-deterministic system.

On execution scope: classical QE tests components in isolation or in defined integration slices. Agentic QE must test the full agent execution graph, including all tool invocations and memory accesses, as the minimum meaningful scope.

On reproducibility: classical QE assumes a failing test can be reproduced deterministically. Agentic QE must treat non-reproducible failures as structurally valid failure signals, not as noise, and instrument for probabilistic failure rates rather than binary pass/fail.

On compliance assertion: classical QE defers compliance checks to a UAT gate or a separate compliance review. Agentic QE must embed compliance assertions into every layer of the test execution because there is no single gate through which all agent behavior passes.

On failure attribution: classical QE can attribute a failure to a specific line, service, or configuration. Agentic QE failure attribution requires trace analysis across the full execution chain, because the observable failure and the causal failure are often in different components.

A Maturity Checklist for QE Teams Starting From Zero

If you are inheriting agentic AI ownership with no established playbook, the following checklist gives you a defensible starting position. It is organised around observable evidence rather than test pass rates, because a future auditor will ask for evidence, not metrics.

First, establish behavioral baseline documentation. Before you run a single automated test, document what the agent is supposed to do, what tools it is authorised to invoke, what data it is authorised to access, and what decisions it is explicitly prohibited from making autonomously. Without this, you cannot define a test oracle.

Second, instrument execution traces at the step level. Ensure that every tool call, every memory read and write, and every model invocation is logged with enough fidelity to reconstruct any execution path post-hoc. This is not optional in regulated industries — it is the audit trail.

Third, build property-based tests for compliance-critical constraints. Identify the constraints that must hold regardless of output path — data minimisation, disclosure requirements, access controls — and encode them as executable assertions that run across every test execution.

Fourth, run adversarial scenario suites against the tool-chaining layer. Specifically test for prompt injection at tool boundaries, for unauthorised data access via retrieval mechanisms, and for state pollution across sessions. These are the failure modes that OWASP LLM Top 10 and NIST's agentic guidance both flag as highest-priority.

Fifth, establish a memory isolation verification procedure. Prove, with evidence, that no user's memory state bleeds into another user's session, and that memory retention does not exceed defined governance windows.

Sixth, define an escalation threshold for non-reproducible failures. Agree with your risk and compliance stakeholders on the failure rate at which a non-deterministic failure triggers a formal review, rather than waiting for a threshold that nobody defined.

Why This Cannot Wait for the Next Planning Cycle

Regulatory frameworks including the EU AI Act, RBI's model risk guidance, and emerging ISO 42001 implementation requirements are converging on a common expectation: that high-risk and autonomous AI systems are continuously assured, not periodically tested. The assurance gap for agentic systems is not a future problem. In most regulated enterprises deploying AI agents today, it is a current and unacknowledged exposure.

The question is not whether your agent passed the test suite. The question is whether your test suite was ever designed for a system that reasons, remembers, and acts across tool boundaries. Answering that question honestly — and building the assurance stack that follows from the answer — is the most important technical risk decision a QE leader in a regulated enterprise will make in the next twelve months.

The question is not whether your agent passed the test suite. The question is whether your test suite was ever designed for a system that reasons, remembers, and acts across tool boundaries.

Go deeper — gated research

The Agentic QE Maturity Model

A five-level framework governing AI quality engineering from ad-hoc testing to production-grade governance—defining the technical controls, organizational structures, and staged investments regulated enterprises need to deploy autonomous agents safely.

By Qapitol· AI assurance & governance

Related insights

Enjoyed this? There’s more every two weeks.

Join 3,000+ readers of The Control Layer Brief.