AI EvaluationJune 23, 2026·7 min read

Your AI Outperforms Humans on Accuracy. SR 11-7 Still Won't Let You Deploy It.

Benchmarking AI against a human baseline means more than measuring accuracy — SR 11-7 and EU AI Act Article 9 require documented equivalence across error distribution, consistency, and edge-case handling.

📥 Featured researchEU AI Act Readiness Index 2026

Get the report →

Key takeaways

Aggregate accuracy is a necessary but insufficient metric for regulatory sign-off — regulators require error distribution, decision consistency, and edge-case handling to be mapped and explained separately.
SR 11-7 and EU AI Act Article 9 share a common logic: the model must demonstrably match or exceed the human process it replaces across multiple performance dimensions, not just headline accuracy.
A human baseline is not simply a benchmark number — it is a documented process artifact showing how human reviewers performed on the same case population used to evaluate the AI.
False-positive rate parity matters as much as accuracy in high-stakes decisions: an AI that is more accurate overall but concentrates errors on a specific demographic or product segment will fail model risk review.
Documentation produced during benchmarking — including limitation logs, edge-case inventories, and reviewer disagreement rates — is what regulators actually examine during audit, not the model's internal evaluation metrics.

The problem that aggregate accuracy creates

When a model validation team signs off an AI system for production deployment, the question they are answering for regulators is not whether the model is good in the abstract. It is whether the model is at least as reliable as the human process it is replacing — and whether that equivalence can be demonstrated, documented, and defended under examination. Knowing how to benchmark an AI model against a human baseline before production deployment is therefore not a technical nicety; it is the evidentiary core of the sign-off. The complication is that most teams start and finish with accuracy, and accuracy alone is not sufficient evidence.

SR 11-7, the Federal Reserve and OCC's model risk management guidance, requires that model performance be validated against the specific use case and population the model will serve, with limitations documented and compensating controls identified. EU AI Act Article 9 requires high-risk AI systems to implement risk management measures throughout the system lifecycle, including evaluation against prior human performance where relevant. Neither framework accepts a single summary metric as evidence of equivalence. Both expect you to show your work — across multiple performance dimensions, on a case population that is representative of live conditions, with documented treatment of cases the model handles differently from how a human reviewer would.

What the benchmarking process actually involves

Step one is establishing the human baseline as a formal artifact, not an assumption. This means running a structured retrospective or shadow exercise in which human reviewers process a defined case population under conditions that can be recorded — decision outcome, time on case, escalation rate, disagreement rate where more than one reviewer sees the same case. This population must be large enough to cover the tail of the distribution your AI will encounter in production. If the human process you are replacing handled edge cases in a particular way, that handling needs to appear in the baseline dataset. A baseline reconstructed from aggregate historical statistics is not acceptable for regulatory purposes because it cannot be interrogated at the case level.

Step two is defining the metric set before the AI is evaluated — not after. The choice of metrics must be anchored to what the human process was actually trying to achieve. For a credit decisioning model replacing a manual underwriter, accuracy across the full population matters, but so does the false-positive rate on declined applications, the false-negative rate on defaults, and the consistency of decisions across equivalent case profiles submitted at different times or by different business units. Deciding which metrics matter after you have seen the AI's scores creates selection bias that a competent model risk officer or external auditor will identify immediately.

Step three is running the AI against the same case population used to establish the human baseline, under equivalent conditions. This sounds obvious but is frequently compromised by data leakage — cases in the AI's training set appearing in the evaluation population — or by population mismatch, where the baseline was collected over a different time window than the evaluation sample and market conditions shifted between the two. Both issues undermine the validity of the comparison and must be controlled for and documented. The evaluation population should be held out from training and, where possible, audited by a party independent of the model development team.

Step four is the comparison itself. The table below illustrates the structure regulators expect to see — not a single headline number, but a side-by-side view across the dimensions that matter for the use case.

Human Baseline vs AI Model — Illustrative Metric Comparison:

Accuracy (overall): Human baseline figure established from the retrospective exercise; AI figure from held-out evaluation population; Regulatory threshold: AI must meet or exceed human figure with documented explanation of any shortfall.

False-positive rate: Human baseline captures the proportion of legitimate cases incorrectly flagged; AI figure from same evaluation population; Regulatory threshold: Equal or lower rate required, with demographic and segment breakdowns documented.

Decision consistency: Human baseline established by presenting equivalent case profiles to multiple reviewers and measuring agreement rate; AI figure measured by presenting equivalent case profiles at different time points or with minor presentation variation; Regulatory threshold: AI consistency must be documented; higher variance than human baseline requires compensating controls.

📊 Related research

EU AI Act Readiness Index 2026

Most regulated enterprises remain structurally unprepared for EU AI Act obligations despite partial enforcement beginning February 2025, with 78% taking no meaningful compliance steps and 83% lacking even basic AI system inventories—the foundation for all subsequent requirements.

Get the report →

Edge-case handling: Human baseline documents escalation rate and outcome for non-standard cases; AI figure documents the rate at which the model produces low-confidence outputs or routes to human review; Regulatory threshold: AI must have a defined and documented response to edge cases; silence or confident-but-wrong outputs are not acceptable.

Step five is the limitation log. Every dimension on which the AI performs differently from the human baseline — even where it performs better — must be recorded with an explanation. A model that achieves higher overall accuracy by concentrating errors in a low-volume segment has a limitation. A model that is more consistent than human reviewers because it ignores contextual signals that reviewers correctly weight for unusual cases also has a limitation. Limitations are not disqualifying; unexplained limitations are.

What SR 11-7 and EU AI Act Article 9 require you to document

The documentation obligations under both frameworks overlap more than practitioners often realize, and the benchmarking process should be designed from the start to produce the artifacts each requires.

SR 11-7 expects model validators to document the conceptual soundness of the model, the quality and representativeness of the data used for validation, the testing performed including outcome analysis, and the ongoing monitoring plan. For a deployment that replaces a human process, the human baseline dataset and the methodology used to construct it are part of the evidence base for conceptual soundness. The comparison table is part of outcome analysis. The limitation log feeds directly into the model's documented risk inventory and determines what compensating controls the business must operate.

EU AI Act Article 9 requires the provider or deployer of a high-risk AI system to establish and maintain a risk management system that is iterative throughout the lifecycle, identify and analyze known and foreseeable risks, evaluate those risks against prior human performance where the system replaces human activity, and adopt appropriate risk management measures. The benchmarking process described above is how you satisfy the evaluation obligation. The artifacts it produces — baseline dataset, metric framework, comparison results, limitation log — are what Article 9 compliance looks like in practice.

The sidebar checklist of documentation artifacts regulators expect to find in a sign-off package includes: the human baseline methodology document describing how cases were selected, how reviewers were instrumented, and what controls prevented gaming of the baseline; the held-out evaluation dataset specification including provenance, time window, population coverage, and leakage controls; the pre-registered metric framework showing the performance dimensions and thresholds defined before AI evaluation began; the side-by-side comparison results at the case level as well as in summary, with segment and demographic breakdowns where the use case involves consumer-facing decisions; the limitation log with each identified limitation, its probable cause, its operational significance, and the compensating control or monitoring trigger assigned to it; the reviewer disagreement analysis showing where human consensus was itself weak, because AI performance in those regions requires additional scrutiny rather than comparison against a noisy human signal; and the ongoing monitoring specification defining how human-baseline equivalence will be reassessed after deployment as population characteristics shift.

Why the failure mode that passes accuracy still gets rejected

The scenario regulators are most alert to is an AI model that improves on the human baseline overall but does so by performing significantly worse on a specific sub-population or case type that the business regards as low-volume and therefore low-risk. Aggregate accuracy masks this. False-positive rate at the segment level reveals it. This is not a hypothetical concern — it is the pattern that drives most model risk management findings against AI deployments in credit, claims, and AML screening, and it is the pattern that a properly structured benchmarking process is designed to surface before the model reaches production rather than after.

The investment in a structured human baseline — conducted rigorously, documented completely, and preserved as a living comparator for ongoing monitoring — is not primarily a compliance activity. It is the mechanism by which a Head of Model Risk can answer the question a regulator will eventually ask: how do you know the AI is performing as well as the process it replaced, and how will you know if that changes? Without a structured baseline, that question has no credible answer. With one, it has a precise and auditable one.

“An AI that outperforms humans on aggregate accuracy can still fail regulatory sign-off if its failure modes are not mapped, explained, and documented before cutover.”

Go deeper — gated research

EU AI Act Readiness Index 2026

Get the report →Talk to our team →

By Qapitol· AI assurance & governance

Your AI Outperforms Humans on Accuracy. SR 11-7 Still Won't Let You Deploy It.

The problem that aggregate accuracy creates

What the benchmarking process actually involves

What SR 11-7 and EU AI Act Article 9 require you to document

Why the failure mode that passes accuracy still gets rejected

EU AI Act Readiness Index 2026

Related insights

Your RLHF Model Passed Staging. The Reward Signal Is Already Decaying.

Your Fraud Scoring SaaS Cleared QA. It Has Never Been Tested for Distributional Drift.

HARA Finds the Cliff Edge. It Cannot See the Fog: SOTIF Test Coverage for Machine Learning ADAS

Enjoyed this? There’s more every two weeks.