template

LLM Evaluation Metrics Reference Card

A practitioner's reference for selecting, calibrating, and governing LLM evaluation metrics across regulated enterprise deployments.

12 min read·Free with email

What you’ll take away

Understand which evaluation metrics apply to which deployment contexts — retrieval-augmented generation, agentic workflows, customer-facing interfaces — and why the wrong metric choice creates blind spots in your assurance program.
Apply threshold starting points for accuracy, groundedness, toxicity, PII leakage, consistency, and latency — then learn how to calibrate them to your organization's risk appetite.
Map each metric category to the regulatory obligations most relevant to BFSI, healthcare, and insurance — including EU AI Act Article 9 risk management requirements and ISO/IEC 42001 controls.
Use the provided scoring template structure to run repeatable, auditable evaluation cycles that produce evidence your governance team can actually use.
Identify the minimum viable metric set for three common enterprise deployment archetypes so you can scope evaluations without over-engineering or under-testing.

How to Use This Reference Card

This document is structured as a working template, not a reading exercise. Each metric category contains a definition, a deployment context guide, threshold starting points, and a calibration checklist. Copy the structure into your evaluation runbook. Adjust thresholds to match your risk tier. Record your rationale — that rationale becomes audit evidence.

The metrics are organized into six categories: Output Correctness, Groundedness and Attribution, Safety and Toxicity, Privacy and PII Leakage, Behavioral Consistency, and Operational Performance. Each category maps to at least one regulatory or standards obligation. A final section gives you three deployment archetypes with a minimum viable metric set for each.

One principle runs throughout: a metric without a threshold is an observation. A threshold without a rationale is a guess. Documented, risk-calibrated thresholds are the difference between evaluation and assurance.

---

Category 1: Output Correctness

Output correctness measures whether the model's response is factually accurate, task-complete, and appropriate to the stated intent. This is distinct from groundedness (covered separately): a response can be grounded to its source documents and still be incorrect if those documents contain errors, or it can be factually accurate but incomplete for the task.

Metrics to track

Exact Match (EM): Proportion of responses identical to reference answer. Useful for closed-domain Q&A, code generation, and classification tasks.
F1 Token Overlap: Softer correctness measure for free-text responses where paraphrase is acceptable.
LLM-as-Judge Correctness Score: A secondary model (or structured rubric) scores the response on a defined scale — typically 1–5 — against a reference answer and evaluation criteria.
Task Completion Rate: For agentic or multi-step tasks, the proportion of tasks where the model reaches the defined end state successfully.

When to use

Apply correctness metrics in any deployment where the output drives a decision, a communication to a customer, or a downstream automated action. In BFSI, this includes loan eligibility summaries, fraud alert narratives, and regulatory correspondence drafts. In healthcare, it includes clinical decision support outputs and patient-facing information responses.

Threshold starting points

EM: Not appropriate as a primary threshold in free-text contexts. Use for structured outputs; set floor at 95% for high-stakes classification.
F1: 0.75–0.80 as a floor for informational responses; 0.85+ for compliance-sensitive outputs.
LLM-as-Judge Correctness: Mean score ≥ 3.8 / 5.0; flag any response scoring ≤ 2.0 for human review regardless of batch average.
Task Completion Rate: 95%+ for agentic workflows operating in production with real-world consequences.

Calibration checklist

[ ] Define the reference answer source (human-annotated gold set, retrieval corpus, regulatory text).
[ ] Document the judge model and version if using LLM-as-Judge — version drift changes scores.
[ ] Record why the threshold was set where it was; reference the risk tier of the use case.
[ ] Establish the retest cadence (suggested: after every model update and quarterly for stable deployments).

---

Category 2: Groundedness and Attribution

Groundedness measures the degree to which model claims are traceable to provided source material. It is the primary metric for retrieval-augmented generation (RAG) architectures and is directly implicated in EU AI Act Article 13 (transparency) and Article 9 (risk management) obligations for high-risk AI systems.

Metrics to track

Groundedness Score: The proportion of factual claims in the response that can be attributed to a retrieved source chunk. Typically computed via NLI (natural language inference) entailment models or LLM-as-Judge rubrics.
Attribution Precision: Of the citations or references the model provides, what proportion are accurate and relevant.
Hallucination Rate: The complement of groundedness — the proportion of responses containing at least one ungrounded claim. Report this separately because stakeholders respond to it more readily than a positive score.
Context Utilization: How much of the provided context the model actually uses. Low context utilization with high groundedness scores may indicate the model is ignoring relevant retrieved content.

When to use

Groundedness is mandatory for RAG deployments. It is also applicable to any system where the model is expected to answer from a defined knowledge base rather than parametric knowledge. In insurance, policy interpretation assistants must ground every claim to the specific policy document. In healthcare, clinical knowledge assistants must attribute to verified clinical guidelines.

Threshold starting points

Groundedness Score: ≥ 0.85 for general enterprise use; ≥ 0.92 for regulated customer-facing deployments.
Hallucination Rate: ≤ 5% of responses for internal tools; ≤ 2% for external-facing high-risk systems.
Attribution Precision: ≥ 0.80 where citations are surfaced to end users.

Calibration checklist

[ ] Select the entailment model or judge rubric and fix its version.
[ ] Define what constitutes a "claim" for your domain (numerical assertions, causal statements, regulatory references).
[ ] Test groundedness on adversarial queries designed to elicit hallucination — not just representative queries.
[ ] Map groundedness failures to retrieval failures vs. generation failures; they require different remediation.

---

Category 3: Safety and Toxicity

Safety metrics detect harmful, offensive, discriminatory, or otherwise policy-violating outputs. For EU AI Act purposes, high-risk AI systems (Annex III) face explicit requirements around non-discrimination and human oversight. Even systems outside the high-risk designation carry reputational and regulatory risk from safety failures.

Metrics to track

Toxicity Score: Probability that a response contains harmful, abusive, or threatening language. Standard classifiers (e.g., Perspective API-family models, fine-tuned domain classifiers) output a 0–1 probability.
Bias Score: Measures disparate treatment of demographic groups across a benchmark prompt set. Typically requires a structured evaluation set with paired prompts varying only by protected attribute.
Policy Violation Rate: Proportion of responses that breach defined content policies — may include off-topic responses, advice outside scope, or jurisdiction-specific content restrictions.
Refusal Accuracy: The proportion of genuinely harmful prompts that the model correctly declines, and — equally important — the proportion of safe prompts the model incorrectly refuses (over-refusal).

When to use

Apply safety metrics to every customer-facing deployment and any internal deployment where the model interacts with employees without human-in-the-loop review of every output. Red-teaming — structured adversarial probing of safety guardrails — should precede production deployment for any system handling sensitive domains.

Threshold starting points

Toxicity Score: Flag responses ≥ 0.50; block or require human review at ≥ 0.75.
Policy Violation Rate: ≤ 1% in steady-state production for customer-facing systems.
Refusal Accuracy on harmful prompts: ≥ 97%. Over-refusal rate on benign prompts: ≤ 3% (excessive over-refusal degrades utility and creates shadow-use risk).

Calibration checklist

[ ] Define your content policy in writing before building the evaluation set — not after.
[ ] Construct a red-team prompt library covering at least: jailbreak attempts, sensitive topic probes, demographic bias probes, and jurisdiction-specific restricted content.
[ ] Establish a review process for borderline cases (0.40–0.74 toxicity range) before going live.
[ ] Refresh the red-team library quarterly or after significant external events (new jailbreak techniques, regulatory guidance changes).

---

Category 4: Privacy and PII Leakage

PII leakage evaluation measures whether the model exposes personal information — from its training data, from user inputs, or from retrieved documents — in its outputs inappropriately. This category intersects directly with DPDP (India's Digital Personal Data Protection Act), GDPR-equivalent obligations in the EU AI Act, and HIPAA considerations in healthcare contexts.

Metrics to track

PII Detection Rate in Outputs: Proportion of model outputs in which PII (names, contact details, financial identifiers, health information) appears when it should not.
Training Data Extraction Rate: In red-team conditions, the proportion of structured probes that successfully extract memorized PII from model weights. Relevant for fine-tuned or domain-adapted models.
PII Propagation Rate: In RAG contexts, the proportion of responses where PII present in retrieved chunks is propagated to the output unnecessarily.
Redaction Compliance Rate: Where the system is expected to apply PII redaction (e.g., in summaries of customer records), the proportion of outputs where redaction is correctly applied.

When to use

Apply PII leakage metrics to any system processing personal data, which in regulated enterprise contexts means nearly every production deployment. Training data extraction testing is specifically relevant when using fine-tuned models or when the training corpus included customer or patient data.

Threshold starting points

PII Detection Rate in Outputs: Target 0.00% for regulated PII categories (health data, financial account numbers, government identifiers). Any instance constitutes a reportable event in most regulatory frameworks.
Redaction Compliance Rate: ≥ 99.5% for systems performing automated PII redaction in production.

Calibration checklist

[ ] Enumerate the PII categories relevant to your deployment using the taxonomy from your data classification policy.
[ ] Use synthetic PII in test data — never use real customer data in evaluation pipelines.
[ ] Map each PII category to the applicable regulatory regime and its breach notification threshold.
[ ] Test for indirect PII exposure (combinations of non-PII attributes that re-identify individuals).

---

Category 5: Behavioral Consistency

Consistency metrics measure whether the model produces equivalent outputs for semantically equivalent inputs, and whether its behavior is stable across time, model versions, and context variations. Inconsistency is a primary indicator of unreliability in high-stakes decision support.

Metrics to track

Semantic Consistency Score: Given a set of paraphrase variants of the same query, the variance in response meaning (not wording). High-quality systems should produce semantically equivalent answers to semantically equivalent questions.
Temporal Consistency: Response stability across identical queries submitted at different times (accounts for non-determinism in sampling parameters).
Cross-Session Consistency: For stateless deployments, the system should not exhibit user-history-dependent drift unless explicitly designed to do so.
Regression Delta: The change in correctness, groundedness, and safety metrics between model versions. Every version update requires a regression evaluation before promotion.

Threshold starting points

Semantic Consistency Score: ≥ 0.90 agreement rate across paraphrase sets for high-stakes outputs.
Regression Delta: Any degradation of more than 2 percentage points on any primary metric constitutes a hold condition — do not promote the new version without documented risk acceptance.

Calibration checklist

[ ] Construct a paraphrase evaluation set with at least 5 variants per canonical query for your top 50 use-case queries.
[ ] Run consistency evaluations at temperature 0 and at the production temperature setting — document both.
[ ] Treat regression evaluation as a release gate, not an optional check.
[ ] Log consistency metric trends over time; gradual drift is as significant as sudden drops.

---

Category 6: Operational Performance

Latency, throughput, and error rate metrics are often treated as infrastructure concerns, but they belong in the AI evaluation framework because performance degradation directly affects safety and correctness outcomes. A model that times out produces no answer; a model under load may truncate context, degrading groundedness.

Metrics to track

Time to First Token (TTFT): Latency from request submission to first token generated. Relevant for streaming interfaces.
End-to-End Latency (P50 / P95 / P99): Full response latency at percentile thresholds. P95 and P99 are the operationally significant figures for SLA design.
Throughput: Requests per second at target quality levels. Quality degrades under load — test both together.
Context Window Utilization: The proportion of the maximum context window consumed. Systems operating near the context ceiling show measurable degradation in instruction-following and coherence.
Error and Fallback Rate: Proportion of requests that result in a system error, timeout, or fallback to a default response.

Threshold starting points

P95 End-to-End Latency: ≤ 3 seconds for interactive customer-facing applications; ≤ 10 seconds for back-office processing workflows. These are starting points — your SLA must reflect actual business requirements.
Error Rate: ≤ 0.5% in production for critical workflows.
Context Window Utilization: Treat ≥ 85% utilization as a risk flag requiring architectural review.

Calibration checklist

[ ] Test performance at 1x, 2x, and 3x expected peak load before production launch.
[ ] Confirm that quality metrics (correctness, groundedness) do not degrade at peak load conditions.
[ ] Set latency alerting thresholds in your observability stack, not just in pre-deployment test reports.
[ ] Document the performance–quality tradeoff if you use quantized or smaller models for efficiency.

---

Deployment Archetypes: Minimum Viable Metric Sets

Archetype A — Internal Knowledge Assistant (RAG, Low-Risk) Required: Groundedness Score, Hallucination Rate, Semantic Consistency, P95 Latency, PII Propagation Rate. Optional: Toxicity Score, Task Completion Rate. Note: "Low-risk" in the EU AI Act sense does not mean zero-risk. Calibrate groundedness thresholds to the consequence of a misinformed employee decision.

Archetype B — Customer-Facing Conversational AI (High-Interaction, Medium-to-High-Risk) Required: Output Correctness (LLM-as-Judge), Groundedness Score, Hallucination Rate, Toxicity Score, Refusal Accuracy, PII Detection Rate in Outputs, Semantic Consistency, P95/P99 Latency, Error Rate. Note: This archetype typically triggers EU AI Act transparency obligations. Maintain evaluation logs for a minimum of the period specified in your applicable regulatory framework.

Archetype C — Agentic Workflow AI (Tool-Calling, Automated Decision-Making) Required: Task Completion Rate, Output Correctness, Groundedness Score, PII Leakage (all sub-metrics), Toxicity Score, Behavioral Consistency (all sub-metrics), Regression Delta, full operational performance suite. Note: Agentic systems interacting with external tools and APIs require evaluation of downstream action correctness, not just text output quality. This is an emerging area — align your evaluation approach with NIST AI RMF Govern and Measure functions.

---

A Note on Why Assurance Is Not Optional

Evaluation metrics are only as valuable as the governance process that acts on them. An organization that runs evaluations but has no defined escalation path for threshold breaches, no version gate process, and no audit trail of evaluation decisions has observed its model — it has not assured it.

The standards that matter here — ISO/IEC 42001 for AI management systems, the EU AI Act's Article 9 risk management requirements, the NIST AI RMF's Measure and Manage functions — all converge on the same expectation: systematic, documented, repeatable evidence that you know how your AI is behaving and that you have acted on what you found.

This reference card is a starting point. The assurance program is the destination.

Free · read in full with your details

Read “LLM Evaluation Metrics Reference Card”

Enter your details to unlock the full resource.