technical paper
Insurance-NER: Domain-Specific Named Entity Recognition
How a fine-tuned Llama 3.1 8B model reaches 94.2% F1 on insurance NER — methodology, data design, and evaluation protocol.
What you’ll take away
- →Understand the domain-specific annotation schema and entity taxonomy that drives high-precision NER on insurance policy documents.
- →Apply a reproducible training data design process, including synthetic augmentation and inter-annotator agreement thresholds, to minimise label noise.
- →Implement a rigorous evaluation protocol — span-level F1, entity-type stratification, and boundary-error analysis — that surfaces real model weaknesses before production.
- →Identify the five most common failure modes in insurance NER and the architectural or data-side mitigations for each.
- →Map model assurance practices to ISO/IEC 42001 and the EU AI Act's high-risk system requirements for deployed document-extraction pipelines.
The Problem This Paper Solves
Insurance policy documents are among the most linguistically dense artefacts in enterprise AI. A single commercial property policy may span 80–120 pages, interleaving defined terms, exclusion clauses, coverage limits, endorsement riders, and jurisdiction-specific regulatory language — often within the same sentence. Extracting structured entities from these documents accurately enough to feed downstream underwriting, claims, or compliance workflows is a hard NER problem, and general-purpose language models solve it imperfectly.
Off-the-shelf models trained on newswire or general web text misfire on insurance-specific entities in predictable ways: they confuse monetary limit expressions with premium amounts, fail to distinguish named insureds from additional insureds, and fragment multi-token exclusion clause references into partial spans. These are not edge cases. In a pilot evaluation of three foundation models against a held-out set of 200 commercial lines policies, span-level F1 scores for the critical entity classes — Coverage Limit, Exclusion Reference, Policy Period, and Named Insured — ranged from 71% to 83%. That gap between 83% and 94% translates directly into downstream extraction errors that require manual review, eroding the operational case for automation.
This paper documents the methodology behind Insurance-NER, a fine-tuned Llama 3.1 8B model that achieves 94.2% macro-averaged span-level F1 across nine entity classes on a held-out test set of insurance policy documents. The goal is not to present a finished product but to give AI/ML practitioners, QE leaders, and risk teams a replicable technical framework — including the decisions that matter, the traps to avoid, and the evaluation rigour required before any such model enters production.
Entity Taxonomy Design
The first determinant of NER quality is not model architecture — it is entity definition quality. Poorly bounded entity classes produce label noise that no amount of training compute can recover from.
Insurance-NER uses a nine-class taxonomy developed through three rounds of schema review with domain annotators who held insurance underwriting or claims backgrounds:
- ▪NAMED_INSURED: the primary policyholder as legally named in the declarations page
- ▪ADDITIONAL_INSURED: parties added by endorsement or schedule, distinct from the named insured
- ▪COVERAGE_LIMIT: monetary or percentage-expressed limits per occurrence, aggregate, or sub-limit
- ▪DEDUCTIBLE: self-insured retention amounts, expressed as fixed values or percentages of loss
- ▪POLICY_PERIOD: effective and expiration dates, including endorsement-specific date ranges
- ▪EXCLUSION_REF: references to named exclusions by clause identifier or descriptive label
- ▪COVERED_PERIL: named perils or all-risk qualifiers that define the trigger for coverage
- ▪JURISDICTION: governing law clauses, regulatory filing references, and state-specific endorsement identifiers
- ▪PREMIUM_AMOUNT: total and installment premium figures, excluding endorsement adjustments unless explicitly stated
Three taxonomy design principles guided every class definition. First, mutual exclusivity: COVERAGE_LIMIT and PREMIUM_AMOUNT share a surface form (currency values) but must be distinguishable by syntactic context alone, because a model cannot rely on document position. Second, nestedness policy: Insurance-NER treats entities as flat spans — no nested annotation — to reduce annotator disagreement and simplify span-extraction training. Where nesting would be semantically meaningful (e.g., a COVERAGE_LIMIT that contains a DEDUCTIBLE reference), the outer span wins. Third, boundary precision: all entity definitions specify left and right boundary rules explicitly, including whether articles, prepositions, and parenthetical qualifiers are included. Boundary vagueness is the primary driver of low inter-annotator agreement in NER tasks.
Training Data Design
Source Document Strategy
The training corpus comprised three document streams: licensed policy documents (redacted for PII under DPDP-aligned data processing agreements), publicly available specimen policy forms from regulatory filings, and synthetically generated policy segments. The split by document count was approximately 55% licensed, 25% specimen, and 20% synthetic — but by annotated token volume, synthetic data accounted for roughly 30% because specimen and synthetic documents were generated at higher annotation density.
All licensed documents underwent PII redaction prior to annotation, replacing real named insured values with synthetic entity-consistent substitutes. This is not merely a compliance step — it prevents the model from learning spurious correlations between specific named entities and coverage structures.
Synthetic Data Generation Protocol
Synthetic policy segments were generated using a controlled template-and-variation approach rather than unconstrained LLM generation. Template skeletons were authored by domain experts for each of the nine entity classes, with slot variables for entity values, surrounding clause language, and document section context. Values were drawn from a curated entity value library — for example, COVERAGE_LIMIT values were sampled from actuarially plausible distributions for the relevant line of business.
Unconstrained LLM-generated synthetic data for NER training carries a specific risk: the generating model may introduce entity surface forms or syntactic patterns that are internally consistent but do not reflect real document distributions. Template-based generation trades variety for distributional fidelity, which is the correct trade-off for a domain-specific production model.
A quality filter discarded any synthetic segment where automated span extraction from the template differed from a secondary annotation pass, enforcing a 100% label-consistency threshold for synthetic examples.
Inter-Annotator Agreement
Annotation was performed by four annotators across two rounds. Before the main annotation pass, all annotators completed a calibration exercise on 50 documents not included in the training set, and pairwise Cohen's Kappa was computed at the span level. Annotators with pairwise Kappa below 0.78 on any entity class underwent a focused calibration session specific to that class before proceeding.
Final main-corpus inter-annotator agreement, computed on a 10% sample adjudicated by a senior domain reviewer, reached an average Kappa of 0.84 across all nine classes. The lowest-agreement class was EXCLUSION_REF at 0.79, which reflects genuine ambiguity in how annotators parsed clause reference syntax. This class also showed the highest per-class error rate in model evaluation — a direct downstream signal of label noise, not model failure alone.
The training set comprised 14,200 annotated document segments after filtering. The development set held 1,800 segments, and the held-out test set 2,000 segments — the latter drawn exclusively from licensed documents not seen during training or development, covering six lines of business.
Fine-Tuning Methodology
Model Selection Rationale
Llama 3.1 8B was selected over larger variants and purpose-built encoder models on three criteria: inference cost at production document volumes, demonstrated general-purpose instruction following that reduces prompt engineering overhead, and a parameter count that permits full fine-tuning within a practical GPU budget using standard quantisation-aware training techniques.
Encoder-only models (BERT-family architectures) remain strong baselines for NER and were evaluated. The encoder baseline achieved 91.4% macro F1 on the same test set — a meaningful gap below the 94.2% achieved by the fine-tuned decoder. The decoder advantage was concentrated in multi-sentence span resolution and EXCLUSION_REF, where understanding clause reference context beyond a single sentence window proved decisive.
Training Configuration
Fine-tuning used a token-classification head rather than a generative span-extraction formulation. This choice was deliberate: generative extraction introduces output format variability that complicates span boundary precision and downstream parsing. The token-classification head produces per-token BIO (Beginning, Inside, Outside) labels, preserving the exact boundary information required for downstream structured extraction.
Key training parameters: LoRA rank 16 applied to attention and feed-forward projection matrices, learning rate 2e-5 with cosine decay, batch size 32 gradient-accumulated over 8 steps, trained for 4 epochs with early stopping on development set macro F1. Mixed precision (bfloat16) training was used throughout. No instruction tuning or RLHF stage was applied post fine-tuning — the token-classification objective is sufficient for this formulation.
Document segments were constructed with a 512-token sliding window and 64-token overlap to handle long documents without truncating mid-entity spans. Spans crossing window boundaries were handled by a merge-and-dedup post-processing step that resolves conflicting BIO labels at overlap regions using a confidence-weighted voting scheme.
Evaluation Protocol
Primary Metric: Span-Level Macro F1
All reported metrics use strict span-level evaluation: a predicted entity is counted as correct only if both the entity class and the exact character-level span boundaries match the gold annotation. Partial-match credit is not awarded in the primary metric. This is the appropriate standard for production extraction pipelines, where downstream systems parse entity values by position — a boundary error of even one token can corrupt a COVERAGE_LIMIT value extraction.
The 94.2% macro F1 figure is the mean of per-class F1 scores, not micro-averaged, to prevent high-frequency classes (COVERAGE_LIMIT, POLICY_PERIOD) from masking performance on low-frequency but high-stakes classes (JURISDICTION, EXCLUSION_REF).
Per-Class Stratification
Per-class results on the held-out test set:
- ▪NAMED_INSURED: 96.8% F1
- ▪ADDITIONAL_INSURED: 93.1% F1
- ▪COVERAGE_LIMIT: 95.7% F1
- ▪DEDUCTIBLE: 94.3% F1
- ▪POLICY_PERIOD: 97.2% F1
- ▪EXCLUSION_REF: 88.4% F1
- ▪COVERED_PERIL: 92.6% F1
- ▪JURISDICTION: 91.9% F1
- ▪PREMIUM_AMOUNT: 96.1% F1
EXCLUSION_REF underperformance (88.4%) is the most significant result. Analysis of false negatives shows that 61% involve clause references using non-standard identifier formats introduced by carrier-specific endorsements not well-represented in training data. This is a data coverage gap, not a model architectural failure — the mitigation is targeted data augmentation for carrier-specific endorsement formats, not architectural change.
Boundary Error Analysis
Beyond F1, a boundary error analysis was performed on all false positives and false negatives in the test set. Errors were classified into four types: left-boundary over-extension (model includes a preceding preposition or article), right-boundary truncation (model stops before end of multi-token value), class confusion (correct span, wrong entity class), and spurious detection (no corresponding gold span).
Left-boundary over-extension accounted for 38% of all span errors and was concentrated in COVERAGE_LIMIT and DEDUCTIBLE classes, where patterns like "up to" and "subject to a" immediately precede the entity value. This error type was reduced by adding explicit boundary rule reminders to the annotation guidelines and re-annotating a stratified 5% sample of training data to enforce consistent left-boundary conventions. The post-correction F1 improvement on COVERAGE_LIMIT was 1.4 percentage points from the pre-correction baseline.
Five Common Failure Modes in Insurance NER
Practitioners deploying domain NER in insurance should anticipate and test for these failure patterns:
- ▪Endorsement language drift: policy endorsements use carrier-specific terminology that deviates from base form language. Models trained on specimen forms without endorsement representation will underperform on in-force policy packages. Mitigation: include endorsement document types as a distinct training stratum.
- ▪Table and schedule entities: COVERAGE_LIMIT and DEDUCTIBLE values frequently appear in structured tables rather than prose. Tokenisation of table content produces irregular whitespace and layout tokens that disrupt BIO sequence labelling. Mitigation: apply table-aware pre-processing to linearise tabular content before tokenisation.
- ▪Cross-reference ambiguity: EXCLUSION_REF entities often reference other clause numbers (e.g., "as defined in Section IV(b)") which are not themselves exclusions. Models trained without negative examples for clause cross-references produce elevated false positives. Mitigation: add hard negative examples of clause cross-references to training data.
- ▪Date format variation: POLICY_PERIOD entities appear in ISO 8601, US long-form, abbreviated month, and mixed numeric formats. Models underfit on rare format variants. Mitigation: synthetic augmentation across all date format variants for every date-bearing entity class.
- ▪Multi-policy package documents: commercial umbrella and package policies interleave multiple coverage forms in a single document. Without document-structure signals, models confuse limits from one coverage part with another. Mitigation: include document section context tokens (e.g., coverage part headers) as additional input features.
Compliance and Governance Considerations
An insurance NER model that feeds underwriting or claims decisions is likely to qualify as a high-risk AI system under Annex III of the EU AI Act, depending on jurisdiction and deployment context. This has concrete technical obligations: risk management system documentation, data governance logging, human oversight mechanisms, and post-market monitoring.
ISO/IEC 42001 Clause 6.1 requires that AI-related risks be identified and treated as part of the AI management system. For a production NER pipeline, this means documenting known failure modes (the five listed above), their likelihood and impact, and the technical or procedural controls in place. The evaluation protocol described in this paper — per-class F1, boundary error analysis, and held-out test set construction discipline — constitutes the technical evidence base for that risk treatment documentation.
NIST AI RMF's MEASURE function maps directly to the evaluation protocol: quantifying model performance across demographic and document-type slices, tracking performance over time as document distributions shift, and establishing re-evaluation triggers when upstream data changes.
For DPDP compliance in Indian insurance deployments, the PII redaction and synthetic substitution approach described in the training data section is the minimum required control for any licensed document used in model training.
Why Assurance Cannot Be an Afterthought
A model that scores 94.2% macro F1 on a held-out test set is not a model that is ready for production — it is a model that has passed the minimum technical bar for a serious deployment conversation. The evaluation protocol, the failure mode taxonomy, and the compliance mapping described here are not documentation overhead. They are the evidence that allows a risk committee, a regulator, or an internal audit function to evaluate the model against the organisation's risk appetite.
Insurance is a domain where extraction errors carry financial and legal consequences. The assurance work — rigorous annotation, stratified evaluation, boundary error analysis, and governed re-evaluation cycles — is what separates a model that performs well in a notebook from one that can be operated responsibly at scale. That distinction is worth more than any single F1 point.
Free · read in full with your details
Read “Insurance-NER: Domain-Specific Named Entity Recognition”
Enter your details to unlock the full resource.
