The Perimeter Is Not the Control: What Financial Services Gets Wrong About Air-Gapped LLMs
Air-gapped LLM deployment in financial services satisfies data-residency and DORA third-party risk rules — but eliminates every cloud-side guardrail, shifting the full assurance burden inward.

Key takeaways
- Air-gapped LLM deployment is technically achievable today across three distinct patterns — NIM on-prem, Ollama in a hardened VLAN, and Azure Government disconnected regions — each with different security and operational trade-offs.
- EU AI Act Article 9 mandates a risk management system; Articles 11 and 12 separately require technical documentation and immutable logging — all three obligations must be satisfied inside the perimeter when inference runs on-prem.
- DORA Article 17 requires detailed ICT incident and audit logging; PCI-DSS Requirement 10 enforces tamper-evident audit trails — both must be replicated by internal tooling once cloud-provider logging is removed from scope.
- Five technical requirements — model artifact signing, inference audit log immutability, no-egress endpoint enforcement, no-call-home licence controls, and GPU driver supply chain integrity — must be independently verified after every model update.
- Air-gapping eliminates continuous model monitoring by the cloud provider; regulated firms that have not staffed or tooled for internal AI evaluation will find the perimeter itself becomes the largest unaudited risk surface.
Why Regulated Firms Are Pulling Inference Inside the Firewall
Air-gapped LLM deployment in financial services is no longer a theoretical architecture choice. Infrastructure architects at Tier-1 and Tier-2 banks, and risk officers at regulated insurers, are actively evaluating on-premise inference as a direct response to three converging pressures: DORA's ICT third-party risk requirements, which demand demonstrable control over critical technology dependencies; PCI-DSS Requirement 3 and 10, which restrict where cardholder-adjacent data can be processed and logged; and internal data-residency policies that preclude sending customer data to any external API endpoint, however well-contracted. The question these buyers are now asking is not whether air-gapped deployment is legally desirable — it clearly is in many scenarios — but whether it is operationally viable without creating a new category of unauditable risk.
The short answer is: yes, it is viable. The longer answer is that viability and safety are not the same thing. Pulling inference inside your perimeter eliminates every cloud-side guardrail — content filtering, output monitoring, model versioning controls, and provider-side abuse detection — and transfers the entire assurance burden to internal teams. Most regulated firms have not yet staffed or tooled for that transfer. This article walks through three reference deployment patterns, a technical-requirements checklist, and the assurance gap that even a well-executed air-gap leaves open.
Three Deployment Patterns Worth Understanding
The first pattern is NVIDIA NIM (NVIDIA Inference Microservices) deployed on-premises on dedicated GPU infrastructure. NIM packages a validated model alongside its runtime, optimisation libraries, and an OpenAI-compatible API surface into a container that can be pulled once from NVIDIA's registry and then operated entirely offline. For a bank running its own GPU cluster — or leasing bare-metal GPU capacity inside a sovereign colocation facility — this pattern offers enterprise-grade throughput, hardware-optimised quantisation, and signed container images. The operational cost is real: you own the GPU fleet, the Kubernetes layer, the ingress/egress controls, and the update pipeline. NIM does not call home once deployed in disconnected mode, but licence entitlement verification must be handled through an internal NVIDIA licence proxy rather than the public endpoint, which requires upfront procurement and configuration work.
The second pattern is Ollama running inside a hardened VLAN. Ollama is an open-weight model server that can pull and serve models — Llama, Mistral, Phi, and others — from local storage with no ongoing internet dependency once the model artifact is cached. For a team that needs to stand up internal inference quickly, without a GPU cluster, Ollama on CPU or a modest GPU node inside an isolated network segment is low-friction to deploy. The risks are proportionally higher: Ollama has minimal built-in access control, no native audit logging, and its model artifact integrity relies entirely on what the operator puts in place. In a PCI-DSS scope, this pattern requires wrapping Ollama with a reverse proxy that enforces mTLS, an external audit log shipper pointed at your SIEM, and explicit no-egress firewall rules at the VLAN boundary. It is a viable pattern for internal tooling or developer environments; it requires significant hardening before it touches regulated workloads.
The third pattern is Azure Government in a disconnected region configuration. Azure Government regions offer physical and logical separation from Microsoft's commercial infrastructure, and the disconnected or "sovereign" configuration allows an agency or regulated firm to operate Azure services — including Azure OpenAI Service in Government configurations — without traffic routing through public Azure endpoints. For a UK or EU bank with an existing Azure relationship, this can satisfy data-residency requirements while preserving managed infrastructure. The trade-off is that you are still inside a hyperscaler's control plane, which DORA's third-party risk provisions require you to assess and document. Genuinely air-gapped it is not in the traditional sense; it is a managed sovereignty boundary. Firms with strict no-hyperscaler policies will not find this pattern acceptable, but for those whose residency requirement is jurisdictional rather than physical-isolation-absolute, it is a practical middle path.
Technical Requirements: A Checklist for Architects
Regardless of which pattern you choose, five technical requirements must be independently satisfied inside the perimeter. First, model artifact signing. Every model weight file, tokeniser, and configuration artifact must carry a cryptographic signature that is verified at load time. Without this, a compromised build pipeline or storage volume can substitute a modified model and the inference service will serve it without complaint. NVIDIA NIM handles this through signed container layers; Ollama and bare-metal deployments require an operator-implemented signing and verification step, typically using a tool such as Sigstore or in-house PKI.
Second, inference audit log immutability. Every prompt and completion must be written to a log store that cannot be altered after the fact. This is not optional under DORA Article 17, which requires firms to maintain detailed records of ICT-related events, nor under PCI-DSS Requirement 10, which mandates tamper-evident audit trails for systems in scope. Write-once object storage, append-only log streams forwarded to a SIEM with integrity verification, or hardware security module-anchored log signing are all acceptable mechanisms. What is not acceptable is a log file on the inference host itself.
Third, no-egress endpoint enforcement. Every network path from the inference host to the public internet must be explicitly blocked at the firewall, not merely unconfigured. LLM runtimes — and their dependencies — can make outbound calls for telemetry, model updates, or licence validation unless those paths are affirmatively closed. This must be verified after every software update, because dependency upgrades can reintroduce egress behaviour silently. Regular egress scanning of the inference host's network activity should be a standing operational control, not a one-time deployment check.
📊 Related research
The State of AI Governance in BFSI 2026
A definitive briefing for risk, compliance, and technology executives on where the regulatory frontier sits, where governance structures are failing, and what priority actions will determine readiness before the August 2026 high-risk AI deadline.
Fourth, no-call-home licence controls. Several model runtimes and enterprise LLM platforms include licence verification that phones home to a vendor endpoint. If that endpoint is unreachable — as it will be in a true air-gap — the software may degrade, refuse to serve, or log errors that create audit noise. Confirm in writing, and test in practice, that the deployment mode you have chosen supports fully offline licence operation. NVIDIA NIM's disconnected licence proxy is one example of a supported mechanism; others exist but must be explicitly configured and tested before go-live.
Fifth, GPU driver and firmware supply chain integrity. The GPU driver stack is a privileged software layer that sits below the container runtime. A compromised or unsigned driver update can undermine the integrity of everything above it. Establish a controlled update pipeline for GPU drivers and firmware — separate from the general OS patch stream — with signature verification and a documented rollback procedure. This is an often-overlooked element of the supply chain that becomes entirely the operator's responsibility in an air-gapped environment.
EU AI Act and DORA: Where the Obligations Actually Sit
It is worth being precise about regulatory attribution, because conflation of articles is a common error in architecture reviews. EU AI Act Article 9 requires high-risk AI systems to have a functioning risk management system — meaning documented identification, estimation, and mitigation of risks, implemented throughout the system's lifecycle. Article 9 does not itself specify technical documentation or logging. Those obligations sit in Article 11, which requires providers to draw up and maintain technical documentation demonstrating compliance before a system enters service, and in Article 12, which mandates that high-risk systems be designed to automatically generate logs enabling traceability and post-hoc review. An air-gapped deployment must satisfy all three articles independently; the perimeter does not discharge the Article 12 logging obligation, it simply moves where the logs must be generated and stored.
On the DORA side, the relevant provision for ICT audit logging is Article 17 of Regulation (EU) 2022/2554, which requires financial entities to maintain detailed records of ICT-related activities sufficient to reconstruct events and support supervisory review. This is distinct from Article 28, which governs ICT third-party risk management and is the article most often cited in the context of cloud concentration risk. Both are relevant to air-gapped LLM deployments: Article 28 is part of the motivation for pulling inference on-premise; Article 17 is the obligation you must now satisfy without the cloud provider's logging infrastructure.
The Assurance Gap: The Unsolved Problem Inside the Perimeter
Here is what air-gapping does not solve, and what most architecture reviews underestimate. Cloud-hosted LLM services come with continuous model monitoring — output quality checks, safety classifiers running on completions, drift detection, and provider-side red-teaming of new model versions before they are promoted. When you move inference inside your own perimeter, all of that disappears. The model you deploy on day one will behave differently on day ninety, not because the weights changed, but because the distribution of inputs from your users will shift, and without continuous evaluation you will not know how the model's behaviour has changed until a failure surfaces.
Cloud providers — including those offering managed LLM APIs — operate safety systems at a scale and cadence that most individual financial institutions cannot match internally. This is not a criticism of internal engineering teams; it is a structural observation about the investment required to run continuous AI evaluation at production depth. Red-teaming a model for a regulated financial use case — testing it against adversarial prompts, out-of-distribution queries, and domain-specific failure modes — requires both tooling and domain expertise that sits outside the traditional QA or InfoSec function.
Firms that air-gap without building internal evaluation capability are trading one category of risk — third-party data exposure — for another: an unmonitored model running in production with no systematic way to detect degradation or misuse. The perimeter stops data from leaving. It does not stop the model from drifting, hallucinating, or behaving in ways that no one inside the firewall is equipped to detect.
The practical implication is that an air-gapped LLM deployment should be treated as the beginning of an assurance programme, not the end of a compliance exercise. Artifact signing, audit log immutability, and egress controls satisfy the infrastructure controls layer. Continuous evaluation — scheduled red-teaming, behavioural regression testing across model versions, and inference log analysis tied back to business outcomes — is the layer that sits above it. That second layer is where most regulated firms currently have a gap, and where the real compliance exposure will emerge as supervisory scrutiny of AI systems in financial services intensifies under both DORA and the EU AI Act's enforcement timeline.
“The perimeter stops data from leaving. It does not stop the model from drifting, hallucinating, or behaving in ways that no one inside the firewall is equipped to detect.”
Go deeper — gated research
The State of AI Governance in BFSI 2026
A definitive briefing for risk, compliance, and technology executives on where the regulatory frontier sits, where governance structures are failing, and what priority actions will determine readiness before the August 2026 high-risk AI deadline.


