New: The State of AI Assurance 2026 is out — download it free.

Aviation / AI Assurance & Evaluation

Catching a passenger-data leak before a leading airline’s AI bots went live

a leading airline · Aviation · AI Assurance

94
scenarios executed
46%
passed on both platforms
54%
trust/safety/UX gaps surfaced
1
critical security vuln (Base64-encoded PNR leak on WhatsApp)
The context

The airline’s Web and WhatsApp AI assistants handle bookings, cancellations and refunds for millions of users. Before full-scale rollout, an independent evaluation was needed to surface hidden risks rather than discover them in production.

The challenge
  • Bots gave contradictory answers across Web and WhatsApp, eroding trust at booking moments.
  • They fabricated prices and assumed flight dates, directly impacting booking accuracy.
  • Risk of exposing passenger data or being bypassed by adversarial inputs — a compliance landmine for an airline.
  • No structured testing across real user personas (first-timers, frequent flyers, policy-bypass attempts).
What we did

A 7-step evaluation — journey selection, contextual grounding, adversarial enrichment, persona coverage, structured prompts via Qapitol’s IP Accelerator, cross-platform execution, and scoring across correctness, safety and consistency.

  • 94 structured scenarios across 7 high-impact journeys, run on both Web and WhatsApp.
  • 20+ personas weighted toward red-team (34%) and gray-zone stress (28%) tests.
  • Scored instruction-following, context retention, hallucination, safety and business-rule adherence.
  • Quantified platform drift — 46% passed on both channels, 29% WhatsApp-only, 14% Web-only, 11% failed both — exposing divergent logic and guardrails.
Draft — pending client approval
The evaluation gave us what a gut call never could — clear evidence of where our assistants were solid and where they’d quietly fail a customer.
Product lead, AI assistants
Stack & tooling
Qapitol IP AcceleratorWeb botWhatsApp bot

Want outcomes like this?

Tell us where quality is slowing you down — we'll scope it in one call, outcomes defined upfront. Or run your own AI Exposure Snapshot in minutes.