Aviation / AI Assurance & Evaluation
Catching a passenger-data leak before a leading airline’s AI bots went live
a leading airline · Aviation · AI Assurance
94
scenarios executed
46%
passed on both platforms
54%
trust/safety/UX gaps surfaced
1
critical security vuln (Base64-encoded PNR leak on WhatsApp)
The context
The airline’s Web and WhatsApp AI assistants handle bookings, cancellations and refunds for millions of users. Before full-scale rollout, an independent evaluation was needed to surface hidden risks rather than discover them in production.
The challenge
- Bots gave contradictory answers across Web and WhatsApp, eroding trust at booking moments.
- They fabricated prices and assumed flight dates, directly impacting booking accuracy.
- Risk of exposing passenger data or being bypassed by adversarial inputs — a compliance landmine for an airline.
- No structured testing across real user personas (first-timers, frequent flyers, policy-bypass attempts).
What we did
A 7-step evaluation — journey selection, contextual grounding, adversarial enrichment, persona coverage, structured prompts via Qapitol’s IP Accelerator, cross-platform execution, and scoring across correctness, safety and consistency.
- 94 structured scenarios across 7 high-impact journeys, run on both Web and WhatsApp.
- 20+ personas weighted toward red-team (34%) and gray-zone stress (28%) tests.
- Scored instruction-following, context retention, hallucination, safety and business-rule adherence.
- Quantified platform drift — 46% passed on both channels, 29% WhatsApp-only, 14% Web-only, 11% failed both — exposing divergent logic and guardrails.
“The evaluation gave us what a gut call never could — clear evidence of where our assistants were solid and where they’d quietly fail a customer.”
Stack & tooling
Qapitol IP AcceleratorWeb botWhatsApp bot