Aviation / AI Assurance & Evaluation

Catching a passenger-data leak before a leading airline’s AI bots went live

a leading airline · Aviation · AI Assurance

scenarios executed

46%

passed on both platforms

54%

trust/safety/UX gaps surfaced

critical security vuln (Base64-encoded PNR leak on WhatsApp)

The context

The airline’s Web and WhatsApp AI assistants handle bookings, cancellations and refunds for millions of users. Before full-scale rollout, an independent evaluation was needed to surface hidden risks rather than discover them in production.

The challenge

Bots gave contradictory answers across Web and WhatsApp, eroding trust at booking moments.
They fabricated prices and assumed flight dates, directly impacting booking accuracy.
Risk of exposing passenger data or being bypassed by adversarial inputs — a compliance landmine for an airline.
No structured testing across real user personas (first-timers, frequent flyers, policy-bypass attempts).

What we did

A 7-step evaluation — journey selection, contextual grounding, adversarial enrichment, persona coverage, structured prompts via Qapitol’s IP Accelerator, cross-platform execution, and scoring across correctness, safety and consistency.

94 structured scenarios across 7 high-impact journeys, run on both Web and WhatsApp.
20+ personas weighted toward red-team (34%) and gray-zone stress (28%) tests.
Scored instruction-following, context retention, hallucination, safety and business-rule adherence.
Quantified platform drift — 46% passed on both channels, 29% WhatsApp-only, 14% Web-only, 11% failed both — exposing divergent logic and guardrails.

Draft — pending client approval

“The evaluation gave us what a gut call never could — clear evidence of where our assistants were solid and where they’d quietly fail a customer.”

— Product lead, AI assistants

Stack & tooling

Qapitol IP AcceleratorWeb botWhatsApp bot

Want outcomes like this?