New: The State of AI Assurance 2026 is out — download it free.

Conversational AI / AI Voice Evaluation

A deployment blueprint that de-risked multilingual AI voice agents — and stopped 3 critical failures before go-live

a leading conversational-AI platform · Conversational AI · AI Assurance

72+
configurations ranked
3
critical failures stopped pre-launch
90%
Golden Path on the recommended stack
6
languages, 1 reusable blueprint
The context

Nurix was building NuPlay — an enterprise AI calling system for regional Indian markets across 6 languages and two agents (bank collections, electrical support). It needed an independent evaluation to find the optimal AI stack before production.

The challenge
  • 70+ configuration permutations (6 languages × 2 agents × STT/LLM/TTS) with no benchmark or playbook.
  • PII and compliance exposure in BFSI voice calling.
  • Compounding infrastructure cost from deploying an unvalidated stack at scale.
  • No reusable method — every new language meant starting from scratch.
What we did

A 6-stage, intelligence-driven evaluation including an agentic config-ranking step that auto-surfaced the highest-potential combinations, with human-in-the-loop scoring via Open Codes.

  • Catalogued and baselined every STT/LLM/TTS variable; pre-validated voices.
  • Validated only the top-ranked configs (not all 70+) across 3 scenario types.
  • Synthesised two costed deployment paths and a language-by-language readiness matrix.
  • Surfaced that 1 in 5 calls entered an error flow with zero audit trail, with logic loops affecting 4 of 6 languages.
Draft — pending client approval
Seventy-plus configurations with no way to choose — they turned it into a ranked, evidence-backed blueprint, and caught three failures we’d otherwise have shipped.
Engineering lead, Nurix
Stack & tooling
Sarvam Saras V3Azure GPT-4.1 MiniOpenAI GPT-5.2Sarvam Bulbul V3Google TTSHITL + Open Codes

Want outcomes like this?

Tell us where quality is slowing you down — we'll scope it in one call, outcomes defined upfront. Or run your own AI Exposure Snapshot in minutes.