Conversational AI / AI Voice Evaluation
A deployment blueprint that de-risked multilingual AI voice agents — and stopped 3 critical failures before go-live
a leading conversational-AI platform · Conversational AI · AI Assurance
72+
configurations ranked
3
critical failures stopped pre-launch
90%
Golden Path on the recommended stack
6
languages, 1 reusable blueprint
The context
Nurix was building NuPlay — an enterprise AI calling system for regional Indian markets across 6 languages and two agents (bank collections, electrical support). It needed an independent evaluation to find the optimal AI stack before production.
The challenge
- 70+ configuration permutations (6 languages × 2 agents × STT/LLM/TTS) with no benchmark or playbook.
- PII and compliance exposure in BFSI voice calling.
- Compounding infrastructure cost from deploying an unvalidated stack at scale.
- No reusable method — every new language meant starting from scratch.
What we did
A 6-stage, intelligence-driven evaluation including an agentic config-ranking step that auto-surfaced the highest-potential combinations, with human-in-the-loop scoring via Open Codes.
- Catalogued and baselined every STT/LLM/TTS variable; pre-validated voices.
- Validated only the top-ranked configs (not all 70+) across 3 scenario types.
- Synthesised two costed deployment paths and a language-by-language readiness matrix.
- Surfaced that 1 in 5 calls entered an error flow with zero audit trail, with logic loops affecting 4 of 6 languages.
“Seventy-plus configurations with no way to choose — they turned it into a ranked, evidence-backed blueprint, and caught three failures we’d otherwise have shipped.”
Stack & tooling
Sarvam Saras V3Azure GPT-4.1 MiniOpenAI GPT-5.2Sarvam Bulbul V3Google TTSHITL + Open Codes