Conversational AI / AI Voice Evaluation

A deployment blueprint that de-risked multilingual AI voice agents — and stopped 3 critical failures before go-live

a leading conversational-AI platform · Conversational AI · AI Assurance

72+

configurations ranked

critical failures stopped pre-launch

90%

Golden Path on the recommended stack

languages, 1 reusable blueprint

The context

Nurix was building NuPlay — an enterprise AI calling system for regional Indian markets across 6 languages and two agents (bank collections, electrical support). It needed an independent evaluation to find the optimal AI stack before production.

The challenge

70+ configuration permutations (6 languages × 2 agents × STT/LLM/TTS) with no benchmark or playbook.
PII and compliance exposure in BFSI voice calling.
Compounding infrastructure cost from deploying an unvalidated stack at scale.
No reusable method — every new language meant starting from scratch.

What we did

A 6-stage, intelligence-driven evaluation including an agentic config-ranking step that auto-surfaced the highest-potential combinations, with human-in-the-loop scoring via Open Codes.

Catalogued and baselined every STT/LLM/TTS variable; pre-validated voices.
Validated only the top-ranked configs (not all 70+) across 3 scenario types.
Synthesised two costed deployment paths and a language-by-language readiness matrix.
Surfaced that 1 in 5 calls entered an error flow with zero audit trail, with logic loops affecting 4 of 6 languages.

Draft — pending client approval

“Seventy-plus configurations with no way to choose — they turned it into a ranked, evidence-backed blueprint, and caught three failures we’d otherwise have shipped.”

— Engineering lead, Nurix

Stack & tooling

Sarvam Saras V3Azure GPT-4.1 MiniOpenAI GPT-5.2Sarvam Bulbul V3Google TTSHITL + Open Codes

Want outcomes like this?