How to build an AI voice agent that doesn't sound like a robot
The 2026 playbook for AI voice agents that callers actually enjoy talking to — latency, voice selection, interruption handling, and the production tricks nobody writes about.
- #voice AI
- #Vapi
- #Retell
- #ElevenLabs
- #Claude
The single biggest lie in the AI voice space right now: “it sounds just like a human.”
Most agents you hear on TikTok demos are tuned for a 40-second clip. They fall apart after the caller interrupts twice or asks something off-script. In production — with real customers, real networks, and real accents — “just like a human” takes actual engineering.
Here is how we build voice agents that callers do not hang up on.
1. Latency is the product
Everything else is a rounding error compared to response time. Under 800ms from end-of-speech to first token of reply feels natural. Over 1.5s feels like a dropped call. Over 3s and callers start saying “hello? hello?” into an agent that is working fine, it just hasn’t started speaking.
The latency budget comes from four places:
- Speech-to-text — Deepgram’s Nova-3 model is where we default. It streams partial transcripts, and we can act on a finalized utterance before the user fully pauses.
- LLM inference — Claude Haiku for routing and quick answers, Claude Sonnet for reasoning, Claude Opus only when we really need it. Streaming first token is mandatory.
- Text-to-speech — ElevenLabs Turbo v2 or Cartesia Sonic-2. Both stream. Both handle interruption.
- Network — We co-locate the voice platform, the LLM, and the TTS in the same region whenever possible. It is worth an engineering hour.
Target budget we actually hit in production: STT 150ms + LLM first-token 350ms + TTS first-word 150ms + network overhead ≈ 700ms end-to-end.
2. Voice selection matters more than model selection
You can have the best Claude prompt in the world. If the voice is wrong — cadence, emotional range, accent, pacing — no one hears the intelligence behind it.
Our defaults:
- English: ElevenLabs Adam or Bella for neutral American; Josh or Charlie when the brand is warmer; a custom-cloned voice when the client has an in-house brand voice.
- Arabic: ElevenLabs multilingual v2 with the “Rachel” or “Sam” variants, carefully evaluated per dialect.
- Spanish: Lola or Mateo tuned to neutral LATAM or Castilian depending on market.
We audition voices on actual scripts the agent will use. Not one-line demos — full realistic conversations, including objections and silences. The voice that sounds best in a demo often sounds wrong when it apologizes.
3. Interruption handling is where most demos lie
If I interrupt an agent mid-sentence and it keeps talking for 2 more seconds, the illusion breaks forever. If I interrupt it and it says “I’m sorry, what was that?” every time, the illusion also breaks forever.
What actually works:
- Voice Activity Detection tuned for the caller’s environment (loud call centers need different thresholds than a home office).
- The agent stops within 100ms of detecting the interruption.
- The agent does NOT apologize or backtrack unless the interruption is clearly a correction.
- The agent keeps the context of what it was saying and intelligently decides to continue, summarize, or drop it.
Claude is genuinely better than other frontier models at this, in our tests, because it follows nuanced instruction like “if the user interrupts you while you are confirming a booking, do not re-apologize, just ask what they want to change.”
4. Don’t ask the LLM to do STT’s job
The single most expensive mistake we see teams make: having the LLM do its own disambiguation on messy transcripts.
STT errors go out as LLM inputs. The LLM hallucinates context to fill gaps. The caller gets an answer that does not match what they said. Trust collapses.
The fix:
- Use word-level confidence scores from the STT.
- If a key field (date, phone number, order ID) has low confidence, the agent repeats it back before acting on it.
- Spelled-out alphabets for names. Always.
- Expected vocabulary boosting for product names, brand terms, and industry jargon.
5. The agent needs real tools, not just a prompt
An agent that can only talk is a better IVR. A useful agent takes actions mid-call:
- Check calendar availability and book.
- Look up an order or property.
- Send an SMS with a confirmation link.
- Take a payment (with proper compliance).
- Escalate with full context to a human.
We build this with the Claude Agent SDK or Vapi/Retell function calling. The key discipline is making every tool call idempotent, observable, and fail-gracefully. A tool that fails silently is worse than no tool at all.
6. Evals are not optional
Every production voice agent we ship has:
- A golden-set eval — 50–200 real call recordings (or synthetic calls that mirror real ones) run against every prompt or model change.
- A hallucination detector — the agent should never invent information not in its knowledge base.
- A “did we solve it?” classifier — run post-call against every transcript, rolled up into a dashboard.
Without this, the agent quietly degrades every time someone tweaks the system prompt. With it, you ship weekly with confidence.
7. Humans are still part of the system
The best voice AI knows when to hand off. We instrument every agent with at least:
- A confidence threshold that triggers handoff.
- A hard-escalation keyword list (“manager”, “complaint”, “lawsuit”) that jumps the caller straight to a human.
- Full conversation context delivered to the human agent before they say hello — so the caller never repeats themselves.
Callers are dramatically more forgiving of an AI that escalates quickly than one that loops.
The result
When you get all of the above right, the illusion stops being an illusion. The agent is not pretending to be a human — it is genuinely doing the work a human would do, faster, cheaper, and with infinite patience. Callers notice the speed and the consistency, not the absence of a person.
That is the bar we hold ourselves to. If you are building voice AI and still in the “sounds slightly robotic” phase, let’s talk. That gap closes faster than you think.