Voice AI architecture for production deployments in 2025 is a solved problem at the component level and an unsolved problem at the integration level. Every layer — STT, LLM, TTS, telephony, orchestration — has mature, cheap, capable tooling available. What kills SME voice agent deployments isn't the choice of model; it's the latency budget that wasn't designed, the state management that wasn't built, and the QA system that never got shipped.
This guide is the reference architecture we've settled on after building and operating AI voice agents for UK SMEs across financial services, property management, field services, and commercial cleaning. It covers every layer of the stack with the reasoning behind each choice — not a product recommendation list, but the tradeoffs that drove the decisions.
How to decide in 30 seconds
Is this an outbound campaign (appointment booking, follow-up)?
YES → managed broker (Retell/Vapi) + Twilio + GPT-4o-mini. Continue.
NO → inbound flow?
Is this an inbound handler (missed calls, triage, support)?
YES → same broker stack + tighter barge-in settings. Continue.
NO → is this a multi-turn document/knowledge Q&A use case?
YES → consider voice-RAG hybrid, see the Document RAG guide.
NO → clarify the use case before choosing the stack.
Are call volumes > 10,000 minutes/month?
YES → evaluate SIP trunk peering alongside Twilio retail.
NO → Twilio retail is fine. Simplicity beats cost at SME volumes.
Reference Stack
The stack we deploy most often, chosen for UK availability, documented latency characteristics, and the ability to swap individual layers without rebuilding the whole system:
- Telephony: Twilio Programmable Voice with UK DID numbers; SIP trunk to WebRTC bridge for custom deployments at scale.
- STT: Deepgram Nova-2 for English (lowest latency at quality); Google STT as fallback for accented speech or non-English calls. VAD enabled on both; use Deepgram's endpointing API, not Twilio's default.
- LLM: GPT-4o-mini / GPT-4.1-mini for latency-sensitive turns (qualification, booking, objection handling). Reserve GPT-4o for the turns that need reasoning — escalation decisions, complex objection handling.
- TTS: ElevenLabs Flash (lowest latency; good quality for UK voices) or Azure Neural TTS (better cost at volume; quality slightly below ElevenLabs). Cache all deterministic utterances — greetings, confirmations, opt-out scripts.
- Agent runtime / broker: Retell AI for fast deployment; custom Node broker when full control over barge-in and state routing is needed.
- Orchestration: n8n or Make for CRM, calendar, and notification ops; a dedicated Node or Python service for the hot path (booking, CRM create, Slack alert) where latency matters.
- State: Redis with call-SID-keyed JSON; TTL = max call duration + 5 minutes. Write on every state transition.
- Persistence: Postgres for call events, structured outcomes, and audit trail; S3 / GCS for recordings and transcripts.
Latency Budget
The target for natural turn-taking is under 1.6 seconds end-to-end from the moment the caller stops speaking to the moment the agent begins its next utterance. Breaking down where that time goes:
| Layer | Component | Target range | Optimisation lever |
|---|---|---|---|
| Transport | RTP jitter + media pipeline | 50–120ms | Edge PoP selection; low-latency carrier |
| STT | Partial transcripts | 150–350ms | Deepgram streaming with endpointing |
| STT | Endpointing decision | 300–600ms | Tune VAD aggressively (see below) |
| LLM | First token (streamed) | 250–500ms | GPT-4o-mini; short system prompt |
| LLM | Full response (streamed) | +200–400ms | Keep output under 30 tokens per turn |
| TTS | First audio chunk | 80–200ms | ElevenLabs Flash; cache static utterances |
| Total | 930–1,570ms | Target < 1.6s for natural pacing |
The single biggest gain available to most SME deployments is caching TTS utterances. Greetings, confirmations, and opt-out scripts are deterministic — pre-generate them as audio files at deploy time and serve from CDN. This saves 200–400ms on the turns that matter most for first impressions. The second biggest gain is reducing LLM output length: a voice agent that says "Great — I've got Tuesday 10:30 or Wednesday 2:00. Which works for you?" has a 0.7s generation time; the same agent that generates a paragraph-long explanation has a 2.4s generation time.
VAD and Endpointing
Endpointing — deciding when the caller has finished speaking — is the layer most SME deployments under-tune and the one with the largest impact on perceived naturalness. Set it too sensitive and the agent interrupts callers mid-sentence. Set it too loose and there's an awkward half-second silence before every agent turn.
The settings that work in production for UK English outbound qualification calls:
- Min speech duration: 180–250ms. Below this, background noise triggers false positives. Above 300ms, fast callers get cut off.
- Max pause before endpointing: 450–650ms. The standard 800ms feels slow to UK callers; 350ms interrupts. 500ms is the median that works across accents and phone conditions.
- Noise gate: enabled. UK PSTN background noise (office hum, traffic, hold music bleed) generates enough signal to trigger false positives without it.
- Barge-in: allowed at all times. Suppress current TTS playback immediately on human speech; crossfade and resume STT. Callers who need to interrupt are high-intent callers — don't penalise them with a forced wait.
- TTS playback gate: hold agent audio until confident end-of-user-turn. The 50–80ms buffer before playback is inaudible but eliminates a class of double-talk artefacts.
Recommended VAD configuration
{
"vad": {
"min_speech_ms": 200,
"max_pause_ms": 500,
"noise_gate_db": -42,
"barge_in": true,
"tts_gate_ms": 60
},
"endpointing": {
"silence_duration_ms": 500,
"utterance_end_ms": 1000,
"speech_final": true
}
}
These settings need per-carrier tuning. BT PSTN calls have different noise floors than VoIP-originated calls. Carrier ringback tones are a common false-positive source for AMD — see the Voicemail Detection guide for the hybrid heuristic that handles ringback and carrier variation.
Call Flow Design
The biggest architectural mistake in AI voice agent deployments is treating the call flow as a linear script rather than a state machine. Linear scripts break on any deviation — a confused caller, a wrong party, a mid-call connection drop. State machines recover gracefully because every transition is explicit.
The production call flow as a JSON state machine:
{
"states": {
"greeting": {
"say": "Hi, this is Nova from Quantum. Is this {{first_name}}?",
"on": { "yes": "qualify", "no": "wrong_party", "no_speech": "reprompt", "voicemail": "leave_voicemail" }
},
"qualify": {
"ask": "Are you currently evaluating AI for appointment setting?",
"capture": ["intent", "timeline", "budget_signal"],
"on": { "positive": "book", "negative": "objection", "unclear": "clarify" }
},
"book": {
"action": "cal.com:propose_slots",
"say": "I can book 15 minutes with our team — {{slot_1}} or {{slot_2}}?",
"on": { "slot_selected": "confirm", "no_slots": "alternative_times", "decline": "objection" }
},
"confirm": {
"action": "cal.com:create_booking",
"action2": "crm:create_meeting",
"say": "Booked — you'll get a calendar invite at {{email}}. See you {{slot}}.",
"on": { "success": "close", "failure": "handoff" }
},
"objection": {
"say": "Understood. Would a short demo video be useful instead?",
"on": { "yes": "send_video", "no": "opt_out" }
},
"wrong_party": {
"ask": "Apologies — is {{first_name}} available on this number?",
"on": { "available": "transfer", "not_available": "capture_callback", "unknown": "close" }
},
"handoff": { "action": "transfer_to_human", "say": "Let me put you through to a member of our team." },
"opt_out": { "action": "crm:add_suppression", "say": "No problem — I've removed you from our call list." }
}
}
Every state has an on map that explicitly handles all branches including failure paths. The no_speech and voicemail branches in the greeting state are the ones most deployments miss on first build — they're common, and without them the agent either loops or crashes. The out-of-hours path that funnels into an inbound lead routing queue is equally important: route the morning handoff to a chat-native queue, not a shared email inbox no one reads before 10am. The handoff state's transfer-to-human logic — warm whisper, context delivery, CRM write — deserves its own design consideration before you wire the telephony.
Webhooks and Functions
The orchestration layer — the code that fires when the agent needs to book a slot, create a CRM record, or send a Slack alert — is where most of the business logic lives. Keep it out of the LLM prompt and in deterministic code.
The three functions every appointment-booking voice agent needs:
// 1. Booking webhook → CRM + calendar
app.post('/webhooks/booking', async (req, res) => {
const { attendee, start, end, uid, call_sid, transcript_id } = req.body;
await Promise.all([
hubspot.createMeeting({
subject: 'Voice booking',
start, end,
attendees: [attendee.email],
notes: `Transcript: ${transcript_id}`
}),
cal.createBooking({ eventType: '30min', attendee, start, end }),
db.insert('bookings', { uid, attendee_email: attendee.email, start, end, call_sid })
]);
res.sendStatus(200);
});
// 2. Slot availability → agent response
app.get('/api/slots', async (req, res) => {
const { owner_id, timezone, horizon_days = 7 } = req.query;
const slots = await cal.getAvailability({ owner_id, from: now(), days: horizon_days });
const filtered = slots
.filter(s => isWithinWorkingHours(s.start, timezone))
.slice(0, 2);
res.json({ slots: filtered.map(formatSlotForSpeech) });
});
// 3. Lead capture → CRM + Slack alert
app.post('/api/ingest/lead', async (req, res) => {
const { name, email, phone, intent, source, call_sid } = req.body;
const [contact] = await hubspot.upsertContact({ email, phone, name });
await hubspot.createActivity({ contact_id: contact.id, type: 'voice_lead', source });
await slack.post({
channel: '#inbound-leads',
text: `New voice lead: ${name} (${intent}) — <${hubspot.contactUrl(contact.id)}|View in HubSpot>`
});
res.json({ success: true, contact_id: contact.id });
});
The pattern that matters: each function is idempotent (safe to retry), returns within 2 seconds (or the agent pauses awkwardly), and writes to a structured event log alongside every external call so failures are debuggable. Function schemas passed to the LLM should be as narrow as possible — don't give the agent a function that can do 10 things when it only ever needs to do one.
State Management
State in a production voice agent lives in three places: the LLM's context window (short-term, 8–32k tokens), Redis (in-call state, millisecond access), and Postgres (durable event log). Each has a different role.
The LLM context window holds the conversation history and the current turn's reasoning. It's not a reliable state store — it can be truncated, it has no transactions, and it doesn't survive a service restart or a call transfer. Everything that matters about the current call — captured intent, confirmed slot, prospect contact data, caller consent status — should be written to Redis at the moment of capture.
Redis call state shape:
{
"call_sid": "CA...",
"state": "qualify",
"turns": 3,
"captured": {
"name": "Alex Rivera",
"email": "[email protected]",
"intent": "demo",
"timeline": "this_quarter"
},
"slot_proposed": "2026-05-21T10:30:00Z",
"consent_given": true,
"opted_out": false,
"transfer_attempted": false
}
Write on every transition. Read at the start of every LLM generation. Expire after max_call_duration + 5 minutes. The Postgres event log captures the full state at each transition and is the audit trail for UK compliance requirements — call recording, consent, opt-out events must all be durable and queryable.
Testing and QA
The difference between a voice agent proof-of-concept and a production deployment is a test harness. Without it, every change to the prompt, the state machine, or an external API is a manual regression test across every call path.
The minimum viable test suite for a booking agent:
- Unit tests: each state machine transition, each webhook handler, each slot-formatting function. Jest or pytest with mocked external calls.
- Integration tests: end-to-end booking path with a real (sandboxed) Cal.com and HubSpot. Catches schema drift and API key rotation issues.
- Conversation replay tests: a library of 20–30 transcripts covering the main call paths (positive booking, neutral, objection, wrong party, voicemail, handoff). Run the LLM against each transcript and assert the expected state transition. Catches prompt regressions before they reach production.
- Load tests: simulate 50 concurrent calls to validate Redis TTL settings, Postgres connection pool sizing, and Twilio rate limits don't create bottlenecks at peak campaign volume.
The grading system for production calls — the QA scorecard that tracks consent, clarity, and booking rate across hundreds of live calls — is covered in detail in the Voice Agent QA Scorecards guide.
What Changed in 2025–2026
Sub-second latency is now achievable. ElevenLabs Flash, Deepgram Nova-2 streaming, and GPT-4o-mini's sub-300ms first-token time combine to deliver natural-feeling conversations below 1 second in optimal conditions. Two years ago, 1.8–2.5 seconds was the realistic floor. The quality gap between AI and human voice has effectively closed for qualification calls — prospects on cold outbound no longer reliably detect the agent as AI.
Managed brokers matured significantly. Retell AI, Vapi, and Bland AI all shipped production-quality barge-in, endpointing, and call recording in 2024–2025. For most SME deployments, the build-vs-buy decision now clearly favours managed brokers — the engineering cost of building the broker layer from scratch is 3–6 weeks versus 1–2 days on a managed platform. The exception is custom telephony integrations (direct SIP trunk, carrier-specific features) or very high volume where per-minute broker costs justify the overhead.
LLM costs dropped 10x. GPT-4o-mini and Claude Haiku now run outbound qualification calls at £0.003–0.008 per call for the LLM layer — well under £0.01 for a 3-minute booking conversation. Total cost of a fully instrumented outbound call (telephony + STT + LLM + TTS + broker) is typically £0.06–0.15 at SME volumes. This makes AI voice agents cost-competitive with human SDR first-touch calls at any scale above a few hundred calls per month.
Voice-RAG became production-viable. The combination of sub-1.6s retrieval from pgvector and streamed LLM generation means a voice agent can answer questions about a specific customer's account, their service history, or a policy document while remaining within conversational latency budgets. This opens up inbound support and account management flows that previously required human agents for anything beyond scripted FAQ responses.
Good / Bad / Ugly
Good. State machine call flows with explicit failure branches. Redis for in-call state, Postgres for the audit trail. Cached TTS for deterministic utterances. Sub-30-token LLM outputs per turn. Barge-in always enabled. Narrow function schemas. Load testing before production launch. QA scorecards from day one.
Bad. Long-form LLM replies in turn. Treating the context window as a state store. AMD settings copied from a blog post without carrier-specific tuning. No endpointing configuration — relying on provider defaults. Shipping without an integration test suite. Webhooks that call external APIs synchronously and add latency to the agent turn.
Ugly. Voicemail false-positives that deliver the full sales pitch to a beep. DTMF misfires that book the wrong slot because the agent heard "two" and the DTMF signal was "3". Timezone mismatches that book a 10am slot in the system but confirm "10am" to a prospect in a different timezone. A Redis TTL shorter than the longest possible call, causing the agent to lose its state mid-conversation. A production deployment with no test harness that nobody dares change after month two.