Quantum Automations Quantum Automations
Blog · Portfolio
← Back to Blog
Guide · Architecture

Voice AI Architecture: A 2025 Implementation Guide

Published May 2026
Topic Architecture · Voice AI
Reading time 12 min
For UK SMEs
On this page
  1. How to decide in 30 seconds
  2. Reference Stack
  3. Latency Budget
  4. VAD and Endpointing
  5. Call Flow Design
  6. Webhooks and Functions
  7. State Management
  8. Testing and QA
  9. What Changed in 2025–2026
  10. Good / Bad / Ugly
  11. FAQ

Voice AI architecture for production deployments in 2025 is a solved problem at the component level and an unsolved problem at the integration level. Every layer — STT, LLM, TTS, telephony, orchestration — has mature, cheap, capable tooling available. What kills SME voice agent deployments isn't the choice of model; it's the latency budget that wasn't designed, the state management that wasn't built, and the QA system that never got shipped.

This guide is the reference architecture we've settled on after building and operating AI voice agents for UK SMEs across financial services, property management, field services, and commercial cleaning. It covers every layer of the stack with the reasoning behind each choice — not a product recommendation list, but the tradeoffs that drove the decisions.

How to decide in 30 seconds

Is this an outbound campaign (appointment booking, follow-up)?
   YES → managed broker (Retell/Vapi) + Twilio + GPT-4o-mini. Continue.
   NO  → inbound flow?

Is this an inbound handler (missed calls, triage, support)?
   YES → same broker stack + tighter barge-in settings. Continue.
   NO  → is this a multi-turn document/knowledge Q&A use case?
         YES → consider voice-RAG hybrid, see the Document RAG guide.
         NO  → clarify the use case before choosing the stack.

Are call volumes > 10,000 minutes/month?
   YES → evaluate SIP trunk peering alongside Twilio retail.
   NO  → Twilio retail is fine. Simplicity beats cost at SME volumes.

Reference Stack

The stack we deploy most often, chosen for UK availability, documented latency characteristics, and the ability to swap individual layers without rebuilding the whole system:

  • Telephony: Twilio Programmable Voice with UK DID numbers; SIP trunk to WebRTC bridge for custom deployments at scale.
  • STT: Deepgram Nova-2 for English (lowest latency at quality); Google STT as fallback for accented speech or non-English calls. VAD enabled on both; use Deepgram's endpointing API, not Twilio's default.
  • LLM: GPT-4o-mini / GPT-4.1-mini for latency-sensitive turns (qualification, booking, objection handling). Reserve GPT-4o for the turns that need reasoning — escalation decisions, complex objection handling.
  • TTS: ElevenLabs Flash (lowest latency; good quality for UK voices) or Azure Neural TTS (better cost at volume; quality slightly below ElevenLabs). Cache all deterministic utterances — greetings, confirmations, opt-out scripts.
  • Agent runtime / broker: Retell AI for fast deployment; custom Node broker when full control over barge-in and state routing is needed.
  • Orchestration: n8n or Make for CRM, calendar, and notification ops; a dedicated Node or Python service for the hot path (booking, CRM create, Slack alert) where latency matters.
  • State: Redis with call-SID-keyed JSON; TTL = max call duration + 5 minutes. Write on every state transition.
  • Persistence: Postgres for call events, structured outcomes, and audit trail; S3 / GCS for recordings and transcripts.

Latency Budget

The target for natural turn-taking is under 1.6 seconds end-to-end from the moment the caller stops speaking to the moment the agent begins its next utterance. Breaking down where that time goes:

LayerComponentTarget rangeOptimisation lever
TransportRTP jitter + media pipeline50–120msEdge PoP selection; low-latency carrier
STTPartial transcripts150–350msDeepgram streaming with endpointing
STTEndpointing decision300–600msTune VAD aggressively (see below)
LLMFirst token (streamed)250–500msGPT-4o-mini; short system prompt
LLMFull response (streamed)+200–400msKeep output under 30 tokens per turn
TTSFirst audio chunk80–200msElevenLabs Flash; cache static utterances
Total930–1,570msTarget < 1.6s for natural pacing

The single biggest gain available to most SME deployments is caching TTS utterances. Greetings, confirmations, and opt-out scripts are deterministic — pre-generate them as audio files at deploy time and serve from CDN. This saves 200–400ms on the turns that matter most for first impressions. The second biggest gain is reducing LLM output length: a voice agent that says "Great — I've got Tuesday 10:30 or Wednesday 2:00. Which works for you?" has a 0.7s generation time; the same agent that generates a paragraph-long explanation has a 2.4s generation time.

VAD and Endpointing

Endpointing — deciding when the caller has finished speaking — is the layer most SME deployments under-tune and the one with the largest impact on perceived naturalness. Set it too sensitive and the agent interrupts callers mid-sentence. Set it too loose and there's an awkward half-second silence before every agent turn.

The settings that work in production for UK English outbound qualification calls:

  • Min speech duration: 180–250ms. Below this, background noise triggers false positives. Above 300ms, fast callers get cut off.
  • Max pause before endpointing: 450–650ms. The standard 800ms feels slow to UK callers; 350ms interrupts. 500ms is the median that works across accents and phone conditions.
  • Noise gate: enabled. UK PSTN background noise (office hum, traffic, hold music bleed) generates enough signal to trigger false positives without it.
  • Barge-in: allowed at all times. Suppress current TTS playback immediately on human speech; crossfade and resume STT. Callers who need to interrupt are high-intent callers — don't penalise them with a forced wait.
  • TTS playback gate: hold agent audio until confident end-of-user-turn. The 50–80ms buffer before playback is inaudible but eliminates a class of double-talk artefacts.

Recommended VAD configuration

{
  "vad": {
    "min_speech_ms": 200,
    "max_pause_ms": 500,
    "noise_gate_db": -42,
    "barge_in": true,
    "tts_gate_ms": 60
  },
  "endpointing": {
    "silence_duration_ms": 500,
    "utterance_end_ms": 1000,
    "speech_final": true
  }
}

These settings need per-carrier tuning. BT PSTN calls have different noise floors than VoIP-originated calls. Carrier ringback tones are a common false-positive source for AMD — see the Voicemail Detection guide for the hybrid heuristic that handles ringback and carrier variation.

Call Flow Design

The biggest architectural mistake in AI voice agent deployments is treating the call flow as a linear script rather than a state machine. Linear scripts break on any deviation — a confused caller, a wrong party, a mid-call connection drop. State machines recover gracefully because every transition is explicit.

The production call flow as a JSON state machine:

{
  "states": {
    "greeting": {
      "say": "Hi, this is Nova from Quantum. Is this {{first_name}}?",
      "on": { "yes": "qualify", "no": "wrong_party", "no_speech": "reprompt", "voicemail": "leave_voicemail" }
    },
    "qualify": {
      "ask": "Are you currently evaluating AI for appointment setting?",
      "capture": ["intent", "timeline", "budget_signal"],
      "on": { "positive": "book", "negative": "objection", "unclear": "clarify" }
    },
    "book": {
      "action": "cal.com:propose_slots",
      "say": "I can book 15 minutes with our team — {{slot_1}} or {{slot_2}}?",
      "on": { "slot_selected": "confirm", "no_slots": "alternative_times", "decline": "objection" }
    },
    "confirm": {
      "action": "cal.com:create_booking",
      "action2": "crm:create_meeting",
      "say": "Booked — you'll get a calendar invite at {{email}}. See you {{slot}}.",
      "on": { "success": "close", "failure": "handoff" }
    },
    "objection": {
      "say": "Understood. Would a short demo video be useful instead?",
      "on": { "yes": "send_video", "no": "opt_out" }
    },
    "wrong_party": {
      "ask": "Apologies — is {{first_name}} available on this number?",
      "on": { "available": "transfer", "not_available": "capture_callback", "unknown": "close" }
    },
    "handoff": { "action": "transfer_to_human", "say": "Let me put you through to a member of our team." },
    "opt_out": { "action": "crm:add_suppression", "say": "No problem — I've removed you from our call list." }
  }
}

Every state has an on map that explicitly handles all branches including failure paths. The no_speech and voicemail branches in the greeting state are the ones most deployments miss on first build — they're common, and without them the agent either loops or crashes. The out-of-hours path that funnels into an inbound lead routing queue is equally important: route the morning handoff to a chat-native queue, not a shared email inbox no one reads before 10am. The handoff state's transfer-to-human logic — warm whisper, context delivery, CRM write — deserves its own design consideration before you wire the telephony.

Webhooks and Functions

The orchestration layer — the code that fires when the agent needs to book a slot, create a CRM record, or send a Slack alert — is where most of the business logic lives. Keep it out of the LLM prompt and in deterministic code.

The three functions every appointment-booking voice agent needs:

// 1. Booking webhook → CRM + calendar
app.post('/webhooks/booking', async (req, res) => {
  const { attendee, start, end, uid, call_sid, transcript_id } = req.body;
  await Promise.all([
    hubspot.createMeeting({
      subject: 'Voice booking',
      start, end,
      attendees: [attendee.email],
      notes: `Transcript: ${transcript_id}`
    }),
    cal.createBooking({ eventType: '30min', attendee, start, end }),
    db.insert('bookings', { uid, attendee_email: attendee.email, start, end, call_sid })
  ]);
  res.sendStatus(200);
});

// 2. Slot availability → agent response
app.get('/api/slots', async (req, res) => {
  const { owner_id, timezone, horizon_days = 7 } = req.query;
  const slots = await cal.getAvailability({ owner_id, from: now(), days: horizon_days });
  const filtered = slots
    .filter(s => isWithinWorkingHours(s.start, timezone))
    .slice(0, 2);
  res.json({ slots: filtered.map(formatSlotForSpeech) });
});

// 3. Lead capture → CRM + Slack alert
app.post('/api/ingest/lead', async (req, res) => {
  const { name, email, phone, intent, source, call_sid } = req.body;
  const [contact] = await hubspot.upsertContact({ email, phone, name });
  await hubspot.createActivity({ contact_id: contact.id, type: 'voice_lead', source });
  await slack.post({
    channel: '#inbound-leads',
    text: `New voice lead: ${name} (${intent}) — <${hubspot.contactUrl(contact.id)}|View in HubSpot>`
  });
  res.json({ success: true, contact_id: contact.id });
});

The pattern that matters: each function is idempotent (safe to retry), returns within 2 seconds (or the agent pauses awkwardly), and writes to a structured event log alongside every external call so failures are debuggable. Function schemas passed to the LLM should be as narrow as possible — don't give the agent a function that can do 10 things when it only ever needs to do one.

State Management

State in a production voice agent lives in three places: the LLM's context window (short-term, 8–32k tokens), Redis (in-call state, millisecond access), and Postgres (durable event log). Each has a different role.

The LLM context window holds the conversation history and the current turn's reasoning. It's not a reliable state store — it can be truncated, it has no transactions, and it doesn't survive a service restart or a call transfer. Everything that matters about the current call — captured intent, confirmed slot, prospect contact data, caller consent status — should be written to Redis at the moment of capture.

Redis call state shape:

{
  "call_sid": "CA...",
  "state": "qualify",
  "turns": 3,
  "captured": {
    "name": "Alex Rivera",
    "email": "[email protected]",
    "intent": "demo",
    "timeline": "this_quarter"
  },
  "slot_proposed": "2026-05-21T10:30:00Z",
  "consent_given": true,
  "opted_out": false,
  "transfer_attempted": false
}

Write on every transition. Read at the start of every LLM generation. Expire after max_call_duration + 5 minutes. The Postgres event log captures the full state at each transition and is the audit trail for UK compliance requirements — call recording, consent, opt-out events must all be durable and queryable.

Testing and QA

The difference between a voice agent proof-of-concept and a production deployment is a test harness. Without it, every change to the prompt, the state machine, or an external API is a manual regression test across every call path.

The minimum viable test suite for a booking agent:

  • Unit tests: each state machine transition, each webhook handler, each slot-formatting function. Jest or pytest with mocked external calls.
  • Integration tests: end-to-end booking path with a real (sandboxed) Cal.com and HubSpot. Catches schema drift and API key rotation issues.
  • Conversation replay tests: a library of 20–30 transcripts covering the main call paths (positive booking, neutral, objection, wrong party, voicemail, handoff). Run the LLM against each transcript and assert the expected state transition. Catches prompt regressions before they reach production.
  • Load tests: simulate 50 concurrent calls to validate Redis TTL settings, Postgres connection pool sizing, and Twilio rate limits don't create bottlenecks at peak campaign volume.

The grading system for production calls — the QA scorecard that tracks consent, clarity, and booking rate across hundreds of live calls — is covered in detail in the Voice Agent QA Scorecards guide.

What Changed in 2025–2026

Sub-second latency is now achievable. ElevenLabs Flash, Deepgram Nova-2 streaming, and GPT-4o-mini's sub-300ms first-token time combine to deliver natural-feeling conversations below 1 second in optimal conditions. Two years ago, 1.8–2.5 seconds was the realistic floor. The quality gap between AI and human voice has effectively closed for qualification calls — prospects on cold outbound no longer reliably detect the agent as AI.

Managed brokers matured significantly. Retell AI, Vapi, and Bland AI all shipped production-quality barge-in, endpointing, and call recording in 2024–2025. For most SME deployments, the build-vs-buy decision now clearly favours managed brokers — the engineering cost of building the broker layer from scratch is 3–6 weeks versus 1–2 days on a managed platform. The exception is custom telephony integrations (direct SIP trunk, carrier-specific features) or very high volume where per-minute broker costs justify the overhead.

LLM costs dropped 10x. GPT-4o-mini and Claude Haiku now run outbound qualification calls at £0.003–0.008 per call for the LLM layer — well under £0.01 for a 3-minute booking conversation. Total cost of a fully instrumented outbound call (telephony + STT + LLM + TTS + broker) is typically £0.06–0.15 at SME volumes. This makes AI voice agents cost-competitive with human SDR first-touch calls at any scale above a few hundred calls per month.

Voice-RAG became production-viable. The combination of sub-1.6s retrieval from pgvector and streamed LLM generation means a voice agent can answer questions about a specific customer's account, their service history, or a policy document while remaining within conversational latency budgets. This opens up inbound support and account management flows that previously required human agents for anything beyond scripted FAQ responses.

Good / Bad / Ugly

Good. State machine call flows with explicit failure branches. Redis for in-call state, Postgres for the audit trail. Cached TTS for deterministic utterances. Sub-30-token LLM outputs per turn. Barge-in always enabled. Narrow function schemas. Load testing before production launch. QA scorecards from day one.

Bad. Long-form LLM replies in turn. Treating the context window as a state store. AMD settings copied from a blog post without carrier-specific tuning. No endpointing configuration — relying on provider defaults. Shipping without an integration test suite. Webhooks that call external APIs synchronously and add latency to the agent turn.

Ugly. Voicemail false-positives that deliver the full sales pitch to a beep. DTMF misfires that book the wrong slot because the agent heard "two" and the DTMF signal was "3". Timezone mismatches that book a 10am slot in the system but confirm "10am" to a prospect in a different timezone. A Redis TTL shorter than the longest possible call, causing the agent to lose its state mid-conversation. A production deployment with no test harness that nobody dares change after month two.

FAQ

What is the minimum latency achievable with a production voice AI stack?

With optimised STT (Deepgram Nova-2), streamed LLM generation (GPT-4o-mini at 4–8 tokens/s first token), and TTS streaming (ElevenLabs Flash), end-to-end turn latency of 800–1,200ms is achievable for short responses. Anything under 1.6s feels natural to most callers. The biggest gains come from caching recurring TTS utterances (greetings, confirmations) and keeping per-turn LLM output under 30 tokens.

Should I use Retell AI or build my own voice agent broker?

For most UK SMEs, Retell AI or a similar managed broker (Vapi, Bland AI) is the right starting point — it handles barge-in, endpointing, SIP/WebRTC negotiation, and call recording out of the box. Build your own only when you need custom endpointing algorithms, unusual carrier integrations, or per-call costs that justify the engineering overhead. The managed broker path ships 3–6x faster and is easier to QA.

How do I handle call state across multiple turns?

Redis with a TTL matching your max call duration (typically 30–60 minutes). Store the conversation state as a JSON object keyed by call SID. Write on every state transition, read at the start of each LLM generation. Don't rely on the LLM's context window as the sole state store — it can be truncated, and you need the state to survive function calls, transfers, and recovery paths.

What telephony provider works best for UK outbound voice agents?

Twilio is the default for good reason: UK DID numbers, Programmable Voice, AMD/DetectMessageEnd, and reliable SIP trunk peering. For cost at volume, a direct SIP trunk via BT Wholesale, KCOM, or a dedicated VoIP provider (Voxbone/Bandwidth) cuts per-minute costs by 40–60% over Twilio retail. The engineering investment for SIP trunk management is roughly 3–5 days; worth it above about 10,000 minutes/month.

Related Reading

Call-Flow Design for Voice Agents: JSON Blueprints That Ship

How to design, test, and version-control the call-flow JSON driving your voice agent

TTS Caching for Voice Agents: Cutting Latency Below 200ms

How to cache TTS audio for voice agents — chunk strategies, cache keying, CDN vs Redis, and production trade-offs for UK

Voicemail Detection Settings That Actually Work

Production AMD/VAD settings and hybrid heuristics for reliable voicemail detection across UK carriers.

Voice Agent QA Scorecards

How to grade conversations at scale with weighted rubrics and LLM-assisted grading.

Need a production-grade voice stack?

30-minute audit. We map your stack, your constraints, and where AI will pay back fastest.

Take the Quantum Leap →
© 2026 Quantum Automations Group Ltd
Home Blog Portfolio Privacy Terms Security