How much does prompt length actually affect latency?

In practice, every 100 tokens of extra prompt costs 15–25ms on GPT-4o-mini and 20–35ms on Claude Haiku at typical traffic levels. That sounds small, but a 400-token system prompt versus an 800-token one adds roughly 80–120ms per turn — nearly a full round-trip on a cold Deepgram stream. Trim ruthlessly: if a sentence doesn't change agent behaviour, cut it.

Should we use a smaller model for latency and a larger one for complex turns?

Yes, this is the right instinct. A model-routing layer that sends simple confirmations to GPT-4o-mini (or equivalent) and escalates complex objections to GPT-4o reduces average turn latency by 30–45% without degrading conversion on complex calls. The complication is routing latency itself — add a lightweight classifier and you often trade 40ms of model savings for 40ms of classification overhead. Fine-tune by logging turn complexity on live calls first.

Does streaming completions actually help for voice agents?

Yes, but only if your TTS layer can consume partial tokens. Streaming from the LLM to TTS in chunks of 20–30 tokens means the first audio byte starts playing 300–400ms earlier than waiting for the full completion. Retell and VAPI both support streaming LLM-to-TTS pipelines natively. On a custom Twilio stack you wire it yourself — buffer the stream, detect sentence boundaries, flush to ElevenLabs.

What temperature setting works best for voice agent turns?

Temperature 0.0–0.3 for structured turns (booking confirmation, qualification questions, objection responses). Temperature 0.5–0.7 for freeform conversation. Higher temperature costs nothing in latency, but greedy decoding (temperature 0) can be marginally faster on some inference providers. The bigger gain is predictability: low-temperature outputs tend to be shorter, which means lower TTS synthesis time.

Prompt Engineering for Voice Agents: Sub-Second Turn Latency

We cut 380ms from a voice agent's average turn latency last autumn without touching the STT model, the TTS provider, or the telephony stack. The fix was the system prompt.

The agent was deployed for a UK mortgage broker — 200 outbound calls a day, qualifying remortgage leads. It was fast enough: median turn latency of 1.2 seconds, which felt acceptable in testing. In production, callers were hanging up at a 22% rate before the second agent turn. We pulled the recordings and measured: 1.2 seconds felt fine in a quiet office. On a mobile, commuting, it felt like the line had died.

We stripped the system prompt from 740 tokens to 360 tokens, rewrote the function call schema, and trimmed the turn history window. Median turn latency dropped to 820ms. Hang-up rate dropped to 11% in the following week. Same LLM. Same voice. Same telephony.

The lesson is one most teams learn expensively: LLM latency is the dominant variable in the voice agent latency budget, and the prompt is the fastest lever to pull before you start shopping for faster infrastructure.

Why LLM latency dominates the end-to-end budget

A typical voice agent turn breaks down like this:

Stage	Typical UK latency
STT (Deepgram Nova-3, streaming)	80–140ms
LLM inference (GPT-4o-mini, 400-token prompt)	350–550ms
TTS synthesis (ElevenLabs turbo v2.5)	150–280ms
Network round-trips (UK datacenter)	40–80ms
Total	620–1050ms

STT is largely fixed — streaming models start decoding before the speaker finishes. TTS has a floor determined by audio synthesis time — for a detailed breakdown of where to claw back TTS latency specifically, see our TTS caching guide. The voice AI architecture reference covers the full stack in context. Network latency is infrastructure cost. LLM inference is the only component the prompt author directly controls. And on GPT-4o-mini, each additional 100 tokens of context (system prompt plus turn history) adds approximately 18–25ms per inference call at typical traffic loads — numbers we measure directly in our production observability dashboards.

That means a 600-token bloated prompt versus a 300-token tight one costs you 54–75ms per turn, before you even count that longer prompts tend to produce longer completions, which cascade into longer TTS synthesis.

Token counting: every unnecessary token costs ~20ms

The first audit of any voice agent prompt should be a token count and a line-by-line read-through asking: "does this sentence change what the agent says?"

Common token waste patterns:

Preamble that adds no constraint. "You are a helpful, friendly, professional AI assistant working for..." — the LLM doesn't need its job description. Replace with the first actual constraint: "Qualify callers against these three criteria: budget ≥£150k, property owned, no defaults in last 3 years."

Repeated instructions. "Never say we can do something we can't. Always be honest." If the agent has been instructed not to invent facts elsewhere, delete duplicates. Every repetition is tokens without behavioural delta.

Tone instructions that don't change anything. "Speak naturally and conversationally." The model already does this. Delete it.

Full company backstory. A caller doesn't need the agent to know that the company was founded in 2018 or has offices in Leeds and London unless those facts are relevant to handling objections.

Run your prompt through the tiktoken tokeniser before deploying. Anything above 400 tokens for a qualification or booking agent is worth a line-by-line audit.

System prompt architecture for low-latency inference

Structure your system prompt as a decision tree, not a policy document. The model reads the entire prompt at inference time, but a tightly structured prompt produces shorter, faster completions because the model's attention heads converge on the relevant section sooner.

Here's the pattern we use:

{
  "role": "Inbound qualification agent for [Company]. Call is being recorded.",
  "qualify_on": ["budget_gte_150k", "property_owned", "no_defaults_3yr"],
  "disqualify_action": "end_politely",
  "qualify_action": "book_slot_or_transfer",
  "tone": "plain, direct",
  "forbidden": ["make_promises", "quote_rates", "discuss_competitors"]
}

We pass this JSON as the system message, not prose. JSON prompts are 30–40% shorter than equivalent prose, parse faster in the model's context, and produce more structured completions. The tradeoff is that they're harder for non-technical stakeholders to read — for internal tools that's fine.

For conversational agents where naturalness matters more than length, keep prose but use an outline format:

ROLE: [One sentence max]
GOAL: [One sentence max]
QUALIFY IF: [Bullet list, each ≤10 words]
IF NOT QUALIFIED: [One instruction]
IF QUALIFIED: [One instruction]
FORBIDDEN: [Bullet list]

Everything else — backstory, brand values, tone advice — goes in a separate "knowledge base" system message that's only injected on turns where the agent needs it. On most calls, it never needs it.

Context window hygiene: trimming turn history mid-call

The second biggest source of prompt bloat during a live call is turn history. Most voice frameworks inject the full conversation history as additional messages on every LLM inference call. By turn 6, the agent is sending 1,200 tokens it doesn't need.

The fix is a sliding window with summarisation:

MAX_HISTORY_TOKENS = 400
SUMMARISE_AT = 600

def build_context(history: list[dict], system_prompt: str) -> list[dict]:
    history_tokens = count_tokens(history)
    if history_tokens > SUMMARISE_AT:
        # Summarise oldest half, keep recent half verbatim
        midpoint = len(history) // 2
        old = history[:midpoint]
        recent = history[midpoint:]
        summary = summarise_turns(old)  # single LLM call, cached
        return [
            {"role": "system", "content": system_prompt},
            {"role": "system", "content": f"[Earlier in call: {summary}]"},
            *recent,
        ]
    return [{"role": "system", "content": system_prompt}, *history]

The summarise call is itself an LLM inference, so run it asynchronously against the prior two turns while the TTS is playing — it's ready before the caller responds. On a 10-turn qualification call, this shaves 150–200ms from turns 6 onwards.

Function call design: single-intent tools vs multi-intent tools

Function calls are the most common source of latency surprises in voice agent deployments. The agent chooses which function to call, assembles the arguments, and the LLM inference cost scales with how many tools are in scope and how complex each schema is.

Don't do this:

{
  "name": "handle_lead",
  "parameters": {
    "action": {"enum": ["qualify", "disqualify", "book", "transfer", "send_sms", "log_note", "end_call"]},
    "reason": "string",
    "slot_time": "string",
    "note_text": "string"
  }
}

A seven-value enum and multiple optional parameters means the model must reason about what combination is appropriate. That's slower and error-prone.

Do this instead:

[
  {"name": "qualify_lead", "parameters": {}},
  {"name": "disqualify_lead", "parameters": {"reason": "string"}},
  {"name": "book_slot", "parameters": {"slot_iso": "string"}},
  {"name": "transfer_to_human", "parameters": {}}
]

Single-intent tools with minimal parameters. On a qualification agent, 3–4 tools is the right number. The model picks faster, the completion is shorter, and the JSON it returns is smaller — which means faster parsing in your webhook handler.

Model and temperature choices that trade quality for speed

The right model for voice agents in 2026 is not the most capable model — it's the fastest model that handles the complexity of your call flows without hallucinating.

For qualification and booking flows (structured, bounded decisions):

GPT-4o-mini or Claude Haiku 4.5: 300–450ms median inference at typical traffic. Handles structured function calls well.
Temperature 0.1–0.2: Reduces output variance, tends to produce shorter completions.
max_tokens 150–200: Cap completions hard. Voice turns are short. A 500-token response to "are you the homeowner?" is a bug.

For objection handling and complex conversations, escalate in-call:

def choose_model(turn_complexity: str) -> str:
    if turn_complexity == "simple":
        return "gpt-4o-mini"
    return "gpt-4o"  # escalate; accept latency hit

Log turn types in production. We've found that for a typical qualification call, 78% of turns are "simple" (confirmations, single-question answers, slot booking) and only 22% require complex reasoning. Routing simple turns to a faster model cuts average latency by 34% on those deployments.

Measuring latency in production: what to instrument and what to ignore

You can't tune what you don't measure. Most voice agent deployments rely on the platform's headline "response time" metric — which typically measures time from end-of-speaker-turn to start-of-agent-audio. That's useful but insufficient.

The metrics worth instrumenting per turn:

latency_event = {
    "call_id": call_id,
    "turn_number": turn_n,
    "stt_final_ms": stt_end - speaker_end,      # How long STT took
    "llm_first_token_ms": first_token - stt_end, # Time to first LLM token
    "llm_full_ms": llm_end - stt_end,            # Total LLM time
    "tts_first_audio_ms": audio_start - llm_end, # TTS first chunk
    "total_turn_ms": audio_start - speaker_end,  # End-to-end
    "prompt_tokens": prompt_token_count,
    "completion_tokens": completion_token_count,
    "model": model_id,
    "function_called": function_name or None,
}

The llm_first_token_ms metric, when streaming is enabled, is the latency that callers actually perceive. The llm_full_ms is irrelevant if the first sentence of audio is playing before the completion finishes. Track both: first-token latency is your streaming health check; full completion time is your billing and context-size indicator.

Log prompt_tokens per turn. Plot it over call duration. If you see prompt tokens growing linearly (from 400 tokens on turn 1 to 1,200 tokens on turn 8), your context hygiene isn't working. The plot should be flat after the first 2–3 turns once your sliding window is active.

Weekly latency review: look at P95 turn latency, not just median. The median masks the tail — and it's the tail (1.8+ seconds) that causes hang-ups. Any P95 above 1.5 seconds is worth a prompt audit before anything else.

What changed in 2025–2026: streaming completions and speculative decoding

Two developments in the past 12 months materially change the prompt engineering calculus.

Streaming completions to TTS — all major inference providers now support token-by-token streaming. Retell and VAPI wire this through natively; on custom stacks you buffer the stream and flush to ElevenLabs when you hit a sentence boundary. This doesn't reduce inference time, but it moves the first-audio-byte metric from "completion finished" to "first sentence completed", cutting perceived latency by 300–500ms on longer responses. For voice agents, this is now the default; batch completion is an antipattern.

Speculative decoding — available on Anthropic's API via extended thinking and natively in some open-source inference servers. Speculative decoding uses a small "draft" model to propose tokens that the larger model then verifies in parallel, cutting wall-clock inference time by 20–40% without changing output quality. As of mid-2026 this is available on Llama.cpp for self-hosted stacks and beginning to appear in hosted APIs. Worth evaluating before reaching for a full stack change.

A counterpoint worth reading: Fireworks AI's analysis of speculative decoding overhead suggests the gains disappear under concurrent load unless the spec model is carefully matched to the target model — their benchmark showed reversal at 50+ concurrent sessions.

Good / Bad / Ugly: prompt patterns from production deployments

Good: short, structured, function-driven. System prompt under 350 tokens, 3–4 single-intent tools, temperature 0.1. Turn history capped at 400 tokens with async summarisation. Streaming enabled. Median turn latency 780ms on GPT-4o-mini from UK region.

Bad: long policy doc, multi-intent functions, no context trim. System prompt 900 tokens of brand guidelines and tone advice, one handle_everything function with 12 parameters, full turn history injected every call. Median turn latency 1.6 seconds. Callers hang up; the team blames the voice or the script.

Ugly: wrong model for the task. A team running Claude Opus for a booking agent because "it sounds smarter." (We've seen this in a similar pattern to the voice AI and document analysis deployment — pilot teams pick the best model to impress, production teams pay the latency cost.) Opus is exceptional for complex reasoning tasks. For "is Tuesday at 2pm okay?" it's 4× the cost and 3× the latency of Haiku on the same task. We've seen this more than once — usually inherited from a pilot that used the best available model to demonstrate quality, then went to production without benchmarking alternatives.

The principle is: match the model to the complexity of the decision, not the prestige of the deployment. A voice agent booking appointments for a plumber doesn't need the same model as one handling complex financial objections.

Start with the prompt. Measure token count before and after every change. Log per-turn latency in production. That loop — measure, trim, measure again — is cheaper than a hardware upgrade and faster than a model switch.

Prompt Engineering for Voice Agents: Sub-Second Turn Latency

Why LLM latency dominates the end-to-end budget

Token counting: every unnecessary token costs ~20ms

System prompt architecture for low-latency inference

Context window hygiene: trimming turn history mid-call

Function call design: single-intent tools vs multi-intent tools

Model and temperature choices that trade quality for speed

Measuring latency in production: what to instrument and what to ignore

What changed in 2025–2026: streaming completions and speculative decoding

Good / Bad / Ugly: prompt patterns from production deployments

FAQ

Want a voice agent that responds faster than a human can blink?

Prompt Engineering for Voice Agents: Sub-Second Turn Latency

Why LLM latency dominates the end-to-end budget

Token counting: every unnecessary token costs ~20ms

System prompt architecture for low-latency inference

Context window hygiene: trimming turn history mid-call

Function call design: single-intent tools vs multi-intent tools

Model and temperature choices that trade quality for speed

Measuring latency in production: what to instrument and what to ignore

What changed in 2025–2026: streaming completions and speculative decoding

Good / Bad / Ugly: prompt patterns from production deployments

FAQ

Related Reading

ElevenLabs vs Cartesia vs PlayHT for UK Voice Agents

Deepgram vs Whisper vs AssemblyAI for UK Voice Agents

TTS Caching for Voice Agents: Cutting Latency Below 200ms

Voice AI Architecture : A 2025 Implementation Guide

Want a voice agent that responds faster than a human can blink?