On the third day of a financial services outbound campaign, we noticed something odd in the call recordings. The agent's greeting — "Hi, this is Nova with Quantum Automations" — had a 340ms silence before the first word on every cold call. Warm calls (where the number had been called before) had no gap at all. The difference was the TTS cache: warm calls hit a Redis entry we'd pre-generated at deploy time; cold calls waited for ElevenLabs to synthesise on demand.
Three hundred and forty milliseconds is nothing in human time. In phone conversation time, it's the gap between "this feels like a real call" and "this is a robot." We saw it in the data: the cold-call answer rate was 4.1 percentage points lower than the warm-call answer rate on the same list, same agent, same time of day. The only variable was whether the greeting audio was pre-cached.
TTS caching is one of the highest-ROI engineering decisions in a voice agent stack. It's also one of the least discussed — most tutorials focus on LLM latency because that's where the hard problems are. But the TTS layer is entirely deterministic for a large fraction of what an agent says, and deterministic outputs are cacheable. Here's how we do it.
Why 300ms of avoidable TTS latency is a conversion problem
The standard latency budget for a voice agent turn is roughly 800–1,200ms end to end: STT endpointing (300–600ms), LLM generation (250–800ms streamed), and TTS generation (120–400ms). The conversation feels natural when it's under 1,000ms; it starts to feel mechanical at 1,200ms; it's clearly robotic at 1,500ms+.
The problem with TTS is that the 120–400ms range isn't fixed. A short, common phrase like "Got it, let me check that for you" generates in around 120ms on ElevenLabs with a fast model. A longer, unusual phrase with punctuation may take 380ms. For the agent's opening line — the moment that determines whether the prospect stays on the call — this variance is unacceptable.
Cache the greeting and the first 15 or so fixed phrases the agent uses, and you cut that variance to near-zero. The audio is already in memory. You're paying for a Redis GET, not an API call.
How TTS generation actually works (and where the cache slot fits)
When an agent turn completes — as laid out in our voice AI architecture reference stack — the pipeline looks like:
STT → endpointing → LLM generates text → TTS API call → audio streamed back → sent to RTP
The TTS call is a synchronous HTTP request (or WebSocket stream) to ElevenLabs, Azure TTS, or equivalent. Latency depends on the API provider's inference queue, your network path, and the length of the text. Pre-generation replaces that API call with a local memory lookup.
The cache slot sits between LLM output and audio delivery:
LLM output → cache lookup → hit: serve cached audio
→ miss: call TTS API → cache the result → serve audio
For deterministic phrases — anything the LLM is instructed to say verbatim under specific conditions — this hit rate can reach 70–80% of all TTS calls in an outbound campaign. That's the bulk of greetings, qualifications, confirmations, and error recovery phrases.
Chunk-level vs phrase-level vs full-turn caching
Three caching granularities, roughly ordered from safest to most aggressive:
Phrase-level (recommended for most deployments). Cache entire pre-defined utterances. "Hi, this is Nova with Quantum Automations. Good time for a quick chat?" is one cache entry. High hit rate, zero stitching complexity, no audio artefacts.
Chunk-level. Break phrases into sub-phrases: "Hi, this is Nova" + "with Quantum Automations" + "Good time for a quick chat?" Each chunk is cached independently. Better theoretical reuse if phrases share sub-phrases, but audio stitching creates a slight pause at joins. Avoid unless your phrase library is enormous and the overlap justifies the complexity.
Full-turn caching. Cache the audio for an entire dynamically generated turn. Low hit rate (the LLM output varies), high storage cost, and stale audio risk if the agent's persona changes. Don't do this unless your agent is scripted end-to-end with no dynamic generation.
For most UK SME deployments — 8–20 fixed utterances per agent — phrase-level caching is the right choice.
Cache keying: what makes a good cache key for voice audio
A naive cache key might be just the text string. That's wrong in production, and here's why: if you upgrade your ElevenLabs voice model or change the voice ID, the text hash still hits the old entry and serves audio that sounds different from the current voice. The prospect hears a different voice mid-campaign.
A correct cache key includes three components:
{
"key_schema": "tts:{voice_id}:{model_version}:{text_sha256}",
"example": "tts:pNInz6obpgDQGcFmaJgB:eleven_turbo_v2_5:a3f7c2e1b8d94f56",
"ttl_seconds": 604800,
"encoding": "opus",
"sample_rate": 22050
}
The text_sha256 is a SHA-256 hash of the exact text string, lowercased and whitespace-normalised. Include speed and stability parameters from your ElevenLabs VoiceSettings object in the hash if you tune them per campaign.
Set TTL to 7 days (604,800 seconds) for stable phrases. Greetings and confirmations don't change; a week of cache freshness is fine. For phrases that reference the campaign (e.g., product names) reduce to 24 hours and build a cache-bust call into your deployment pipeline.
Redis vs CDN vs in-process: where to put the cache
| Layer | Latency | Max size | Best for |
|---|---|---|---|
| In-process (Node Map) | <1ms | ~100 MB | Top 5–10 most-hit phrases, single-instance |
| Redis | 1–4ms | Unlimited (RAM cost) | All cached phrases, multi-instance |
| CDN (CloudFront, Cloudflare R2) | 5–50ms (cold), 1–3ms (warm) | Unlimited (cheap) | Pre-generated library, infrequent updates |
For most voice agent deployments running on a single Node process, start with an in-process Map for the top 10 phrases and Redis for the rest. CDN caching makes sense when you have a large pre-generated phrase library (100+ utterances) and want to share it across multiple agent servers without Redis memory cost.
Don't overthink the hierarchy. Redis + phrase-level caching cuts TTS latency by 300–380ms for the phrases that matter. That's the result, and you can get there in an afternoon.
Pre-generating the high-hit phrases
The highest-value move is pre-generating fixed phrases at deploy time, before any calls run. This warms the cache and means the first call of the day doesn't pay TTS latency either. If you're running an outbound campaign, the 2025 appointment playbook covers how script structure and phrase reuse interact — the phrases you pre-generate should match your call-flow script segments.
Build a pre-generation script that reads the agent's phrase library (a simple YAML or JSON config), calls TTS for each phrase with your production voice settings, and writes the result to Redis with the correct key schema:
const phrases = [
"Hi, this is Nova with Quantum Automations — I'm an AI assistant.",
"Is now a good time for a quick 30-second chat?",
"Got it, let me grab that for you.",
"I'll have someone from the team follow up today.",
"No problem at all — should I call back at a better time?",
"Perfect, I've booked that in — you'll get a confirmation by email.",
"Thanks for your time. Have a good one.",
];
for (const phrase of phrases) {
const hash = sha256(normalise(phrase));
const key = `tts:${VOICE_ID}:${MODEL_VERSION}:${hash}`;
const existing = await redis.get(key);
if (!existing) {
const audio = await elevenLabs.textToSpeech({ text: phrase, voiceId: VOICE_ID });
await redis.set(key, audio, { ex: 604800 });
console.log(`Pre-generated: ${phrase.slice(0, 40)}…`);
}
}
Run this as part of your CI/CD pipeline or agent deployment script. It takes under 30 seconds for a 20-phrase library and means every cold-start begins with a warm cache.
One operational detail that's easy to miss: store the pre-generated audio as Opus-encoded bytes at 22050 Hz, not MP3. Opus has better compression at the same perceptual quality (a 4-second greeting at 24kbps Opus is roughly 12 KB; the same clip at 128kbps MP3 is 64 KB). At 500 calls per day, each serving 12 cached phrases, the storage difference is negligible — but the Redis GET latency is marginally better for smaller payloads, and Twilio's media streaming accepts Opus natively in µ-law format without re-encoding. Pre-encode once, serve efficiently.
Also worth instrumenting: cache hit rate per phrase. Set up a simple counter increment on hit and miss. After two weeks you'll know which phrases are actually benefiting from the cache and which ones (agent persona-specific slots, dynamic data inserts) are never hitting. This lets you prune the cache to the phrases that earn their storage and stops you wasting pre-generation time on phrases the agent rarely says in a given campaign configuration.
A practical addition: log the cache miss latency alongside the hit latency. If your Redis GET takes 1ms and your ElevenLabs call takes 320ms, a cache hit is 320× faster. Report this ratio monthly to justify the Redis hosting cost — which at SME call volumes (200–600 calls/day) is usually a £10–20/month Redis Cloud instance. The ROI case writes itself from the latency delta and the reduction in ElevenLabs API spend, since every cache hit is a TTS call you don't pay for.
What changed in 2025–2026: streaming TTS and partial audio delivery
The caching calculus shifted slightly in 2025 when ElevenLabs released streaming WebSocket delivery with first-chunk latency under 100ms, and when Cartesia launched Sonic, their latency-optimised TTS model targeting 90ms to first audio byte.
For genuinely novel utterances — dynamic slot-fills, names, specific numbers — streaming TTS now delivers the first audio chunk fast enough that caching those utterances isn't worth the complexity. The practical implication: cache deterministic phrases (greetings, confirmations, error handling), stream dynamic ones. Don't cache dynamic content hoping for a hit rate that won't materialise.
The counterpoint worth acknowledging: Resemble AI's research argues that aggressive audio pre-buffering — pre-generating multiple plausible next utterances and discarding all but the one the LLM selects — can cut latency below streaming for even dynamic content. It works, but it's complex to implement and wastes TTS API budget. Phrase-level caching is simpler and achieves the same result for the 70% of calls that follow predictable patterns.
Good / Bad / Ugly
Good: Pre-generating the top 15 fixed phrases at deploy time. In our flagship voice AI and document analysis deployment, caching the greeting and confirmation phrases across 10 simultaneous outbound agents dropped average first-word latency from 380ms to 40ms per agent. Near-zero latency for every greeting, confirmation, and recovery phrase. Trivial to implement, major perceptual improvement, no ongoing engineering overhead.
Bad: Caching full dynamic turns. Hit rate drops to under 5%, storage cost goes up, and when the LLM varies its output by one word (which it will), you serve stale audio. You spend more on Redis and still pay TTS for most turns.
Ugly: Forgetting to include voice model version in the cache key — this is the single most common cache-correctness error we see in voice agent QA. If you're running voice agent QA scorecards and noticing audio quality regression mid-campaign, a stale TTS cache after a model version bump is a prime candidate., then upgrading ElevenLabs models mid-campaign. Every cached entry now serves audio from the old model. The agent sounds like it has two voices. Customers notice. You spend a day debugging why calls suddenly sound different before tracing it to the cache.
FAQ
Answered in the frontmatter — rendered by the template as FAQPage JSON-LD.
Want a voice stack where the first word hits before the prospect even thinks to hang up? Book a 30-minute audit and we'll show you exactly where your latency is going.