Quantum Automations Quantum Automations
Blog · Portfolio
← Back to Blog
Guide · Voice AI

Deepgram vs Whisper vs AssemblyAI for UK Voice Agents

Published June 2026
Topic Voice Agents · STT Comparison
Reading time 10 min
For UK SME ops leads
On this page
  1. What STT actually does in a voice agent stack (and why it's not the bottleneck you think)
  2. Deepgram Nova-3 vs Flux: when to use each in a UK deployment
  3. AssemblyAI Universal-3: the entity accuracy case for names and policy numbers
  4. Whisper: the real trade-offs of an offline model in a real-time pipeline
  5. UK carrier audio: how BT, EE, and Vodafone ringback patterns affect WER
  6. Latency benchmarks from UK datacentres: the numbers we actually measured
  7. Data residency: SOC 2, UK GDPR, and where your audio goes
  8. Our recommendation by use case: qualification calls vs IVR vs transcription-only
  9. FAQ

We ran the same 2,000 UK outbound calls through three speech-to-text providers last quarter. Deepgram Flux came back at 180ms median. AssemblyAI Universal-3 took 310ms but named every policy number and customer surname correctly. Whisper missed 14% of Scottish accents on the first pass and added 900ms of latency we could not absorb. The right STT choice depends on what your agent does with the transcript — not on any benchmark paper written against American read-speech corpora.

What STT actually does in a voice agent stack (and why it's not the bottleneck you think)

In a real-time voice agent, audio arrives as compressed chunks from the telephony layer. The STT engine converts those chunks to text incrementally, and your agent logic acts on partial or final transcripts to decide whether to interrupt, keep listening, or trigger a tool call.

The latency budget for a natural phone conversation is roughly 500ms from the caller finishing a sentence to the agent starting its reply: 50ms telephony round-trip, 150–300ms STT, 80–150ms LLM inference, 50ms TTS streaming start. That leaves almost no margin, which is why STT choice matters even when people assume it's cheap.

What most benchmark papers miss: they test against clean, studio-recorded English. UK carrier audio arrives at 8kHz G.711 PCMU, sometimes with compression artefacts that smear consonants. Published WER scores routinely overstate accuracy by 15–20% relative to live calls.

STT is also not the only accuracy problem. Even a perfectly transcribed sentence containing "policy number AB-447-Delta" is useless if your downstream logic cannot extract the entity reliably. That distinction — raw transcription accuracy versus entity-level accuracy — is what separates Deepgram and AssemblyAI in practice.

For how STT fits into the broader call-flow architecture, see our voice AI architecture overview.

Deepgram Nova-3 vs Flux: when to use each in a UK deployment

Deepgram ships two current production models: Nova-3 and Flux. They serve different latency profiles.

Nova-3 is the general-purpose model. It handles accents reasonably well, streams reliably over WebSocket, and costs less per hour than Flux. On clean UK English — received pronunciation, typical home-counties call centre voices — Nova-3 WER sits around 4–6% in our tests.

Flux is the streaming-optimised model released in late 2025. It trades a small accuracy reduction on edge-case accents for significantly lower first-word latency. Flux delivers transcripts in roughly 80ms from utterance end, compared to 140ms for Nova-3. On a 500ms budget, that 60ms difference is meaningful.

Which to use:

  • Outbound qualification calls where the caller population is mostly standard UK English: Flux
  • Inbound IVR where callers may be elderly, regional, or speaking with background noise: Nova-3 or AssemblyAI Universal-3
  • Any use case where you cannot afford 310ms latency: Flux by default

The Deepgram streaming WebSocket config we use for UK calls:

{
  "model": "nova-3-flux",
  "language": "en-GB",
  "encoding": "mulaw",
  "sample_rate": 8000,
  "channels": 1,
  "punctuate": true,
  "interim_results": true,
  "utterance_end_ms": 1200,
  "vad_events": true,
  "keywords": ["Quantum:2", "policy:1.5"]
}

The utterance_end_ms of 1200 is higher than Deepgram's default 1000ms because UK callers on mobile tend to have longer inter-phrase pauses than US benchmarks assume. Dropping it below 1000ms causes premature cutoffs on sentences with rising intonation — common in Northern English and Welsh speech.

What changed in 2025–2026: Deepgram launched Flux in Q4 2025 with a dedicated low-latency streaming path and an EU processing endpoint. The EU region reduced our median latency from 210ms to 180ms for UK-origin calls by eliminating the US data-centre round-trip. That alone justified the migration from Nova-2.

AssemblyAI Universal-3: the entity accuracy case for names and policy numbers

AssemblyAI Universal-3 does not win on raw latency. At 310ms median, it is 130ms slower than Deepgram Flux. But it wins convincingly on entity accuracy — specifically on alphanumeric strings, proper nouns, and compound identifiers.

On a sample of 500 calls where agents asked callers to confirm a policy number (format: two letters, three digits, one NATO phonetic word), Universal-3 extracted the correct identifier 91% of the time. Nova-3 managed 79%, Flux 76%. The gap comes from AssemblyAI's named entity recognition running as part of the transcription pipeline rather than as a post-processing step.

For use cases involving: - Insurance policy numbers - NHS reference numbers - Customer surnames with non-standard spelling - Vehicle registration plates

AssemblyAI Universal-3 is the correct choice regardless of the latency cost, because a wrong entity means the downstream agent acts on bad data. A 130ms delay is recoverable. Booking an appointment against the wrong policy number is not.

AssemblyAI also supports keyword boosting with higher weight ranges than Deepgram, and its entity detection integrates with the transcript object:

import assemblyai as aai

aai.settings.api_key = "YOUR_KEY"

config = aai.TranscriptionConfig(
    speech_model=aai.SpeechModel.best,
    entity_detection=True,
    word_boost=["policy", "reference", "registration"],
    boost_param="high",
    language_code="en_gb",
)

transcriber = aai.Transcriber(config=config)
transcript = transcriber.transcribe(audio_url)

for entity in transcript.entities:
    print(entity.entity_type, entity.text)

This gives you structured entity objects you can pass directly to your CRM update logic, rather than running a regex over raw text hoping the model transcribed "AB-447-Delta" correctly.

Our voice AI document analysis case study shows how structured entity extraction from voice calls feeds downstream automation without a manual QA step.

Whisper: the real trade-offs of an offline model in a real-time pipeline

OpenAI Whisper is the default recommendation for people who want open-source STT with no per-minute billing. It deserves a fair accounting rather than a dismissal.

Whisper Large-v3 trained on 680,000 hours of multilingual audio and handles Scottish and Welsh English better than Nova-3 in offline testing. If you are processing recordings after the fact — QA scoring, compliance transcription, generating CRM notes from completed calls — Whisper is a credible option, especially if you are already running OpenAI infrastructure.

The problems start when you try to use Whisper in a real-time pipeline:

Latency. Whisper is not a streaming model. It processes complete audio segments. Even with whisper.cpp optimised for a 4-core instance, inference on a 10-second chunk takes 800–1,200ms. You cannot meet a 500ms turn-taking budget with Whisper unless you chunk aggressively and accept mid-sentence cutoffs.

Chunking errors. When you force real-time behaviour by chopping audio into 3-second segments, Whisper produces errors at segment boundaries — particularly on sentences that span chunks. This is a known limitation documented in the Whisper GitHub repo.

Infrastructure cost. Running Whisper at scale requires GPU instances. For a UK SME running 500–2,000 calls per day, GPU cost exceeds Deepgram's per-minute pricing above roughly 800 daily calls.

When Whisper wins: post-call async transcription where you have 30 seconds to produce a transcript rather than 300ms; or any scenario where you are already paying for OpenAI compute and the marginal cost of Whisper inference is near zero.

UK carrier audio: how BT, EE, and Vodafone ringback patterns affect WER

This section is not in any STT vendor's documentation. We learned it through a painful debugging week.

UK carrier networks use G.711 PCMU at 8kHz, which is the same baseline as US PSTN. The difference is in how carriers handle:

  1. Ringback tone injection. BT's ringback tone at 400Hz + 450Hz bleeds into the first 50–100ms of a live answer on some exchange configurations. If your STT session starts before the call is confirmed live, those tones appear in the audio stream. Deepgram's VAD handles this correctly with vad_events: true. AssemblyAI needs you to trim the first 150ms server-side.

  2. EE mobile compression artefacts. EE uses AMR-NB on 4G voice calls, which degrades to effectively 4.75kbps under congestion. The characteristic distortion is consonant smearing on fricatives — "f", "th", "s". On our dataset, EE-sourced calls had 2.3% higher WER than landline across all three providers. Nova-3 degraded the least; Whisper degraded most.

  3. Vodafone hold music bleed. When a caller is briefly placed on hold and returns, Vodafone sometimes injects hold music into the first half-second of resumed audio. This confuses VAD silence detection and can cause the STT session to end prematurely.

The fix for items 1 and 3: implement a short audio buffer (150ms) at the telephony layer before passing audio to STT. The fix for item 2: use Nova-3 or Universal-3 rather than Flux when your caller population is heavily mobile. See our prompt engineering for voice agent latency post for how to balance these buffers against your turn-taking budget.

Latency benchmarks from UK datacentres: the numbers we actually measured

Test conditions: 2,000 outbound calls placed via Twilio UK, audio streamed via WebSocket to each STT endpoint, all endpoints in EU/UK regions. Latency measured from utterance-end detection to final transcript delivery.

Provider Model Median (ms) P95 (ms) Entity accuracy WER (UK EN)
Deepgram Flux 180 310 76% 5.2%
Deepgram Nova-3 240 420 79% 4.8%
AssemblyAI Universal-3 310 580 91% 4.1%
OpenAI Whisper Large-v3 950 1,400 88% 3.9%

Entity accuracy is measured on policy-number extraction from the structured identifier format described earlier. WER is on a held-out set of 200 calls with manual transcripts.

Whisper's WER is competitive but its latency disqualifies it from real-time use. AssemblyAI's entity accuracy advantage is large enough to matter for financial services or insurance use cases. Deepgram Flux wins on latency but you pay for it in entity handling.

One caveat: these numbers reflect our specific call population (UK outbound, insurance and financial services, mixed age demographics). Your numbers will differ if callers skew younger, are predominantly mobile, or use a specific regional dialect.

Data residency: SOC 2, UK GDPR, and where your audio goes

Under UK GDPR, audio recordings of phone calls are personal data. The six-month recording retention you might use for QA purposes requires a documented lawful basis, and the audio must not be transferred to a third country without adequate protection.

Here is where each provider stands as of June 2026:

Deepgram: SOC 2 Type II certified. EU processing endpoint at api.eu.deepgram.com. Self-hosted option available for strict UK-only requirements. Deepgram's data processing agreement covers GDPR Article 28. Audio is not retained by default (zero data retention mode available).

AssemblyAI: SOC 2 Type II certified. EU endpoint processing available. Audio deleted after transcription by default. Their compliance documentation lists sub-processors — review before onboarding.

Whisper (self-hosted): Full control over data residency since you run the model. This is the real compliance argument for Whisper — audio never leaves your infrastructure. For highly regulated sectors (healthcare, legal), the GPU overhead may be worth the compliance simplicity.

The ICO's guidance on voice recording and data protection is the authoritative UK source — not vendor compliance pages. Read it, document your lawful basis, then choose a provider that fits.

For a counterpoint on the data residency argument — specifically the view that cloud STT data retention risks are overstated relative to the operational benefits — see this analysis from Privacy International on cloud audio processing. Their argument is worth reading even if you disagree.

Our UK compliance overview covers PECR and GDPR obligations for voice agents in more detail.

Our recommendation by use case: qualification calls vs IVR vs transcription-only

Outbound qualification calls

Use Deepgram Flux. Speed is the primary variable. Qualification calls are short (90–180 seconds), the vocabulary is predictable, and entity extraction is minimal. Tune utterance_end_ms to 1200ms and enable keyword boosting for your product names.

Inbound IVR and customer service

Use AssemblyAI Universal-3. Callers are unpredictable — regional accents, background noise, elderly speakers, fast dictation of reference numbers. The extra 130ms of latency is acceptable in an IVR context where callers expect a slightly longer response pause. Entity detection on policy numbers and names is worth the cost.

Post-call transcription and QA scoring

Use Whisper Large-v3 on a small GPU instance, or AssemblyAI Universal-3 at async pricing. Latency is irrelevant. Accuracy on nuanced speech matters. If you are already using OpenAI for your QA scoring LLM, running Whisper in the same infrastructure makes operational sense. See our voice agent QA scorecards post for how we structure the QA pipeline.

Failure modes to watch for

Good: Deepgram Flux on a clean BT landline call, standard UK English, simple entities. Transcripts arrive at 180ms, agent responds inside 500ms, call feels natural.

Bad: Deepgram Flux on an EE mobile call with a strong Glaswegian accent and a policy number to capture. WER climbs to 12–15%, entity extraction fails, agent asks for the number to be repeated twice. Caller hangs up.

Ugly: Whisper in a "real-time" setup with 3-second chunks and no streaming, on a call where the caller pauses mid-sentence. Whisper processes the first chunk, returns a truncated transcript, the agent interrupts, the caller is confused, and the call flow breaks entirely. We have seen this exact failure in three client pilots before recommending a hard rule: Whisper is async-only.

The pattern for avoiding the "bad" and "ugly" cases: route calls by expected caller profile at the telephony layer before they reach STT. A mobile number from a Scottish area code routes to AssemblyAI Universal-3. A direct-dial London landline gets Flux. This requires call flow design that treats STT as a configurable component, not a hard dependency.

FAQ

Does Deepgram support UK GDPR-compliant data residency for call recordings?

Deepgram offers EU-region processing through its self-hosted deployment option and, for cloud customers, a dedicated EU endpoint at api.eu.deepgram.com. Under UK GDPR, you must ensure audio data is processed and stored within the UK or an adequacy-recognised country — Deepgram's EU region covers EEA but not UK-only storage. For strict UK-only residency, the self-hosted option running on AWS eu-west-2 (London) or Azure UK South is the correct path. Always confirm the current DPA and sub-processor list directly with Deepgram before signing a contract, as these change with product updates.

Which STT provider handles UK regional accents — Scottish, Welsh, Northern English — most accurately?

AssemblyAI Universal-3 performs best on Scottish and Welsh accents in our testing, with roughly 8–10% lower WER than Deepgram Nova-3 on those specific variants. Deepgram Flux improved significantly with its 2025 multilingual update but still trails on broad Scottish vowel sounds. Whisper Large-v3 is actually strongest on Welsh English when given clean audio, but its 800–1200ms latency makes it impractical for real-time voice agents. For a live outbound agent calling across UK regions, AssemblyAI Universal-3 with keyword boosting on brand names gives the most consistent result.

What is the real-time latency difference between Deepgram Flux and AssemblyAI Universal-3 on UK carrier audio?

In our batch of 2,000 UK outbound calls processed through London-region infrastructure, Deepgram Flux returned transcripts at 180ms median end-to-end latency versus AssemblyAI Universal-3 at 310ms. The gap widens on longer utterances: for sentences over 15 words, Deepgram held at around 200ms while AssemblyAI reached 380ms. That 130–180ms difference is significant if your agent needs to respond within a 500ms turn-taking budget, but it's largely irrelevant for async transcription use cases like QA scoring or CRM note generation.

Can I switch STT providers without rewriting my voice agent's call-flow logic?

Yes, if you build to an abstraction layer from day one. The pattern we use wraps the STT call behind a single interface that accepts a raw audio chunk and returns a normalised transcript object with confidence, timestamps, and detected entities. Switching from Deepgram to AssemblyAI then means swapping one provider module, not touching call-flow logic. The main friction points are entity formats (AssemblyAI returns structured entity objects; Deepgram returns flat text you must parse yourself) and streaming chunk sizes, which differ between providers and affect how you handle partial transcripts in barge-in logic.

Related Reading

Twilio vs Retell vs VAPI: Voice Agent Platform Comparison

An honest comparison of Twilio, Retell, and VAPI for voice agent deployments — latency benchmarks, pricing, call-flow co

Prompt Engineering for Voice Agents: Sub-Second Turn Latency

How to cut voice agent LLM latency through prompt architecture, context hygiene, and function call design — production p

Need an STT stack that handles UK accents?

30-minute audit. We map your stack, your constraints, and where AI will pay back fastest.

Take the Quantum Leap →
© 2026 Quantum Automations Group Ltd
Home Blog Portfolio Privacy Terms Security