Why does Twilio AMD produce so many false positives on UK numbers?

Twilio AMD was trained predominantly on North American carrier patterns. UK voicemail greetings, BT ringback tones, and network announcements have different acoustic signatures — particularly greeting length distributions and beep characteristics. The DetectMessageEnd mode with a 2.0–2.4s max greeting threshold performs better for UK numbers than the default Detect mode, but a hybrid AMD + custom VAD + beep detection heuristic outperforms Twilio AMD alone by 15–25 percentage points on UK PSTN calls in our measurements.

What should a voice agent do when it detects voicemail?

The best outcomes come from a short, specific voicemail message (under 20 seconds) that names the prospect, states one concrete reason for calling, and gives a callback number. Generic 'we'd love to connect' messages have near-zero callback rates. After leaving a voicemail, wait at least 4 hours before the next contact attempt on a different channel (SMS or email) — same-channel retry within 2 hours typically reads as harassment.

How many retry attempts are compliant under PECR?

PECR doesn't set an explicit maximum, but ICO enforcement guidance and common sense both point to 3 attempts per campaign (voice + voicemail or missed call) across different days as a reasonable limit for cold outreach. For warm leads (submitted form, prior engagement), 5–6 attempts with appropriate spacing is generally defensible. The key test is whether a reasonable person would consider the pattern harassment. Keep attempt counts and spacing in your audit log.

Should I leave voicemails with an AI voice or a human recording?

Current ElevenLabs Flash and Azure Neural TTS voices are indistinguishable from human speech to most listeners on a voicemail recording. The compliance answer is the same regardless: the message must identify itself as being from your company and the purpose must be stated. Using a consistent AI voice persona (Nova, Alex) with proper disclosure is compliant. Pre-recorded human voicemails with no personalisation typically underperform AI-generated ones that include the prospect's name and a specific detail — personalisation is the variable that matters most for callback rate.

Voicemail Detection Settings That Actually Work

Voicemail detection — the process of determining whether an outbound call connected to a human or a voicemail system — is the most under-tuned component in most UK voice agent deployments, and the failure modes are among the most visible: your agent delivers a full 90-second sales pitch to a beep, or worse, greets a real prospect with a pre-recorded "Please hold" and they hang up.

The default AMD (Answering Machine Detection) settings from Twilio, Retell, and other providers are trained on North American carrier patterns. UK voicemail behaviour, ringback tones, and carrier announcements are meaningfully different. A voice agent running North American defaults on a UK outbound campaign will typically misclassify 15–30% of calls — roughly one in five calls either delivered to the wrong state or handled incorrectly before the agent begins speaking.

This guide is the detection configuration we've settled on after measuring false positive and false negative rates across BT, EE, Vodafone, and O2 consumer and business numbers on UK PSTN and VoIP-originated calls.

Why Voicemail Detection Matters

The economics are simple: a 20% misclassification rate on 500 calls/day means 100 calls per day either wasted (pitch delivered to a voicemail system that records it) or broken (a human hears something that makes no sense and hangs up). At £0.10 per connected call, that's £10/day in direct waste — but the compounding effect on speed-to-lead and campaign conversion is larger. Misclassified calls don't retry correctly, the CRM shows a "contacted" status that isn't accurate, and the human review queue fills with false positives.

There are two failure modes with opposite impacts:

False positive (human misclassified as voicemail): the agent plays a pre-recorded message or hangs up while a real prospect is on the line. This is the more damaging failure — you've lost the call, potentially irritated the prospect, and the CRM records a "voicemail" outcome. Callback from a prospect who heard a pre-recorded message when they answered is effectively zero.
False negative (voicemail misclassified as human): the agent begins its conversational flow and delivers it to a voicemail system. The voicemail receives a confusing, interactive-style message. Lower damage than a false positive, but wastes call cost and corrupts the CRM record.

The asymmetry matters for tuning: in most B2B outbound campaigns, false positives are more expensive than false negatives, so you should err toward "leave a voicemail" over "treat as human" when the signal is ambiguous.

AMD vs VAD: What Each Does

AMD (Answering Machine Detection) is a telephony-layer classifier that analyses the first 1–4 seconds of audio after a call connects and attempts to determine whether it's a human or a machine. Twilio's AMD uses a machine learning model that was trained on US carrier patterns. It outputs HUMAN, MACHINE, FAX, or UNKNOWN. The DetectMessageEnd variant waits for the greeting to finish before making a final decision — more accurate but adds 2–4 seconds of latency before the agent begins speaking.

VAD (Voice Activity Detection) is a signal-processing classifier that detects the presence of human speech in an audio stream. Unlike AMD, it doesn't try to classify what type of machine a voicemail system is — it just reports whether there's active human speech present. VAD is fast (20–50ms decision time) and operates continuously throughout the call, making it useful for detecting whether the human has finished speaking (endpointing) as well as for initial machine detection.

Neither AMD nor VAD alone is reliable for UK carrier calls. The hybrid approach described below combines both signals with a custom beep-detection heuristic to achieve false positive and false negative rates under 5% on UK PSTN.

Recommended Baselines

The settings that work across most UK carrier combinations, as a starting point before carrier-specific tuning:

Greeting length threshold: 1.6–2.4s. Greetings longer than this are statistically voicemail greetings. Human answers in the UK typically fall within 0.5–1.4s of connection; voicemail greetings run 2.0–4.5s.
Beep detection: 500–1,200Hz band emphasis, −28dB threshold. UK voicemail beeps are typically 1,000Hz tone for 0.3–0.8s. Enable with a 100ms minimum beep duration to avoid false positives from touch-tone sounds.
Post-greeting silence window: 600–900ms before the first agent utterance. This window gives the VAD time to confirm no human speech is incoming before the agent speaks. Too short and the agent interrupts the end of a human greeting; too long and the call feels dead.
DTMF presence: if early DTMF is detected (IVR menu, call screening), route to a separate "IVR" state rather than either HUMAN or VOICEMAIL. Don't attempt to navigate IVR systems automatically — route to human handoff or retry.
Twilio AMD mode: use DetectMessageEnd (not Detect) with a 2.0–2.4s max greeting. This catches the "long human greeting" false positive that plagues the default Detect mode on UK numbers.

Hybrid Heuristic

The production decision function that outperforms any single signal:

const classifyCall = (amd, vad, beep, dtmf, tGreeting) => {
  // Definite voicemail signals
  if (beep.detected && tGreeting > 1.7) return 'VOICEMAIL';
  if (amd.result === 'MACHINE' && vad.noHumanSpeechFirstSecond && tGreeting > 1.8)
    return 'VOICEMAIL';

  // Definite human signals
  if (dtmf.early) return 'IVR';   // route to separate state, not HUMAN
  if (vad.humanSpeechStart < 1200) return 'HUMAN';

  // Ambiguous — err toward voicemail to avoid false positives
  if (amd.result === 'MACHINE') return 'VOICEMAIL';
  if (amd.result === 'UNKNOWN' && tGreeting > 2.5) return 'VOICEMAIL';

  // Default: treat as human if AMD says HUMAN or UNKNOWN with short greeting
  return 'HUMAN';
};

The key design decision is in the ambiguous case: when AMD returns UNKNOWN and greeting length is above 2.5s, we classify as VOICEMAIL rather than HUMAN. For B2B outbound, this reduces false positives (more valuable) at the cost of a small increase in false negatives. Flip this for inbound calls where the cost of a false positive is a dropped call from a real prospect.

Running the heuristic in your own call event handler rather than relying entirely on the telephony provider's AMD gives you full control over the decision logic and the ability to tune thresholds without a redeployment. The latency overhead is under 5ms — well within your latency budget.

UK Carrier Differences

UK carriers have meaningfully different voicemail behaviour that requires per-carrier tuning. Based on measurements across 10,000+ calls:

Carrier	Typical greeting length	Beep characteristics	Ringback behaviour
EE / BT Mobile	3.5–5.0s	1,000Hz, 0.4s, clean	Standard UK ringback
Vodafone	2.8–4.2s	1,050Hz, 0.5s, with fade	Music-on-hold sometimes present
O2	3.0–4.5s	900Hz, 0.6s, double beep	Standard UK ringback
Three (3)	2.5–3.8s	1,000Hz, 0.3s, fast	Standard UK ringback
BT Landline	4.0–6.0s	Tone + silence + tone	Callminder greeting longer than mobile VMs
Sky/Virgin	3.5–5.5s	Variable (outsourced VM)	Ring patterns differ from BT

O2's double-beep pattern is the most common source of false negatives — the hybrid heuristic's beep detection needs a minimum beep duration of 250ms to avoid triggering on the first of a double-beep sequence. Vodafone's music-on-hold is the most common source of "UNKNOWN" classifications from Twilio AMD — the 2.5s UNKNOWN threshold handles this correctly in most cases, but Vodafone enterprise numbers may need a slightly higher threshold.

Voicemail Message Strategy

When VOICEMAIL is confirmed, the choice is between leaving a message, hanging up silently, or playing a tone and hanging up. Leaving a well-crafted message is almost always the right choice for B2B outbound:

Short and specific (under 20 seconds). The callback rate on voicemails drops sharply above 20 seconds. Name the prospect, state the company, give one specific reason for calling, and provide a direct callback number. "Hi Alex, this is Nova from Quantum Automations — I'm calling because you enquired about AI appointment booking last week. Our number is 020 7946 0000 — happy to call back at a time that suits." This format outperforms generic messages by 4–6x on callback rate in our data.

Personalise the one detail that matters. A voicemail that references the specific thing the prospect expressed interest in (web form topic, LinkedIn connection, a recent post they made) converts better than one that doesn't. The incremental cost of generating a personalised TTS voicemail per call is negligible; the lift in callback rate is material.

Do not use the agent's full qualification script as a voicemail. A 90-second message asking qualifying questions to an empty phone is the most common voicemail failure mode. Keep a separate voicemail script that's 15–20 seconds, with a clear CTA and a callback number, entirely separate from the live conversation script.

Retry Logic and Cooldown

The retry strategy after a voicemail or missed call directly affects both campaign performance and PECR compliance. The two failure modes to avoid: retry too quickly (harassment) and retry too slowly (the prospect forgets they wanted your service).

The pattern we ship most often:

Attempt 1: initial call, business hours (9–11am or 2–5pm local time). Voicemail left.
Attempt 2: different time slot, minimum 4 hours after attempt 1. If no VM on attempt 1, leave one now.
Attempt 3: different day, minimum 24 hours after attempt 2. If no answer, SMS or email follow-up (with PECR consent for the channel).
No attempt 4+ without re-engagement signal (email reply, SMS reply, re-submit of form). Continued contact without a response signal moves from outreach into harassment territory under ICO enforcement guidance.

// Retry cooldown logic
const shouldRetry = (contact) => {
  const { attempts, lastAttempt, lastVoicemail, optedOut } = contact;
  if (optedOut) return false;
  if (attempts >= 3) return false;
  const hoursSinceLast = (Date.now() - lastAttempt) / (1000 * 60 * 60);
  if (attempts === 1 && hoursSinceLast < 4) return false;
  if (attempts >= 2 && hoursSinceLast < 24) return false;
  return true;
};

Store attempt counts and timestamps in the CRM. The "attempted" status in your CRM should distinguish between "answered by human", "voicemail left", "no answer / no VM left", and "disconnected/SIT tone". These distinctions matter for campaign analytics and for the audit trail required under PECR compliance.

QA Methodology

Voicemail detection QA is different from conversation QA — you're measuring a binary classification against a labelled ground truth, not grading a conversation. The methodology:

Label 200+ calls: for each call, record the actual outcome (human / voicemail / IVR / SIT tone / no answer) separately from the system's classification. Treat this as a ground truth dataset. Do it by listening to the call recording and the first 5 seconds of each call.
Measure FP and FN rates per carrier: aggregate results by carrier (derive from the called number's prefix). A 5% FP rate on EE and a 22% FP rate on O2 are very different problems requiring different solutions.
Record all classification features: for each call, log greeting length, beep detection result, VAD speech start time, DTMF events, and AMD output. You need these to tune the heuristic without re-running calls.
A/B AMD settings monthly: carriers update their voicemail systems regularly. A threshold that worked in Q1 may have drifted by Q3. Run a monthly review of your FP/FN rates against a fresh 100-call sample.
Fail-safe for uncertain classifications: when the heuristic returns UNKNOWN or its confidence is low, play a short "just a moment" utterance (0.5s) and re-assess VAD for an additional 300ms before committing. This catches the "slow-to-answer human" case that's common on BT landlines.

Good / Bad / Ugly

Good. Hybrid AMD + VAD + beep detection. Carrier-specific threshold tuning. Separate voicemail script (15–20s) from live conversation script. Monthly FP/FN rate review against labelled calls. DTMF routed to separate IVR state. Retry cooldown enforced in code, not policy.

Bad. Relying on AMD alone — drift over time is guaranteed. Using the live conversation script as the voicemail message. A 4+ attempt retry cadence without re-engagement signals. Not logging the classification decision features, making tuning impossible without re-running calls.

Ugly. A double-beep (O2 VM) triggering a HUMAN classification because beep minimum duration wasn't set. Carrier ringback music (Vodafone enterprise) getting classified as "no answer" instead of "connecting" — campaign metrics show 0% connect rate on a carrier where calls are actually ringing. A voicemail left that begins with "Hello? Hello? Are you there?" because the endpointing logic fired too early during the VM greeting. SIT tones (disconnected numbers) routing to the live agent flow because AMD classified the tone as HUMAN speech.

Voicemail Detection Settings That Actually Work

Why Voicemail Detection Matters

AMD vs VAD: What Each Does

Recommended Baselines

Hybrid Heuristic

UK Carrier Differences

Voicemail Message Strategy

Retry Logic and Cooldown

QA Methodology

Good / Bad / Ugly

FAQ

Wrestling with AMD reliability?

Voicemail Detection Settings That Actually Work

Why Voicemail Detection Matters

AMD vs VAD: What Each Does

Recommended Baselines

Hybrid Heuristic

UK Carrier Differences

Voicemail Message Strategy

Retry Logic and Cooldown

QA Methodology

Good / Bad / Ugly

FAQ

Related Reading

Voice AI Architecture 2025

Speed-to-lead: the 5-minute window

Wrestling with AMD reliability?