We A/B-tested two versions of the same outbound qualification agent. Version A had barge-in enabled: VAD threshold at 0.55, endpointing at 380ms. Version B had barge-in disabled — the agent spoke its full turn before listening. On a 2,000-call sample, version A produced 34% more completed conversations and 41% fewer hang-ups in the first 30 seconds. The script was identical. The voice was identical. The LLM model was identical. The only variable was whether the caller could interrupt.
Barge-in is the single largest UX lever in voice agent design, and almost universally the last thing teams tune.
What barge-in actually is (and what it isn't)
Barge-in means the agent stops speaking when it detects the caller talking and begins listening immediately. Without barge-in, the agent speaks its full turn — potentially a 6-second response — before it processes what the caller said. The caller experiences this as being talked at by a system that ignores them.
Barge-in is not the same as interruption tolerance. An agent with barge-in enabled will stop speaking. What it does next — whether it correctly interprets a mid-sentence interruption, preserves its prior context, and generates a sensible response — depends on the endpointing and context-preservation implementation.
The distinction matters because teams often enable barge-in, observe that the agent now stops speaking, declare success, and move on. The harder problem is what happens in the 400ms after the interruption is detected.
The VAD stack: how interruption detection works under the hood
Voice Activity Detection (VAD) is the component that determines whether the audio signal contains speech. In a voice agent context, VAD is running continuously during the agent's turn; when it detects speech above its threshold, it signals an interruption.
The production standard for local VAD is Silero VAD — a compact PyTorch model (~1MB) that runs at under 1ms per 30ms audio chunk on CPU. Most managed voice platforms (Retell, VAPI) use Silero or an equivalent internally, but expose it through configuration parameters rather than direct model access.
The core parameters you control:
vad_config:
model: silero_v4
threshold: 0.55 # 0–1; probability above which audio is classified as speech
min_speech_ms: 200 # minimum speech duration to trigger barge-in
min_silence_ms: 400 # silence duration required before speech event ends
noise_suppression: true # pre-process with NSNET2 before VAD
window_size_ms: 30 # VAD inference frequency
The threshold is the primary tuning variable. The min_speech_ms guard prevents very short sounds (a cough, background TV) from triggering barge-in. The min_silence_ms determines when the caller has finished speaking.
Platform comparison for barge-in configuration:
| Platform | VAD config accessible | Endpointing control | Context preservation | Custom models |
|---|---|---|---|---|
| Retell AI | Yes (via agent config JSON) | Yes | Built-in | No |
| VAPI | Partial (threshold only) | Yes | Configurable | No |
| Bland.ai | Minimal | Limited | Basic | No |
| Custom (Pipecat/LiveKit) | Full control | Full control | Implement yourself | Yes |
| Custom (Vocode) | Full control | Full control | Implement yourself | Yes |
For teams on managed platforms, Retell's configuration surface is currently the most complete. For teams who need custom VAD models or corpus-specific tuning, the Pipecat open-source framework provides full control.
Endpointing configuration: the settings that matter
Endpointing is the complement of VAD: where VAD detects speech starting, endpointing detects speech ending. Getting this wrong in either direction creates problems: too sensitive, and the agent cuts the caller off mid-sentence; too slow, and the agent pauses awkwardly before responding.
The endpointing delay — how long to wait after the last detected speech before treating the turn as complete — should be calibrated to your call type:
- Outbound qualification calls: 380–420ms. Callers on outbound calls tend to give shorter, more decisive answers.
- Inbound service calls: 480–550ms. Callers describing a problem often pause mid-sentence while collecting their thoughts.
- Complex questions: 600–800ms with semantic endpointing. If the caller is likely to give multi-sentence responses, volume-based endpointing alone is insufficient.
Semantic endpointing — using a small model to classify whether the caller has completed their thought, rather than relying purely on silence duration — is now practical in production. ElevenLabs' turn detection uses a transformer-based turn-completion classifier; it reduces premature cut-offs by approximately 30% compared to pure silence-based endpointing on conversational queries.
Context preservation: what to do when the agent is interrupted mid-thought
The hardest part of barge-in is not detecting the interruption — it's deciding what to do with the context that was interrupted.
When a caller interrupts at second 2 of a 6-second TTS response, the agent has committed to a conversational direction. The caller's interruption may be:
- Aligned — "yes, exactly" or completing the agent's point: the agent should acknowledge and continue along the same path
- Redirecting — "actually, can we go back to price?": the agent should abandon the current thread and switch to the redirected topic
- Clarifying — "sorry, what does that mean?": the agent should pause, answer the clarification, then return to its thread
The naive implementation: stop speaking, process the interruption from scratch as a new turn, lose the prior context. This produces an agent that feels forgetful — it was mid-sentence explaining something and now acts as if that explanation never started.
The better implementation: maintain an in-memory conversation state object that records both the agent's intent at the point of interruption and the content that had been delivered. Pass this context to the LLM alongside the caller's interruption:
{
"system": "You are continuing a conversation. You were interrupted while saying: '{{interrupted_text}}'. The caller said: '{{caller_utterance}}'. Respond to the caller's input, and if appropriate return to the point you were making.",
"conversation_history": [...],
"interrupted_context": {
"agent_intent": "explaining pricing structure",
"delivered_text": "Our setup fee is £2,400, which covers",
"remaining_text": "the integration, the agent training, and the first 30 days of optimisation."
}
}
This context-aware interruption handling is the difference between an agent that feels conversational and one that feels stateless.
False barge-in: background noise, music, and multi-speaker environments
The failure mode that erodes caller trust most quickly: the agent stopping mid-sentence because a car radio triggered the VAD. Two to three false barge-ins in a call and callers disengage.
False barge-in sources, ranked by frequency in UK outbound campaigns: 1. Background music or TV in a home office 2. Car radio during a hands-free call 3. A second person speaking in the caller's environment 4. HVAC noise in a call centre (constant low-frequency noise can trigger poorly-calibrated VAD) 5. Network artefacts on poor mobile connections (packet bursts that sound like brief speech)
Mitigation:
Noise suppression pre-processing: run the incoming audio through a noise suppression model (NSNET2, RNNoise) before the VAD stage. This removes stationary background noise before the VAD sees the signal. Available as a pre-built component in most managed platforms.
min_speech_ms guard: require speech to persist for at least 200ms before triggering barge-in. A radio jingle or a single syllable from background TV rarely sustains for 200ms at a level that would also pass the VAD threshold.
Post-interruption validation: after the VAD triggers and TTS stops, run a second classification pass on the audio that triggered it. If the classifier returns low confidence that the audio was directed at the agent (as opposed to background noise), continue the TTS from where it was interrupted rather than switching to listen mode.
What changed in 2025–2026: turn-taking models and semantic endpointing
The major shift in the past 12 months has been the move from energy-based VAD + silence-duration endpointing to semantic turn-taking models that predict whether a speaker has finished based on linguistic content, not just silence.
The practical difference: a caller who pauses mid-sentence while thinking — "so the price would be... [1.2 seconds]... about three grand a month?" — triggers a silence-duration endpointer and causes the agent to jump in and miss the completion. A semantic model trained on conversational turn-taking recognises the mid-sentence construction and waits.
ElevenLabs' Conversational AI platform published a turn-taking architecture in early 2025 that combines VAD with a transformer-based completion classifier. Retell AI implemented a similar hybrid approach in their v3 release. The improvement in perceived conversational quality is significant enough that for any new deployment, semantic endpointing should be the default, not the advanced option.
Testing barge-in before going live
Barge-in behaviour is difficult to evaluate from call logs alone — you need to listen to calls during the first week and classify interruption handling manually. The automated metrics to track:
- False-barge-in rate: percentage of agent turns where TTS was interrupted but no meaningful caller speech followed within 2 seconds. A rate above 8% means your VAD threshold is too sensitive.
- Premature-cutoff rate: percentage of caller turns where the agent began speaking before the caller finished. A rate above 5% means your endpointing delay is too short.
- Context-preservation score: for calls with at least one barge-in, what percentage of agent responses after interruption acknowledged or continued from the pre-interruption context? This requires human annotation but should be tracked for a sample of 50 calls per week during optimisation.
Deepgram's endpointing documentation provides a useful reference for the relationship between utterance-end detection and barge-in; the STT layer's endpointing configuration interacts directly with the VAD layer above it, and misalignment between the two is a common source of double-cutoffs.
Good / Bad / Ugly: barge-in patterns from production deployments
Good: Retell AI agent, VAD threshold 0.55, min_speech_ms 200ms, semantic endpointing enabled, noise suppression pre-processing on, context preservation implemented in the LLM prompt. False barge-in rate: 3.2% of calls. Premature cut-offs: 1.8%. Caller-reported conversational quality on post-call survey: 4.1/5. Call completion rate: 71%.
Bad: VAD threshold at 0.3 (too sensitive) on an outbound campaign targeting car-commute hours. False barge-in rate: 22%. The agent stops speaking an average of 1.4 times per call due to car radio noise. Callers perceive the agent as broken. Campaign paused after day 3 with a 28% hang-up rate in the first 30 seconds.
Ugly: Barge-in completely disabled on an agent with average TTS turn length of 9 seconds. Callers who want to confirm a booking must wait for the full 9-second response, then speak. Most callers interrupt at second 4, find the agent continues speaking, and either give up or speak over it. The result is a dialogue where the caller and agent are often simultaneously talking, producing a call recording that is unintelligible. Completion rate: 34%. One client described their experience on a forum as "like being trapped in a bad phone tree from 2008."
For the full architecture that surrounds barge-in — STT model selection, LLM configuration, TTS streaming, and telephony stack — see Voice AI Architecture 2025: A Production Implementation Guide. For how TTS latency tuning interacts with barge-in UX, see TTS Caching for Voice Agents: Cutting Latency Below 200ms. The Voice AI and Document Analysis case study shows how barge-in was configured in a 10-agent production deployment.
Book a 30-minute scoping call — we'll audit your current barge-in configuration and identify the tuning that's costing you calls.