Reference Stack
- Telephony: Twilio Programmable Voice or SIP trunk → WebRTC
- STT: Deepgram / Google STT; TTS: ElevenLabs / Azure; VAD enabled
- LLM: GPT-4o-mini / GPT-4.1-mini for latency-sensitive turns
- Orchestration: n8n / Make for workflow ops; Node services for hot paths
- Agent runtime: Retell.ai or custom broker with barge-in + endpointing
- Data: Redis (state), Postgres (events), S3 (call recordings, transcripts)
Latency Budget (round-trip)
- RTP jitter + media pipeline: 50–120ms
- STT partials: 150–350ms; endpointing 300–600ms (tune aggressively)
- LLM generation: 250–800ms (streamed), keep tokens small
- TTS synthesis: 120–350ms; cache recurring utterances
- Total target: < 1.2–1.6s for natural turn-taking
Recommended VAD/Endpointing
- Min speech: 180–250ms; max pause: 450–650ms; noise gate enabled
- Barriers: suppress TTS playback until confident end-of-user-turn
- Barge-in: allow user interruption; crossfade TTS and resume STT
Call Flow (JSON)
{
"states": {
"greeting": { "say": "Hi, this is Nova from Quantum. Is this John?", "on": { "yes": "qualify", "no": "wrong_party", "no_speech": "reprompt" } },
"qualify": { "ask": "Are you currently evaluating AI for appointment setting?", "capture": ["intent", "timeline"], "on": { "positive": "book", "negative": "objection", "unclear": "clarify" } },
"book": { "action": "cal.com:create_booking", "on": { "success": "confirm", "failure": "handoff" } },
"objection": { "say": "Understood. Would a quick demo help decide?", "on": { "yes": "book", "no": "opt_out" } },
"handoff": { "action": "transfer_to_human" }
}
}
Webhooks and Functions
// Example: booking webhook -> HubSpot create engagement
app.post('/webhooks/booking', async (req, res) => {
const { attendee, start, end, uid } = req.body;
await hubspot.createMeeting({ subject: 'Voice booking', start, end, attendees: [attendee.email] });
await db.insert('bookings', { uid, attendee_email: attendee.email, start, end });
res.sendStatus(200);
});
Good / Bad / Ugly
- Good: Fast barge-in, tight endpointing, short utterances, deterministic tool-calls.
- Bad: Long-form LLM replies, no caching, unbounded function schemas.
- Ugly: Voicemail false-positives, DTMF misfires, timezone mismatches in bookings.