We listened to the first 200 production calls before the voice agent went live for our financial services client. Forty-seven of them had something we'd want to fix. Eleven had a bug. Three were so bad we held the launch. The 200th call took us six hours of grading.
That's the moment most SMEs running voice agents discover that QA at scale is a different problem from QA at proof-of-concept. The agent that scored 8.5/10 across a 30-call pilot has 200 calls per day in production and "average score" tells you almost nothing about whether the next call will book a meeting or chase a customer away.
Voice agent QA isn't about hitting an average. It's about building a scorecard system that catches specific failure modes — the script drift, the consent miss, the dropped emotional cue — before they compound into a churned customer or a regulatory complaint. This post is what that system looks like in production. The dimensions worth grading. The auto-grade vs human-grade split. And what changed in 2025–2026 that makes the architecture genuinely different than it was a year ago.
How to decide in 30 seconds
Are you running >50 production voice agent calls per day?
YES → automated grading is mandatory. Continue.
NO → human grading on every call is fine. Stop.
Do calls have regulated content (consent, recording disclosure, advice)?
YES → 100% grading + human spot-check on every flagged call.
NO → 100% LLM grading + sample-based human review.
Are you A/B testing prompts or voice variants?
YES → side-by-side grading on the same call types.
NO → start. The compounding wins are in the test loop.
Why averages lie about voice agents
The average call quality score is the most misleading metric in voice AI. A pipeline that scores 8.5/10 on average can be hiding any of three failure modes:
Bimodal distribution. Most calls land at 9/10, a small tail at 3/10. The 3s are the ones where the agent missed the consent step, transferred to the wrong queue, or hallucinated a number. Averaging hides them.
Grade inflation by easy dimensions. "Did the agent introduce itself" is graded at 99% pass; "did the agent successfully de-escalate an angry customer" is graded at 64% pass. If the rubric is unweighted, the easy wins drown the hard ones.
Drift over time. A scorecard average that holds at 8.5 for six weeks is a lagging indicator if the underlying behaviour shifts — the agent gets slightly worse at edge cases week by week, but the average stays put because the easy cases keep scoring high.
The fix is the same in all three cases: weight the rubric by what you actually care about, track distribution shape (not just mean), and grade the failure-mode dimensions explicitly.
The dimensions worth grading
A working voice agent scorecard for a sales-qualification flow has six dimensions. The weights matter as much as the dimensions.
| Dimension | What it grades | Weight |
|---|---|---|
| Script adherence | Hit all required steps in order | 1.0 |
| Data capture | Captured the structured fields (name, intent, timeline) | 1.5 |
| Consent + disclosure | Recording notice, AI disclosure, opt-out path | 2.0 |
| Conversational quality | No hallucinations, no awkward turn-taking, recovery from interruption | 1.0 |
| Outcome handling | Booked / handed off / opted-out cleanly | 1.5 |
| Tone match | Appropriate tone for the prospect's signal (frustrated, busy, curious) | 0.5 |
Weights from real production deployments. Consent and disclosure carry double weight because the cost of getting them wrong is regulatory, not just commercial. Tone match carries half-weight because LLM graders are weakest on this dimension and over-weighting it would inject noise.
The aggregate score is a weighted average. The dashboard view is the per-dimension distribution — you want to see the histogram for each dimension, not the single aggregate number.
Sample scorecard JSON
What an automated grade looks like for a single call, written by an LLM grader and stored in your QA system:
{
"call_id": "call_abc123",
"agent_version": "v3.2",
"duration_sec": 287,
"scorecard_v": "2025-09",
"dimensions": {
"script_adherence": {"score": 90, "weight": 1.0, "notes": "Skipped pre-qual question 3"},
"data_capture": {"score": 100, "weight": 1.5, "notes": "All fields captured"},
"consent_disclosure": {"score": 100, "weight": 2.0, "notes": "Recording + AI both stated in opening"},
"conversational_quality": {"score": 85, "weight": 1.0, "notes": "Two awkward pauses 0:42, 1:53"},
"outcome_handling": {"score": 100, "weight": 1.5, "notes": "Booked Thu 14:00"},
"tone_match": {"score": 80, "weight": 0.5, "notes": "Slightly formal for chatty prospect"}
},
"weighted_score": 94.4,
"flags": [],
"human_review_required": false,
"human_reviewed": null
}
Two design choices in this schema do most of the work. Weights live with the score, so the rubric version is auditable per-call (scorecards drift over time as the agent matures; old grades shouldn't be invalidated). Flags and human-review fields are separate, so the reviewer queue is just WHERE human_review_required = true OR weighted_score < 80.
Auto-grading vs human grading
| Approach | Cost per call | Latency | Best for | Failure mode |
|---|---|---|---|---|
| 100% LLM grading | £0.02 (Haiku) | < 30s | Default everywhere | Misses tone, borderline compliance |
| 100% human grading | £2–5 | 1–5 min/call | Pilot phase, regulated content | Doesn't scale past 50 calls/day |
| LLM + biased human spot-check | £0.02 + £2 × 5–10% | < 30s + async | SME production default | Requires good flagging logic |
| Side-by-side LLM vs human (calibration) | £0.02 + £4 | 5 min | Quarterly rubric tuning | Schedule the calibration session |
The split most SMEs settle on after 4–6 weeks of running QA: LLM grades 100% of calls, humans review 5–10% — biased toward the calls the LLM flagged low, calls where the agent took an unusual code path, and a random 1% sample for calibration.
A second receipt: a UK property management client running several hundred voice agent calls per week settled on this exact split after starting at 100% human grading (unsustainable) and oscillating to 100% LLM (which missed several regulatory edge cases in the first month). The hybrid pattern caught roughly 95% of real issues at a fraction of the human-only review cost.
Sampling strategies
If you're under 50 calls/day, grade them all and skip the rest of this section. Above that volume, the sampling strategy matters more than the grading volume.
Stratified random sampling. Split calls by category (booking, qualification, escalation, refusal, missed-call recovery) and sample proportionally. Catches category-specific drift that uniform random sampling misses.
Importance sampling. Over-sample the categories where errors are expensive — escalations, regulated calls, high-value prospects. A 30% sample of escalations and a 5% sample of routine bookings is honest about where the risk lives.
Trigger-based sampling. The agent itself flags calls for review when its own confidence is low, the prospect used certain words ("complaint", "manager", "lawyer"), or the call duration is unusually long or short. Trigger-based sampling catches what random sampling misses by definition.
The right pattern is all three: stratified base coverage, importance weighting on top, trigger-based flagging as the safety net.
A subtle sampling failure worth naming: time-of-day bias. Calls handled at 2am have systematically different characteristics from calls at 2pm — older agent prompt versions are still active at unusual hours, prospects are different (insomnia or international callers), and recording quality varies by network conditions. A sample that doesn't stratify by hour-of-day will overweight a few peak hours and miss real drift in off-peak performance. We bucket every QA review by 4-hour blocks and ensure each block has at least two graded calls per week. It's a small overhead and catches the kind of slow regression that 'overall average' will hide for months.
A related trap is calibrating sampling to the wrong base rate. If 95% of calls are routine bookings and 5% are escalations, random sampling gives you almost no escalation data unless you over-sample that path explicitly. Decide the minimum number of escalation calls you want to grade per week as a target, work backwards to the sample rate needed, and apply that rate to escalations specifically. The same logic applies to any rare-but-high-cost call category.
Reference architecture for a QA pipeline
- Storage: every call recorded with a structured event log (state transitions, tool calls, transcripts). The same Postgres or S3 layer used for the voice agent itself — don't build a separate QA store.
- Transcription: Deepgram or equivalent for production transcripts; the live STT can sometimes be reused if it includes word-level timestamps and speaker turns.
- Grader: Claude Haiku or GPT-4o-mini for cost-efficient grading. Prompt includes the rubric, weights, and 3–5 graded examples per dimension. Returns the JSON shape above.
- Flagging logic: rules in code that mark calls for human review (low aggregate, specific dimension below threshold, regulated content categories, agent took unusual path).
- Reviewer UI: transcript with the agent's actions inline, the LLM-grade scorecard, audio replay, and one-click adjust score / approve. We build this as a small Next.js or htmx app in a few days; existing tools like Hume AI's call analysis cover the emotion-grading dimension if your rubric needs it.
- Reporting: weekly scorecard dashboard. Distribution per dimension (not just averages), trend lines week-over-week, top failure modes. This becomes the input for the next round of prompt iteration on the appointment booking flow or whatever the agent's primary task is.
- Feedback loop: human-overridden grades feed a "calibration set" used to test new prompt versions. The same calibration set runs against every prompt change before the change ships to production.
What changed in 2025–2026
LLM grading reached human parity on most rubric dimensions. Empirical studies through 2024–2025 (the Zheng et al. paper cited below is the most useful) demonstrate that LLM-as-judge with a tight rubric and graded examples now agrees with expert humans 85–92% of the time across most dimensions. The exception remains tone, rapport, and borderline regulatory cases. This shifted the cost-effective grading split from 30/70 LLM/human to 95/5 in our deployments.
Real-time grading became feasible. Latency for grading a 5-minute call dropped under 10 seconds with Haiku-class models. This means QA can run inline — grade in real-time and surface low-scoring calls within seconds of hangup, while the agent state and prospect context are still fresh. The lag in the post-hoc-grading model breaks coaching loops.
The counterpoint worth tracking. Some operators argue that auto-grading creates a Goodhart problem — agents are tuned to the rubric, the rubric drifts from reality, and quality degrades while metrics improve. The 2023 Judging LLM-as-a-Judge paper by Zheng et al. is the most-cited examination of where LLM-judges agree and where they systematically diverge from human reviewers; the failure modes it documents matter operationally — scorecards must be re-derived from raw call review (not just iterated within their own dimensions) every quarter to stay grounded.
Good / Bad / Ugly
Good. Weighted rubric with regulated dimensions weighted up. Distribution-aware reporting (not just averages). LLM grading on 100% with human spot-checks on flags. Calibration sessions every quarter where a human and the LLM grade the same calls and you compare. Trend lines on individual dimensions, not just aggregate.
Bad. Single average score with no distribution. Same weight on every dimension. Grading only a random sample. Manual reviewers seeing transcripts but not the agent's tool calls. Scorecards that never get re-tuned.
Ugly. Performance-managing individual agents (or in our case, individual prompt versions) on grade. Goodhart-trap rubrics that the agent eventually games. Scorecards that grade the easy dimensions and avoid consent and de-escalation because they're harder to grade. QA reports that go to the leadership weekly but never trigger a prompt change.
The financial services client we held the launch for is now running well over a thousand calls per day with a hybrid auto-grade plus ~7% human spot-check, weekly distribution review, and quarterly calibration. Aggregate score has held in the low 90s for the past several weeks. The distribution shape is what we actually monitor — and the failure-mode tail has shrunk from roughly 9% of calls in week one to under 2% now. That's the metric the average never showed.