Glossary
Endpointing
Endpointing is the process by which a voice agent decides that the caller has finished their conversational turn and it is now the agent’s turn to speak. It answers a deceptively hard question — “are they done, or just pausing?” — and it does so under time pressure, because every millisecond spent deciding is added directly to the response latency the caller perceives.
The naive approach is a fixed silence timeout: if voice activity detection reports no speech for, say, 700ms, declare the turn over. The problem is that human speech is full of mid-turn pauses — “my account number is… let me find it… four four two…” — so a short timeout cuts callers off, while a long timeout adds its full duration as dead air to every single turn, even crisp ones like “yes.”
Semantic endpointing solves this by combining the acoustic silence signal with the content of the partial transcript. A lightweight model judges whether the utterance is linguistically complete: “What’s my balance?” is a finished turn after even a short pause; “What’s my…” is not, no matter how long the silence. Prosodic cues — falling pitch, trailing energy — feed in as well.
Why it matters for voice agents
Endpointing is the most underrated latency lever in voice AI. In a typical pipeline, a fixed 700ms timeout can be the single largest line in the latency budget — bigger than the LLM’s time-to-first-token. Semantic endpointing routinely recovers 300–500ms per turn while also reducing the rate at which the agent talks over callers. It directly moves time to first audio, the metric that determines whether an agent feels human or robotic.
It also shapes conversation quality beyond speed. Aggressive endpointing trains callers to rush and produces fragmented transcripts that confuse the LLM; cautious endpointing produces the stilted, walkie-talkie rhythm people associate with bad voice bots. Getting it right — fast on complete utterances, patient on incomplete ones — is a large part of what separates production-grade platforms from demos.
Related terms
- Voice Activity Detection (VAD) — the speech/silence signal endpointing builds on
- Barge-in — handling the reverse case, where the caller takes the turn back
- Time to First Audio (TTFA) — the latency metric endpointing feeds into