Glossary
Voice Activity Detection (VAD)
Voice activity detection (VAD) is the technique a voice system uses to determine, in real time, whether an audio stream currently contains human speech — as opposed to silence, background noise, music on hold, or line artifacts. It is the lowest-level perceptual building block of a voice agent: before anything can be transcribed, understood, or interrupted, the system has to know that someone is talking.
Modern VAD is typically a small neural model running continuously on incoming audio frames (every 10–30ms), emitting a speech/non-speech probability. Because it runs on every frame of every call, it has to be extremely cheap computationally while staying robust to the messy reality of phone audio: traffic noise, speakerphones, side conversations, and the 8kHz narrowband codecs still common on the PSTN.
Why it matters for voice agents
VAD quality propagates through everything above it. Three functions depend on it directly:
- Turn-taking. VAD provides the raw speech/silence signal that endpointing uses to decide when a caller has finished their turn. A VAD that’s slow to detect speech onset adds latency; one that mistakes breath sounds for speech makes the agent wait forever.
- Barge-in. To let callers interrupt, the system must run VAD on inbound audio while the agent is talking and reliably separate genuine caller speech from echo of the agent’s own voice.
- Cost and accuracy. Streaming silence into a speech-to-text engine wastes money and can produce hallucinated transcripts. VAD gates what audio is worth transcribing.
The failure modes are familiar to anyone who has used a bad voice bot: an agent that interrupts you because it decided your pause for thought was the end of your sentence (over-aggressive VAD/endpointing), or one that leaves three seconds of dead air after you stop because it’s still waiting to be sure (over-cautious). Telephone audio makes this genuinely hard — which is why production-grade voice platforms treat VAD tuning as an engineering discipline rather than a checkbox.
Related terms
- Endpointing — the turn-completion decision built on top of VAD
- Barge-in — interruption handling that depends on always-on VAD
- Time to First Audio (TTFA) — the end-to-end latency metric VAD contributes to