Platform · Latency

Voice agents that answer at the speed of conversation

A voice agent is only as good as its pause. Rayvoc is engineered end to end — media, models, and telephony in one stack — to keep voice-to-voice response under a second, consistently, on real phone calls.

Join the waitlist See how it works

<1s

voice-to-voice target

~200ms

human turn-taking gap

0 hops

to a third-party carrier API

per-call

latency breakdown in dashboard

Why latency is the metric that matters

Humans leave about 200–300 milliseconds between conversational turns. When an agent takes two seconds to reply, callers assume the line dropped, talk over the agent, or hang up. Latency isn’t a nice-to-have in voice AI — it is the difference between “that felt like a person” and “that was a robot.”

Most platforms advertise “sub-second” latency but measure only one stage of the pipeline. The honest number is time-to-first-audio: from the moment the caller stops speaking to the first syllable of the response, including the telephone network. That is the number Rayvoc optimizes — and the number we show you on every single call.

Where the milliseconds go

A typical voice agent turn passes through five stages. Each one is an opportunity to stream — or to stall.

Network & media transport PSTN/SIP leg into the media layer

Speech recognition (or S2S ingest) streaming, partial results

LLM time-to-first-token the biggest and most variable slice

Speech synthesis (TTFA) streaming synthesis, first audio chunk

Turn detection & endpointing knowing when the caller finished

Relative share of a typical turn’s latency budget. LLM time-to-first-token dominates — which is why model choice and streaming matter most.

How Rayvoc engineers it out

Streaming at every stage

Recognition emits partial transcripts while the caller is still talking; the LLM starts generating on the first stable tokens; synthesis starts speaking on the first sentence. No stage waits for the previous one to finish.

Speech-to-speech when you want it

Rayvoc supports native speech-to-speech models — including Grok’s voice models, with ~0.78s time-to-first-audio in independent testing — which collapse the STT→LLM→TTS chain into a single model. For pipeline setups, you choose every component; for the fastest possible turns, go speech-native. Read more in how it works.

The carrier is in the same stack

Platforms that resell telephony add a network round-trip to a third-party API on every call event. Rayvoc’s media layer and telecom layer are the same system — your call audio never detours through another vendor’s cloud. Using your own carrier? We terminate SIP as close to your trunk as possible.

Turn detection that doesn’t guess

Fixed silence timeouts either cut callers off or add dead air. Rayvoc combines voice activity detection with semantic endpointing so the agent responds the moment the caller is actually done — and supports barge-in so callers can interrupt naturally.

Measured, not promised

Every call in the dashboard shows a per-stage latency waterfall: transport, recognition, model time-to-first-token, synthesis. If your bring-your-own LLM is slow, you’ll see it — with data, not vibes.

Frequently asked questions

What latency should a voice AI agent target?

In human conversation, the natural gap between turns is roughly 200–300 milliseconds. Anything beyond about one second feels like talking to a machine, and callers start interrupting or hanging up. Production voice agents should target under 800ms voice-to-voice, measured as the time from the caller finishing speaking to hearing the first syllable of the response.

How does Rayvoc keep latency low?

Streaming at every stage (no step waits for the previous one to finish), media servers co-located with model inference, support for native speech-to-speech models that skip the STT→LLM→TTS pipeline entirely, smart turn detection, and connection reuse on the telephony leg. Because Rayvoc also runs the telecom layer, there is no extra network hop to a third-party carrier API.

Does using my own LLM increase latency?

It depends on your model host’s time-to-first-token. Rayvoc streams to any OpenAI-compatible endpoint and reports per-stage latency on every call, so you can see exactly what your model contributes and compare providers with real data.

What is time-to-first-audio (TTFA)?

TTFA is the time from the end of the caller’s speech until the first audio of the agent’s reply is played back. It is the latency metric that actually matches what callers perceive. See our glossary entry for details.

Hear the difference a second makes

Every account starts with a 14-day free trial — 1 concurrent channel, a real phone number, and full platform access.