I’m building a real-time voice agent with LiveKit Agents, and I’m seeing unexpectedly high end-to-end latency.
I added precise timing logs **directly in my agent code** to measure each core step:
ASR transcription latency: ~100ms
LLM inference latency: ~500ms
TTS synthesis latency: ~200ms
**Total business logic processing time: < 1 second**
However, when testing end-to-end, the time from **when I finish speaking (end of user speech)** to **when I hear the first audio frame of the agent’s reply** is consistently **3-4 seconds**. There is a 2-3 second gap that I cannot explain with my own code.
What I’ve already checked
All three core modules (ASR/LLM/TTS) are fast, total processing <1s
Network ping between client and server is normal
No hardware latency issues with microphone/speaker
TTS is streaming audio chunks to LiveKit immediately as they are generated
Questions
Could you help identify where this extra latency is coming from in the LiveKit Agents pipeline? Specifically:
How much latency does LiveKit’s **turn detection (VAD)** add? Is there any buffering before the agent receives the end-of-speech event?
Does the WebRTC jitter buffer on the client side add significant latency for voice-only agents?
Are there any server-side buffering or queuing delays when sending audio from the agent to the client?
What configuration parameters (client or agent side) can I tune to minimize end-to-end latency for real-time voice interactions?
Typically I would suggest to first use Agent Observability to understand which stages of your pipeline contribute to your end-to-end latency:
But, since I can’t see an account on LiveKit associated with your email, I assume you might be self-hosting LiveKit open source? (If that is the case you might consider changing the category of your question to get more OSS expert eyes on it ) .
Have you checked your endpointing settings? By default, they’re set to minDelay: 500 and maxDelay: 3000. Once the STT transcripts arrive, the system uses either the minimum or maximum delay (not a value in between) depending on your turn-handling config and after that delay it starts the LLM turn.
One thing that helped in my setup was making the endpointing dynamic with an alpha of 0.8. This allows the system to adjust the delays based on the caller’s speech patterns, lowering the maximum delay for faster speakers and increasing the minimum delay when longer pauses are common.
Hi Darryn, thanks so much for your quick and helpful reply!
You’re exactly right — I am self-hosting the open-source version of LiveKit Server and LiveKit Agents, so I don’t have access to the official hosted observability tools you linked. That’s a great suggestion to adjust the question category for better help!
Could you please advise which specific category I should re-categorize this question into, to get more visibility and assistance from OSS and self-hosting experts?
I found the alpha param in source code and applied all your tweaks (alpha=0.8, tuned min/max delay, dynamic mode) plus enabled preemptive generation, but only got ~200ms improvement. I’m still seeing ~2s extra latency outside my 1s business logic.
Any other hidden params, self-hosted server-side configs, or turn-taking/VAD source patches I’m missing? Thanks!
@zjh1378805302, The 200ms gain after dropping min/max + dynamic suggests endpointing isn’t the whole story, but worth confirming first. Add a log right when STT emits a final transcript and another when the LLM starts. That gap is your endpointing wait. With dynamic mode the system still anchors to min or max, not in between
[ Turn-taking tuning | LiveKit Documentation ], so a turn that picks max can still eat the full max_delay window.
If that gap is your missing 2s, the cleaner move is switching turn_detection to the turn-detector model instead of raw VAD timers [same page].
If the gap is small, the 2s is hiding elsewhere. Most likely TTS: your 200ms is probably first-chunk synthesis time, not first-byte-on-the-wire. Log a third timestamp when the first audio chunk actually leaves the agent (right before the LiveKit publish call), and compare to when the client renders. That isolates network/jitter from agent.
Server v1.0.25jitter buffer + WebRTC packetization usually add tens of ms, not seconds, so put network last on the suspect list.
@Muhammad Usman Bashir Thanks for the detailed breakdown! That makes perfect sense.
One catch on my end: I’m running fully self-hosted local ASR/LLM/TTS (not using LiveKit’s managed inference plugins) and don’t have access to LiveKit’s hosted observability tools.
Could you advise how to manually instrument logs to isolate this latency in my self-hosted setup? Specifically:
Where exactly in the agent code should I add log points to measure:
The gap between STT final transcript output and LLM starting (endpointing wait)
The gap between TTS finishing first chunk synthesis and audio being published to LiveKit
Are there any common hidden buffering points in self-hosted agent deployments I should check, outside of endpointing?
@zjh1378805302, You don’t need to instrument manually, the agents framework already emits the metrics you’re asking about. Subscribe at the session level:
from livekit.agents import metrics, MetricsCollectedEvent
@session.on("metrics_collected")
def _on_metrics(ev: MetricsCollectedEvent):
metrics.log_metrics(ev.metrics)
The metric types map directly to your two questions. EOUMetrics carries end_of_utterance_delay (your "STT final → LLM start" gap, the endpointing wait), plus transcription_delay and on_user_turn_completed_delay. TTSMetrics carries ttfb (time to first byte, distinct from your “synthesis time”), plus duration and audio_duration. LLMMetrics (ttft) and STTMetrics round out the rest.
For self-hosted hidden buffering points outside endpointing: your STT plugin’s internal finalization timer (some streaming STTs hold the final transcript briefly after speech end before flushing), the TTS plugin’s first-chunk wait (provider-specific), and any custom asyncio queue between your TTS callback and the AudioSource.push_frame call. The metrics above will surface most of these without code changes.
@Muhammad Usman Bashir Thanks a lot for the metrics tip!
I’ve added the metrics hook and confirmed: all EOU/STT/LLM/TTS metrics add up to less than 1s total, exactly matching my own code measurements. But end-to-end latency is still consistently ~3s, so the extra 2s gap is definitely hiding in the audio transmission/publishing layer.
I found the queue_ms_size parameter for audio frame queuing in the SDK, and I suspect this buffering is where the latency is coming from. Could you advise:
Does queue_ms_size (or similar audio frame queuing parameters) add significant buffering latency for voice-only agents?
What value should I set this to for minimal latency in self-hosted deployments, and are there any other SDK/server-side audio transmission buffering configs I should tune?