We’re seeing ~325ms one-way audio latency (Twilio → Agent)
with both sides in US/Ashburn. Both Twilio and our own RTP
analysis show clean network metrics — the delay appears to
be inside LiveKit’s SIP-to-WebRTC pipeline.
Jitter buffer configuration — what is the buffer
size/mode for SIP inbound trunks? A conservative buffer
is the most likely cause of the bulk of this delay.
Internal pipeline latency breakdown — time from
“RTP received at SIP GW” to “audio delivered to agent
participant” for our test call.
Tunable parameters — can we reduce jitter buffer
size or adjust any SIP trunk settings for lower latency?
Jitter buffer configuration is not exposed publicly. How are you measuring when the audio arrives at the agent? Could some of the delay be attributable to operations. I mean, what are the start and end points for your measurement? It’s not clear reading the above.
Thanks for the follow-up. Here’s our measurement methodology:
Measurement Method
Record the same call from both sides:
Twilio: call recording
LiveKit: Agent Insights recording
Open both in Audacity and align the caller’s speech
waveform as the reference point
Measure turn-taking latency on each:
A = caller speech end → agent response start (Twilio recording)
B = caller speech end → agent response start (LiveKit recording)
Difference: A - B = 325ms
Since both recordings include the same agent processing
time (STT + LLM + TTS), subtracting cancels it out.
The 325ms represents the round-trip transport delay
between Twilio and LiveKit (not one-way).
That’s ~162ms one-way, which is still significant for
a same-region path (Twilio Ashburn ↔ LiveKit Ashburn).
What this rules out
Agent processing time (cancelled by subtraction)
PSTN/carrier delay (not in LiveKit recording)
Client-side factors (server-side recordings only)
What this includes
Twilio → LiveKit SIP GW: RTP transport + jitter buffer
LiveKit SIP GW → Room → Agent: internal pipeline
Agent → Room → SIP GW → Twilio: return path
Any transcoding or media processing in between
For comparison: Twilio-side round-trip
We performed the same measurement for the Twilio side
(PSTN ↔ Twilio edge):
Twilio RTP RTT: ~105ms (US caller ↔ Taiwan)
Twilio RTP RTT: <10ms (US caller ↔ US)
For our US↔US test call, Twilio’s contribution is <10ms
of the 325ms round-trip. That leaves ~315ms attributable
to the LiveKit SIP pipeline (~157ms one-way).
Could you help us understand what’s contributing to this
within LiveKit’s SIP processing path? Jitter buffer
sizing would be a good starting point.
Room: SCL_KQBpFvPTAUbM
Time: 2026-02-05 12:38:50 UTC
Jitter buffer (media-sdk/jitter, 60ms max) is used in
two places: MediaPort (SIP inbound) and Room (track handler),
both controlled by enable_jitter_buffer config.
Even if enabled, 60ms × 2 = 120ms round-trip, which only
accounts for ~1/3 of our measured 325ms.
We also see PCMU → Opus transcoding (room.go:324) and
resampling in the pipeline. Could these account for the
remaining ~200ms?
Questions:
Is enable_jitter_buffer enabled on LiveKit Cloud?
What is the Room’s internal sample rate and mixer latency?
Is there a way to bypass Opus transcoding for SIP-to-SIP
or SIP-to-Agent paths?
Thank you for confirming. That helps narrow it down.
With jitter buffer disabled, the remaining processing
pipeline is:
PCMU/8kHz → resample to 48kHz → Opus encode → Room →
Opus decode → Agent → (return path) → PCMU/8kHz
We estimate this at ~80-130ms round-trip, but we’re
measuring ~325ms. That leaves ~200ms unaccounted for.
Could you check the internal logs/traces for room
SCL_KQBpFvPTAUbM (2026-02-05 12:38:50 UTC) to see:
What is the actual measured latency from “RTP packet
received at SIP MediaPort” to “audio sample delivered
to agent participant”? Your SIP code logs RoomStats
(gaps, delayed packets, etc.) — any anomalies for
this call?
Is there additional buffering we’re not seeing?
For example, does the Opus encoder accumulate multiple
frames before sending, or does the mixer introduce
frame-level delays?
The Opus encoder frame size — is it 20ms or larger?
At 48kHz with 20ms frames that’s 960 samples per
frame, but if the encoder waits for larger frames
(40ms, 60ms) that adds directly to latency.
We’ve exhausted what we can measure from outside.
The remaining ~200ms must be somewhere in the internal
pipeline, and only your server-side instrumentation
can pinpoint it.
In addition, I wonder the recording comparison is not accurate enough.
Where in the pipeline does Agent Insights recording
capture audio — at the SIP GW inbound, after Opus
transcoding, or at the agent participant level?
Twilio recording: captured at RTP reception on their
media gateway (post-transmission)
Agent Insights recording: captured by the agent itself
(end of pipeline)
So the 325ms round-trip we measured sits almost entirely
within LiveKit’s SIP→Room→Agent pipeline. Network transport
is <10ms (confirmed by both sides). Twilio has closed their
investigation with clean metrics.
Could you profile the internal pipeline for a test call?
We want to understand how much time is spent in resample,
Opus encode/decode, and room routing.
Room: SCL_KQBpFvPTAUbM
Time: 2026-02-05 12:38:50 UTC
Have you already looked at the PCAP? The RTP frames are there. How does that compare to the one from the Twilio side? I would expect the timestamps can be helpful to see the hysteresis.
You asked about a jitter buffer. I took a look and the JitterBuffer was disabled for the call you referenced.
One thing you can try is to set up two clients that just open a room and ping-pong an audio signal to measure the delay and see if it is similar to what you see. You could also measure a phone in a room if you wanted to do it. It will be quite fast but you should measure it so you can have some confidence in it.
I would expect the latency to be quite low.
For the agent framework, the source code is available, so you should be able to see any delays that may be affecting your agent. Have you already looked at metrics? To me this is probably a lot better way to approach the data you are after than trying to correlate two clocks and captures that are not necessarily able to be alighned.
You asked about how the room works and the codec used. Here is the code for that, if you want to take a look.
You can find the SIP implementation here if you would like to review the code.
I will see if I can identify any issues in the systems, but I doubt we have the data you are looking for.
These are the SIP/WebRTC stats for the call you reference:
Thank you for the suggestions — we followed your advice and set up a ping-pong audio test to measure pure room round-trip latency. Here’s what we found:
Test Setup
We created two participants in the same room: one sends a 1kHz tone burst, the other immediately echoes it back. We measure the time from send to echo detection. No STT/LLM/TTS involved — pure audio echo.
Here’s the agent code we deployed for the in-region test: room_pingpong_agent.py (9.2 KB)
AudioSource(queue_size_ms=…) has a huge impact. The default value of 1000ms added ~400ms to the round-trip. Reducing it to 10ms cut RTT from 803ms → 392ms in-region.
Even with queue_size_ms=10, in-region RTT is still ~392ms. Both participants are deployed in the same US-East region as a LiveKit Cloud Agent, so network latency should be near-zero. Yet we still see
~390ms round-trip for a simple audio echo.
The remaining ~390ms appears to be a floor from the underlying audio pipeline — Opus encode/decode, the native FFI layer, and SFU processing. We can’t reduce it further from the Python SDK side.
We expected in-region room latency to be “quite low” as you mentioned. Is ~390ms RTT expected for a simple audio echo between two participants in the same region?
If not, could there be additional buffering in the native livekit-ffi layer or the SFU mixer that we should look into? We noticed 34 mixer restarts in the call stats you shared earlier, which also seemed
unusually high.
The test code is straightforward — the agent deploys to LiveKit Cloud (US-East), joins the room as an echo participant, and spawns a second ping participant in the same room. Happy to share the full source
if helpful.
I am asking for some advice from other team members, but I think the problem is that `sent_at` is recorded, and the tone samples are in the buffer behind ~24,000 samples of silence. A 10ms timer pops only 480 samples per tick. The tone won’t be delivered via WebRTC until the silence before it drains.
The echo side contributes less because frames arrive from WebRTC at a real-time rate (the jitter buffer paces them), so the echo’s buffer stays relatively shallow regardless of queue_size_ms. The dominant source is the ping side’s burst push.
queue_size_ms controls a FIFO with a fixed-rate 10ms drain. The default 1000ms lets you push ~1 second of audio with no backpressure, which means capture_frame it returns instantly, but the audio doesn’t actually reach WebRTC until the timer drains through the queue. Your 25 silence frames pile up ~500ms deep, and the tone sits behind them. Setting it to 10ms forces frame-by-frame pacing, keeping the buffer shallow (~10-20ms) so audio reaches WebRTC almost immediately after capture.
Thanks, that explanation is very clear. Confirmed — reducing queue_size_ms to 10 matches the behavior you described, and your point about the echo side
staying shallow (jitter buffer paces incoming frames) explains the asymmetry we observed.
One correction to our earlier numbers: our sent_at is recorded at capture_frame() call time, not actual WebRTC send time. So part of the measured RTT
improvement from reducing queue_size_ms is a measurement artifact — the queued silence was inflating our timing.
Two additional optimizations we found on the SDK side:
DTX=False + continuous silence — keeps the receiving jitter buffer converged. Audio RTT dropped from 435ms → 245ms (Taiwan → US-East, pure client
test).
Dedicated audio thread — in production agents, capture_frame() competes with STT/LLM/TTS on the same event loop, causing irregular push timing →
underruns → jitter buffer inflation. Moving it to a dedicated thread: 504ms → 286ms in-region, nearly matching our pure client baseline (275ms).
Remaining questions:
Jitter buffer floor: With queue_size_ms=10, DTX off, and in-region deployment, data channel RTT is ~7ms but audio RTT is still ~280ms. The
~137ms/direction gap appears to be libwebrtc’s adaptive jitter buffer. min_playout_delay=0, max_playout_delay=100 had no effect. Is this expected? Any way
to reduce it?
queue_size_ms=0 panics on livekit-ffi v0.12.42. Should we file a bug?
34 mixer restarts (call SCL_KQBpFvPTAUbM) — expected, or worth investigating?
RoomIO queue_size_ms=200 hardcoded in _output.py — plans to make configurable?
It would be good to get a webrtc-internals dump; you can use a tool like https://fippo.github.io/webrtc-dump-importer/ to look at the data. That would show the jitter buffer state.
Questions:
…public API to override NetEq’s target….
I an not aware of a public API to adjust this.
2.queue_size_ms=0 panics on livekit-ffi v0.12.42. Should we file a bug?
No need to file an issue. I think that is already fixed on main. What version of RTC are you currently using?
If you upgrade to rtc ≥ v1.0.25 and re-test queue_size_ms=0. If it still panics on a current version, that’s definitely a bug to file. You’ll need to ensure your frames are exactly 10ms (480 samples at 48kHz) since the fast path enforces this strictly.
RoomIO queue_size_ms=200 hardcoded in _output.py — plans to make configurable?
I am not aware of a plan to make that configurable.