Subject: 325ms one-way audio latency in SIP inbound trunk — need internal pipeline diagnostics

Hi LiveKit team,

We’re seeing ~325ms one-way audio latency (Twilio → Agent)
with both sides in US/Ashburn. Both Twilio and our own RTP
analysis show clean network metrics — the delay appears to
be inside LiveKit’s SIP-to-WebRTC pipeline.

Our Setup

  • Architecture: Twilio (PSTN, Ashburn/us1) → SIP/UDP →
    LiveKit SIP Inbound Trunk → WebRTC → Voice Agent (Python)
  • TwiML: sip:{room_name}@{project_id}.sip.livekit.cloud;transport=udp
  • Transport: UDP, Digest Auth
  • Codec: PCMU/8000 (both sides, no transcoding)
  • LiveKit SDKs: livekit 1.0.17, livekit-api 1.0.7

End-to-End Latency (Recording Comparison)

Condition One-way Latency
Agent in JP, SIP TCP ~583ms
Agent in JP, SIP UDP ~457ms
Agent in US, SIP UDP ~325ms

TCP → UDP saved ~126ms. JP → US agent saved ~132ms.
With everything in US, 325ms still persists.

Twilio-Side Analysis (Clean)

Metric Value
Codec PCMU
Packet Loss false
Jitter false
Edge Ashburn (us1)

Twilio confirmed clean metrics on their side.

LiveKit-Side RTP Analysis (Our PCAP: SCL_KQBpFvPTAUbM)

We analyzed the PCAP from LiveKit SIP Gateway (10.34.4.75):

Metric Inbound (Twilio→GW) Outbound (GW→Twilio)
Packets 2,644 2,743
Packet Loss 0 0
Avg Jitter 0.13ms 0.15ms
P95 Jitter 0.57ms 0.56ms
Avg Interval 20.01ms 20.01ms
Codec PCMU/8000 PCMU/8000

RTP at the SIP Gateway is perfect. The SIP endpoint
(161.115.179.133) resolves to Oracle Cloud, Ashburn —
same city as Twilio’s Ashburn edge.

Our Conclusion

Network transport is clean on both sides. The 325ms
one-way delay is occurring inside LiveKit’s processing
pipeline:

Twilio → [SIP GW] → ??? 325ms ??? → [Room/Agent]
Ashburn (internal) Ashburn

What We Need From LiveKit

  1. Jitter buffer configuration — what is the buffer
    size/mode for SIP inbound trunks? A conservative buffer
    is the most likely cause of the bulk of this delay.

  2. Internal pipeline latency breakdown — time from
    “RTP received at SIP GW” to “audio delivered to agent
    participant” for our test call.

  3. Tunable parameters — can we reduce jitter buffer
    size or adjust any SIP trunk settings for lower latency?

Test Call Reference

  • Room: SCL_KQBpFvPTAUbM
  • Time: 2026-02-05 12:38:50 UTC
  • Twilio Call SID: CA0107aab2fa95341da57c75337396c74c
  • LiveKit project: 67ikv72cpml.livekit.cloud

We can provide both Twilio and LiveKit PCAPs if needed

Jitter buffer configuration is not exposed publicly. How are you measuring when the audio arrives at the agent? Could some of the delay be attributable to operations. I mean, what are the start and end points for your measurement? It’s not clear reading the above.

Thanks for the follow-up. Here’s our measurement methodology:

Measurement Method

  1. Record the same call from both sides:

    • Twilio: call recording
    • LiveKit: Agent Insights recording
  2. Open both in Audacity and align the caller’s speech
    waveform as the reference point

  3. Measure turn-taking latency on each:

    • A = caller speech end → agent response start (Twilio recording)
    • B = caller speech end → agent response start (LiveKit recording)
  4. Difference: A - B = 325ms

Since both recordings include the same agent processing
time (STT + LLM + TTS), subtracting cancels it out.
The 325ms represents the round-trip transport delay
between Twilio and LiveKit (not one-way).

That’s ~162ms one-way, which is still significant for
a same-region path (Twilio Ashburn ↔ LiveKit Ashburn).

What this rules out

  • Agent processing time (cancelled by subtraction)
  • PSTN/carrier delay (not in LiveKit recording)
  • Client-side factors (server-side recordings only)

What this includes

  • Twilio → LiveKit SIP GW: RTP transport + jitter buffer
  • LiveKit SIP GW → Room → Agent: internal pipeline
  • Agent → Room → SIP GW → Twilio: return path
  • Any transcoding or media processing in between

For comparison: Twilio-side round-trip

We performed the same measurement for the Twilio side
(PSTN ↔ Twilio edge):

  • Twilio RTP RTT: ~105ms (US caller ↔ Taiwan)
  • Twilio RTP RTT: <10ms (US caller ↔ US)

For our US↔US test call, Twilio’s contribution is <10ms
of the 325ms round-trip. That leaves ~315ms attributable
to the LiveKit SIP pipeline (~157ms one-way).

Could you help us understand what’s contributing to this
within LiveKit’s SIP processing path? Jitter buffer
sizing would be a good starting point.

Room: SCL_KQBpFvPTAUbM
Time: 2026-02-05 12:38:50 UTC

Here is the information from Twilio:

We reviewed your open-source SIP code and found:

  1. Jitter buffer (media-sdk/jitter, 60ms max) is used in
    two places: MediaPort (SIP inbound) and Room (track handler),
    both controlled by enable_jitter_buffer config.

  2. Even if enabled, 60ms × 2 = 120ms round-trip, which only
    accounts for ~1/3 of our measured 325ms.

  3. We also see PCMU → Opus transcoding (room.go:324) and
    resampling in the pipeline. Could these account for the
    remaining ~200ms?

Questions:

  • Is enable_jitter_buffer enabled on LiveKit Cloud?
  • What is the Room’s internal sample rate and mixer latency?
  • Is there a way to bypass Opus transcoding for SIP-to-SIP
    or SIP-to-Agent paths?

Is enable_jitter_buffer enabled on LiveKit Cloud?

For the call you provided, SCL_KQBpFvPTAUbM, I see in the cloud logs that jitter buffer was not enabled.

What is the Room’s internal sample rate and mixer latency?

I can’t find it in the documentation, but that should be 48kHz, I’m not sure about the mixer latency.

Is there a way to bypass Opus transcoding for SIP-to-SIP
or SIP-to-Agent paths?

No, again I can’t find it stated explicitly, but audio tracks are processed through Opus, so transcoding is inevitable.

Thank you for confirming. That helps narrow it down.

With jitter buffer disabled, the remaining processing
pipeline is:

PCMU/8kHz → resample to 48kHz → Opus encode → Room →
Opus decode → Agent → (return path) → PCMU/8kHz

We estimate this at ~80-130ms round-trip, but we’re
measuring ~325ms. That leaves ~200ms unaccounted for.

Could you check the internal logs/traces for room
SCL_KQBpFvPTAUbM (2026-02-05 12:38:50 UTC) to see:

  1. What is the actual measured latency from “RTP packet
    received at SIP MediaPort” to “audio sample delivered
    to agent participant”? Your SIP code logs RoomStats
    (gaps, delayed packets, etc.) — any anomalies for
    this call?

  2. Is there additional buffering we’re not seeing?
    For example, does the Opus encoder accumulate multiple
    frames before sending, or does the mixer introduce
    frame-level delays?

  3. The Opus encoder frame size — is it 20ms or larger?
    At 48kHz with 20ms frames that’s 960 samples per
    frame, but if the encoder waits for larger frames
    (40ms, 60ms) that adds directly to latency.

We’ve exhausted what we can measure from outside.
The remaining ~200ms must be somewhere in the internal
pipeline, and only your server-side instrumentation
can pinpoint it.

In addition, I wonder the recording comparison is not accurate enough.
Where in the pipeline does Agent Insights recording
capture audio — at the SIP GW inbound, after Opus
transcoding, or at the agent participant level?

Agent Insight recordings are recorded by the agent itself.

Hi LiveKit team,

Quick update on the recording capture points:

  • Twilio recording: captured at RTP reception on their
    media gateway (post-transmission)
  • Agent Insights recording: captured by the agent itself
    (end of pipeline)

So the 325ms round-trip we measured sits almost entirely
within LiveKit’s SIP→Room→Agent pipeline. Network transport
is <10ms (confirmed by both sides). Twilio has closed their
investigation with clean metrics.

Could you profile the internal pipeline for a test call?
We want to understand how much time is spent in resample,
Opus encode/decode, and room routing.

Room: SCL_KQBpFvPTAUbM
Time: 2026-02-05 12:38:50 UTC

Have you already looked at the PCAP? The RTP frames are there. How does that compare to the one from the Twilio side? I would expect the timestamps can be helpful to see the hysteresis.

You asked about a jitter buffer. I took a look and the JitterBuffer was disabled for the call you referenced.

One thing you can try is to set up two clients that just open a room and ping-pong an audio signal to measure the delay and see if it is similar to what you see. You could also measure a phone in a room if you wanted to do it. It will be quite fast but you should measure it so you can have some confidence in it.

I would expect the latency to be quite low.

For the agent framework, the source code is available, so you should be able to see any delays that may be affecting your agent. Have you already looked at metrics? To me this is probably a lot better way to approach the data you are after than trying to correlate two clocks and captures that are not necessarily able to be alighned.

You asked about how the room works and the codec used. Here is the code for that, if you want to take a look.

You can find the SIP implementation here if you would like to review the code.

I will see if I can identify any issues in the systems, but I doubt we have the data you are looking for.

These are the SIP/WebRTC stats for the call you reference:

Hi Team,

Thank you for the suggestions — we followed your advice and set up a ping-pong audio test to measure pure room round-trip latency. Here’s what we found:

Test Setup

We created two participants in the same room: one sends a 1kHz tone burst, the other immediately echoes it back. We measure the time from send to echo detection. No STT/LLM/TTS involved — pure audio echo.

Here’s the agent code we deployed for the in-region test:
room_pingpong_agent.py (9.2 KB)

We ran the test in three configurations:

  ┌──────────────────────┬─────────────────────────────────────────┬───────────────────────────┬─────────┬───────┬───────┐
  │         Test         │                Location                 │ AudioSource queue_size_ms │ Avg RTT │  Min  │  Max  │
  ├──────────────────────┼─────────────────────────────────────────┼───────────────────────────┼─────────┼───────┼───────┤
  │ Agent (default)      │ US-East in-region (LiveKit Cloud Agent) │ 1000 (default)            │ 803ms   │ 733ms │ 892ms │
  ├──────────────────────┼─────────────────────────────────────────┼───────────────────────────┼─────────┼───────┼───────┤
  │ Pure rtc.Room client │ Taiwan → US-East                        │ 50                        │ 435ms   │ 252ms │ 532ms │
  ├──────────────────────┼─────────────────────────────────────────┼───────────────────────────┼─────────┼───────┼───────┤
  │ Agent (optimized)    │ US-East in-region (LiveKit Cloud Agent) │ 10                        │ 392ms   │ 314ms │ 474ms │
  └──────────────────────┴─────────────────────────────────────────┴───────────────────────────┴─────────┴───────┴───────┘

Key Findings

  1. AudioSource(queue_size_ms=…) has a huge impact. The default value of 1000ms added ~400ms to the round-trip. Reducing it to 10ms cut RTT from 803ms → 392ms in-region.
  2. Even with queue_size_ms=10, in-region RTT is still ~392ms. Both participants are deployed in the same US-East region as a LiveKit Cloud Agent, so network latency should be near-zero. Yet we still see
    ~390ms round-trip for a simple audio echo.
  3. The remaining ~390ms appears to be a floor from the underlying audio pipeline — Opus encode/decode, the native FFI layer, and SFU processing. We can’t reduce it further from the Python SDK side.

We expected in-region room latency to be “quite low” as you mentioned. Is ~390ms RTT expected for a simple audio echo between two participants in the same region?

If not, could there be additional buffering in the native livekit-ffi layer or the SFU mixer that we should look into? We noticed 34 mixer restarts in the call stats you shared earlier, which also seemed
unusually high.

The test code is straightforward — the agent deploys to LiveKit Cloud (US-East), joins the room as an echo participant, and spawns a second ping participant in the same room. Happy to share the full source
if helpful.

Thanks!

Hi Team,

Following your suggestion to “set up two clients that just open a room and ping-pong an audio signal”, we built two tests:

  1. Audio Ping-Pong: Two participants in a room — one sends a 1kHz tone, the other echoes it back immediately. Measures audio round-trip.
  2. Data Channel Ping-Pong: Same setup, but using publish_data() instead of audio. Measures pure WebRTC transport without codec or jitter buffer.

We deployed both as LiveKit Cloud Agents in US-East so everything runs in-region.

In-Region Results

  ┌─────────────────┬────────────┐
  │      Test       │ Median RTT │
  ├─────────────────┼────────────┤
  │ Data Channel    │ 7ms        │
  ├─────────────────┼────────────┤
  │ Audio Ping-Pong │ 392ms      │
  └─────────────────┴────────────┘

Same region, same room, same two participants. The only difference is audio goes through Opus codec + jitter buffer.

The audio pipeline adds ~385ms of overhead. Per direction ~192ms, of which we estimate ~147ms is the WebRTC adaptive jitter buffer.

We also tried min_playout_delay=0, max_playout_delay=100 on room creation — no effect on audio RTT.

Questions

  1. Is there a way to reduce the WebRTC jitter buffer depth for audio tracks?
  2. AudioSource(queue_size_ms=0) panics on livekit-ffi v0.12.42 — known bug?

Thanks!

Edit:This issue will be handled by John(my co-worker)

I am asking for some advice from other team members, but I think the problem is that `sent_at` is recorded, and the tone samples are in the buffer behind ~24,000 samples of silence. A 10ms timer pops only 480 samples per tick. The tone won’t be delivered via WebRTC until the silence before it drains.

The echo side contributes less because frames arrive from WebRTC at a real-time rate (the jitter buffer paces them), so the echo’s buffer stays relatively shallow regardless of queue_size_ms. The dominant source is the ping side’s burst push.

queue_size_ms controls a FIFO with a fixed-rate 10ms drain. The default 1000ms lets you push ~1 second of audio with no backpressure, which means capture_frame it returns instantly, but the audio doesn’t actually reach WebRTC until the timer drains through the queue. Your 25 silence frames pile up ~500ms deep, and the tone sits behind them. Setting it to 10ms forces frame-by-frame pacing, keeping the buffer shallow (~10-20ms) so audio reaches WebRTC almost immediately after capture.

Thanks, that explanation is very clear. Confirmed — reducing queue_size_ms to 10 matches the behavior you described, and your point about the echo side
staying shallow (jitter buffer paces incoming frames) explains the asymmetry we observed.

One correction to our earlier numbers: our sent_at is recorded at capture_frame() call time, not actual WebRTC send time. So part of the measured RTT
improvement from reducing queue_size_ms is a measurement artifact — the queued silence was inflating our timing.

Two additional optimizations we found on the SDK side:

  1. DTX=False + continuous silence — keeps the receiving jitter buffer converged. Audio RTT dropped from 435ms → 245ms (Taiwan → US-East, pure client
    test).
  2. Dedicated audio thread — in production agents, capture_frame() competes with STT/LLM/TTS on the same event loop, causing irregular push timing →
    underruns → jitter buffer inflation. Moving it to a dedicated thread: 504ms → 286ms in-region, nearly matching our pure client baseline (275ms).

Remaining questions:

  1. Jitter buffer floor: With queue_size_ms=10, DTX off, and in-region deployment, data channel RTT is ~7ms but audio RTT is still ~280ms. The
    ~137ms/direction gap appears to be libwebrtc’s adaptive jitter buffer. min_playout_delay=0, max_playout_delay=100 had no effect. Is this expected? Any way
    to reduce it?
  2. queue_size_ms=0 panics on livekit-ffi v0.12.42. Should we file a bug?
  3. 34 mixer restarts (call SCL_KQBpFvPTAUbM) — expected, or worth investigating?
  4. RoomIO queue_size_ms=200 hardcoded in _output.py — plans to make configurable?

I am curious about the use case you are trying to solve for.

A colleague mentions:

It would be good to get a webrtc-internals dump; you can use a tool like https://fippo.github.io/webrtc-dump-importer/ to look at the data. That would show the jitter buffer state.

Questions:

  1. …public API to override NetEq’s target….

I an not aware of a public API to adjust this.

2.queue_size_ms=0 panics on livekit-ffi v0.12.42. Should we file a bug?

No need to file an issue. I think that is already fixed on main. What version of RTC are you currently using?

PR #778

If you upgrade to rtc ≥ v1.0.25 and re-test queue_size_ms=0. If it still panics on a current version, that’s definitely a bug to file. You’ll need to ensure your frames are exactly 10ms (480 samples at 48kHz) since the fast path enforces this strictly.

  1. RoomIO queue_size_ms=200 hardcoded in _output.py — plans to make configurable?

I am not aware of a plan to make that configurable.