Publisher connection times out mid-call, then retries ~11×/sec until hangup (agent goes silent)

Setup: voice agent on livekit-agents 1.4.6 / livekit 1.1.2 (Python), SIP, LiveKit Cloud.

Mid-call, the publisher PeerConnection drops and never re-establishes. The SDK then logs this continuously until the caller hangs up:

livekit::rtc_engine::rtc_session:1926 - connection error: could not establish publisher connection: timeout   [ERROR]

A single call produced 1,688 of these ERROR lines over ~2.5 min (00:04:22 → caller hangup 00:06:52, ~11/sec). For that whole window the agent couldn’t publish audio → caller heard silence → hung up. No prior media issue; a few mild event-loop lag warnings (0.2–0.7s) earlier in the call.

1What causes the publisher PC to fail to re-establish mid-call, and is there a recommended way to recover it (force ICE restart / reconnect) — or to fail the job fast instead of looping until hangup?

IDs (LiveKit Cloud, project p_3tqm7ro6kbs): room RM_KS3CeDnoxDXT, job/call AJ_4xWBxJyCDTR3, 2026-06-03 00:04:22Z. Happy to share more logs.

Is that the right room ID? I’m seeing that room ID lasting for 80 minutes, https://cloud.livekit.io/projects/p_3tqm7ro6kbs/sessions/RM_KS3CeDnoxDXT and I can’t find a room starting at 00:04:22Z (I’m sure I’m missing something)

Typically that error:

livekit::rtc_engine::rtc_session:1926 - connection error: could not establish publisher connection: timeout [ERROR]

doesn’t have a single indicative root cause. I could check the server logs for network related issues if I could find the room. Another thing to check is whether you see any other warnings in the agent logs.

The timeline would be about 1am my time, and the team were not making any changes to the infra at that time.

Thanks! You’re right — 00:04:22Z was misleading; that’s 78 min into the call, where the publisher error started, not the room start. RM_KS3CeDnoxDXT is the correct room - it’s the ~80-min session you’re seeing.

Corrected timeline (UTC):

  • 22:46:53Z - room start, agent connected + talking by 22:46:55Z
  • ~78 min of normal call
  • 00:04:22Z - could not establish publisher connection: timeout begins, repeats ~11/sec (1,688×)
  • 00:06:52Z - caller hangs up (agent was silent the whole final window)
  • 00:06:57Z - shutdown

So please focus server/network logs your side for that room.

On agent warnings in that window: only mild event-loop “slow callback” delays (~0.2s) — no signaling/reconnect/ICE warnings on our side, so signaling appears to have stayed up while only the publisher media/ICE path failed to re-establish.

Happy to share further logs if needed to debug further

I see the following warning in the server logs at 00:04:06Z:

write data channel failed

with the accompanying message:

outbound packet larger than maximum message size: 65536

That’s the only thing suspicious that I see. This isn’t something over SIP, this looks like something in your app is trying to send large data packets: Data packets | LiveKit Documentation .

@darryncampbell finding points to the root cause: something in your agent is sending a data packet over 65KB via the data channel.

Check anywhere you call publish_data(). LiveKit’s limit is 65536 bytes, and exceeding it can corrupt the data channel and even crash the publisher PC.

Large transcripts or RAG payloads sent through publish_data() are common causes.

Fix: chunk anything you send to under 16KB per packet, or remove large payloads from the data channel entirely.

For the fast-fail question: there is no built-in publisher-failure to job-exit hook in livekit-agents 1.x.

Workaround: monitor repeated RTC errors in your agent and call ctx.room.disconnect() manually once you detect the publisher is stuck.

write data channel failed

with the accompanying message:

outbound packet larger than maximum message size: 65536

That’s the only thing suspicious that I see. This isn’t something over SIP, this looks like something in your app is trying to send large data packets: Data packets | LiveKit Documentation .

Thanks for looking into the logs Darryn. We are using the livekit/agents sdk - shouldn’t this be handled in that sdk? From the Agent side - I am unsure we are sending any large packets. The consumer audio did have a large transcript buffer because the audio was being streamed continuously from the consumer end.

@Zaheer_Abbas The continuous audio stream is likely the cause. After 78 minutes, the transcript buffer grows large enough to exceed 65KB when the SDK flushes it through the data channel.

The agent’s SDK should chunk this automatically, but there is an open issue with large transcript payloads during long calls. It is worth filing an issue on livekit/agents with your reproduction case.

Short-term workaround: set a max buffer duration on your STT to force periodic flushes instead of accumulating the full session transcript.

@Zaheer_Abbas It’s not an error I can remember seeing before to be honest & I don’t see any previous reports. I presume you don’t see anything else in your agent logs? I would expect something in your agent logs if this was coming from the agent, or maybe your client logs?

We don’t have a client, our usecase is telephony based voice agents. So don’t have “client” side logs

Hi @Zaheer_Abbas , I took another look at this, this morning.

I still don’t see anything obviously wrong on our end, if I look at the SIP participant events, https://cloud.livekit.io/projects/p_3tqm7ro6kbs/sessions/RM_KS3CeDnoxDXT/participants/PA_rFZabmAsboNb, the track isn’t reconnecting.

My best guess is some spurious network issue with the agent. Have you seen this issue repeat, or was it a one-off?

This was a one off yes - I too am unable to debug the exact root cause

@Zaheer_Abbas the team did some extensive investigation around this issue, and the following PR should avoid this reoccurring: Reject oversized data messages before they break the data channel by cnderrauber · Pull Request #1137 · livekit/rust-sdks · GitHub. The PR is still under review, but wanted to give you an update.

I believe this was the root cause, as you suspected.