Urgent: Intermittent CONNECTION_TIMEOUT / action=Resume causing livekitwebrtcsink publisher pipeline crashes
Hi LiveKit community,
I wanted to flag a critically urgent issue we have been experiencing with our LiveKit integration that started recently and is actively disrupting customer sessions.
Summary
We use the GStreamer LiveKit plugin, livekitwebrtcsink, in a pipeline to publish a video/audio feed into our LiveKit rooms as a WebRTC participant.
This pipeline has been consistently crashing at around 4 minutes into a participant’s connection.
From the logs, we can see the pipeline receives a connection timeout message with action=Resume, which appears to signal that it should attempt to reconnect. However, the GStreamer LiveKit client does not seem to handle this reconnection signal and instead stops entirely.
Key details
-
The affected participant only publishes video and audio. It does not subscribe to anything.
-
It connects to the LiveKit room using an API key/secret, not a token.
-
The issue appeared simultaneously across all of our environments:
-
production
-
staging
-
local machines
-
-
There were no changes on our side when this began. No deploys, updates, or config changes.
-
We have been connecting to LiveKit Cloud server version v1.9.11 in staging for over a month, confirmed via logs.
-
This issue previously appeared and then stopped on its own:
-
first occurrences started around March 3, 2026
-
they stopped around 2026-03-04 21:30:00 UTC
-
-
New occurrences started again on 2026-03-11 13:30:00 UTC and are currently ongoing.
Because the behavior started and stopped suddenly without changes on our side, and affected all environments at once, this strongly suggests there may be an infrastructure-side factor involved.
Possible connection to UDP to TCP fallback / reconnect handling
We suspect this may be related to the UDP to TCP fallback / reconnect behavior introduced around LiveKit v1.8.5, since that logic can produce this kind of timeout message.
Specifically, we are seeing a LeaveRequest with:
-
reason: ConnectionTimeout -
action: Resume -
region candidates included in the payload
But instead of resuming, the GStreamer pipeline stops.
Example session
One affected session from a user report:
room-gst-producer receiving Participant left (CONNECTION_TIMEOUT)
Session RM_dDBeHJhEr9ke
Relevant logs
A few seconds before the crash:
0:04:01.527049115 436 0x7aa948007780 DEBUG webrtc-livekit-signaller net/webrtc/src/livekit_signaller/imp.rs:316:gstrswebrtc::livekit_signaller::imp::Signaller::on_signal_event::{{closure}}:<GstLiveKitWebRTCSinkSignaller@0x5cb778f8a250> Connection quality: ConnectionQualityUpdate { updates: [ConnectionQualityInfo { participant_sid: "PA_5NLgLKCgQJMm", quality: Lost, score: 1.252 }] }
At the point where the GStreamer pipeline crashes:
0:04:04.970865139 436 0x7aa948007780 DEBUG webrtc-livekit-signaller net/webrtc/src/livekit_signaller/imp.rs:337:gstrswebrtc::livekit_signaller::imp::Signaller::on_signal_event::{{closure}}:<GstLiveKitWebRTCSinkSignaller@0x5cb778f8a250> Leave: LeaveRequest { can_reconnect: false, reason: ConnectionTimeout, action: Resume, regions: Some(RegionSettings { regions: [RegionInfo { region: "osaopaulo1b", url: "https://teleo-staging-bk1tdl1w.osaopaulo1b.production.livekit.cloud", distance: 89412 }, RegionInfo { region: "ojohannesburg1a", url: "https://teleo-staging-bk1tdl1w.ojohannesburg1a.production.livekit.cloud", distance: 7503272 }, RegionInfo { region: "oashburn1b", url: "https://teleo-staging-bk1tdl1w.oashburn1b.production.livekit.cloud", distance: 7603430 }, RegionInfo { region: "ochicago1b", url: "https://teleo-staging-bk1tdl1w.ochicago1b.production.livekit.cloud", distance: 8335785 }, RegionInfo { region: "omarseille1b", url: "https://teleo-staging-bk1tdl1w.omarseille1b.production.livekit.cloud", distance: 9152002 }, RegionInfo { region: "ophoenix1b", url: "https://teleo-staging-bk1tdl1w.ophoenix1b.production.livekit.cloud", distance: 9312041 }] }) }
What we have done so far
As a short-term mitigation, we have implemented a pipeline restart when this specific failure occurs.
That helps reduce impact, but it is not a real fix and is still causing disruption for customers.
Our questions
We would really appreciate help understanding:
-
Whether there is anything on the LiveKit infrastructure side that could be contributing to these
ConnectionTimeout/action=Resumeevents -
Whether
livekitwebrtcsinkis expected to handle this signal differently -
Whether there were any infrastructure or transport-related changes around:
-
March 3 to March 4, 2026
-
March 11, 2026
-