Issue: [Python SDK] Network switch causes "stuck-in-reconnecting" and "ghost room" states in livekit-agents

Hey team — hitting a reliable failure on Wi-Fi network switches with livekit-agents (Python). We’ve confirmed this in production on LiveKit Cloud, and I am also able to consistently reproduce it locally by simply switching Wi-Fi networks.

It results in two distinct failure modes from the same trigger:

  • Stuck-in-reconnecting: connection_state_changed fires with state 2 but never returns to 1. A TCP probe to the LK host stays reachable (30–100ms RTT) throughout the 30+ seconds the SDK is wedged. Because the network is demonstrably fine, the reconnect logic seems stuck at the application layer.

  • Ghost-room: connection_state_changed fires with state 1 (claiming a successful reconnect), but remote_participants remains a stale pre-blip snapshot. No audio frames flow, and participant_disconnected never fires for the participant that actually dropped.

The Impact: Users end up with 2+ minutes of a disconnected, unresponsive agent. Since we are using LiveKit Cloud for a production use case, this unresponsiveness is a critical issue for us.

Current Workaround & Ask: We are currently mitigating this by monitoring connection_state_changed and force-evicting via RoomService.RemoveParticipant when the SDK doesn’t recover.

Could you provide guidance on how to handle this state properly? Also, is this a known issue with a fix currently being baked? Happy to share logs and exact repro steps if helpful!

Environment

  • livekit-agents==1.4.1 (also reproduced on 1.5.6)
  • livekit==1.0.25 (Python wrapper around the Rust livekit-ffi)
  • livekit-api==1.1.0
  • Python 3.14
  • macOS (local dev) and LiveKit Cloud (production)
  • RUST_LOG=livekit=debug,livekit_api=debug

Repro

  1. Start an AgentSession worker, have a remote participant join the room and exchange audio normally.
  2. Cause a brief network blip on the agent host — Wi‑Fi network switch on macOS works every time; production
    hits it on natural network blips.
  3. Observe: WS keepalive ping times out, SDK enters Resume strategy, then stalls.

Two observed end-states:

  • Variant A (stuck-in-reconnecting): connection_state_changed → 2 (Reconnecting), never returns to 1. No
    further SDK activity for ~60s.
  • Variant B (ghost-room): connection_state_changed → 1 (Connected) but remote_participants is the stale
    pre‑blip snapshot. No audio frames flow and participant_disconnected never fires. Same SDK-side stall,
    different surface symptom. (Related: livekit/agents#1581.)

Smoking-gun logs

The wedge (Variant A, the 2s window before we killed the process):

18:24:43.187 WARN livekit::rtc_engine - received session close: “signal client closed: “ping timeout””
UnknownReason Resume
18:24:43.188 ERROR livekit::rtc_engine - resuming connection… attempt: 0
<silence — no ICE candidates, no further signaling, no errors, no attempt: 1>

Earlier run where we let the SDK self-recover (same trigger, same attempt: 0 entry, no kill switch):

15:15:55.557 ERROR livekit::rtc_engine - resuming connection… attempt: 0
<silence for ~65s>
15:17:00.319 ERROR livekit::rtc_engine - resuming connection failed: signal failure: ws failure: Connection
closed normally
15:17:00.319 ERROR livekit::rtc_engine - restarting connection… attempt: 1

-> Resume does eventually escalate, but only when the next ping cycle independently times out ~65s later.
That’s the user-facing outage.

close event never fires during the wedge. @session.on(“close”) handler does not run. await session.aclose()
and await room.disconnect() both never return — confirmed with asyncio.wait_for(…, timeout=2.0) always
tripping the timeout in this state.


Why this looks like a bug, not expected behavior

(LiveKit support confirmed this off-thread, paraphrasing their reply:)

▎ A ping timeout followed by resuming connection… attempt: 0 with no ICE restart and no escalation for ~60s
▎ is not consistent with the documented reconnect flow. Given that close never fires, this is a stalled resume
▎ path rather than expected behavior.

Hi Matan,

Firstly, your comment about LiveKit support confirming the issue off-thread triggered me to try and find the internal conversation. The only thing I can find about this is a comment from Abayomi Praise in our Slack community, here. Please note that this individual is not a member of LiveKit support cc @CWilson

Secondly, for your actual issue, is this reproducible with:

These are our two starter apps for agent and Mac client.

Your repro steps feel straight forward, it’s just temporarily (how long?) disabling the network, so I would expect the front end to reconnect as detailed here, Connecting to LiveKit | LiveKit Documentation. If this is reproducible with the starter apps, let me know and we can investigate further.

@Matan_Porat, this is recognizable pattern from production voice agents. connection_state is necessary but not sufficient as a liveness signal on the agent side. For the ghost-room variant specifically, the more reliable signal is audio frame timestamps from the remote participant: if no frames for 5-10s while state still reads Connected, treat the session as dead and force-recreate.

For stuck-in-reconnecting, the application-level hammer is a watchdog on state transitions: if Reconnecting persists beyond your tolerance window (15-20s for production voice), terminate the worker process and let your orchestration layer dispatch fresh. RoomService.RemoveParticipant on the affected participant from the server side, like you’re already doing, handles the room-side cleanup.

Neither workaround fixes the Rust reconnect stall, they just contain blast radius. Your off-thread support confirmation that 65s silence between attempt 0 and attempt 1 isn’t the documented flow is the right signal that the fix lives in the client SDK reconnect path. #1581 covered a related event-firing case but is closed; this stuck-resume pattern probably warrants a fresh issue if you don’t already have one open with the team.