Server-initiated migration fails to resume on agents 1.4.6 — subscriber + publisher PC fail, no recovery, process killed (expected fixed in >1.4.2 per agents #4705)

Setup: voice agents on livekit-agents 1.4.6 / livekit 1.1.2 / livekit-api 1.0.7 (Python 3.11), inbound SIP telephony, LiveKit Cloud (project p_3tqm7ro6kbs). Each call = one agent job; ctx.connect() is invoked at job entry.

Background: we previously reported subscriber PeerConnection failures in livekit/agents#4705, where the guidance was that this should be addressed in >1.4.2. We have since upgraded to 1.4.6 and still hit it — this time clearly triggered by a server-initiated migration, after which the connection never recovers.

What we saw (one inbound call, 2026-06-16 UTC). The call ran normally for ~34s. Then LiveKit Cloud issued a migration and the Resume never recovered:

13:07:10.796  Participant migration — rtc_engine: received session close: "server request to leave"  (reason: Migration / Resume)
13:07:21.805  rtc_session:762   signal_event taking too much time: Answer(SessionDescription { type: "answer", ... a=ice-lite ... a=recvonly ... })
13:07:29.414  rtc_session:1161  Subscriber pc state failed   → resuming connection... attempt: 0
13:07:29.510  rtc_session:919   Wrong packet sequence while retrying: 1046
13:07:47.204  rtc_session:1161  Publisher pc state failed    → resume
13:11:52.443  rtc_session:1161  Subscriber pc state failed   → resume
13:15:19       livekit.agents: process exited with non-zero exit code -9

Sequence: a server migration → during the Resume the SDK spent >10s processing the new subscriber Answer (the signal_event taking too much time watchdog) → the subscriber PeerConnection failed → the publisher PeerConnection failed → the SDK retried Resume ~3× over ~4.5 minutes (Wrong packet sequence while retrying) and never reconnected → the process exited with -9. From the caller’s side the agent went silent at ~13:07:11, mid-conversation — a dropped call.

IDs (LiveKit Cloud, project p_3tqm7ro6kbs): room RM_viPL8m3AqU8d, job AJ_m4tsujv4U4Fs, worker AW_GnAj5VMfxtbX, window ~13:07:10–13:15:19 UTC on 2026-06-16.

Please help investigate this issue and let us know what mitigation we can put in place

Happy to share full agent logs for the room/job above.

Reported here too

I think the reconnect hardening is not in 1.1.2 but in 1.1.8. Along with this PR in 1.6.x fix quick reconnect participant keyerror by tinalenguyen · Pull Request #5979 · livekit/agents · GitHub. Right?

The above PR you have linked seems a diff issue

Participant Migration and State Mismatch issues in certain SIP calls · Issue #4705 · livekit/agents · GitHub → check this issue where I had raised the same state mismatch and Subscriber pc state failed. It was mentioned here Participant Migration and State Mismatch issues in certain SIP calls · Issue #4705 · livekit/agents · GitHub that this has been fixed in 1.4.2 of livekit/agents which is 1.2.0 of rtc

I think your mapping is a little off. Agents 1.4.2 and 1.4.6 both pin livekit==1.1.2 in their pyproject.toml, and current livekit-agents 1.6.0 pins `livekit==1.1.9. Being on agents 1.4.6 leaves you on the same rtc 1.1.2 where I believe the bug lives.

It does not seem to be fixed in 1.4.6 as you have demonstrated at the top of this thread.

I think these are the needed PRs:

We may also need this one that is not merged yet:

  • harden reconnect behaviour

Sorry I mentioned 1.2.0 in my earlier message

  1. Check this GH message - Participant Migration and State Mismatch issues in certain SIP calls · Issue #4705 · livekit/agents · GitHub which was made in February
  2. And this PR linked to this issue - fix full_reconnect downgrade & don't ignore Leave messages by theomonnom · Pull Request #893 · livekit/rust-sdks · GitHub. This PR was released in 1.1.2 of rtc package
  3. This message Participant Migration and State Mismatch issues in certain SIP calls · Issue #4705 · livekit/agents · GitHub that explicitly mentions this will be fixed in the next livekit/agents release - comment from February 13 - Release livekit-agents@1.4.2 · livekit/agents · GitHub this release was made in Feb 17

fix full_reconnect downgrade & don’t ignore Leave messages by theomonnom · Pull Request #893 · livekit/rust-sdks · GitHub
Participant Migration and State Mismatch issues in certain SIP calls · Issue #4705 · livekit/agents · GitHub
fix full_reconnect downgrade & don’t ignore Leave messages by theomonnom · Pull Request #893 · livekit/rust-sdks · GitHub
Participant Migration and State Mismatch issues in certain SIP calls · Issue #4705 · livekit/agents · GitHub

You also mentioned the issue I myself raised and the fix that was merged and released in 1.1.2 of the RTC.

I don’t think my mapping is off here.

This still seems to be an issue even after the fix that had been applied in February

Ok, maybe agents team will weigh in on that comment you made in the PR. I don’t think it is fixed in 1.4.6 but maybe I am wrong.

If it is still broken in 1.6.0 it won’t be fixed until at least 1.6.1. I will highlight this thread to the team.

If you have a code example that reproduces the issue and the steps they should follow, I can test it with 1.6.0 and see if it works.

The problem is this is NOT replicable and occurs intermittently only during production SIP calls. I myself haven’t been able to replicate with multiple hardened network conditions and seems to be an issue on LiveKit Cloud server side

I will ask if we can add some extra logging in 1.6.1 and see if we can get something that will help you find it.

From server logs it looks like agent applying the migration Answer too late/stale the same event the SDK logs as signal_event taking too much time.

The team dug deeper into the server-side logs for the session referenced above. We believe this PR will address the issue. It appears the failure is if the resume is somehow unsuccessful. This PR addresses that.

Once this is released, I hope you can let us know if the issue is resolved for you.

Thanks for the update @CWilson - any eta on when this will be released and what version of livekit/agents and rtc sdks we will need to upgrade?

Any way to verify this fixes things before we do the upgrade? We were told the same last time that upgrading to 1.4.x would solve the issue :folded_hands:

In that PR there is a test that reproduces what we saw in the server logs 100%, and once the fix is applied, it works.

See: tests/peer_connection_signaling_test.rs

The only way to know for sure is if you give it a try in your test env and verify. If you can run that PR and see that is a good way for you to verify it is fixed.

There may be a release today, but I am not sure if this will make it in time for that. I am pushing to get it added, but the changes still need to be reviewed before it can be released. I am not sure what the version numbers will be.

I can post in this thread once it is released.

Ok thanks - will test it once it is released using the script under tests folder. Much appreciated