Setup: voice agents on livekit-agents 1.4.6 / livekit 1.1.2 / livekit-api 1.0.7 (Python 3.11), inbound SIP telephony, LiveKit Cloud (project p_3tqm7ro6kbs). On JobContext.connect() the agent starts an AgentSession with RoomIO (it publishes one audio track for the agent’s voice) and a room-composite audio egress.
What happened (one call): the agent connected, then failed to publish its own audio output track because the publish call timed out with no response from the server. The session never recovered: it dropped into a Subscriber pc state failed → resume loop that repeated every 5 minutes, the agent never produced any audio (the caller heard dead air and hung up), and the job process stayed alive for ≥ 2 h 18 m — leaking memory and running inference slower than realtime — without ever self-terminating.
Root error at join (_init_task):
File ".../livekit/agents/voice/room_io/room_io.py", line 338, in _init_task
await self._audio_output.start()
File ".../livekit/agents/voice/room_io/_output.py", line 65, in _publish_track
self._publication = await self._room.local_participant.publish_track(...)
File ".../livekit/rtc/participant.py", line 699, in publish_track
raise PublishTrackError(cb.publish_track.error)
livekit.rtc.participant.PublishTrackError: room error engine: internal error:
track publication timed out, no response received from the server
Timeline (UTC):
22:16:36 Entrypoint / Connected to LiveKit
22:16:38 Audio egress started
22:16:48 _init_task FAILS -> publish_track: "track publication timed out, no response received from the server"
22:16:52 rtc_engine: received session close -> resuming connection... attempt: 0
22:17:03 rtc_session: signal_event taking too much time: Answer(...)
22:17:09 rtc_session: Subscriber pc state failed -> resume ; "Wrong packet sequence while retrying"
22:17:25 Subscriber pc state failed -> resume
22:21:52 Subscriber pc state failed -> "received session close: pc_state failed" (resume)
22:26:52 Subscriber pc state failed ... (repeats EVERY 5 minutes)
...
23:58:37+ "process memory usage is high" (every ~5s) + "inference is slower than realtime"
00:35:03 still logging the same loop (≥ 2h18m after start); process never exited in our capture window
IDs (LiveKit Cloud, project p_3tqm7ro6kbs):
- room:
RM_G3frdtYX4CPj - agent job:
AJ_w93PjKVfhwUK - start: 2026-06-19 22:16:36 UTC
Please help investigate. I am guessing this is similar to Server-initiated migration fails to resume on agents 1.4.6 — subscriber + publisher PC fail, no recovery, process killed (expected fixed in >1.4.2 per agents #4705) ?