Setup: voice agent on livekit-agents 1.4.6 / livekit 1.1.2 / livekit-api 1.0.7 (Python), inbound SIP telephony, LiveKit Cloud. Agents are dispatched to rooms on inbound calls, and we start a room-composite egress (audio_only, DUAL_CHANNEL_AGENT, OGG → S3) right after the agent joins the room.
5 inbound SIP calls arrived near-simultaneously (~18:21:29–36 UTC) and agents were dispatched to all 5 rooms. All 5 agents joined their rooms successfully, and all 5 calls then ran normally end-to-end (conversation / voicemail detection, clean shutdowns).
But for ~25 seconds (18:21:38 → 18:21:59 UTC, 2026-06-04) every StartRoomCompositeEgress call for those rooms failed with:
TwirpError(code=not_found, message=requested room does not exist, status=404)
We retried 5× per room over ~21s — every attempt 404’d, for all 5 rooms — while the rooms were demonstrably live (our agents were connected inside them the whole time, and the calls continued for 30–80+ seconds after). Once retries were exhausted, the calls completed without any recording.
Affected rooms (project p_3tqm7ro6kbs), all 2026-06-04 18:21:38–18:21:59 UTC:
| Room ID | Egress attempts window |
|---|---|
RM_BuEqWgz7JJCJ |
→ 18:21:54.6 |
RM_TTNwaCm7GSp4 |
→ 18:21:58.9 |
RM_jYgXuCWEvqQr |
18:21:39 → 18:21:58.9 |
RM_Uuusmm6z8ZCC |
18:21:38 → 18:21:59.0 |
RM_E79B9F7nrnbN |
18:21:38 → 18:21:59.3 |
Questions:
- Under what conditions does the Egress API return 404
room does not existfor a room that is live (participants connected)? Is there room-state propagation between the RTC and egress control planes that can lag? - Was there a known blip in the egress service around 18:21 UTC on 2026-06-04? The simultaneity across 5 independent rooms suggests something service-side rather than per-room.
- Any recommended client strategy here — e.g., how long is it worth retrying StartEgress on
not_foundfor a room we know is live?
Impact for us: 5 completed calls with no audio recording (compliance-relevant). Happy to share request IDs / more logs.