Egress API returned 404 "requested room does not exist" for ~25s — on rooms that were live (agents connected)

Setup: voice agent on livekit-agents 1.4.6 / livekit 1.1.2 / livekit-api 1.0.7 (Python), inbound SIP telephony, LiveKit Cloud. Agents are dispatched to rooms on inbound calls, and we start a room-composite egress (audio_only, DUAL_CHANNEL_AGENT, OGG → S3) right after the agent joins the room.

5 inbound SIP calls arrived near-simultaneously (~18:21:29–36 UTC) and agents were dispatched to all 5 rooms. All 5 agents joined their rooms successfully, and all 5 calls then ran normally end-to-end (conversation / voicemail detection, clean shutdowns).

But for ~25 seconds (18:21:38 → 18:21:59 UTC, 2026-06-04) every StartRoomCompositeEgress call for those rooms failed with:

TwirpError(code=not_found, message=requested room does not exist, status=404)

We retried 5× per room over ~21s — every attempt 404’d, for all 5 rooms — while the rooms were demonstrably live (our agents were connected inside them the whole time, and the calls continued for 30–80+ seconds after). Once retries were exhausted, the calls completed without any recording.

Affected rooms (project p_3tqm7ro6kbs), all 2026-06-04 18:21:38–18:21:59 UTC:

Room ID Egress attempts window
RM_BuEqWgz7JJCJ → 18:21:54.6
RM_TTNwaCm7GSp4 → 18:21:58.9
RM_jYgXuCWEvqQr 18:21:39 → 18:21:58.9
RM_Uuusmm6z8ZCC 18:21:38 → 18:21:59.0
RM_E79B9F7nrnbN 18:21:38 → 18:21:59.3

Questions:

  1. Under what conditions does the Egress API return 404 room does not exist for a room that is live (participants connected)? Is there room-state propagation between the RTC and egress control planes that can lag?
  2. Was there a known blip in the egress service around 18:21 UTC on 2026-06-04? The simultaneity across 5 independent rooms suggests something service-side rather than per-room.
  3. Any recommended client strategy here — e.g., how long is it worth retrying StartEgress on not_found for a room we know is live?

Impact for us: 5 completed calls with no audio recording (compliance-relevant). Happy to share request IDs / more logs.

@Zaheer_Abbas, Yeah, 5 simultaneous 404s on 5 rooms basically rules out per-room state and points at something service-side. For compliance recording resilient to that timing window, auto egress is the move: configure the recording in CreateRoom so LK Cloud kicks it off itself, no separate StartEgress to 404.

The shape (S3 tracks example verbatim from docs):

  curl -X POST <host>/twirp/livekit.RoomService/CreateRoom \
    -H "Authorization: Bearer <token>" \
    -H 'Content-Type: application/json' \
    --data-binary '{
      "name": "my-room",
      "egress": {
        "tracks": {
          "filepath": "bucket-path/{room_name}-{publisher_identity}-{time}",
          "s3": {"access_key": "", "secret": "", "bucket": "mybucket", "region": ""}
        }
      }
    }'

For your RoomComposite + audio_only + DUAL_CHANNEL_AGENT stack, you’d use egress.room instead of egress.tracks, mirroring your existing StartRoomCompositeEgress request body into that field. The room-composite-with-S3 shape isn’t explicitly demoed in the docs, so verify your audio_mixing + OGG fields carry over cleanly [ Auto egress | LiveKit Documentation ].

I have requested @Milos_Pesic to check further and help what happened here. We have been seeing a couple of egress issues lately