Egress_ended webhook not delivered for some Room Composite Egress jobs

We’re seeing a small but persistent number of cases where a Room Composite
Egress is started successfully (we get a 200 back from
StartRoomCompositeEgress and a valid EG_… ID), the room subsequently ends,
but we never receive an egress_ended webhook for that job — not
EGRESS_COMPLETE, not EGRESS_FAILED, nothing. Most jobs (>99%) do deliver
the webhook correctly through the same configuration, so it doesn’t look
like a config or signing-key issue at our end. We’d like help understanding
what’s happening to these specific jobs on LiveKit’s side.

Pattern from our logs

  1. We start a Room Composite Egress against an active room. Egress accepted,
    ID returned.
  2. The call (SIP telephony) completes normally — agent ends the session.
  3. Our agent worker calls DeleteRoom during shutdown.
  4. We expect egress_ended to fire for the egress (after the room is deleted). It doesn’t for these cases.

Affected Rooms

All on production LiveKit Cloud, US region.

LiveKit Project - p_3tqm7ro6kbs

Date (UTC) Room ID Egress ID
2026-03-04 13:01 RM_SmT8U5tgEwQ9 EG_aWxMbwSuATFF
2026-03-25 12:01 RM_EEQQXu8mGErj EG_j8tZRfprtqqD
2026-03-26 12:30 RM_VSLMMmnDzmbV EG_gZ46Xw4r5eHk
2026-03-26 12:30 RM_zWaAmnDfNbka EG_XoPRtjYh5nr5
2026-03-27 13:18 RM_reV4RkfMK6Pz EG_iEvcivBkeSyC
2026-03-31 12:13 RM_hwnBxihmu7Cm EG_vpcLAn9vbay7
2026-03-31 12:13 RM_bhf5MCboQ2PT EG_LEuvnVnzdSyX
2026-04-03 14:31 RM_MQTCnvrCYvYb EG_aQgJApX9eT3N
2026-05-05 14:00 RM_Zz4U7vFtkeLw EG_givytVfrVDsU
2026-05-05 14:02 RM_5Qn5pd4UBNtU EG_Pwi6x4nZL7XG

Webhook URL is set on the egress request (not on the project) and points at
our service which signs/verifies with LIVEKIT_API_KEY. For comparison, on
the same day as some of these, hundreds of other egresses delivered
egress_ended to the same URL without issue.

Thanks @Zaheer_Abbas for reporting the issue.

I checked recent recordings and can see that our web hook requests were timing out trying to reach your endpoint. We do have retry mechanism to overcome transient errors but they also got exhausted.
I would recommend checking web hook best practices article for more info and guidelines for implementation:

Thanks @Milos_Pesic - I am looking into this further.

Is it possible to have any alerts setup or some notification to help us know if webhooks have failed all 3 retries? Would be helpful if we can self-discover this somehow and see where the bottleneck is currently in our system

A webhook for failed webhooks maybe (jk)

@Zaheer_Abbas, the cleanest self-discovery pattern: pair every start-egress with a periodic terminal-event check rather than waiting on the webhook alone.

Every N minutes, ListEgress recent egresses and diff against your local store. Any EGRESS_COMPLETE or EGRESS_FAILED on LiveKit’s side that you haven’t terminated locally is a missed webhook. Route those into your normal post-egress pipeline and emit a metric on the rate; that’s your alert signal, independent of LiveKit notifying you when retries exhaust.

I don’t know off-hand whether the Cloud dashboard surfaces webhook delivery retry counts directly; worth checking with @Milos_Pesic if that would help close the loop.