Livekit Outage - Production

Hi LiveKit Team,

We are facing two issues with our agent worker deployment and have attached the logs for reference.

First, agent jobs are failing to connect to the room within the 10 second window after job_entry is called. The process stays alive and memory climbs to around 520MB but the call never connects.

Second, session report uploads to the observability endpoint are returning 401 Unauthorized at session end, and the OTLP trace and log exporters are also failing with an invalid API key error.

Could you please help us understand what might be causing these and how we can resolve them?

livekit_error.txt (41.4 KB)

Three distinct problems in these logs:

  1. ConnectionError: engine is closed
    Race condition. Room disconnected, but code still tried stream_bytes() to send a message. The engine was already torn
    down. Happens when send is attempted during/after disconnect cleanup.

  2. 401 on LiveKit Cloud + OTLP
    Two separate auth failures:

  • https://…/observability/recordings/v0 → LIVEKIT_API_KEY / LIVEKIT_API_SECRET invalid or doesn’t have observability
    permissions for that cloud project
  • OTLP exporter 401 → OTEL_EXPORTER_OTLP_HEADERS API key is wrong/expired
  1. Signal connection timeouts → RegionError(“region fetch timed out”)
    LiveKit can’t reach the signal server to negotiate region. Network/DNS issue from wherever the agent is running.
    Causes retrying… (1/3) loop then eventually connects (you see RoomInputOptions warning after — so it does connect
    eventually).

Secondary (non-critical):

  • Memory 517-520MB > 500MB warn threshold → normal if loading large models per process, but worth watching
  • RoomInputOptions/RoomOutputOptions deprecated → migrate to RoomOptions
  • Audio queue overflow → momentary, likely during connection delay

Root cause priority:

  1. Fix API keys (check LIVEKIT_API_KEY, LIVEKIT_API_SECRET, OTLP headers in env/secrets)
  2. Investigate network path from agent host to LiveKit signal endpoint — DNS resolution, firewall rules
  3. Guard the stream_bytes call with a connection check before sending

Is this something you are setting up for first time or is this something that has been working and now you are having issues?

Hi @Yethu_Krishnan , we’re currently tracking an issue with connection latency, details here:

Still working to understand if your observability 401 report is part of the same issue.

We are having identical issues that started 45 min ago roughly, it appears to be acknowledged by livekit on their status page

Some more info, here are some of the errors we’re seeing when a call comes in:

10:55:49.190 ERROR  livekit            livekit_api::signal_client:161:livekit_api::signal_client -  
                                       unexpected signal error: ws failure: HTTP error: 500         
                                       Internal Server Error                     

10:56:15.581 ERROR opentele…_exporter Failed to export span batch code: 401, reason: invalid API
key+
)type.googleapis.com/google.rpc.BadRequest

10:56:25.646 ERROR opentele…_exporter Failed to export logs batch code: 401, reason: invalid API
key+
)type.googleapis.com/google.rpc.BadRequest
10:56:25.933 ERROR livekit.agents failed to upload the session report to LiveKit Cloud

And here’s the startup sequence which explains why the 10 second timeout is getting flagged:

10:55:29.910 INFO livekit.agents initializing job runner {“tid”: 77360}
10:55:29.986 DEBUG asyncio Using proactor: IocpProactor
10:55:29.987 INFO livekit.agents job runner initialized {“tid”: 77360, “elapsed_time”: 0.08}
10:55:39.990 WARNI… livekit.agents The room connection was not established within 10 seconds
after calling job_entry. This might mean that
job_ctx.connect() was never invoked, or that no
AgentSession with an active RoomIO has been started.
10:55:49.190 ERROR livekit livekit_api::signal_client:161:livekit_api::signal_client -
unexpected signal error: ws failure: HTTP error: 500
Internal Server Error

A bit more info guys - I just cleared my livekit log (local runner) and sent a test call in, the terminal was completely blank for at least 5 rings before it finally acknowledged a new job request whereas normally this happens almost instantly after we route the call from twilio to livekit’s sip trunk so maybe something in the agent dispatch system is getting hung up on an API call or something and failing to dispatch the calls?

Thanks for the update. We are looking into it. Concrete info like session IDs or regions are helpful.

In case useful, here are my recent logs when unsuccessfully starting a session using LiveKit agents (session ID RM_fF8XifWMsNWp, room name room_d2e5e6d3-7778-4c4a-8a9c-68c3e7880f36):

```
{“message”:“received job request”,“level”:“INFO”,“name”:“livekit.agents”,“job_id”:“AJ_vThKWNYmPFBK”,“dispatch_id”:“AD_cCM67Z5oNsAd”,“room”:“room_d2e5e6d3-7778-4c4a-8a9c-68c3e7880f36”,“room_id”:“RM_fF8XifWMsNWp”,“agent_name”:“”,“resuming”:false,“enable_recording”:false,“timestamp”:“2026-05-28T15:19:44.592476+00:00”}
{“message”:“initializing process”,“level”:“INFO”,“name”:“livekit.agents”,“pid”:526,“timestamp”:“2026-05-28T15:19:44.642874+00:00”}
{“message”:“process initialized”,“level”:“INFO”,“name”:“livekit.agents”,“pid”:526,“elapsed_time”:0.6,“timestamp”:“2026-05-28T15:19:45.238288+00:00”}
{“message”:“livekit_api::signal_client:287:livekit_api::signal_client - signal connection failed on v0 path: Timeout(“signal connection timed out”)”,“level”:“WARNING”,“name”:“livekit”,“pid”:368,“job_id”:“AJ_vThKWNYmPFBK”,“room_id”:“RM_fF8XifWMsNWp”,“timestamp”:“2026-05-28T15:19:49.609790+00:00”}
{“message”:“livekit::rtc_engine:434:livekit::rtc_engine - failed to connect: Signal(RegionError(“region fetch timed out”)), retrying… (1/3)”,“level”:“WARNING”,“name”:“livekit”,“pid”:368,“job_id”:“AJ_vThKWNYmPFBK”,“room_id”:“RM_fF8XifWMsNWp”,“timestamp”:“2026-05-28T15:19:52.621462+00:00”}
{“message”:“The room connection was not established within 10 seconds after calling job_entry. This might mean that job_ctx.connect() was never invoked, or that no AgentSession with an active RoomIO has been started.”,“level”:“WARNING”,“name”:“livekit.agents”,“pid”:368,“job_id”:“AJ_vThKWNYmPFBK”,“room_id”:“RM_fF8XifWMsNWp”,“timestamp”:“2026-05-28T15:19:54.598597+00:00”}
{“message”:“livekit_api::signal_client:287:livekit_api::signal_client - signal connection failed on v0 path: Timeout(“signal connection timed out”)”,“level”:“WARNING”,“name”:“livekit”,“pid”:368,“job_id”:“AJ_vThKWNYmPFBK”,“room_id”:“RM_fF8XifWMsNWp”,“timestamp”:“2026-05-28T15:19:57.630220+00:00”}
{“message”:“livekit_api::signal_client:287:livekit_api::signal_client - signal connection failed on v0 path: Timeout(“signal connection timed out”)”,“level”:“WARNING”,“name”:“livekit”,“pid”:368,“job_id”:“AJ_vThKWNYmPFBK”,“room_id”:“RM_fF8XifWMsNWp”,“timestamp”:“2026-05-28T15:20:02.666510+00:00”}
{“message”:“process exiting”,“level”:“INFO”,“name”:“livekit.agents”,“reason”:“room disconnected”,“pid”:368,“job_id”:“AJ_vThKWNYmPFBK”,“room_id”:“RM_fF8XifWMsNWp”,“timestamp”:“2026-05-28T15:20:17.419832+00:00”}
```

Here are some session IDs:

RM_xPrQLQEAoeUx

RM_Zh6NHpY7nsia

Region: us-east

Issue has been mitigated. Subscribe to LiveKit Status - Elevated Reports of Participant Connection Latency and Errors In US East Region to further updates.

Verify firewall allows outbound connections.

Check Agent Code for ctx.connect() - Ensure the entrypoint function includes the connection call

A Workaround – Increase the Connection Timeout. Also, check you Livekit Server Version. The 401 most likely stems from not reading or a misread API key, or API Secrete mismatch. From what I have seen. All just little guesses, from what I have had to face. Big Livekit Server fan, and hope you get things worked out.