Inbound SIP call: agent job dispatched ~8s after the INVITE reached LiveKit Cloud — caller had already hung up

Setup: voice agents on livekit-agents 1.4.6 (Python 3.11), inbound SIP via Twilio Elastic SIP Trunking → LiveKit Cloud SIP (project p_3tqm7ro6kbs),
agent dispatch via inbound trunk + dispatch rule. Each call = one agent worker process.

What we saw (2026-06-08, UTC):

  ┌────────────┬────────────────────────────────────────────────────────────────────────────────────────────┐
  │    Time    │                                           Event                                            │
  ├────────────┼────────────────────────────────────────────────────────────────────────────────────────────┤
  │ 14:29:23   │ Twilio sends the INVITE toward LiveKit (Twilio CallSid CA4083e5e125b6065fd9db3941e2b022a4) │
  ├────────────┼────────────────────────────────────────────────────────────────────────────────────────────┤
  │ 14:29:29   │ Caller abandons after ~6s of ringing — Twilio final status no-answer, 0s                   │
  ├────────────┼────────────────────────────────────────────────────────────────────────────────────────────┤
  │ 14:29:31.5 │ Agent job AJ_NWYstzRMnfLS dispatched — our entrypoint fires (session RM_jusR5FARB6hz)      │
  ├────────────┼────────────────────────────────────────────────────────────────────────────────────────────┤

So the job was dispatched ~8s after the INVITE — and ~2.5s after the caller had already hung up.

Why we think the delay is dispatch-side: this was 1 of 208 inbound calls on this trunk that day; the other 207 dispatched within 0.5–2.5s of ring start. The entrypoint timestamp is the first line of the job (no work before it), and the same warm workers dispatched neighboring calls within ~1s.

Session events: https://cloud.livekit.io/projects/p_3tqm7ro6kbs/sessions/RM_jusR5FARB6hz/events

Can you please help check what happened here?

What do you see in your agent logs?

{"message": "received job request", "level": "INFO", "name": "livekit.agents", "job_id": "AJ_NWYstzRMnfLS", "dispatch_id": "AD_b4jVwW8jpGMJ", "room": "twilio_fsa_use2_000_inbound_prod_deployment_+13134028862_BLheSpcxKWTQ", "room_id": "RM_jusR5FARB6hz", "agent_name": "prod_inbound", "resuming": false, "enable_recording": false, "timestamp": "2026-06-08T14:29:31.458624+00:00"}
{"levelname": "INFO", "name": "agent_orchestrator", "process": 164371, "event": "Connected to LiveKit", "message": "", "pathname": "/home/app/agent-orchestrator/agent_orchestrator/service/call_service.py", "lineno": 2088, "call_id": "call_AJ_NWYstzRMnfLS", "time_taken": 301, "room_name": "twilio_fsa_use2_000_inbound_prod_deployment_+13134028862_BLheSpcxKWTQ", "tenant_id": "fsa_use2_000", "timestamp": "2026-06-08T14:29:31.778001+00:00"}

@CWilson - we received the job request at 14:29:31 UTC. The SIP call was received at LiveKit side at 14:29:23 UTC per the session events: https://cloud.livekit.io/projects/p_3tqm7ro6kbs/sessions/RM_jusR5FARB6hz/events

Happy to share all the logs of our agent over Slack. Lmk

Looks like there was no agent capacity, and the server was waiting for agents to report WS_AVAILABLE.

A worker became available at 31.46 and the parked job was placed immediately.

Thanks for the response.

We have autoscaling and we spin up resources at 50% CPU / memory capacity. So that we can handle the burst of calls quickly.

The default capacity for the worker is 80% I believe. Do you see any logs around why the worker mentioned no worker available? We are on 1.4.6 version of the livekit/agents sdk

What do you have num_idle_processes set to? How do you see the work distributed through out your cluster during that time? How long had the “new” instances been added before the dispatch happened. “Cold” instances can take time to warm up.

What was the load before you hit the high load moment?

  1. num_idle_processes - set to default, 4 cores CPU - so 4 processes
  2. Work distribution - 3 worker pods at the time, on AWS EKS (us-east-2).
  3. Load at the timeframe - steady ~300 job dispatches per 5 minutes (~1 new call/sec) across the 3 pods from 14:15–14:40 UTC — no spike at 14:29
  4. Autoscaling - we scale using CPU and Memory HPAs on Keda when either reach 50% and the pod comes up within 1 min, but in this case there were already 3 pods running ready to serve requests.

Are these the same symptoms as this other issue? If not, how do they differ?

No the Signal Connection times out is a separate issue - hence raised it as a separate post

I dug into the logs for (RM_cHCojGmsaoQw and the worker logs). Quick summary of what I’m seeing:

This isn’t a dispatch problem, jobs are getting assigned in ~0.2s. The delay is entirely in the agent joining the room after assignment. In the worst case the agent took ~13s to join, and cloud-sip holds the caller in “180 Ringing” until the agent is in, so the caller just hears extended ringing (and some hang up).

The cause is the worker failing to open new outbound connections to LiveKit Cloud during short bursts. The timing matches the SDK’s own timeouts exactly: the signal WebSocket connect hangs and hits the 5s timeout (“v0 path Timeout”), then the region fetch (an HTTPS GET to your project host) hangs and hits its 3s timeout (“region fetch timed out”), then a retry to the same host succeeds a few seconds later. Two separate connections to the same *.livekit.cloud host stalling back-to-back, in bursts, across 5 different pods.

That points to egress/connection-establishment contention on your side rather than Cloud or the carrier. It’s also rare, 13 slow joins out of ~8,000 calls over 3 hours, clustered in a couple of bursts.

Worth checking for the 15:07–15:16 window:

  • AWS NAT Gateway ErrorPortAllocation / ActiveConnectionCount (SNAT port exhaustion is my top suspect, lots of new sockets per call toward a few destinations)
  • node conntrack usage (nf_conntrack_count vs max)
  • CoreDNS latency

The CPU/autoscaler fix you’re already making should also help here, since a pegged pod can’t service the connect within the timeout. And since you’re on agents 1.4.6 / an older rtc core, an upgrade is worth testing, newer versions improve the region/connect path.

I hope this helps.

@CWilson - I am a bit confused now - are we saying that this is because of network connectivity issues and NOT due to worker capacity issues?

Is this related to Signal connection times out on the "v0 path" at agent join, forcing a fallback that adds 0.5–5s of call-setup latency - #5 by Zaheer_Abbas

I am still investigating the worker capacity issue on my end

This issue is a network issue that is aggravated by the capacity issue.

Ok - let me check more on this. Will get back to you on that