AssignmentTimeoutError when accepting job requests

We’re seeing a flood of failed job request accepts in our production agents today, Wednesday May 13, since around 7:25am PST. The stack trace looks like this:

Traceback (most recent call last):
  File "/app/src/socrait/libs/livekit/v1/livekit_worker.py", line 204, in _try_accept_job
    await req.accept(metadata=metadata)
  File "/app/.venv/lib/python3.12/site-packages/livekit/agents/job.py", line 561, in accept
    await self._on_accept(accept_arguments)
  File "/app/.venv/lib/python3.12/site-packages/livekit/agents/worker.py", line 903, in _on_accept
    raise AssignmentTimeoutError() from None
livekit.agents._exceptions.AssignmentTimeoutError

We have not deployed production for a week+ nor changed any configuration on our end.

The LiveKit status page shows green, our portal doesn’t show any quota overages, etc.

Is anyone else experiencing this? Can anyone from LiveKit confirm whether there is any ongoing outage?

The signature is server-side, not worker-side. accept() sends the assignment ack and waits for the server’s confirmation; AssignmentTimeoutError fires when that confirmation doesn’t arrive within the worker’s timeout window. Nothing in your worker code triggers it; the worker is doing its part.

The status page lags actual incidents by minutes to hours, so green doesn’t rule out a current issue. Three concrete moves while waiting on @CWilson / @darryncampbell:

  • Spin up one worker in a different region. If accept-timeouts only happen on your current region, it’s localized; if everywhere, it’s project-wide or global.
  • Pull a fresh agents version. Lock the version (livekit-agents==X.Y.Z) and redeploy a single replica to rule out a runtime regression that crept in via an unpinned pip install at image build.
  • Open a Cloud support ticket. Production-blocking + status page green is exactly the case where direct support beats forum cadence.

If you can share the approximate failure rate (1% / 50% / 100%) and which region your project is pinned to, that sharpens the team’s check.

Do you have a session id or job id that failed? I can check the logs to see if I see anything on this side.

More of your agent log can be helpful too.

Thank you both for your replies.

Working with support to troubleshoot, it appears to have been a partial network interrupt between our self-hosted LiveKit agent running in a Cloud Run container in GCP, and the LiveKit server. Redeploying the container fixed the issue.

During the outage, incoming job requests reached our worker process fine, which indicates that the websocket connection was at least partially active. However outgoing job accepts back to LiveKit would not go through, getting timeouts.

The container had been running for several weeks, which is longer than normal deployment cadence, and we theorize that networking state got corrupted somehow, or GCP changed something opaquely within its networking infrastructure that affected running containers with active connections.

I speculated to LiveKit support that a two-way acknowledgement heartbeat over the websocket connection between the worker and the server might have allowed it to self-heal, i.e. get explicitly reestablished, without the manager process having to be restarted. But for now we are treating as an infrastructure-related edge case.

Here are a couple of job id’s in case anything interesting pops up in the server logs on your side:

  • AJ_mpQeuXu6dxDP
  • AJ_Nn4DsvYqkPAY
  • AJ_kCoCeABDKzg8
  • AJ_MCY6Qi4LCXh4

If you are working with support, I will leave it to them not need both of us to dig in.