JobRequest.reject(terminate=False) on Cloud Agents — does it reroute or fail?

JobRequest.reject(terminate=False) on Cloud Agents — does it reroute or fail?

Looking for the right pattern for per-replica concurrency control on LiveKit Cloud, and running into what looks like a contradiction in the docs.

What we’re trying to do

Voice agent on Cloud (Ship plan). We want to cap each replica at a known number of concurrent calls so a burst can’t degrade quality on already-active sessions. Since load_fnc and load_threshold are not honored on Cloud, we tried doing this at the application layer with a request_fnc:

async def request_fnc(req: JobRequest):
    if active_jobs >= MAX_PER_REPLICA:
        await req.reject(terminate=False)
        return
    await req.accept()

The doc contradiction

  • Server options page: “If the request is rejected, it’s sent to the next available agent server.”
  • Python API reference for JobRequest.reject(): “Reject the job request. The job will not be assigned to another worker.”

These seem to say opposite things. Which one applies on Cloud?

What we observed in a burst test

Sent ~13 simultaneous calls into a deployment running with min_replicas=1:

  1. The worker accepted exactly N jobs (our cap), as expected.
  2. The worker correctly flipped to unavailable via the default load reporter (load: 0.82, threshold: 0.7).
  3. Despite the unavailable status, the scheduler kept offering the overflow jobs back to the same full worker for ~60 seconds. Same job IDs reappeared multiple times.
  4. No second replica was provisioned. Replica count stayed at 1 / 1 / 8 throughout.
  5. Overflow callers got no usable route.

What I’m trying to figure out

  1. On Cloud, does reject(terminate=False) actually trigger a reroute to another replica, or does it just fail the job? The two doc pages disagree.
  2. If a worker reports itself unavailable via the default load mechanism, what’s supposed to happen to incoming jobs? In our test, the scheduler kept dispatching to the unavailable worker rather than scaling up.
  3. Has anyone successfully implemented a per-replica concurrency cap on Cloud Agents that also lets overflow calls reach a new replica? If so, what was the pattern?
  4. Is there a way to raise min_replicas above 1 on Ship, or is that fixed by plan?

Happy to share log snippets if it helps. Mostly trying to confirm whether what we saw is expected behavior or a config issue on our end before we go to production volumes.

Thanks!

On Cloud, reject() follows the server-level behavior: a rejected job is reassigned to the next available agent server, not failed outright, as described in the Request handler section of Server options. The Python API reference wording reflects worker-level semantics, but in a multi-replica deployment the scheduler will attempt reassignment.

However, on LiveKit Cloud you cannot customize load_fnc or load_threshold, and availability is determined by the platform’s managed load reporting. If only one replica is running and it is marked unavailable, there may be no other server to route to, so the same job can be retried against that replica.

If you want to reliably fail a job request you can do that in entry point before you do the connect.

Thanks, that clarifies the reject behavior.

The remaining question for us is the autoscaling side. In our test the single replica reported unavailable (load 0.82, threshold 0.7) for ~60 seconds while overflow jobs queued, and no second replica was ever provisioned. Replicas stayed at 1/1/8 throughout. Was that expected behavior, or did the scale-up signal not fire?

For sizing context: we currently cap at 4 concurrent calls per replica based on observed load (the worker reported load 0.82 at 4 active jobs). To safely serve our Ship plan’s 20 concurrent session ceiling at that per-replica capacity, we’d need ~5 warm replicas (ceil(20/4)).

A few questions:

  1. Is there a Cloud dashboard or CLI setting to configure a minimum warm replica count for an agent? If not exposed to us directly, can LiveKit configure it on our behalf?

  2. Can min replicas be raised above 1 on Ship, or is that fixed by plan tier?

  3. Do app-level rejected jobs (via request_fnc + reject(terminate=False)) contribute to autoscale decisions, or does the autoscaler only consider managed worker load signals like CPU?

  4. What is the expected autoscale latency from “worker reports unavailable” to “replica #2 ready to accept jobs”? We’re trying to understand whether warm-replica preallocation is the only viable path for sub-second SIP bursts.

  5. For inbound SIP calls when no replica has capacity, is there a way to return SIP 486 Busy or 503 to the caller’s PBX, rather than failing inside the agent process? Ideally the caller’s phone system would know to retry rather than the customer hearing dead air.

Appreciate the help — trying to get to a clear picture of what’s possible on Cloud before going to production volumes.