AI_coustics blocking pipeline (?)

Hi all,

I’m seeing an intermittent but severe failure in a Python LiveKit Agents deployment and wanted to check whether anyone else has run into this.

Setup:

  • LiveKit Agents Python SDK 1.5.2
  • Python worker
  • ai-coustics noise cancellation enabled with the Quail model

What happens:

  • A session is running normally
  • The last normal log is typically something like the user beginning to speak
  • After that, there is a long gap with no STT events, no transcription events, and no downstream activity
  • Eventually the worker is killed by the supervisor with logs like:
    • process is unresponsive, killing process
    • exit code -10

What makes this concerning is that the failure appears to happen before STT is activated, so it looks like the audio/input pipeline is getting blocked upstream rather than a normal STT/LLM/TTS exception. There is no useful traceback when it happens.

Current hypothesis:
I suspect the issue may be related to ai-coustics Quail in the input pipeline, possibly blocking or stalling audio processing under some conditions. I’m removing ai-coustics for now to see whether the issue disappears.

Questions:

  1. Has anyone seen ai-coustics / Quail cause worker hangs or audio pipeline stalls in Python agents?
  2. Are there recommended timeout / watchdog / fallback patterns for the audio input path?
  3. If this is not likely ai-coustics, are there other parts of the pre-STT pipeline I should inspect first?

Any guidance would be really appreciated. This is intermittent but production-impacting because it causes the whole worker handling the session to die.

Hi Alexander,

Thanks for the details - we’re actively investigating and have a local repro setup running. A few things would help us narrow it down:

  1. How long does a session usually run before it hangs? Any pattern around long pauses or inactivity?
  2. Where is the pipeline running? Cloud VM / container / local — and rough CPU + RAM specs?
  3. How many concurrent sessions per worker when it happens?
  4. Any additional logs from just before the “process unresponsive” kill would be very useful.

Will keep you posted. Thanks for your patience!

Hi Pawel,

Thanks for looking into this. Quick update first: since removing ai_coustics from the
input pipeline on Apr 17, we have not seen a single occurrence of the
“process is unresponsive, killing process” signature. For reference, prior to that
we had several hundred occurrences of that exact issue going back to Feb 26, with the most recent
one on Apr 15 14:05 UTC. So the correlation is strong on our side, though not a
definitive causal proof.

Answers to your questions based on what we observed prior to removal:

  1. Session length / pattern before the hang

    • No clean pattern on duration. It was intermittent across short and longer sessions.
    • The most consistent pattern we could find was that the last normal log line
      was often right around the user beginning to speak, followed by a long gap
      (tens of seconds) with no STT / transcription / downstream activity,
      and then the supervisor kill. In other words, it looked like the
      freeze occurred upstream of STT rather than during STT/LLM/TTS.
  2. Where the pipeline runs

    • LiveKit Cloud Agents (managed worker pool), not self-hosted.
    • Python 3.13 worker
    • No other info on CPU etc usage. Sorry.
  3. Concurrency per worker

    • Default LiveKit Agents concurrency settings (we have not tuned
      num_idle_processes / job_executor_type away from defaults). Typical observed
      load at time of incidents was relatively low, so this does not look like a high-concurrency overload pattern.
  4. Logs just before the kill

    • This is the most frustrating part: there were essentially no useful
      app-level logs in the window leading up to the kill. The last events were
      normal pipeline events (user state transitions, e.g. user starting to speak),
      then silence, then the supervisor’s unresponsive / kill messages.
    • No Python traceback, no exception, no STT error, no network error in that
      window. That is part of what led us to suspect an upstream audio stage
      blocking the event loop rather than an exception in STT/LLM/TTS.

Happy to share any other information we can to assist in the local repro.

Thanks again.