"process is unresponsive, killing process" - happens intermittently on long Gemini 3.1 Flash Live calls, on_session_end shutdown callback often doesn't run

Happens on roughly 3 out of ~20 calls lasting 1-2+ minutes. The call flow looks completely normal up until the end:

deleting room on agent session close (disable via RoomInputOptions.delete_room_on_close=False)

Agent session closed: USER INITAITED

and then, after a delay, the process gets force-killed with “process is unresponsive, killing process” — which (per supervised_proc.py) means the parent stopped receiving PongResponse from the child for the full ping_timeout window.

What we’ve ruled out:

  • We have a custom on_session_end shutdown callback (registered via ctx.add_shutdown_callback) that does LLM summarization + a few API mutations. We instrumented it heavily and confirmed: in 2 out of 3 failures, on_session_end never even starts - so the hang happens before our callback runs, somewhere in the framework’s own teardown.
  • We also reproduced a synthetic 120-second blocking call inside on_session_end (via a local mock server replacing our OpenAI call) and the process was not killed in that case - so a long-running sync call inside our own shutdown callback doesn’t appear to be the trigger either.

Do you have any other blocking calls that happen? It feels like the agent is becoming non-responsive for some reason. I think your test that on_session_end only gets called in 2 out of the 3 failures indicates it’s something that happens during agent execution (and the fact you say it’s only during long-running calls makes me think the same - not that 1-2 minutes is very long for an AI agent call in the grand scale of things)

Can you reproduce this with a minimal agent, such as GitHub - livekit-examples/agent-starter-python: A complete voice AI starter for LiveKit Agents with Python. · GitHub + Gemini Live? If so, that would go a long way to isolating the issue to the framework.

Hi,

The issue is occurring on random calls above 1-2 mins+ and hence it is getting difficult to reproduce the same. I did try to reproduce it, but did not succeed.

I am getting the log of “Agent session closed”, which I’m assuming tells us that the session has successfully closed, and from here on, the job process is not getting shut down completely, which is what I’m assuming and is the culprit. Because if it had been the case, then the on_session-end would have been called successfully, and issues would have been with respect to code inside on-session-end, which is not the case.

If you can share a couple of session IDs (begins with RM_) where this issue occurs I’ll see if there’s anything in the server logs that might be a clue, although I think it’s unlikely there’ll be anything there.

Do you have full agent logs?

It would be a very good idea to set up an external logging service to obtain full logs, if you haven’t already done so.

These are the session IDs - RM_RkbUy6JQfAgJ, RM_o5Ya9qrT9wF7, RM_AMgYJBGQqpJk

We have Sentry integrated for external logging, but cannot find any clue from there as well

I took a thorough look at the server logs for RM_RkbUy6JQfAgJ and I don’t see anything untoward, from our side it just looks like the caller hung up. I can take a look at your agent logs if you’re happy and able to share them in DM.