Session.say() hangs indefinitely (~20s+) when using Gemini RealtimeModel(gemini 3.1) as llm together with google.beta.GeminiTTS as tts; session.input.set_audio_enabled() / aec_warmup_duration do not reliably block user interruption in this configuration

Summary

When using llm=google.beta.realtime.RealtimeModel (gemini-3.1-flash-live-preview) together with a separate tts=google.beta.GeminiTTS(gemini-3.1-flash-tts-preview) instance on the same AgentSession, a sequence of two session.say() calls used to deliver a scripted greeting reliably hangs on the second call for ~20 seconds with no logs, no exception, and no say() completion — even though allow_interruptions=False is set and session.input.set_audio_enabled(False) was called beforehand. The same code path works correctly when tts is swapped for a non-Google provider (e.g. sarvam.TTS), with both calls completing in well under 2 seconds combined.

Separately, even when the hang is not occurring, we’ve observed that session.input.set_audio_enabled(False) and the session’s aec_warmup_duration interruption window do not compose as expected. The AEC-warmup-based interruption suppression appears to be a fixed timer independent of how long the say() calls actually take, so it can expire mid-greeting and let STT-detected noise/echo interrupt the agent before our manually-disabled input window ends.

Environment

  • livekit-agents version: ~=1.5
  • Realtime model: gemini-3.1-flash-live-preview
  • TTS (problem case): google.beta.GeminiTTS, model gemini-3.1-flash-tts-preview, voice Sulafat
  • TTS (working case): sarvam.TTS, model bulbul:v3
  • AgentSession(aec_warmup_duration=6, ...)

Code

python

session = AgentSession[Call_State](
    userdata=call_state,
    llm=realtime_model,  
    tts=tts_model,
    user_away_timeout=user_away_timeout,
    aec_warmup_duration=6,
)

python

session.input.set_audio_enabled(False)
logger.warning("disabling the user input for saying the Hello and first intro!!")

await session.say(
    "Hello",
    audio=audio_frames_from_file(
        file_path=f"src/assets/audio_files_for_hello/{audio_file}",
    ),
)
logger.warning("said hello!!")

await asyncio.sleep(1.5)

await session.say("Hey there, I am an AI calling bot. Can we have a quick chat?")
logger.warning("said second sentence from the script")  # <-- never reached for ~20s

session.input.set_audio_enabled(True)
logger.warning("enabled the user input")

Interesting, I believe this could just be the start up time of the gemini-3.1-flash-tts-preview model rather than anything you or the plugin is doing wrong.

Unfortunately, I need to request a new key before I can look into this further / properly, which will take a few days. I do see some external reports about slow startup time, but nothing internally.

@Jayesh_Shinde, That startup-time read fits, and your own code pins it down: the hang is on the second say() because the first one plays a file. say() with an audio= argument streams those frames directly and never invokes the TTS [ livekit/agents voice/agent_activity.py ], so your “Hello” never touches GeminiTTS. The second call has no audio=, so it goes through the tts, making it GeminiTTS’s first real synthesis of the session, exactly where a preview-model cold start lands. Sarvam’s first synth just warms faster.

You can confirm this today without waiting on the key: warm GeminiTTS once at session start, before the first say(). synthesize() just produces frames you drain, so nothing is published to the caller:

async for _ in tts_model.synthesize("warmup"):
    pass

If that call absorbs the ~20s and the live greeting is then fast, it’s GeminiTTS first-request latency and pre-warming hides it. If even the warmup hangs with no completion, it’s not cold start and a minimal repro on livekit/agents is worth filing; I didn’t find an existing one.

The aec_warmup_duration / set_audio_enabled behavior is a separate code path; I’d pin the say hang first, then isolate that on its own.

No difference!!
Expected a lag in the initial part.

async for _ in tts_model.synthesize("warmup"):
    pass

Then after the first ‘Hello’, the behaviour was the same as earlier.