Title: 6–12s latency on the first model response after a pre-recorded greeting, even with early session initialization (Gemini 3.1 Flash Live)

Setup:

LiveKit Agents + google.beta.realtime.RealtimeModel (gemini-3.1-flash-live-preview)
Agent greets the user first with a pre-recorded audio file (“Hello”) - not model-generated.
User responds (“Hello”), and this is the first actual turn sent to the Gemini Live session.

Issue:

The response to the user’s first “Hello” takes 6–12 seconds, while every subsequent turn in the same session responds normally (sub-1-2s, as expected for a realtime model). This delay is consistent and only ever happens on this first real exchange.

What I’ve already tried:

Initialized the AgentSession / RealtimeModel connection before the call is even placed, hoping to give the WebSocket session and model extra warm-up time ahead of the first user utterance. The delay persists regardless - same 6–12s on the first turn, no improvement.

My hypothesis:

Since the greeting is a static audio file (not produced by Gemini), the Live API session may not be considered “warmed up” in any meaningful way until it processes its first real input — i.e., the session connection itself might be open, but the model/inference backend doesn’t actually spin up or allocate resources until the first realtimeInput/turn is received, regardless of how early the WebSocket was established.

What I’d like help with:

Is there a known “cold start” cost specific to the first turn of a Live API session that’s separate from WebSocket connection setup - i.e., does merely opening the session and waiting NOT pre-warm the model? Is there a recommended way to pre-warm the actual model/inference path - e.g., sending a dummy/throwaway turn immediately after setupComplete (before the real greeting plays), so the first real user turn doesn’t pay this cost?

Has anyone else measured this first-turn latency specifically with gemini-3.1-flash-live-preview via LiveKit, and confirmed whether it’s a Gemini-side cold start vs something in LiveKit’s session/event handling (e.g., first-time setup of audio pipelines, VAD calibration, etc. on the LiveKit side)?

Any logging/tracing recommendations to pinpoint where the 6-12s is actually spent (network round-trip vs Gemini inference queue vs LiveKit-side buffering)?

6-12s is very long, is this happening for every session? It honestly feels more like this: Deployment management | LiveKit Documentation but I wouldn’t expect that to happen every session.

How are you sending the static audio file? Through a separate TTS?

I haven’t seen this come up before, but as you have probably seen, there are compatibility issues with 3.1, so quite possibly this is coming from the LLM: Gemini Live API plugin | LiveKit Documentation

I would say the first thing should be to nail down exactly what is contributing to that 6-12 seconds. Is it agent dispatch? Is it something else? I would start with Agent insights in LiveKit Cloud | LiveKit Documentation to try and understand this.

Yes, I am sending the static audio file through a separate TTS.

@madhur, On the pre-warm idea specifically: a throwaway generate_reply after setup won’t work on 3.1. The Google realtime plugin sets mutable_chat_context=False for any model with “3.1” in the name (mutable = "3.1" not in model), and generate_reply is gated on that capability, so it returns a RealtimeError on gemini-3.1-flash-live-preview [ livekit/agents google/realtime/realtime_api.py ]. That’s the same 3.1 incompatibility surface flagged above, and it means the plugin won’t let you force an early inference turn on that model.

Cleanest way to localize the 6-12s to Gemini vs LiveKit: run the identical setup on gemini-2.5-flash-native-audio-preview-12-2025, where generate_reply is supported. If the first-turn delay disappears, it’s specific to the 3.1 preview backend, not your pipeline or LiveKit’s session handling; if it persists on both, the cost is on the LiveKit/dispatch side and the insights breakdown will show it.