Setup:
LiveKit Agents + google.beta.realtime.RealtimeModel (gemini-3.1-flash-live-preview)
Agent greets the user first with a pre-recorded audio file (“Hello”) - not model-generated.
User responds (“Hello”), and this is the first actual turn sent to the Gemini Live session.
Issue:
The response to the user’s first “Hello” takes 6–12 seconds, while every subsequent turn in the same session responds normally (sub-1-2s, as expected for a realtime model). This delay is consistent and only ever happens on this first real exchange.
What I’ve already tried:
Initialized the AgentSession / RealtimeModel connection before the call is even placed, hoping to give the WebSocket session and model extra warm-up time ahead of the first user utterance. The delay persists regardless - same 6–12s on the first turn, no improvement.
My hypothesis:
Since the greeting is a static audio file (not produced by Gemini), the Live API session may not be considered “warmed up” in any meaningful way until it processes its first real input — i.e., the session connection itself might be open, but the model/inference backend doesn’t actually spin up or allocate resources until the first realtimeInput/turn is received, regardless of how early the WebSocket was established.
What I’d like help with:
Is there a known “cold start” cost specific to the first turn of a Live API session that’s separate from WebSocket connection setup - i.e., does merely opening the session and waiting NOT pre-warm the model? Is there a recommended way to pre-warm the actual model/inference path - e.g., sending a dummy/throwaway turn immediately after setupComplete (before the real greeting plays), so the first real user turn doesn’t pay this cost?
Has anyone else measured this first-turn latency specifically with gemini-3.1-flash-live-preview via LiveKit, and confirmed whether it’s a Gemini-side cold start vs something in LiveKit’s session/event handling (e.g., first-time setup of audio pipelines, VAD calibration, etc. on the LiveKit side)?
Any logging/tracing recommendations to pinpoint where the 6-12s is actually spent (network round-trip vs Gemini inference queue vs LiveKit-side buffering)?