Hi LiveKit team,
I’m testing Gemini 3 Flash Preview for a latency-sensitive voice agent built on LiveKit Agents. I compared Gemini 3 through LiveKit Inference vs direct Google Vertex using the same app workflow and the same scripted dialogue-turn benchmark.
Goal: reduce user perceived latency, especially time from final user transcript to first assistant text/audio.
Environment:
LiveKit Agents: 1.5.8
Category: Agents
Model: google/gemini-3-flash-preview
Use case: voice rehearsal agent, short dialogue turns
LiveKit Inference config:
from livekit.agents import inference
llm = inference.LLM(
model="google/gemini-3-flash-preview",
extra_kwargs={
"temperature": 0.0,
"reasoning_effort": "low",
"max_tokens": 512,
},
)
Direct Google Vertex config:
from livekit.plugins import google
llm = google.LLM(
model="gemini-3-flash-preview",
temperature=0.0,
max_output_tokens=512,
vertexai=True,
location="global",
thinking_config={
"thinking_level": "low",
},
)
On the same 12 normal dialogue turns:
| Route | Behavior pass | Avg TTFT | p50 TTFT | p90 TTFT | Max TTFT |
|---|---|---|---|---|---|
LiveKit Inference + reasoning_effort="low" |
12/12 | 2339ms | 2098ms | 4134ms | 4447ms |
Direct Vertex + thinking_level="low" |
12/12 | 1052ms | 988ms | 1288ms | 1326ms |
I expected LiveKit Inference to be comparable or faster, but in this test it had materially higher TTFT and more jitter.
Relevant logs confirming route selection:
Using LiveKit Inference (Standard Route) for model: google/gemini-3-flash-preview
CONFIRMED USING MODEL: google/gemini-3-flash-preview
Direct Vertex route logs:
Advanced model detected (google/gemini-3-flash-preview): using Direct Google Plugin
Gemini 3 detected: forcing vertexai=True
Gemini 3 detected: setting thinking_level=low
CONFIRMED USING MODEL: google/gemini-3-flash-preview
Questions:
- Is
reasoning_effort="low"the recommended LiveKit Inference equivalent of Gemini 3thinking_level="low"? - Is there a way to influence Gemini provider routing/region for LiveKit Inference?
- Should I use
inference_class="priority"or another option for lower TTFT? - Are there known Gemini 3 Flash Preview latency differences between LiveKit Inference and direct Vertex?
- Any recommended config for lowest TTFT/jitter on short voice-agent dialogue turns?
I’m happy to share more text logs or a small repro benchmark if helpful.