Gemini 3 Flash Preview via LiveKit Inference has much higher TTFT/jitter than direct Vertex in same Agents workflow

Hi LiveKit team,

I’m testing Gemini 3 Flash Preview for a latency-sensitive voice agent built on LiveKit Agents. I compared Gemini 3 through LiveKit Inference vs direct Google Vertex using the same app workflow and the same scripted dialogue-turn benchmark.

Goal: reduce user perceived latency, especially time from final user transcript to first assistant text/audio.

Environment:

LiveKit Agents: 1.5.8
Category: Agents
Model: google/gemini-3-flash-preview
Use case: voice rehearsal agent, short dialogue turns

LiveKit Inference config:

from livekit.agents import inference

llm = inference.LLM(
    model="google/gemini-3-flash-preview",
    extra_kwargs={
        "temperature": 0.0,
        "reasoning_effort": "low",
        "max_tokens": 512,
    },
)

Direct Google Vertex config:

from livekit.plugins import google

llm = google.LLM(
    model="gemini-3-flash-preview",
    temperature=0.0,
    max_output_tokens=512,
    vertexai=True,
    location="global",
    thinking_config={
        "thinking_level": "low",
    },
)

On the same 12 normal dialogue turns:

Route Behavior pass Avg TTFT p50 TTFT p90 TTFT Max TTFT
LiveKit Inference + reasoning_effort="low" 12/12 2339ms 2098ms 4134ms 4447ms
Direct Vertex + thinking_level="low" 12/12 1052ms 988ms 1288ms 1326ms

I expected LiveKit Inference to be comparable or faster, but in this test it had materially higher TTFT and more jitter.

Relevant logs confirming route selection:

Using LiveKit Inference (Standard Route) for model: google/gemini-3-flash-preview
CONFIRMED USING MODEL: google/gemini-3-flash-preview

Direct Vertex route logs:

Advanced model detected (google/gemini-3-flash-preview): using Direct Google Plugin
Gemini 3 detected: forcing vertexai=True
Gemini 3 detected: setting thinking_level=low
CONFIRMED USING MODEL: google/gemini-3-flash-preview

Questions:

  1. Is reasoning_effort="low" the recommended LiveKit Inference equivalent of Gemini 3 thinking_level="low"?
  2. Is there a way to influence Gemini provider routing/region for LiveKit Inference?
  3. Should I use inference_class="priority" or another option for lower TTFT?
  4. Are there known Gemini 3 Flash Preview latency differences between LiveKit Inference and direct Vertex?
  5. Any recommended config for lowest TTFT/jitter on short voice-agent dialogue turns?

I’m happy to share more text logs or a small repro benchmark if helpful.

@Anand_Kumar, Few grounded answers to your 5 questions:

reasoning_effort vs thinking_level: yes, reasoning_effort="low" is the LK Inference equivalent. It’s the cross-provider knob in ChatCompletionOptions; the mapping to Gemini’s thinking_config happens server-side. You’re using it right.

Routing / region: no public knob. inference.LLM on main exposes only model, provider, inference_class, extra_kwargs.

inference_class="priority": try it. Shipped in livekit-agents 1.5.7 specifically for lower-TTFT routing. You’re on 1.5.8 so it’s available, and it’s the documented lever for this case.

Gemini 3 Preview latency: preview models run on whatever capacity Google has allocated to the preview tier, plus you’re paying an Inference hop. A 2x gap vs direct Vertex on a preview model isn’t surprising. Direct Vertex with location="global" lets Google pick the closest region; Inference can’t expose that.

Lowest TTFT: try priority first. If it doesn’t close the gap, direct Vertex stays the right call for latency-critical paths until Gemini 3 Flash is GA.