Feature request: Gemini thinkingLevel=minimal for faster voice-agent TTFT

Anand_Kumar · May 20, 2026, 2:42pm

Hi LiveKit team,

I’m testing Gemini 3 / 3.5 Flash for a real-time voice agent where TTFT has direct user-experience impact.

LiveKit Inference currently exposes reasoning_effort for Gemini thinking-capable models. From the docs, reasoning_effort=low maps to a fixed thinking token budget. However, Gemini’s native API also exposes thinkingLevel, including minimal.

In our benchmarks, thinkingLevel=minimal via direct Vertex is materially faster than LiveKit Inference with reasoning_effort=low, even when using service_tier=priority.

Same prompt, same dialogue flow, same model family:

Route	Model	Thinking config	Tier	Median TTFT	P90 TTFT
Direct Vertex	`gemini-3.5-flash`	`thinkingLevel=minimal`	n/a	~877ms	~980ms
LiveKit Inference	`google/gemini-3.5-flash`	`reasoning_effort=low`	priority	~1049ms	~1272ms
LiveKit Inference	`google/gemini-3.5-flash`	`reasoning_effort=low`	standard	~1048ms	~1484ms

For our workload, many turns are short and stateful rather than open-ended reasoning tasks. In testing, Gemini’s minimal setting appears to provide a better latency/behavior tradeoff.

Would LiveKit consider exposing Gemini-native thinkingLevel directly for Gemini models, or mapping a lower reasoning_effort option to thinkingLevel=minimal?

Something like:

inference.LLM(
    model="google/gemini-3.5-flash",
    extra_kwargs={
        "thinking_level": "minimal",
        "service_tier": "priority",
    },
)

or:

extra_kwargs={
    "reasoning_effort": "minimal"
}

This would make LiveKit Inference much more competitive for latency-critical Gemini voice agents while keeping billing and routing inside LiveKit.

Happy to share more benchmark detail if useful.

Muhammad_Usman_Bashir · May 21, 2026, 12:00am

@Anand_Kumar, great work, Sir. I really enjoyed reading the benchmarking.

As a suggestion, note that extra_kwargs gets filtered via drop_unsupported_params() livekit/agents/…/inference/llm.py, so passing thinking_level through directly won’t work as a workaround on your side.

darryncampbell · May 21, 2026, 2:17pm

You are probably aware, but we do expose thinking_config through the Gemini plugin, Google Gemini LLM | LiveKit Documentation, which maps to thinkingLevel on Gemini 3.

I don’t see any reason why this parameter should not be added to Inference, can you raise an issue or PR against the agents repository? I don’t see any existing requests or submissions for this.

Anand_Kumar · May 21, 2026, 11:36pm

@darryncampbell thank you. Just raised: Expose Gemini thinkingLevel=minimal in LiveKit Inference · Issue #5802 · livekit/agents · GitHub

darryncampbell · May 26, 2026, 8:49am

I’m not sure how much of the associated comment you can see since it’s related to a private repo, but your PR was closed the same day as it coincided with an identical PR (which was merged). I’m not sure if that was pure coincidence, or if your PR triggered the internal PR, but thank you

Anand_Kumar · May 26, 2026, 1:07pm

Thanks for clarifying - no worries at all. Glad the equivalent change landed.

Do you know which agents release this will be available in? Happy to test it against our real-time voice benchmark once it ships.

darryncampbell · May 26, 2026, 3:35pm

Good question, that wouldn’t be on the Agents release cycle, since it has been added to Inference. It should be soon (There hasn’t been an Inference release in a few days, but I think that’s because it’s been Memorial day weekend)

darryncampbell · May 27, 2026, 6:57am

The agents team inform me they have just pushed a new Inference release, which contains this change

Anand_Kumar · May 27, 2026, 1:37pm

Excellent!!! Thanks for letting me know @darryncampbell !

Topic		Replies	Views
Livekit Inference thinking configuration for gemini 2.5 and 3.5 flash Agents agent-deployment , llm , livekit-inference	4	71	May 28, 2026
Livekit Inference no-thinking config for google gemini 2.5 flash model Getting Started livekit-inference	4	85	March 26, 2026
Gemini 3 Flash Preview via LiveKit Inference has much higher TTFT/jitter than direct Vertex in same Agents workflow Agents llm	1	38	May 15, 2026
Gpt-realtime-2 set reasoning_effort to none or very low Agents agent-development , realtime , openai	1	248	May 8, 2026
LiveKit inference for gemini 3.1 flash lite when? Getting Started	3	151	April 3, 2026

Feature request: Gemini thinkingLevel=minimal for faster voice-agent TTFT

Related topics