Feature request: Gemini thinkingLevel=minimal for faster voice-agent TTFT

Hi LiveKit team,

I’m testing Gemini 3 / 3.5 Flash for a real-time voice agent where TTFT has direct user-experience impact.

LiveKit Inference currently exposes reasoning_effort for Gemini thinking-capable models. From the docs, reasoning_effort=low maps to a fixed thinking token budget. However, Gemini’s native API also exposes thinkingLevel, including minimal.

In our benchmarks, thinkingLevel=minimal via direct Vertex is materially faster than LiveKit Inference with reasoning_effort=low, even when using service_tier=priority.

Same prompt, same dialogue flow, same model family:

Route Model Thinking config Tier Median TTFT P90 TTFT
Direct Vertex gemini-3.5-flash thinkingLevel=minimal n/a ~877ms ~980ms
LiveKit Inference google/gemini-3.5-flash reasoning_effort=low priority ~1049ms ~1272ms
LiveKit Inference google/gemini-3.5-flash reasoning_effort=low standard ~1048ms ~1484ms

For our workload, many turns are short and stateful rather than open-ended reasoning tasks. In testing, Gemini’s minimal setting appears to provide a better latency/behavior tradeoff.

Would LiveKit consider exposing Gemini-native thinkingLevel directly for Gemini models, or mapping a lower reasoning_effort option to thinkingLevel=minimal?

Something like:

inference.LLM(
    model="google/gemini-3.5-flash",
    extra_kwargs={
        "thinking_level": "minimal",
        "service_tier": "priority",
    },
)

or:

extra_kwargs={
    "reasoning_effort": "minimal"
}

This would make LiveKit Inference much more competitive for latency-critical Gemini voice agents while keeping billing and routing inside LiveKit.

Happy to share more benchmark detail if useful.

@Anand_Kumar, great work, Sir. I really enjoyed reading the benchmarking.

As a suggestion, note that extra_kwargs gets filtered via drop_unsupported_params() livekit/agents/…/inference/llm.py, so passing thinking_level through directly won’t work as a workaround on your side.

You are probably aware, but we do expose thinking_config through the Gemini plugin, Google Gemini LLM | LiveKit Documentation, which maps to thinkingLevel on Gemini 3.

I don’t see any reason why this parameter should not be added to Inference, can you raise an issue or PR against the agents repository? I don’t see any existing requests or submissions for this.

@darryncampbell thank you. Just raised: Expose Gemini thinkingLevel=minimal in LiveKit Inference · Issue #5802 · livekit/agents · GitHub

I’m not sure how much of the associated comment you can see since it’s related to a private repo, but your PR was closed the same day as it coincided with an identical PR (which was merged). I’m not sure if that was pure coincidence, or if your PR triggered the internal PR, but thank you :slight_smile:

Thanks for clarifying - no worries at all. Glad the equivalent change landed.

Do you know which agents release this will be available in? Happy to test it against our real-time voice benchmark once it ships.

Good question, that wouldn’t be on the Agents release cycle, since it has been added to Inference. It should be soon (There hasn’t been an Inference release in a few days, but I think that’s because it’s been Memorial day weekend)

The agents team inform me they have just pushed a new Inference release, which contains this change :lk-launch:

Excellent!!! Thanks for letting me know @darryncampbell ! :rocket: