Hi LiveKit team,
I’m testing Gemini 3 / 3.5 Flash for a real-time voice agent where TTFT has direct user-experience impact.
LiveKit Inference currently exposes reasoning_effort for Gemini thinking-capable models. From the docs, reasoning_effort=low maps to a fixed thinking token budget. However, Gemini’s native API also exposes thinkingLevel, including minimal.
In our benchmarks, thinkingLevel=minimal via direct Vertex is materially faster than LiveKit Inference with reasoning_effort=low, even when using service_tier=priority.
Same prompt, same dialogue flow, same model family:
| Route | Model | Thinking config | Tier | Median TTFT | P90 TTFT |
|---|---|---|---|---|---|
| Direct Vertex | gemini-3.5-flash |
thinkingLevel=minimal |
n/a | ~877ms | ~980ms |
| LiveKit Inference | google/gemini-3.5-flash |
reasoning_effort=low |
priority | ~1049ms | ~1272ms |
| LiveKit Inference | google/gemini-3.5-flash |
reasoning_effort=low |
standard | ~1048ms | ~1484ms |
For our workload, many turns are short and stateful rather than open-ended reasoning tasks. In testing, Gemini’s minimal setting appears to provide a better latency/behavior tradeoff.
Would LiveKit consider exposing Gemini-native thinkingLevel directly for Gemini models, or mapping a lower reasoning_effort option to thinkingLevel=minimal?
Something like:
inference.LLM(
model="google/gemini-3.5-flash",
extra_kwargs={
"thinking_level": "minimal",
"service_tier": "priority",
},
)
or:
extra_kwargs={
"reasoning_effort": "minimal"
}
This would make LiveKit Inference much more competitive for latency-critical Gemini voice agents while keeping billing and routing inside LiveKit.
Happy to share more benchmark detail if useful.