Livekit Inference no-thinking config for google gemini 2.5 flash model

Rinvo_Team · March 24, 2026, 4:58pm

Hello,

Im testing your Livekit Inference feature to use for our production voice AI agent and have two questions:

How should I implement the no-thinking config so that the gemini model truly follows the setting. Currently this is not being passed properly and I can still see that the model is thinking under the hood:

llm_instance = inference.LLM(
 model=llm_model,  # e.g. "google/gemini-2.5-flash"provider="google",
 extra_kwargs={
  "temperature": temperature,
  "extra_body": {
    "thinking_config": {
      "thinking_budget": 0,
      "include_thoughts": False
        }        
      },

    },

)

What is the stability of the Livekit Inference API for the users on the Scale Plan? Im considering switching from VertexAI API because their API keeps throttling and I need very low latency for my agents without experiencing 429 errors - does the Livekit Inference ensure that?

Kind regards,
Michal

Rinvo_Team · March 24, 2026, 6:10pm

I have also tried this approach but can You verify if it is correct for the gemini models?

llm_instance = inference.LLM(
  model=llm_model,  # "google/gemini-2.5-flash"
  provider="google",
  extra_kwargs=ChatCompletionOptions(
    temperature=temperature,
    reasoning_effort="none"
    ),
)

Thanks in advance!

CWilson · March 24, 2026, 7:17pm

For Gemini 2.5 Flash via the Google plugin, the supported way to control reasoning is through the thinking_config parameter on the Google LLM itself, not via generic extra_body or reasoning_effort. The Python Google plugin exposes thinking_config directly on the LLM constructor, which is the correct integration point for disabling thinking behavior. See the Google plugin reference for the full parameter list:
Google LLM plugin reference

reasoning_effort is not a Gemini-native setting, so it will not reliably disable internal reasoning for Gemini models.

Regarding stability: LiveKit Inference runs models through LiveKit-managed infrastructure designed for low-latency voice workloads. Plan-specific quotas and limits are documented here:
Quotas & limits

Rinvo_Team · March 25, 2026, 5:12pm

Thanks for the answers!

Just to clarify one thing, what about disabling or specifying the thinking configuration on the Livekit Inference API and not on the Google Plugin API? Is it possible to do so or is there a way to invoke livekit Inference by using the Google plugin directly?

Im talking about the llm instance from here:
from livekit.agents.inference.llm

and not from here:
livekit.plugins.google.llm

darryncampbell · March 26, 2026, 9:45am

Unfortunately the thinking_config is not available through LiveKit Inference, only direct through the plugin.

Topic		Replies	Views
Livekit Inference thinking configuration for gemini 2.5 and 3.5 flash Agents agent-deployment , llm , livekit-inference	4	31	May 28, 2026
Feature request: Gemini thinkingLevel=minimal for faster voice-agent TTFT Agents llm , gemini	8	64	May 27, 2026
Gemini 3 Flash Preview via LiveKit Inference has much higher TTFT/jitter than direct Vertex in same Agents workflow Agents llm	1	23	May 15, 2026
Gpt-realtime-2 set reasoning_effort to none or very low Agents agent-development , realtime , openai	1	188	May 8, 2026
All Livekit Inference Gemini LLMs return "Completion_tokens=0" and stop responding , This suddenly started happening today without any code change Agents llm	5	21	May 28, 2026

Livekit Inference no-thinking config for google gemini 2.5 flash model

Related topics