Hello,
Im testing your Livekit Inference feature to use for our production voice AI agent and have two questions:
- How should I implement the no-thinking config so that the gemini model truly follows the setting. Currently this is not being passed properly and I can still see that the model is thinking under the hood:
llm_instance = inference.LLM(
model=llm_model, # e.g. "google/gemini-2.5-flash"provider="google",
extra_kwargs={
"temperature": temperature,
"extra_body": {
"thinking_config": {
"thinking_budget": 0,
"include_thoughts": False
}
},
},
)
- What is the stability of the Livekit Inference API for the users on the Scale Plan? Im considering switching from VertexAI API because their API keeps throttling and I need very low latency for my agents without experiencing 429 errors - does the Livekit Inference ensure that?
Kind regards,
Michal
I have also tried this approach but can You verify if it is correct for the gemini models?
llm_instance = inference.LLM(
model=llm_model, # "google/gemini-2.5-flash"
provider="google",
extra_kwargs=ChatCompletionOptions(
temperature=temperature,
reasoning_effort="none"
),
)
Thanks in advance!
For Gemini 2.5 Flash via the Google plugin, the supported way to control reasoning is through the thinking_config parameter on the Google LLM itself, not via generic extra_body or reasoning_effort. The Python Google plugin exposes thinking_config directly on the LLM constructor, which is the correct integration point for disabling thinking behavior. See the Google plugin reference for the full parameter list:
Google LLM plugin reference
reasoning_effort is not a Gemini-native setting, so it will not reliably disable internal reasoning for Gemini models.
Regarding stability: LiveKit Inference runs models through LiveKit-managed infrastructure designed for low-latency voice workloads. Plan-specific quotas and limits are documented here:
Quotas & limits
Thanks for the answers!
Just to clarify one thing, what about disabling or specifying the thinking configuration on the Livekit Inference API and not on the Google Plugin API? Is it possible to do so or is there a way to invoke livekit Inference by using the Google plugin directly?
Im talking about the llm instance from here:
from livekit.agents.inference.llm
and not from here:
livekit.plugins.google.llm
Unfortunately the thinking_config is not available through LiveKit Inference, only direct through the plugin.