We are using openai gpt 5.4 mini and getting around avg 900ms ttft, we are wondering if livekit had more data or more proper insights about which model is fastest and yet in budget.
Main requirement is intelligence about ~ gpt 5.4 mini but still faster than it.
We tried google gemini flash and lite models, but it seems to be very slow for us >1s ttft, not sure why as we thought it seems to be flash/faster, we are using livekit inference and the agent is deployed in the eu-central region.
Any thoughts or guidance on best model selection,it would be much appreciated. Our input prompt is around 3.5k tokens and output is around 10-80. Agent is doing roleplay and on avg there are 120 turns (ai + human)
I’m surprised the Flash and Lite models are very slow (>1s ttft), quite possibly this is somehow related to the large context but I couldn’t say.
We do need to work on benchmarks for these kind of recommendations, and I’ll feed back internally, but my two cents would be to try:
xai/grok-4-1-fast-non-reasoning (I hadn’t realised this was deprecated, thanks for pointing that out)
- openai/gpt-5.4-nano
And see how those perform in terms of performance / intelligence.
I expect others in this forum will have other opinions 
Context is very minimal, around 3k system prompt token and 150 turns, so max input token is only around 7k-10k which is very small i think. But strangely google Flash and Lite models are very slow, both with custom api key or via livkit inference, it take >1 second.
Yes, xai/grok-4-1-fast-non-reasoning is fast but its retired as mentioned here May 15, 2026 Model Retirement | xAI Docs
And openai/gpt-5.4-nano is good in terms of latency but its not that intelligent.
I wonder if others are also facing same issue with Flash and Lite models of google?
@Kaushal_Shah,Before switching models, two TTFT levers on gpt-5.4-mini itself, both exposed through Inference: your 3.5k system prompt is re-sent on every one of ~120 turns, so prefill is most of your TTFT. Set a prompt_cache_key so the provider caches that repeated prefix, and bump service_tier to the priority latency tier [ docs.livekit.io/reference/agents/inference-llm-parameters ]:
from livekit.agents import inference
llm = inference.LLM(
model="openai/gpt-5.4-mini",
extra_kwargs={
"prompt_cache_key": "roleplay-sys-v1", # caches your repeated 3.5k prefix
"service_tier": "priority",
},
)
On Flash/Lite being slow: “flash/lite” is tuned for throughput (tokens/sec), not first-token latency. TTFT is dominated by prefill and routing, so the name doesn’t predict it, and pinning down the eu-central Gemini number is exactly the benchmark that doesn’t exist publicly yet.
If you do want a swap for mini-level intelligence at lower TTFT, GPT OSS 120B is served on Cerebras and Groq through Inference [ docs.livekit.io/agents/models/inference ]; that hardware is built for low first-token latency, which fits your “smart but faster” target better than nano or the retired grok-fast.
Thanks a lot, i will try with prompt cache key and see if it imrpoves ttft.
Also will try GPT OSS 120B, via livekit inference.