Query about fastest TTFT livekit inference model

Kaushal_Shah · June 17, 2026, 11:47am

We are using openai gpt 5.4 mini and getting around avg 900ms ttft, we are wondering if livekit had more data or more proper insights about which model is fastest and yet in budget.

Main requirement is intelligence about ~ gpt 5.4 mini but still faster than it.

We tried google gemini flash and lite models, but it seems to be very slow for us >1s ttft, not sure why as we thought it seems to be flash/faster, we are using livekit inference and the agent is deployed in the eu-central region.

Any thoughts or guidance on best model selection,it would be much appreciated. Our input prompt is around 3.5k tokens and output is around 10-80. Agent is doing roleplay and on avg there are 120 turns (ai + human)

darryncampbell · June 17, 2026, 1:07pm

I’m surprised the Flash and Lite models are very slow (>1s ttft), quite possibly this is somehow related to the large context but I couldn’t say.

We do need to work on benchmarks for these kind of recommendations, and I’ll feed back internally, but my two cents would be to try:

~~xai/grok-4-1-fast-non-reasoning~~ (I hadn’t realised this was deprecated, thanks for pointing that out)
openai/gpt-5.4-nano

And see how those perform in terms of performance / intelligence.

I expect others in this forum will have other opinions

Kaushal_Shah · June 17, 2026, 2:28pm

Context is very minimal, around 3k system prompt token and 150 turns, so max input token is only around 7k-10k which is very small i think. But strangely google Flash and Lite models are very slow, both with custom api key or via livkit inference, it take >1 second.

Yes, xai/grok-4-1-fast-non-reasoning is fast but its retired as mentioned here May 15, 2026 Model Retirement | xAI Docs

And openai/gpt-5.4-nano is good in terms of latency but its not that intelligent.

I wonder if others are also facing same issue with Flash and Lite models of google?

Muhammad_Usman_Bashir · June 17, 2026, 6:57pm

@Kaushal_Shah,Before switching models, two TTFT levers on gpt-5.4-mini itself, both exposed through Inference: your 3.5k system prompt is re-sent on every one of ~120 turns, so prefill is most of your TTFT. Set a prompt_cache_key so the provider caches that repeated prefix, and bump service_tier to the priority latency tier [ docs.livekit.io/reference/agents/inference-llm-parameters ]:

from livekit.agents import inference

llm = inference.LLM(
    model="openai/gpt-5.4-mini",
    extra_kwargs={
        "prompt_cache_key": "roleplay-sys-v1",  # caches your repeated 3.5k prefix
        "service_tier": "priority",
    },
)

On Flash/Lite being slow: “flash/lite” is tuned for throughput (tokens/sec), not first-token latency. TTFT is dominated by prefill and routing, so the name doesn’t predict it, and pinning down the eu-central Gemini number is exactly the benchmark that doesn’t exist publicly yet.

If you do want a swap for mini-level intelligence at lower TTFT, GPT OSS 120B is served on Cerebras and Groq through Inference [ docs.livekit.io/agents/models/inference ]; that hardware is built for low first-token latency, which fits your “smart but faster” target better than nano or the retired grok-fast.

Kaushal_Shah · June 17, 2026, 8:56pm

Thanks a lot, i will try with prompt cache key and see if it imrpoves ttft.
Also will try GPT OSS 120B, via livekit inference.

Topic		Replies	Views
Bad LiveKit Inference ttft for gpt-4.1 Agents livekit-inference , livekit-cloud	1	47	April 15, 2026
Why is GPT-5.4 pricing via LiveKit Inference about 2x OpenAI direct? Agents livekit-inference	7	78	May 14, 2026
Gemini 3 Flash Preview via LiveKit Inference has much higher TTFT/jitter than direct Vertex in same Agents workflow Agents llm	1	27	May 15, 2026
Livekit inference GPT-5 mini does not works Getting Started llm , livekit-inference	5	35	June 17, 2026
LiveKit inference for gemini 3.1 flash lite when? Getting Started	3	137	April 3, 2026

Query about fastest TTFT livekit inference model

Related topics