Cutting LLM round-trip latency for the voice agent

Running a voice agent on LiveKit Agents (Silero VAD → STT → custom LLM logic → TTS). I added per-turn timing and found the gap the caller actually feels is basically all the LLM call — everything else (STT finalize, my logic, handing off to TTS) adds up to under 10ms, while the LLM round-trip is ~0.85–1s, sometimes spiking past 1.5s.

Stuff I’ve already tried that didn’t really help:

  • Swapping to a smaller/faster model → same latency
  • Token-streaming the reply → no faster first word (my replies are usually one sentence)
  • Tuning endpointing + trimming the prompt → small wins only

So it really seems like the LLM round-trip is the floor and I’m out of easy ideas.

How are you all keeping this low in production? Curious what’s actually working for people — provider/model choices, prefix caching, self-hosting, anything. Trying to get the whole turn under a second. Any real numbers would be awesome :folded_hands:

Your point makes sense; at a single-sentence level, streaming doesn’t really help much because TTS still has to wait until the sentence is complete before it can start. So the real bottleneck ends up being the LLM provider itself.

The biggest real-world improvement usually comes from switching to Groq. Llama 3.3 70B on Groq consistently hits around 100–200ms TTFT in production, compared to roughly 700ms–1.5s on OpenAI. That alone can often bring full pipelines down under ~500ms. Cerebras can go even faster if your workload fits their supported models.

If you’re working with multi-sentence responses, you can squeeze more gains by streaming tokens directly into TTS as they arrive. That overlap between LLM generation and speech synthesis can save another ~100–200ms. For single-sentence flows, early STT triggering tends to help more starting the LLM call at around 80% transcript confidence instead of waiting for final output can cut ~200–400ms in most cases.

For absolute minimum latency in production, self-hosting something like Qwen 2.5 7B on vLLM can push TTFT down to ~80–150ms. It does add operational overhead, but you get full control over latency.

Also worth building a fallback setup. Groq can spike under load, so having an 8B-instant style backup helps keep p99 latency stable.

Reference: Understand and Improve Voice Agent Latency | LiveKit

@Hemil_Parmar, The single biggest lever you haven’t named is the inference provider. If a smaller model didn’t help, the bottleneck is the provider’s TTFT, not model size. LK has two purpose-built fast-inference integrations for exactly this case.

Cerebras is the canonical pick when TTFT is the floor. LK positions it as “the world’s fastest inference” [ Cerebras and LiveKit | LiveKit Documentation ].

Wiring it in:

  from livekit.agents import AgentSession
  from livekit.plugins import cerebras

  session = AgentSession(
      llm=cerebras.LLM(model="llama-3.3-70b"),
      # vad, stt, tts, turn_detection here
  )

  Groq is the other fast-inference option, exposed directly through LK Inference as a string identifier [docs.livekit.io/agents/integrations/groq/]:

  session = AgentSession(
      llm="groq/gpt-oss-120b",
  )

Two more knobs on top of provider choice:

  • preemptive_tts=True on AgentSession options. Default is False; enabling it starts TTS on the first LLM chunk rather than waiting for the full response, even for single-sentence replies [ Turn-taking tuning | LiveKit Documentation ].
  • Prompt caching where the provider supports it. OpenAI exposes prompt_cache_key, Anthropic exposes cache_control blocks. Cuts TTFT measurably for repeated system prompts. Not LK-config, just pass the parameter through.

If you can change the architecture for a structural sub-second floor, the realtime speech-to-speech path (Gemini Live, OpenAI Realtime, xAI Realtime) bypasses the LLM round-trip entirely. Caveat: custom LLM logic ports less cleanly to RealtimeModel. Function calling works, but heavy custom retrieval or multi-step reasoning is harder to keep.

I already tried using groq, it is giving latency of around >1s,
{“chat.item”:{“message”:{“id”:“item_8d9fa0789466”,“role”:“ASSISTANT”,“content”:[{“text”:“मैं आपकी आवाज़ साफ़ सुन पा रही हूँ — बताइए, आपका क्या business है या किस तरह की मदद चाहिए?”}],“metrics”:{“started_speaking_at”:“2026-06-05T10:19:11.748Z”,“stopped_speaking_at”:“2026-06-05T10:19:16.556Z”,“llm_node_ttft”:1.3179019999988668,“tts_node_ttfb”:0.2668750999982876,“e2e_latency”:2.4381468296051025},“created_at”:“2026-06-05T10:19:10.141Z”}},“room_id”:“RM_WzARJxgwr557”,“job_id”:“AJ_qgKtrpS2HQna”,“logger.name”:“chat_history”,“lk.id”:“60aadf96-ecb2-9552-9300-a975dc98914a”}

which is significantly higher that I have seen others mention what mistake could I be making, any idea where I should look at?

1.3s TTFT on Groq is definitely higher than what I’d normally expect, so there’s probably something specific contributing to the delay.

A few things check first:

  1. Prompt size
    TTFT on Groq scales with input length. Try logging the token count for each turn. If you’re sending a large system prompt or extensive conversation history, that alone can add noticeable latency. A 1,000+ token prompt can push TTFT much higher than the published benchmarks.

  2. Model selection
    Which model are you running? llama-3.3-70b typically lands around ~180ms median TTFT under normal conditions. I’d test with llama-3.1-8b-instant first, which is usually in the 50–120ms range. That can quickly tell you whether the bottleneck is model-related or somewhere else in the stack.

  3. Rate limiting
    If you’re on a free or development tier, keep an eye on token limits. Once you start approaching those limits, latency can increase significantly. It’s worth checking the Groq dashboard while reproducing the issue to rule that out.

If all of the above looks healthy, it may be worth testing Cerebras as a comparison. The LiveKit plugin includes gzip + msgpack compression, which can help reduce TTFT when you’re working with larger prompts.