Stuck at 2.3s p50 after weeks of tuning - is the livekit.io homepage demo a classic chain or realtime speech?

Hi,

Running a bilingual (PT/EN) medical voice agent (Livia) on LiveKit Agents (Python) + LiveKit Cloud (Build tier). Tuned over several weeks with Gladia and ElevenLabs support on their respective ends. Sitting at ~2.3s p50 total turn latency. The voice demo on livekit.io’s homepage is obviously in a different class, so I’m trying to figure out if that’s reachable with a classic STT/LLM/TTS chain or if I’m kidding myself.

Use case

  • Bilingual phone + web, Portuguese PT + English with mid-utterance code-switching.
  • Inbound: Telnyx SIP → LiveKit Cloud (EU West B) → Python agent as a worker on UNRAID (i5-1340p, 32GB, ~500Mbps sym).
  • Measured TCP RTT agent to LK edge: ~50ms p50.
  • Tool calls out to a local n8n/Baserow for patient lookup, booking, escalation.

Current pipeline (tuned)

STT: Gladia Solaria-1
     languages=["pt","en"], code_switching=True,
     endpointing=0.03, pre_processing_speech_threshold=0.85,
     interim_results=True, region=eu-west

LLM: Claude Sonnet 4.6 via OpenRouter, provider pinned

TTS: ElevenLabs flash_v2_5 (voice "Maria") via native plugin (WS streaming)
     auto_mode=False, chunk_length_schedule=[200,280,350,400]

VAD: Silero defaults
Turn detection: MultilingualModel()
preemptive_generation=True

For the LLM I run a background benchmark every 60 minutes across Sonnet 4.0 and 4.6, via OpenRouter->Anthropic, OpenRouter->Google Vertex, and Anthropic API direct. Picks the winner by p90 TTFT (p50 hides the spikes that bite you in voice) and writes it to a file the agent reads at each call start. Config is always on whatever’s least spiky right now.

Measured latencies (p50, real call data)

Component p50
EOU (STT finalize) 633 ms
LLM TTFT 1,024 ms
TTS TTFB 629 ms
Total 2,286 ms

Best single turn we’ve clocked: 1,678 ms (short English exchange). Typical isn’t close.

Where the time is going now

LLM TTFT is the big one. Provider pinning helped (677ms p50 pinned vs 790ms auto-routed on a microbench), but p90 is still painful and ~1s feels like the floor for frontier models in a classic chain. That’s what’s making me suspect the homepage demo isn’t classic STT/LLM/TTS at all.

Tried and dropped

  • Deepgram Nova-3 multi mode: kills short PT words (“terça-feira” → “febre”). Medical vocab unusable.
  • Cartesia Beatriz: English with a PT accent. Quality regression vs Maria.
  • Azure STT + Azure Emma TTS: 3.1s p50, EOU alone was 1.3s driven by non-configurable buffers.
  • ElevenLabs ConvAI (prior engine): 3-4s, irreducible. Why we moved to LK Agents.

Questions

  1. The livekit.io homepage voice demo: classic STT/LLM/TTS chain or end-to-end realtime speech model? Not asking for vendor names, just the category. If it’s realtime speech, 1s LLM TTFT in a classic chain is a structural ceiling and I need to rethink architecture.

  2. Anyone running tuned bilingual PT/EN under 1s total p50? If yes, what LLM? We’re on Sonnet 4.6 for medical context + tool calls. Curious if people downshift to smaller/faster models and how they handle quality.

  3. Anything LK-side to shave LLM TTFT? Streaming config I’m missing, prompt caching on the OpenAI-compatible adapter when talking to OpenRouter, session reuse, anything.

  4. Per-turn trace view in LK Cloud that breaks down EOU / TTFT / TTS TTFB / playout with media-path timing? I have agent-side logs but not the media side.

Happy to post a room SID.

Nuno

1 Like

The livekit.io homepage voice demo: classic STT/LLM/TTS chain or end-to-end realtime speech model? Not asking for vendor names, just the category. If it’s realtime speech, 1s LLM TTFT in a classic chain is a structural ceiling and I need to rethink architecture.

Most of them are pipeline, but at least one is realtime. The details are given in the ‘Agent Configuration’ section next to the agent after you switch between them. I wrote a guide to address this question as it used to come up frequently: How to match the latency of the homepage agent | LiveKit

Although the voice agent code on our homepage isn’t open source, I can probably share configurations over DM if you had a specific agent you were interested in, e.g. Hayley.

The other blog I usually point people towards for latency is this one, Understand and Improve Agent Latency | LiveKit . It feels like you have a good handle on everything discussed there, but it’s worth a look.

I’m also interested in others’ points for your questions 2 - 4 :thinking:

  1. I’ve not seen claude models being used in VoiceAI - they are notoriasly slow - have you tried other models, especially the gpt mini variants? For complex flows multi-model deployments is a way to improve latency/quality.
  2. Ensuring everything is Streaming can be tricky, especially tool calls (and their dependencies)