Hi,
Running a bilingual (PT/EN) medical voice agent (Livia) on LiveKit Agents (Python) + LiveKit Cloud (Build tier). Tuned over several weeks with Gladia and ElevenLabs support on their respective ends. Sitting at ~2.3s p50 total turn latency. The voice demo on livekit.io’s homepage is obviously in a different class, so I’m trying to figure out if that’s reachable with a classic STT/LLM/TTS chain or if I’m kidding myself.
Use case
- Bilingual phone + web, Portuguese PT + English with mid-utterance code-switching.
- Inbound: Telnyx SIP → LiveKit Cloud (EU West B) → Python agent as a worker on UNRAID (i5-1340p, 32GB, ~500Mbps sym).
- Measured TCP RTT agent to LK edge: ~50ms p50.
- Tool calls out to a local n8n/Baserow for patient lookup, booking, escalation.
Current pipeline (tuned)
STT: Gladia Solaria-1
languages=["pt","en"], code_switching=True,
endpointing=0.03, pre_processing_speech_threshold=0.85,
interim_results=True, region=eu-west
LLM: Claude Sonnet 4.6 via OpenRouter, provider pinned
TTS: ElevenLabs flash_v2_5 (voice "Maria") via native plugin (WS streaming)
auto_mode=False, chunk_length_schedule=[200,280,350,400]
VAD: Silero defaults
Turn detection: MultilingualModel()
preemptive_generation=True
For the LLM I run a background benchmark every 60 minutes across Sonnet 4.0 and 4.6, via OpenRouter->Anthropic, OpenRouter->Google Vertex, and Anthropic API direct. Picks the winner by p90 TTFT (p50 hides the spikes that bite you in voice) and writes it to a file the agent reads at each call start. Config is always on whatever’s least spiky right now.
Measured latencies (p50, real call data)
| Component | p50 |
|---|---|
| EOU (STT finalize) | 633 ms |
| LLM TTFT | 1,024 ms |
| TTS TTFB | 629 ms |
| Total | 2,286 ms |
Best single turn we’ve clocked: 1,678 ms (short English exchange). Typical isn’t close.
Where the time is going now
LLM TTFT is the big one. Provider pinning helped (677ms p50 pinned vs 790ms auto-routed on a microbench), but p90 is still painful and ~1s feels like the floor for frontier models in a classic chain. That’s what’s making me suspect the homepage demo isn’t classic STT/LLM/TTS at all.
Tried and dropped
- Deepgram Nova-3
multimode: kills short PT words (“terça-feira” → “febre”). Medical vocab unusable. - Cartesia Beatriz: English with a PT accent. Quality regression vs Maria.
- Azure STT + Azure Emma TTS: 3.1s p50, EOU alone was 1.3s driven by non-configurable buffers.
- ElevenLabs ConvAI (prior engine): 3-4s, irreducible. Why we moved to LK Agents.
Questions
-
The livekit.io homepage voice demo: classic STT/LLM/TTS chain or end-to-end realtime speech model? Not asking for vendor names, just the category. If it’s realtime speech, 1s LLM TTFT in a classic chain is a structural ceiling and I need to rethink architecture.
-
Anyone running tuned bilingual PT/EN under 1s total p50? If yes, what LLM? We’re on Sonnet 4.6 for medical context + tool calls. Curious if people downshift to smaller/faster models and how they handle quality.
-
Anything LK-side to shave LLM TTFT? Streaming config I’m missing, prompt caching on the OpenAI-compatible adapter when talking to OpenRouter, session reuse, anything.
-
Per-turn trace view in LK Cloud that breaks down EOU / TTFT / TTS TTFB / playout with media-path timing? I have agent-side logs but not the media side.
Happy to post a room SID.
Nuno