ElevenLabs pretty slow despite using inference

Team,

I am struggling with 11Labs performance with in Livekit. I am getting 3-4 second turnaround on Elevenlabs. I am trying to migrate from 11Labs and hence trying to keep the same setup for now, but it’s at least a second slower.

Has anyone been able to get decent performance with 11Labs? Is it possible to achieve 2 seconds or less?

Thanks

Current stack:
STT: scribe_v2_realtime (temp: 0.0, lang: en)
TTS: eleven_flash_v2_5 via inference (“speed”: 0.8, “style”: 0.5, “auto_mode”: true, “stability”: 0.8, “sync_alignment”: true, “similarity_boost”: 0.9, “use_speaker_boost”: true)
LLM: gpt-4.1-mini, (“temperature”: 0.3)

I am also running VAD:
“min_speech_duration”: 0.05, “activation_threshold”: 0.35, “min_silence_duration”: 0.2

I am using preemptive generation.

Library versions:
```

livekit = “^1.0.19”

livekit-api = “^1.0.7”

livekit-agents  = “^1.3.10”

livekit-plugins-openai = “^1.3.6”

livekit-plugins-silero = “^1.3.6”

livekit-plugins-deepgram = “^1.3.10”

livekit-plugins-turn-detector = “^1.3.6”

livekit-plugins-noise-cancellation = “~=0.2”

```

Do you know which components are contributing to the delay?

If you haven’t seen it, there is a blog to understand agent latency:

It also covers using agent observability to understand which part of the pipeline the latency is coming from

@Pete, three levers ranked by typical savings:

  • STT: scribe_v2_realtime is slower than Deepgram streaming. You already have livekit-plugins-deepgram; switch to nova-3 streaming for several hundred ms of TTFT savings.

  • TTS knobs: your params are tuned for fidelity. For latency: style: 0, stability: 0.5, similarity_boost: 0.5, speed: 1.0, drop sync_alignment, keep auto_mode: true. Each fidelity-tuned param adds server-side processing time on Flash v2.5.

  • Turn detection: swap “vad” for MultilingualModel from livekit-plugins-turn-detector. Semantic EOU often fires before your 200ms silence window.

Also, you’re on livekit-agents==1.3.10; latest is 1.5.9. inference_class="priority" for lower-TTFT routing on LK Inference shipped in 1.5.7.

Run @darryncampbell’s observability breakdown first. The Sessions dashboard splits each turn into STT/EOU/LLM/TTS so you’ll know which knob actually moves your number.

Thanks @darryncampbell I was able to get into more detail on the delays, thanks to your article. There were a few.

I was able to improve STT by switching to Deepgram nova-3.

TTS was tuned for latency performance.

Both thanks to @Muhammad_Usman_Bashir

I am left with LLM performance, which is abysmal. I understand geography plays a big part.

Gemini 2.5 flash with 0 temp and 0 thinking budget produces 1.6 sec TTFT on average (mode)
Gemini 3.1 flash lite with 0 temp and minimal reasoning effort produces 0.9 TTFT
Both are faster on from Google than from Livekit inference

Meanwhile, GPT 5.2 chat (0 temp, low effort) produces a larger variance of TTFT, with mode around 1.75 seconds
In Elevenlabs, GPT 5.2 produces TTFT in the 400ms to 550ms

Do you guys find inference better in your experience or using directly from the vendor?

It can be, but in general you should not expect an order of magnitude improvement.

I understand geography plays a big part.

The one exception is where we self-host models in LiveKit cloud colocated with the agent, example here, then I would expect a noticeable improvement