Voxtral TTS API 1,230ms TTFB in real-time voice agent pipeline

Hi, I’m building a real-time voice AI agent using LiveKit and tested Voxtral TTS (voxtral-mini-tts-2603) via the /v1/audio/speech endpoint with SSE streaming.

I referred to code from the merged PR: feat(mistral): add voxtral TTS support by jeanprbt · Pull Request #5245 · livekit/agents · GitHub

I’m consistently seeing ~1,230ms TTFB (time to first audio byte) on warm connections. For comparison, here’s what I’m getting from other TTS providers in the same pipeline:

Provider TTFB
Cartesia Sonic ~40-90ms
Smallest.ai Lightning v3.1 ~250ms
Mistral Voxtral ~1,230ms

My setup:

  • SSE streaming (stream: true)
  • Response format: mp3
  • Short conversational text (~1-2 sentences)
  • Measured from POST to first speech.audio.delta event

Is anyone else seeing similar latency? The docs mention ~90ms processing time wondering if there’s something I’m missing in my configuration, or if this is expected during the early rollout period.

Unfortunately is the same in our case, around 1000ms TTFB. I had high hopes for this model

1 Like

According to Text to Speech | Mistral Docs , “End-to-end API time-to-first-audio varies by format (~0.8s for pcm, ~3s for mp3)”, which is what you see. 90 ms is the model processing time, not the TTFB unfortunately. Additionnally, Voxtral TTS doesn’t currently support input streaming.

1 Like

That clarifies alot, thanks @Josselin_Lecocq