Hi, I’m building a real-time voice AI agent using LiveKit and tested Voxtral TTS (voxtral-mini-tts-2603) via the /v1/audio/speech endpoint with SSE streaming.
I referred to code from the merged PR: feat(mistral): add voxtral TTS support by jeanprbt · Pull Request #5245 · livekit/agents · GitHub
I’m consistently seeing ~1,230ms TTFB (time to first audio byte) on warm connections. For comparison, here’s what I’m getting from other TTS providers in the same pipeline:
| Provider | TTFB |
|---|---|
| Cartesia Sonic | ~40-90ms |
| Smallest.ai Lightning v3.1 | ~250ms |
| Mistral Voxtral | ~1,230ms |
My setup:
- SSE streaming (
stream: true) - Response format:
mp3 - Short conversational text (~1-2 sentences)
- Measured from POST to first
speech.audio.deltaevent
Is anyone else seeing similar latency? The docs mention ~90ms processing time wondering if there’s something I’m missing in my configuration, or if this is expected during the early rollout period.