Have anyone benchmarked various stt in livekit in terms of WER and latency, especially deepgram flux, nova-2 and nova-3
@Ayush_Kumar_Singh, No public LiveKit benchmark exists for those three side by side, only Deepgram’s own claims. They publish Nova-3 at 6.84% median WER on real-time streams with “comparable latency to Nova-2” [ Introducing Nova-3: Setting a New Standard for AI-Driven Speech-to-Text ]. Treat those as vendor numbers; production WER depends entirely on your audio domain.
The more useful framing is that Flux and Nova-3 aren’t really competing on the same axis. Nova-3 is general-purpose transcription on /listen/v1. Flux uses /listen/v2 and has a custom phrase-endpointing model using acoustic + semantic cues, designed for turn-based conversational audio [ Deepgram STT | LiveKit Documentation ]. With Flux you set turn_detection="stt" and let the STT handle turn-taking; with Nova-3 you do endpointing on the agent side.
So the question for a voice-agent pipeline is usually “where does turn detection live?” not raw WER. If you want it inside the STT, Flux. If you want your own endpointing with general transcription, Nova-3. For real WER on your audio, the only number that matters is the one you measure against a representative slice of your own calls.
Hey Ayush,
I have ran my own benchmark on the Dutch language over PSTN and from my testing it came out that elevenlabs scribe v2 realtime was the best STT by far, beating nova 3 by a lot. DM me if you want me to send you the JSON object with the benchmarks results.