I am using Deepgram Nova-3 for STT in my pipeline. The issue I am facing is that it sometimes does not transcribe answers correctly. While the overall transcription accuracy is quite impressive, there are occasional inaccuracies.
The pipeline performs very well for longer sentences and conversations. However, it struggles more with short, single-word responses, where the transcription quality is noticeably lower.
Do you have any recommendations regarding model selection or parameter tuning that could help improve accuracy for short utterances and single-word answers?
@Umer_Usman bhai, short-utterance dropouts on Nova-3 usually come from one of two things: the model has no context to disambiguate (deep-context models lean on surrounding tokens, which a single word doesn’t give them), or endpointing is cutting the audio before the word finishes.
For domain-bounded vocabulary (yes/no, names, products, status words), the biggest lever is keyterm. Nova-3 supports up to 100 terms with documented confidence lifts on isolated words ("tretinoin" 0.712 → 0.965, "escalation" 0.765 → 0.981 per Deepgram's published numbers) [ developers.deepgram.com/docs/keyterm ]. The LK Deepgram plugin exposes it directly [ docs.livekit.io/agents/models/stt/deepgram/ ].
If your expected answers are open-ended, check endpointing (LK default 25ms, worth bumping to 50-100ms if waveforms show words getting cut mid-syllable) and set language explicitly, since auto-detection has less to work with on one-word audio.
If the use case is heavily turn-based and you can switch, Flux is the alternative, its phrase-endpointing model uses acoustic + semantic cues and supports keyterm too.
@Muhammad_Usman_Bashir I tried using Keyterm, but it did not improve the performance. I went through the documentation here: developers.deepgram.com/docs/keyterm . It mentions that generic words should not be used as keyterms. However, my use case mainly involves generic terms such as “yes,” “no,” or numbers like “one,” “two,” “three,” “four,” and “five.” I added these as keyterms anyway, but I’m still having difficulty with recognition.
I also tried increasing the endpointing time from 25 ms to 50 ms and 100 ms. This sometimes helps capture the words, but not consistently. The tradeoff is that it increases latency. Since users often respond with a single word, they expect the agent to reply almost immediately.
Are there any other settings or tweaks that I could try to improve this?
Since you already experimented with endpointing, the next things I would look at are your VAD config and audio quality. VAD misconfiguration can clip short utterances before they reach the model, which would explain why longer sentences are fine. Have you looked at your VAD settings? What is your current VAD setup? Also, have you tried any audio enhancement tools to rule out noise as a factor?
I am currently using the default settings for Silero VAD and the ai_coustics.EnhancerModel.QUAIL_VF_S model for noise suppression and audio enhancement.
@Umer_Usman, Since keyterm and endpointing are exhausted, two things to isolate given your chain.
The enhancer first: QUAIL_VF_S runs before STT, and enhancement tuned for continuous speech can smooth the brief, low-energy onsets of isolated words (the s/f/th in "yes"/"five"/"three") while leaving sentences intact, which matches your “sentences fine, single words bad” split. A/B the same utterances with it bypassed to confirm it’s net-positive here (the audio-quality angle raised above).
Then Silero: defaults are activation_threshold=0.5, min_speech_duration=0.05, prefix_padding_duration=0.5 [ livekit/agents silero vad.py ]. Lower the threshold so a quick, soft onset still trips VAD on a one-word turn:
from livekit.plugins import silero
vad = silero.VAD.load(activation_threshold=0.35, min_speech_duration=0.05)
If both are clean and only isolated words fail, that’s Nova-3 having no context on a one-word turn; mapping the noisy transcript to your fixed yes/no/1-5 set on your side will beat further STT tuning.