Improving accuracy

Trevor_Skelton · May 20, 2026, 4:22pm

Hi! Building on LiveKit Agents API with OpenAI Realtime plugin, what are the levers at my disposal to improve the accuracy? I am noticing inaccurate transcription. Would appreciate any advice, strategies, etc.

Muhammad_Usman_Bashir · May 20, 2026, 11:46pm

@Trevor_Skelton, The Realtime plugin’s RealtimeModel constructor has a few knobs for this livekit/agents/…/openai/realtime/realtime_model.py:

input_audio_transcription to swap the user-speech transcription model (try AudioTranscription(model="gpt-4o-transcribe")),
input_audio_noise_reduction to clean input audio, and
turn detection (threshold, prefix_padding_ms, silence_duration_ms on server VAD; eagerness on semantic VAD) docs.livekit.io/agents/integrations/realtime/openai/.

Instructions in the session config can also bake in domain glossary terms.

Please do share the language, your current config, and an example of misheard text if you want targeted advice.

abhi · May 21, 2026, 3:10pm

The Realtime transcriber is actually pretty bad. We found it useful to run a parallel transcriber STT(scribe v2 from elevenlabs).
After running in shadow for a few days we switched over as call reviews and debugging became so much easier.

Muhammad_Usman_Bashir · May 22, 2026, 1:34pm

@abhi the parallel scribe v2 shadow pattern is a solid call for the review/debug side.

For @Trevor_Skelton’s question on the Realtime plugin: input_audio_transcription flips the user_transcription capability flag ( literally user_transcription=input_audio_transcription is not None in livekit/agents/…/openai/realtime/realtime_model.py ), so it’s a user-transcript surfacing knob.

Both that and running a parallel STT like Abhi’s setup improve the displayed/loggable transcript. Whether they also improve the model’s responses depends on whether the model is responding wrong because it misheard, or responding right while the visible transcript is wrong.

Trevor, the language, current config, and a misheard example would tell us which side of that you’re hitting.

Trevor_Skelton · May 26, 2026, 3:19pm

Thanks for the replies! To give additional context, we are using English, and previously were not using input_audio_transcription or input_audio_noise_reduction, and server VAD.

@Muhammad_Usman_Bashir we’ve been experimenting with all 3 of your initial suggestions and seeing some improvements. Particularly swapping the user-speech transcription model with the input_audio_transcription parameter with language hint set to English greatly helps as we are doing some post-processing on this transcription. Without setting this, the transcription was not very accurate and sometimes doing strange behavior like transcribing the correct word in another language or transcribing background noise as words in another language. Could you point me to reference for a parallel STT implementation you’re referencing?

Also, curious if others have better feel for when to use semantic VAD vs server VAD. Thanks so much!

Pawel_Lach · May 26, 2026, 5:35pm

Interesting topic here! I have also recently tried out OpenAI Realtime and it’s great though the turn taking sometimes is troublesome, the same for the background speech. It looks like the semantic VAD does pretty well, though I am not sure about more challenging environments. With the default VAD you can increase the threshold and decrease the Silence duration to make it a bit more responsive. Have you tried some noise cancelling tools? I have been building some robots on top of it and it helped quite a lot.

Muhammad_Usman_Bashir · May 26, 2026, 7:45pm

@Trevor_Skelton, Parallel STT reference: examples/other/transcription/transcriber.py.

It subscribes to the participant’s audio and runs a standalone STT with no LLM (audio_output=False), which is the exact shadow-transcriber shape. Run it as a separate worker in the same room feeding scribe v2 (or any STT) for your review/debug transcript, independent of the Realtime model and the post-processing you’re doing on it.

Semantic vs server VAD, from the OpenAI Realtime docs: docs.livekit.io/agents/integrations/realtime/openai:

Server VAD chunks on silence (raise threshold for noisy input, lower silence_duration_ms for faster turn-end).
Semantic VAD uses a classifier to decide turn-end from the words, so it interrupts mid-sentence less.

The single knob is eagerness (low lets users take their time, high chunks as soon as possible). Rule of thumb: semantic for natural conversational pauses, server-with-raised-threshold when background noise is the dominant problem, which matches the foreign-language transcription you were seeing.

On that hallucination specifically: the multilingual decoder guesses a language on ambiguous or non-speech audio, which your English hint constrains (why setting it helped).

@Pawel_Lach’s noise-cancellation suggestion above is the upstream version of the same fix: cleaner input means less ambiguous non-speech for the decoder to mis-label.

Topic		Replies	Views
Realtime model with Azure whisper STT Agents python , stt , realtime , openai , azure	17	322	February 26, 2026
Gpt-realtime-2 + LiveKit: VAD does not work well Agents agent-development , plugin	2	125	May 14, 2026
Support for Live STT Partial Transcripts in Python SDK for OpenAI models Agents stt , openai	6	50	June 23, 2026
Gpt realtime transcription misses Getting Started	1	38	June 18, 2026
Unstability with livekit plugins for azure openai realtime Getting Started	5	61	June 2, 2026

Improving accuracy

Related topics