Hi! Building on LiveKit Agents API with OpenAI Realtime plugin, what are the levers at my disposal to improve the accuracy? I am noticing inaccurate transcription. Would appreciate any advice, strategies, etc.
@Trevor_Skelton, The Realtime plugin’s RealtimeModel constructor has a few knobs for this livekit/agents/…/openai/realtime/realtime_model.py:
-
input_audio_transcriptionto swap the user-speech transcription model (tryAudioTranscription(model="gpt-4o-transcribe")), -
input_audio_noise_reductionto clean input audio, and -
turn detection (
threshold,prefix_padding_ms,silence_duration_mson server VAD; eagerness on semantic VAD) docs.livekit.io/agents/integrations/realtime/openai/.
Instructions in the session config can also bake in domain glossary terms.
Please do share the language, your current config, and an example of misheard text if you want targeted advice.
The Realtime transcriber is actually pretty bad. We found it useful to run a parallel transcriber STT(scribe v2 from elevenlabs).
After running in shadow for a few days we switched over as call reviews and debugging became so much easier.
@abhi the parallel scribe v2 shadow pattern is a solid call for the review/debug side.
For @Trevor_Skelton’s question on the Realtime plugin: input_audio_transcription flips the user_transcription capability flag ( literally user_transcription=input_audio_transcription is not None in livekit/agents/…/openai/realtime/realtime_model.py ), so it’s a user-transcript surfacing knob.
Both that and running a parallel STT like Abhi’s setup improve the displayed/loggable transcript. Whether they also improve the model’s responses depends on whether the model is responding wrong because it misheard, or responding right while the visible transcript is wrong.
Trevor, the language, current config, and a misheard example would tell us which side of that you’re hitting.
Thanks for the replies! To give additional context, we are using English, and previously were not using input_audio_transcription or input_audio_noise_reduction, and server VAD.
@Muhammad_Usman_Bashir we’ve been experimenting with all 3 of your initial suggestions and seeing some improvements. Particularly swapping the user-speech transcription model with the input_audio_transcription parameter with language hint set to English greatly helps as we are doing some post-processing on this transcription. Without setting this, the transcription was not very accurate and sometimes doing strange behavior like transcribing the correct word in another language or transcribing background noise as words in another language. Could you point me to reference for a parallel STT implementation you’re referencing?
Also, curious if others have better feel for when to use semantic VAD vs server VAD. Thanks so much!
Interesting topic here! I have also recently tried out OpenAI Realtime and it’s great though the turn taking sometimes is troublesome, the same for the background speech. It looks like the semantic VAD does pretty well, though I am not sure about more challenging environments. With the default VAD you can increase the threshold and decrease the Silence duration to make it a bit more responsive. Have you tried some noise cancelling tools? I have been building some robots on top of it and it helped quite a lot.
@Trevor_Skelton, Parallel STT reference: examples/other/transcription/transcriber.py.
It subscribes to the participant’s audio and runs a standalone STT with no LLM (audio_output=False), which is the exact shadow-transcriber shape. Run it as a separate worker in the same room feeding scribe v2 (or any STT) for your review/debug transcript, independent of the Realtime model and the post-processing you’re doing on it.
Semantic vs server VAD, from the OpenAI Realtime docs: docs.livekit.io/agents/integrations/realtime/openai:
-
Server VADchunks on silence (raise threshold for noisy input, lowersilence_duration_msfor faster turn-end). -
Semantic VADuses a classifier to decide turn-end from the words, so it interrupts mid-sentence less.
The single knob is eagerness (low lets users take their time, high chunks as soon as possible). Rule of thumb: semantic for natural conversational pauses, server-with-raised-threshold when background noise is the dominant problem, which matches the foreign-language transcription you were seeing.
On that hallucination specifically: the multilingual decoder guesses a language on ambiguous or non-speech audio, which your English hint constrains (why setting it helped).
@Pawel_Lach’s noise-cancellation suggestion above is the upstream version of the same fix: cleaner input means less ambiguous non-speech for the decoder to mis-label.