Hi Aman, thanks for the detailed report. QUAIL Voice Focus (VF) models are optimized for close-microphone scenarios like headsets, where the speaker’s voice is the dominant signal. Built-in laptop microphones pick up more room noise and have different gain characteristics, which causes VF models to over-process the signal.
A few suggestions:
Switch from QUAIL_VF_S to a standard QUAIL model, which handles more varied mic environments better.
Try lowering enhancement_level (e.g., to 0.4-0.5) as a test, even before switching models.
For the Chinese transcription issue specifically you could try explicitly setting the language parameter (e.g., language="en") in your Whisper config should prevent incorrect language detection.
On sample rate and browser settings - what’s your current echo cancellation and auto gain control setting?
Happy to look at logs or audio samples and feel free to reach out to me directly.
Hi @Pawel_Lach, thanks for the suggestions and detailed explanation.
Regarding the 2nd point, we already tried lowering the enhancement_level, but unfortunately it did not improve the output. In fact, reducing it introduced even more background noise.
For the 3rd point, we need to support multiple languages including English, Hindi, and Arabic. Because of this, explicitly setting a fixed language code is not feasible for our use case, so we will need to continue with the multilingual configuration.
Regarding the model change, we will test by switching from QUAIL_VF_S to the standard QUAIL model and will share our observations after validation.
I see, guys would you mind sharing the audio samples? We are more than happy to have a look into it. Feel free to reach out directly to me as well. It looks like a bit more complex case.
Building on @Pawel_Lach’s QUAIL guidance, the headset-vs-laptop split points at a root cause upstream of QUAIL: browser audio processing mangling the laptop mic before QUAIL sees it.
The web SDK’s AudioCaptureOptions has echoCancellation, autoGainControl, and noiseSuppression, typically on by default [ client-sdk-js/src/room/track/options.ts ].
On a headset (clean close-mic) the browser does little, so QUAIL gets near-raw audio.
On a laptop built-in mic (noisy far-field) the browser applies aggressive AGC and noise suppression first, so QUAIL over-processes an already-processed signal.
Test: set those three to false so QUAIL is the only processing stage. That isolates whether the browser chain is the culprit.
On @Rajan_kumar’s multilingual point (EN/HI/AR): the Chinese output is most likely a symptom of the degraded audio, not a separate language bug. Whisper auto-detect mis-fires on garbled input. Fix the input pipeline first; language detection should stabilize without pinning language="en", keeping multilingual intact.
Thanks for the suggestion and explanation. We already tried this approach, but we need to keep echoCancellation enabled because disabling it causes the agent’s output audio to be picked up again as input, which puts the agent into a loop.
We did see some improvement after disabling autoGainControl and noiseSuppression, but the issue still persists. The groq/whisper-large-v3-turbo model is interpreting even very small background noises as phrases like “Thank you.” For example, if someone nearby coughs, the model sometimes transcribes it as “Thank you,” which then causes our agent to end the call unexpectedly.
This behavior feels quite strange to us, and we’re trying to understand the best way to handle or prevent these false transcriptions.
Like Pawel suggested can we get a sample of the audio? These are challenging things to debug and nearly impossible without concrete examples of the issue.
If you are using Agent Observability, can you share a session ID from your dashboard so I can take a look at what is going on?
Have you tried other STT to see if you get any different results?
Please take a look — I tested multiple STT providers and models with the following Deepgram and Groq LiveKit session IDs:
Deepgram Nova 3: RM_s3ipNTquuyD6
Deepgram Nova 2: RM_su7ywYhHQseB
(This worked well for English, but toward the end some abbreviations were added incorrectly. I already included some of them in keywords because they are important for the model to understand.)
Groq Whisper Large v3 Turbo: RM_cyhDpVmTTLce
(You can notice unnecessary “thank you” transcriptions in this session.)
Please review these sessions and let me know which STT provider/model would be better suited for supporting Hindi, English, and Arabic reliably.
I was not able to access that Insights data. For each session you will need to go to the “Agent Insights” tab, click share, and check the share with LiveKit staff checkbox.
Something like this:
Also, I can really give much advise about which STT will work better for what language. But I maybe able to help you isolate specific audio issues which is what I believe you were trying to sort out.
Hi @CWilson, just following up on this thread. I shared the Agent Insights session links with LiveKit staff access enabled.
When you have a chance, could you please review the sessions and let me know if you notice any STT-related issues in the audio or transcripts? My main goal is to understand whether the observed transcription problems are due to audio quality, model behavior, or provider-specific limitations across Hindi, English, and Arabic.
If there’s any additional information, audio samples, or specific session settings that would help with the investigation, I’d be happy to provide them.
For the first two, there are no audio waveforms associated with them. I’m unsure exactly why (I thought perhaps it was this incident, but the dates don’t match up).
Anyway, for the third transcript I was able to review the audio. I see a single mis-transcribed “Thank you.” at 34.28. Two observations about this:
The actual audio, to my ear, although a sneeze does sound very much like “Thank you” . Anyway, the sneeze is what we would classify as a ‘backchannel’ and it’s the kind of thing that ‘Adaptive interruption handling’ is designed to filter out and ignore: Adaptive interruption handling | LiveKit Documentation. The first thing to ensure is that you are using a version of agents that supports Adaptive interrupt handling.
The transcript has a little keyboard icon next to the word ‘User’. I would only expect this for textual user input, but clearly your user is speaking. The only other time I have seen the keyboard icon for spoken text is when the developer was doing something unique with the pipeline, such as implementing a custom LLM adapter.
I grant that’s not the exact answer you were looking for, but I hope the above is helpful.
@darryncampbell I am using following setting in my agent session:
turn_handling=TurnHandlingOptions(
turn_detection=eo_turn_detector, # SWITCHED from "vad" to the model
endpointing={
"mode": "dynamic",
"min_delay": 0.05, # Reduced: The turn detector is very fast
"max_delay": 0.4 # Reduced: Don't wait too long if the model is unsure
},
interruption={
"mode": "adaptive",
"enabled": True,
},
preemptive_generation={
"enabled": not _rag_enabled,
"preemptive_tts": not _rag_enabled
},
),
adaptive setting is already there but we still facing this noise picking issue. The issue seems clearly of the noise cancellation because on the headsets, everything is working fine.
As long as eo_turn_detector is one of the values defined at https://docs.livekit.io/agents/logic/turns/ and you’re using a version of the agents SDK >= v1.5.0 (Python) [or >= v1.2.0 NodeJS] then that looks good.
The issue seems clearly of the noise cancellation
I’m not as sure, the audio you’re hearing in agent observability is AFTER any agent noise cancellation, so listening to that recording, you’re essentially hearing what the agent hears. Like I say, I could only listen to one of the recordings, and even to me it sounded a bit like ‘Thank you’ so it wouldn’t surprise me if the classifier thought the same.
I listened to the audio again, and it doesn’t sound like “thank you” to us. It appears to be a low-volume background noise that is faintly picked up by the agent.
For turn detection, I am using the multilingual model:
eo_turn_detector = MultilingualModel()
This is the configuration mentioned in the LiveKit documentation link that you shared.