Participant was audible in room/demo recording, but LiveKit Agent heard very low audio and STT/VAD pipeline missed most of the response

Hey LiveKit community,

I’m looking for help debugging an intermittent issue with a production AI voice moderation app I built on LiveKit Cloud.

Context

I’ve built and deployed an AI voice moderation platform for focus groups using:

  • LiveKit Cloud for real-time multi-user voice/video rooms

  • Python + FastAPI backend

  • Docker-based deployment on a VPS

  • LiveKit Agents

  • Deepgram for STT

  • ElevenLabs for TTS

  • OpenAI for conversation logic

  • Anam for the AI avatar layer

The platform creates rooms, generates participant tokens, dispatches the AI moderator agent, and supports Zoom-like multi-participant focus group sessions.

Issue

During a recorded demo, the AI moderator asked a participant a question. The participant answered clearly, and everyone in the room could hear him. The demo recording also captured his voice clearly.

However, the AI voice agent did not reliably process what he was saying.

In LiveKit Agent Insights, the participant’s audio sounded very faint/weak compared with what was heard in the actual room/demo recording. My application logs showed moments where the system detected speech activity but did not receive usable STT transcripts.

Example pattern from my logs:

User started speaking
STT HEALTH CHECK: VAD detected speech 7s ago but no STT transcripts received for christopher! Nudging.

Later, some partial fragments came through, but the agent treated the participant’s answer as incomplete/off-topic because the transcript was missing or fragmented.

What I checked

  • The participant’s microphone appeared to be working in Session Analytics.

  • Other participants could hear him clearly.

  • The demo recording captured his response clearly.

  • The issue seemed isolated to what the agent/STT pipeline was receiving.

  • Agent Insights made his audio sound much weaker than the room/demo recording.

My question

What is the best way to debug this type of mismatch?

Specifically:

  1. Can LiveKit room audio/recording sound clear while the agent receives a much lower-quality or lower-volume subscribed track?

  2. Could this be caused by participant-side connectivity, packet loss, browser audio processing, audio level normalization, or track subscription behavior?

  3. Are there specific LiveKit metrics I should inspect, such as packet loss, jitter, audio level, connection quality, track SID, participant SID, or agent subscription state?

  4. Is there a recommended way to compare what the room received versus what the LiveKit Agent actually received?

  5. Could this be related to the LiveKit Agents audio pipeline, VAD configuration, or the downstream STT provider receiving weak/partial audio?

I’m not trying to assume this is a LiveKit infrastructure issue. I’m trying to determine whether the failure point is participant connectivity, browser/mic behavior, LiveKit track delivery, agent subscription/ingestion, VAD, or STT.

Any recommended debugging steps, metrics to export, or best practices for diagnosing “participant audible to humans but not reliably heard by agent” would be greatly appreciated.

The Session ID: RM_MQSDbz3SfBMD

@Ganesh_Krishnan The mismatch you’re describing is the strong signal. Room recording is the SFU's mix; Agent Insights replays what reached your worker’s audio pipeline specifically. If they sound different on the same participant, the loss is on the agent’s subscription/preprocessing path, not the underlying track.

Most likely culprits, in order:

  • Noise cancellation on the agent’s input. If you have noise_cancellation=audio_enhancement() or BVC on RoomInputOptions, an aggressive model can over-attenuate a quieter speaker as “noise.” Disable it, re-run, compare. If the participant comes through clearly, that’s it.
  • Silero VAD threshold too high. Default threshold=0.5 gates quiet-but-real speech as silence; VAD might fire briefly while STT gets fragments. Try silero.VAD.load(threshold=0.3, min_speech_duration=0.1).
  • Wrong subscribed track. Confirm the agent is subscribed to the participant’s mic publish, not a secondary device or screen-share audio. Track SIDs in Session Analytics let you verify.

Concrete A/B: in Agent Insights, compare the audio waveform peak at the input stage vs the room composite at the same timestamp. If both peak at similar dB, loss is downstream (noise cancellation / VAD / STT). If Insights peaks lower, loss is upstream (subscription path or participant-side Chrome AGC dropping gain).

The “VAD detected speech 7s ago but no STT transcripts” pattern usually traces to Silero firing on noise while actual speech is attenuated below Deepgram’s energy floor. Disabling noise cancellation first isolates that.

I was also wondering:

  1. What are you using for the VAD?
  2. Do you use any voice enhancement or noise cancellation plugins?
vad_instance = silero.VAD.load(

    *min_silence_duration*=0.55,

min_speech_duration=0.15,

prefix_padding_duration=0.6,

activation_threshold=0.5,

)

noise_cancellation=noise_cancellation.BVC()

Imported from livekit.plugins.noise_cancellation (src/moderator_agent.py:17). No ElevenLabs/Krisp or other enhancement plugins are wired in — BVC is the sole audio cleanup layer, and the VAD’s prefix_padding_duration=0.6s is intentionally tuned to give BVC headroom to suppress TTS echo before VAD decisions fire.

Thanks for the config, @Ganesh_Krishnan. That confirms it: noise_cancellation.BVC() is the culprit.

Per LiveKit’s docs, BVC is single-speaker tuned: “Optimized for single-speaker scenarios where cross-talk from nearby people could confuse transcriptions.” NC is the multi-participant variant: “better for multiple speakers and diarization.” A focus group is the textbook BVC anti-pattern; quieter or non-dominant participants get attenuated as background voices.

  • Fix: swap to noise_cancellation.NC(). It removes traffic, fans, and music without isolating one speaker.

  • Silero: activation_threshold=0.5 is fine on clean audio. After the swap, leave it; if you still see fragments, drop to 0.3.

Also: prefix_padding_duration=0.6s is STT context captured before VAD fires, not a gate for BVC. BVC/NC run continuously on the input stream, not interlocked with VAD. For TTS echo suppression, the lever is muting the agent’s mic-input subscription while it speaks (or AEC at the client), not VAD padding.

You can also try out Quail L model from ai-coustics Noise & echo cancellation | LiveKit Documentation if you want to have a cleaner audio but still maintain multiple speakers.

Good call from @Pawel_Lach on Quail L. Two valid paths for @Ganesh_Krishnan depending on what you need:

noise_cancellation.NC(): bundled in livekit-plugins-noise-cancellation, free, removes environmental noise while preserving all speech. Right baseline swap for “stop BVC from suppressing my participants.”

ai-coustics Quail L: pitched for higher-fidelity output while preserving multiple speakers. Worth A/B-ing against NC if audio quality on the focus-group recordings is itself a product differentiator.

Quick swap to NC first to validate the root cause, then evaluate Quail L on the same sessions to see if the quality lift is worth it.

Thank you very much Muhammad & Pawel! Really appreciate the advice. You are increasing my awareness of the LiveKit ecosystem and for that I am very grateful.