Application-level turn-state bug

Hi everyone and @Muhammad_Usman_Bashir and @Pawel_Lach — I’m debugging a LiveKit Agents voice moderator session and would appreciate guidance on whether this looks like a noise cancellation / VAD / STT timing issue or an application-level turn-state bug.

Context:

  • We’re running a focus group style voice agent.
  • Stack: LiveKit Agents, Deepgram STT, LLM moderator logic, ElevenLabs TTS.
  • In this latest test, we added noise_cancellation.NC().
  • Overall the 16-minute session completed successfully.
  • However, the first question glitched for one participant, Ganesh.

What happened:
The moderator asked Ganesh the first warm-up question:

“Please tell me where you live and what you do for a living, or are you a student or retired?”

Ganesh answered correctly the first time and second time. Agent Insights showed that he was clearly speaking to the moderator. But the agent did not accept the answer until Ganesh asked the moderator to repeat the question and then answered a third time.

Relevant log sequence:

  1. Agent selected and unmuted Ganesh.

  2. Turn moved into awaiting_response.

  3. VAD detected Ganesh speaking:
    PHASE awaiting_response → speaking
    User started speaking

  4. Then VAD moved to paused:
    PHASE speaking → paused

  5. Shortly after that, the STT health check fired:
    “STT HEALTH CHECK: VAD detected speech 7s ago but no STT transcripts received for ganesh! Nudging.”

  6. But after the nudge, the STT interim transcript arrived with the correct answer:
    is_final=False
    frag=“I live in Dallas. I’m an engineer.”
    buf_after=“I live in Dallas. I’m an engineer.”
    acc_after=“”

My current hypothesis:
This is probably not that Ganesh’s audio was missing. VAD detected speech, and STT eventually produced the correct transcript. The failure seems to be that my app logic treated “no finalized STT transcript yet” as “no usable answer,” even though the interim STT buffer had a valid response.

So the turn-state machine may be nudging too early while STT is still delayed/pending.

Questions:

  1. With LiveKit Agents + noise_cancellation.NC(), is it expected that interim/final STT timing can be delayed enough that VAD sees speech before transcripts arrive?
  2. Is there a recommended pattern for gating turn timeouts so the app does not nudge while VAD has detected speech but STT finalization is still pending?
  3. Should I treat interim transcripts as a valid candidate response after VAD pause/silence, even if a final transcript never arrives?
  4. Are there best practices for resetting timeout / nudge timers on VAD events and interim STT events?
  5. Is noise_cancellation.NC() known to change VAD/STT timing behavior, or is this more likely an application state-machine issue?

Second issue:
On the final wrap-up question, another participant, Christopher, gave an answer that included “a range of different verticals,” and the moderator triggered an off-topic response. The question was broad:

“Wrapping up, what is the most important thing you would want others to know about this experience or is there anything that you think is missing from the product that you would like to add?”

My suspicion is that the off-topic classifier may be running on partial/interim fragments instead of waiting for the full candidate response.

Any guidance on how LiveKit users typically structure VAD + STT + turn-end + timeout logic would be very helpful.

My session ID is: RM_NJmiALmWqkBW

@Ganesh_Krishnan, your diagnosis is correct on both. These are app-level turn-state bugs, not framework or NC issues.

Issue 1 (nudge before STT): a 7s VAD-to-nudge gap is far longer than NC’s processing latency. Suspect Deepgram WebSocket delivery or your stt_node, not NC.

Pattern fixes for the turn-state machine:

  • Reset the nudge timer on every interim transcript: subscribe to UserInputTranscribedEvent and reset whenever transcript is non-empty, regardless of is_final.
  • Gate “no transcript” on VAD-pause + STT-idle, not fixed elapsed time. After VAD pauses and a 1-2s grace window with no new STT, accept the latest interim as a candidate.
  • Don’t nudge while VAD is speaking. Only nudge after paused AND the interim buffer is stale.

Issue 2 (off-topic on partial answer): same root cause. Gate the classifier on is_final=True from UserInputTranscribedEvent. Buffer interims; classify on the final or your VAD-pause fallback.

On NC specifically: noise_cancellation.NC() adds inline processing with negligible latency. A 7s gap isn’t explained by NC.

Worth pulling STTMetrics for session RM_NJmiALmWqkBW. The acquire_time and connection_reused fields name whether the Deepgram WebSocket reconnected or just stalled on that turn.

Have you tried maybe playing around with speech duration and silence duration in the VAD? Sounds like a cool project, I am looking into building something like that myself. Would you mind sharing the code? What’s more, if you are OK with that I can have a look into audio files and do some analysis in my spare time. Nevertheless I agree with @Muhammad_Usman_Bashir you could maybe just like I said, try changing the VAD speech and silence duration settings to align them somehow with nudge.