STT Audio Never Reaches Agent Despite Successful Track Subscription (Started ~May 31)

Hi LiveKit team,

We’re experiencing a production issue that started around 2026-05-31 20:48 UTC and is currently impacting all of our voice agents.

Symptoms

Every AgentSession successfully connects to the room and subscribes to the user’s audio track, but the agent never receives usable audio for STT processing.

As a result:

  • Agent speaks the opening message.

  • User talks normally.

  • Agent never responds.

  • Agent eventually sends keepalive prompts (“Still there?”) and disconnects.

In all affected sessions:

  • tts_audio_duration accumulates normally (~20s+)

  • stt_audio_duration remains 0.00–0.05s

Diagnostic Findings

We added instrumentation and consistently see:

participant_connected
track_published kind=1 source=2 muted=False
track_subscribed kind=1

Example broken session:

[diag] participant_connected identity=69ed9ad9...
[diag] track_published participant=69ed9ad9... track_sid=TR_AMkPxpzvJ4LTNZ kind=1 source=2 muted=False
[diag] track_subscribed participant=69ed9ad9... track_sid=TR_AMkPxpzvJ4LTNZ kind=1

Usage summary:
tts_audio_duration=20.736
stt_audio_duration=0.05

This suggests:

  • Publisher joins successfully

  • Audio track is published

  • Subscriber receives subscription event

  • Audio never reaches the STT pipeline

What We’ve Already Ruled Out

Hypothesis Result
BVC / Noise Cancellation No change
auto_subscribe defaults Explicit AUDIO_ONLY, no change
Silero VAD thresholds No change
Agent code regression No change
Docker image drift No change
STT provider issue No change

We upgraded:

livekit-agents: 1.2.1 → 1.5.16

and aligned all plugin versions.

We also tested multiple STT providers:

  • Deepgram

  • Groq

Both exhibit the same behavior.

Deepgram account has available credits.

Additional Observation

In the LiveKit dashboard for affected sessions:

  • Both participants appear under Publishers

  • Subscribers table is empty

  • Session Events only show:

    • participant_joined

    • participant_left

No track_published or track_subscribed events appear there, even though our agent logs show them firing locally.

Recent Broken Room IDs

RM_PRzMEaydKHw9  (Groq STT)
RM_HfV7LTckpTv8  (Deepgram STT)

Questions

  1. Has anyone seen a situation where:

    • track_subscribed fires

    • but audio frames never reach the agent/STT layer?

  2. Are there known SFU-side conditions that could produce:

    • successful subscription events

    • near-zero stt_audio_duration

  3. Could a project-level configuration, routing issue, or media forwarding problem cause subscribers to appear missing in the dashboard while subscriptions appear successful inside the SDK?

At this point we’ve ruled out application code, container versions, STT providers, and account credits, so we’re looking for guidance on what to inspect next.

Any suggestions would be greatly appreciated.

Thanks!
CareerKart Team

Same symptom, same time window (May 31+), but on SIP inbound (Tokyo trunk) — adding a data point in case it helps narrow the regression.

End-to-end trace for one ~30 s call:

  1. RTP arriving at the trunk: real voice (PCMU/G.711). Packet counts, voice/silence ratio, and amplitude are all consistent with normal speech at the carrier side (verified at our network edge before LiveKit).

  2. LiveKit dashboard, SIP participant — Total upstream ≈ 8.78 KB / 31 s (~2.3 kbps). This is consistent with Opus-DTX-encoded speech at the observed ~10% voice ratio, i.e. media is reaching the SIP ingress.

  3. Agent participant (livekit-rtc 1.1.8) — track_subscribed fires for the SIP audio track. Frame format is correct (10 ms mono, 16 kHz, 160 samples/channel), but peak_amp = abs(int16_samples).max() = 0 for every 3-second window across the entire 27+ s session. Every PCM sample is literally 0x00.

  4. Downlink (agent → SIP) is fine — the caller hears the agent’s TTS greeting normally. Only the agent-side subscribe is silent.

So the disconnect sits between the SIP ingress (upstream-bytes > 0) and the agent’s subscribed track (every sample 0). Same shape as the WebRTC pattern in the OP, just on the SIP side — suggests it’s not transport-specific.

Trunk config (all defaults / ruled out): media encryption disabled, Krisp disabled, allowed-addresses matches sender, codec negotiation clean (PCMU/8000, RTP/AVP, ptime 20). No noise-cancellation plugin in agent code.

@info_career that does sound like an STT issue to me, I see the call was successfully established and the agent joined.

The first thing to check would be any logs that your agent produced, I don’t have access to those, but I don’t see anything out of the ordinary in the server logs.

You would also try the agent console, Agent Console | LiveKit Documentation, to isolate testing to just your agent.

Testing other STT providers through LiveKit Inference, even if it’s just a sanity test, would help isolate the root cause.

@dongwan.hong We added an identical int16-peak-amplitude tap on our subscribed audio frames and just confirmed: we see exactly your symptom, just on WebRTC instead of SIP. Format on first frame (confirmed via diag): 48000 Hz mono, samples_per_channel=480, memoryview, 480-sample frames. Per-10-frame window peaks across a 3-minute session (with the bot publishing real Polly TTS the whole time): frames=10 samples=4800 peak=1 zero_frames=0–2 frames=10 samples=4800 peak=1 frames=10 samples=4800 peak=1 … (sustained for 3 min) So our frames aren’t literally 0x00 like yours — peak=1 means a single sample slot is hitting amplitude 1 (out of int16 max 32,767). Functionally identical to your finding: subscribed audio is silenced to noise floor before reaching the agent’s STT input. Confirms it’s NOT transport-specific (WebRTC + SIP both affected) AND NOT STT-provider-specific (we tested Deepgram, Groq, AND LK Inference managed STT — all three see stt_audio_duration ≈ 0). Same May 31+ window for both of us. This is now a documented multi-tenant LK Cloud regression. We’ve got an open ticket — happy to coordinate so LK can correlate SFU state across both projects.

@darryncampbell

  1. LK Inference STT — same failure.
    stt=lk_inference (model=deepgram/nova-3 routed through LK Cloud’s
    managed path, no direct provider key) → stt_audio_duration=0.0.
    Identical to direct Deepgram + Groq.

  2. Agent Console — set up; events stream confirms participant joins
    and track subscription but the audio waveform pane shows flat-line
    in the subscriber direction.

  3. Per-frame peak amplitude diagnostic — landed and ran. Definitive
    evidence on our side that subscribed frames arrive silenced:

    Track subscribed (kind=1, source=2, muted=False) on a 48kHz / mono /
    480 samples-per-channel memoryview pipe. First frame format matches
    what livekit-agents expects. But:

    window 1: frames=10 samples=4800 peak=1 zero_frames=2
    window 2: frames=10 samples=4800 peak=1 zero_frames=0
    window 3: frames=10 samples=4800 peak=1 zero_frames=1
    … sustained across ~1500 windows in a single 3-minute session

    peak=1 means the loudest int16 sample in any 10-frame window is 1
    (out of 32767). That’s the noise floor — there is no real signal
    in the subscribed audio frames, even though the publisher is
    continuously sending real Polly TTS (verified bytes pre-publish).

  4. Cross-tenant corroboration — @dongwan.hong’s post above reports
    the same shape on SIP Tokyo trunk: track_subscribed fires, frames
    arrive with correct format (10ms / mono / 16kHz / 160 samples),
    but their per-window peak_amp=0 for every sample across the call.
    Same May 31+ regression window. Different transport, same outcome.

So our updated hypothesis: between the SFU’s receive path (where
real audio packets demonstrably arrive — your dashboard shows the
publisher’s upstream bytes for both our projects) and the subscriber’s
audio-frame delivery, audio sample values are being zeroed out. The
SDK-level subscription handshake is fine; the media-frame payload is
empty.

What we’d like:

  • Could you correlate per-track upstream-bytes vs forwarded-bytes
    on the SFU for our broken rooms (RM_uwTim9hMWKPx, RM_jCTVJLvLWJJV,
    RM_PRzMEaydKHw9, RM_HfV7LTckpTv8 — all today, all 48kHz Opus from
    Polly-published WebRTC)?
  • If you see upstream-bytes > 0 but forwarded-bytes ~ 0 (or forwarded
    payload but stripped sample values), that’s the smoking gun.
  • Any project-level audio-processing or BVC default that changed for
    careerkart-b2s80ese around 2026-05-31 20:48 UTC?

Project: careerkart-b2s80ese
Region: India West (per our worker registered_at log)

Standing by to fire a live repro at any window you specify.

I don’t believe this is the same issue that @dongwan.hong is seeing, this is not a SIP call, and recently their test with Agent console showed the agent was running fine.

Something feels fundamentally wrong here - STT isn’t working for you, and your agent is not responding at all (it sounds like). If I were you, I would revert back to the agent starter, just to check that something isn’t fundamentally misconfigured.

QUOTE: “If I were you, I would revert back to the agent starter, just to check that something isn’t fundamentally misconfigured.” In most of my main issues, that has been the case. And, it was usually something that was minor, and that I had just overlooked, or left off. Though, it may have took me hours to track it down. Then, when I would notice it was something I had just forgot, or did wrong, I would use a few choice words… Great advice, and I agree with ya. Back track, and check your code.