I have an LLM + STT + TTS pipeline and I need to programmatically toggle the STT/TTS parts on and off.
Right now, even when I try to disable them via audio.set_audio_enabled(), they don’t fully shut down.
The audio stream continues, and the session eventually crashes because it’s still attempting to process recognized audio.
I think I tried all the tricks but couldn’t be successful. Please help
Hi, did you see this? Agent session | LiveKit Documentation it also links to a couple of examples that show how to toggle room IO:
Yes, I reviewed those examples. When I use session.input.set_audio_enabled(False), the AgentSession crashes after retrying speech recognition. It still attempts to process the audio event, even though no audio is being sent.
This is the issue I see:
WARNING:livekit.agents:failed to recognize speech: Audio Timeout Error: Long duration elapsed without audio. Audio should be sent close to real time. [], retrying in 2.0s
I want the user to start with text-based interaction only, and during the session allow them to connect to the audio pipeline for spoken interaction.
Both examples crash, or just one? I’ll try to reproduce. Which version of LiveKit agents are you using?
Toggle.io one, I’m on version 1.4.2, and this kind of workaround is required to prevent the crash. However, I’m not sure whether this is the intended or recommended approach cos requires accessing internal APIs
def _deactivate_stt_node(self, session) -> None:
"""Stop the STT stream to prevent Audio Timeout errors.
Args:
session: The active AgentSession
"""
if (
hasattr(session, '_activity')
and session._activity is not None
and session._activity._audio_recognition is not None
):
session._activity._audio_recognition.update_stt(None)
logger.info("🔇 STT stream stopped")
OK, for agents/examples/voice_agents/toggle_io.py at main · livekit/agents · GitHub , when I connect to a room I see the following exception:
Exception: cannot access local participant before connecting
I will need to fix that in the sample, but the fix is to add:
await ctx.connect()
Immediately after await session.start(…)
After you apply that fix and connect to a room, it should work. You can test it as follows:
- Start the agent
uv run examples/voice_agents/toggle_io.py dev
- Use the Agents Playground as a front end: https://agents-playground.livekit.io/
- Press
Connect in the Playground and allow the agent to connect to your toom
- Under the
RPC options in the playground specify the method name as toggle_input and then the Payload is either audio_off or audio_on - you should see the agent stops or starts responding to your speech.
- The example is written with the OpenAI realtime LLM, but it should work with any LLM/STT/TTS - you’ll need to modify it to match your setup if you don’t have an OpenAI key
Hey! I ran into this exact issue when building voice agents that needed to “mute” themselves during certain workflows. Have you tried controlling it at the VAD (Voice Activity Detection) level? That’s been the cleanest approach for me.
The idea is that VAD sits at the entry point of your pipeline before STT even kicks in. When you disable VAD, the agent stops detecting speech entirely, so the whole pipeline stays idle without needing to tear down STT/TTS.
Here’s what worked for me:
python
# Control listening state via VAD**
async def toggle_listening(agent: VoiceAssistant, enabled: bool):
if enabled:
agent.vad.start() # Resume speech detection
else:
agent.vad.stop() # Ignore audio input**
If you need something heavier (like completely pausing multi-turn conversations), I’d recommend pausing the entire agent session instead:
python
# Pause the full agent context, not just listening**
async def toggle_agent(agent: VoiceAssistant, active: bool):
if active:
await agent.resume()
else:
await agent.pause()**
VAD control is way more efficient because audio tracks keep flowing you’re just not processing them. Pause/resume is better when you’re switching between different interaction modes entirely.
What’s your use case? Are you trying to implement push to talk, or is it more like “agent speaks, user can’t interrupt”?