Hi folks, I have started using realtime-2, but the VAD is horrible. Any sort of backchanelling like “uh ha”, “ok” would trigger the agent to stop talking. My current settings are below:
turn_detection:
type: server_vad # semantic_vad | server_vad
eagerness: low # low | medium | high (semantic_vad only)
create_response: true
interrupt_response: true
threshold: 0.8 # server_vad only — energy detection threshold
prefix_padding_ms: 300 # server_vad only — ms of audio before speech
silence_duration_ms: 700 # server_vad only — ms of silence to end turn
I have tried playing with various settings but it’s still not working well.
Is this a known issue or am I missing something?
@James_Lau, server_vad is the issue: it’s energy-only, so “uh ha”/“ok” trip interrupt regardless of threshold. Switch to semantic_vad, which classifies on the actual words. Note eagerness: low in your config is dead; it only applies when type is semantic_vad.
# livekit-agents==1.5.x
from livekit.plugins.openai import realtime
from openai.types.beta.realtime.session import TurnDetection
llm = realtime.RealtimeModel(
turn_detection=TurnDetection(
type="semantic_vad",
eagerness="low",
create_response=True,
interrupt_response=True,
),
)
Worth knowing: with realtime models, LiveKit-side InterruptionOptions are mostly ignored; only enabled and discard_audio_if_uninterruptible apply. All tuning has to happen on the model’s own TurnDetection. If semantic_vad + eagerness="low" still over-interrupts, the escape hatch is turn_detection=None on the model and run LiveKit’s turn detector, but that needs an STT plugin, doubling transcription cost. Filler-word filtering as a separate knob is requested in Ignore Filler Words During Interruption Detection · Issue #4450 · livekit/agents · GitHub, not yet in main.
https://docs.livekit.io/agents/integrations/realtime/openai/
Hi everyone, thanks for reporting this! We’re working on full support of gpt-realtime-2 in this PR, feel free to follow along for progress 