Unstability with livekit plugins for azure openai realtime

Hi LiveKit Team,

I’m facing an issue while using GPT Realtime 2 as a single model for STT, LLM, and TTS.

Configuration:

realtime_kwargs = {
“azure_deployment”: “gpt-realtime-2”,
“azure_endpoint”: “https://my-openai-resource.openai.azure.com”,
“api_key”: “AZURE_OPENAI_API_KEY”,
“api_version”: “2024-10-01-preview”,
“temperature”: 0.7,
“modalities”: [“audio”, “text”],
“voice”: “alloy”,
}

llm_service = openai.realtime.RealtimeModel.with_azure(**realtime_kwargs)

Error observed:

{
“message”: “expected to receive only one message generation from the realtime API”,
“level”: “WARNING”,
“name”: “livekit.agents”
}

After this warning, the agent suddenly stops speaking/responding until the conversation is triggered again.

Could you please help identify whether this is related to multi-generation handling or compatibility with GPT-Realtime-2?

@darryncampbell could you help me out with this

@sahil.dutta, The warning lines up with an assumption in the Realtime plugin: “Our code assumes a response will generate only one item with type ‘message’” [ livekit/agents/livekit-plugins/livekit-plugins-openai/livekit/plugins/openai/realtime/realtime_model.py ]. When the Realtime API emits multiple message items in one response, downstream handling breaks, which produces the “stops until next trigger” behavior.

Two things to try:

  1. Update your api_version. You’re on 2024-10-01-preview (October 2024). The plugin’s _AZURE_EVENT_MAPPING [same file] normalizes Azure’s old beta event names to OpenAI GA event names, so a newer api_version may avoid the multi-message case. Check Microsoft’s current Azure OpenAI Realtime api_version and bump.
  2. Drop text from modalities. Try modalities: ['audio'] only. The single-message assumption is more likely to hold when only audio is emitted; text + audio can land as separate items depending on API version.

Did you try with the latest Agents release, 1.5.12? It contains:

Thanks mate, the error resolves after upgrading to latest version. I see there’s so much of noise captured and very unstable behavior while using realtime model. I found this https://playground.livekit.io/ where i see demo usage of livekit with realtime model. I am trying to replicate this setup in my agent. Can you tell me if i am correct here - we are using here two models that is whisper-1 for transcription and a realitme model for (llm and tts) ? if yes how do we control this that realtime model should only do the work of llm and tts … not the stt. Can you help me out with this ?

The code for that, realtime-playground/agent/main.py at main · livekit-examples/realtime-playground · GitHub, is from November 2024, which may as well be a lifetime ago in this industry :slight_smile:

This is a better resource for OpenAI Realtime model: https://docs.livekit.io/agents/models/realtime/plugins/openai/#usage. It includes STT, LLM and TTS and I suggest just using the defaults to get started.

If you run through our Voice AI quickstart, https://docs.livekit.io/agents/start/voice-ai/, you’ll end up using our agent starter - the quickstart assumes you’re using a pipeline architecture, but there’s a commented out line in the agent.py, agent-starter-python/src/agent.py at main · livekit-examples/agent-starter-python · GitHub, which gives instructions on how to use a realtime model (architecture) instead.