Hi LiveKit Team,
I’m facing an issue while using GPT Realtime 2 as a single model for STT, LLM, and TTS.
Configuration:
realtime_kwargs = {
“azure_deployment”: “gpt-realtime-2”,
“azure_endpoint”: “https://my-openai-resource.openai.azure.com ”,
“api_key”: “AZURE_OPENAI_API_KEY”,
“api_version”: “2024-10-01-preview”,
“temperature”: 0.7,
“modalities”: [“audio”, “text”],
“voice”: “alloy”,
}
llm_service = openai.realtime.RealtimeModel.with_azure(**realtime_kwargs)
Error observed:
{
“message”: “expected to receive only one message generation from the realtime API”,
“level”: “WARNING”,
“name”: “livekit.agents”
}
After this warning, the agent suddenly stops speaking/responding until the conversation is triggered again.
Could you please help identify whether this is related to multi-generation handling or compatibility with GPT-Realtime-2?
@darryncampbell could you help me out with this
@sahil.dutta , The warning lines up with an assumption in the Realtime plugin: “Our code assumes a response will generate only one item with type ‘message’” [ livekit/agents/livekit-plugins/livekit-plugins-openai/livekit/plugins/openai/realtime/realtime_model.py ]. When the Realtime API emits multiple message items in one response, downstream handling breaks, which produces the “stops until next trigger” behavior.
Two things to try:
Update your api_version. You’re on 2024-10-01-preview (October 2024). The plugin’s _AZURE_EVENT_MAPPING [same file] normalizes Azure’s old beta event names to OpenAI GA event names, so a newer api_version may avoid the multi-message case. Check Microsoft’s current Azure OpenAI Realtime api_version and bump.
Drop text from modalities. Try modalities: ['audio'] only. The single-message assumption is more likely to hold when only audio is emitted; text + audio can land as separate items depending on API version.
Did you try with the latest Agents release, 1.5.12? It contains:
main ← longc/multi-message-realtime-v2
opened 06:46AM - 18 May 26 UTC
## Summary
- Process each `MessageGeneration` from `generation_ev.message_str… eam` serially via `perform_audio_forwarding` + `perform_text_forwarding` + `wait_for_playout`. Only one flush is in flight at a time.
- Per-msg state is derived directly from the `playback_finished` event:
- `full` → emit `ChatMessage(interrupted=False)` with the msg's `message_id`
- `partial` → emit `ChatMessage(interrupted=True)` and call `_rt_session.truncate(...)` with this msg's local `playback_position` (not a cumulative offset)
- `skipped` → drop locally and call `update_chat_ctx(...)` so the realtime server removes never-played items from its history
- `_on_first_frame` now early-returns once `started_speaking_at` is set, so per-msg first-frame callbacks don't re-fire `_update_agent_state("speaking")` for each message.
## Alternative considered
#5690 makes multi-message work by flushing per message — that needs the synchronizer to keep pending/finalizing impls alive and serialize concurrent flushes in `room_io/_output.py`. Our AudioOutput assumes there is only one speech at a time, serializing per-message at the `wait_for_playout` boundary (this PR) avoids both changes.
close https://github.com/livekit/agents/pull/5690, https://github.com/livekit/agents/issues/5684
Thanks mate, the error resolves after upgrading to latest version. I see there’s so much of noise captured and very unstable behavior while using realtime model. I found this https://playground.livekit.io/ where i see demo usage of livekit with realtime model. I am trying to replicate this setup in my agent. Can you tell me if i am correct here - we are using here two models that is whisper-1 for transcription and a realitme model for (llm and tts) ? if yes how do we control this that realtime model should only do the work of llm and tts … not the stt. Can you help me out with this ?
The code for that, realtime-playground/agent/main.py at main · livekit-examples/realtime-playground · GitHub , is from November 2024, which may as well be a lifetime ago in this industry
This is a better resource for OpenAI Realtime model: https://docs.livekit.io/agents/models/realtime/plugins/openai/#usage . It includes STT, LLM and TTS and I suggest just using the defaults to get started.
If you run through our Voice AI quickstart, https://docs.livekit.io/agents/start/voice-ai/ , you’ll end up using our agent starter - the quickstart assumes you’re using a pipeline architecture, but there’s a commented out line in the agent.py, agent-starter-python/src/agent.py at main · livekit-examples/agent-starter-python · GitHub , which gives instructions on how to use a realtime model (architecture) instead.