Gpt-realtime-1.5 leaks audio control tokens (<|audio_text|>, <|caption_quality_N|>) into text stream when run with modalities=["text"]

ZaidKaraymeh · April 18, 2026, 9:12am

Affected: gpt-realtime-1.5 (OpenAI direct API and Azure OpenAI deployments). gpt-realtime is not affected.

Reproduction (minimal idea):

Open a Realtime API session with modalities: ["text"] (no audio output requested).
Send a normal user message via input_audio_buffer (audio in) or conversation.item.create (text in).
Observe the assistant’s response.text.delta / response.output_text.delta events.

Expected: Text stream contains only the spoken transcript.

Actual: The text stream is interleaved with audio-side control tokens, e.g.:

<|audio_text|><|caption_quality_9|>Hello, how can I help you today?

These tokens never appear with gpt-realtime. They appear consistently with gpt-realtime-1.5 on the very first response of every session, regardless of system prompt.

Why this matters in production: When the Realtime LLM is paired with an external TTS (e.g. ElevenLabs, Cartesia, etc.) — which is the standard “realtime LLM + 3rd-party voice” architecture — the raw text stream is fed to the TTS engine. The engine speaks the tokens literally, so users hear “audio text caption quality nine …” prefixed to every assistant reply. With OpenAI’s native voice (modalities=["text","audio"]), the tokens stay inside OpenAI’s TTS path and are never spoken, which is why the bug is invisible if you only test with OpenAI voice.

Sample log line from a real call (LiveKit agents transcript):

[AGENT]: <|audio_text|><|caption_quality_9|>

darryncampbell · April 20, 2026, 9:35am

This comes from the model and not from LiveKit Agents.

The workaround is to strip the characters before they reach the TTS engine:

Topic		Replies	Views
Agent speaking audio_text tokens out loud Agents llm , openai	4	65	March 6, 2026
Realtime model with Azure whisper STT Agents python , stt , realtime , openai , azure	17	241	February 26, 2026
Unstability with livekit plugins for azure openai realtime Getting Started	5	48	June 2, 2026
Voxtral TTS API 1,230ms TTFB in real-time voice agent pipeline Agents tts , mistralai	3	107	April 6, 2026
Gpt-Realtime 2: Experience so far? Agents agent-development , llm , realtime , openai	2	147	May 9, 2026

Gpt-realtime-1.5 leaks audio control tokens (<|audio_text|>, <|caption_quality_N|>) into text stream when run with modalities=["text"]

Related topics