Gpt-realtime-1.5 leaks audio control tokens (<|audio_text|>, <|caption_quality_N|>) into text stream when run with modalities=["text"]

Affected: gpt-realtime-1.5 (OpenAI direct API and Azure OpenAI deployments). gpt-realtime is not affected.

Reproduction (minimal idea):

  1. Open a Realtime API session with modalities: ["text"] (no audio output requested).

  2. Send a normal user message via input_audio_buffer (audio in) or conversation.item.create (text in).

  3. Observe the assistant’s response.text.delta / response.output_text.delta events.

Expected: Text stream contains only the spoken transcript.

Actual: The text stream is interleaved with audio-side control tokens, e.g.:

<|audio_text|><|caption_quality_9|>Hello, how can I help you today?

These tokens never appear with gpt-realtime. They appear consistently with gpt-realtime-1.5 on the very first response of every session, regardless of system prompt.

Why this matters in production: When the Realtime LLM is paired with an external TTS (e.g. ElevenLabs, Cartesia, etc.) — which is the standard “realtime LLM + 3rd-party voice” architecture — the raw text stream is fed to the TTS engine. The engine speaks the tokens literally, so users hear “audio text caption quality nine …” prefixed to every assistant reply. With OpenAI’s native voice (modalities=["text","audio"]), the tokens stay inside OpenAI’s TTS path and are never spoken, which is why the bug is invisible if you only test with OpenAI voice.

Sample log line from a real call (LiveKit agents transcript):

[AGENT]: <|audio_text|><|caption_quality_9|>

This comes from the model and not from LiveKit Agents.

The workaround is to strip the characters before they reach the TTS engine: