Gemini 2.5 Flash Native Audio skipping letters during "Spelling Out" tasks

Ashwin_P · February 27, 2026, 9:17am

Issue: I am experiencing an inconsistency between the model’s text output (transcripts) and its audio output when using gemini-live-2.5-flash-native-audio. My agent is instructed to spell out names letter-by-letter (e.g., “T-H-O-M-P-S-O-N”).

While the LiveKit transcripts show the name spelled perfectly, the audio output frequently skips letters or “swallows” parts of the spelling during playback. It seems like the TTS/Audio generation layer is dropping tokens that the LLM layer successfully generated.

Prompt Used:

“- For names: When the user provides a name in any form (first name only, last name only, full name, mentioned in conversation, or spelled letter by letter, or spelled using phonetic or word associations such as ‘R for Robert’ or ‘B as in Boy’), the assistant must immediately spell the name out letter by letter (e.g., T-H-O-M-P-S-O-N), treating each letter as an individual spoken unit with a brief pause between letters, and ask the user to confirm it. The assistant MUST NOT ask the user to manually spell or repeat the name. If the user later corrects the name in any form by providing a different name or a different spelling or different letters, the assistant must immediately discard the previous value and must only spell and confirm the corrected name.”

Observations:

Transcript : If the user says “Thompson,” the transcript shows “T-H-O-M-P-S-O-N.”
Audio behavior: The voice might say “T-H-M-P-S-N,” skipping the “O” and the “O” entirely, or rushing through them so fast they aren’t audible.
Setup: Using livekit-plugins-google with gemini-live-2.5-flash-native-audio.

google.realtime.RealtimeModel(
model=“gemini-live-2.5-flash-native-audio”,
voice=“Puck”,
temperature=0.3,
language=“en-US”,
location=“us-central1”,
vertexai=True,

)

darryncampbell · February 27, 2026, 10:19am

Others may well have a better idea. It feels like you have exhausted efforts to have the audio sound correct through modifying your prompts. For complete control you might consider using a separate and more capable TTS, Gemini Live API plugin | LiveKit Documentation, but that would be a large change.

Topic		Replies	Views
Gpt-realtime-1.5 leaks audio control tokens (<\|audio_text\|>, <\|caption_quality_N\|>) into text stream when run with modalities=["text"] Agents tts , realtime	1	17	April 20, 2026
Inconsistent transcripts language when using Gemini realtime model ( gemini-live-2.5-flash-native-audio ) Agents agent-development , plugin , gemini , google	3	40	March 3, 2026
Agent speaking audio_text tokens out loud Agents llm , openai	4	49	March 6, 2026
Audio Glitches with Gemini Live Plugin Getting Started agent-development , realtime , node-js , gemini	4	80	February 7, 2026
Behaviour of Gemini Live 3.1 model in LiveKit (not consistent) Getting Started	3	77	April 8, 2026

Gemini 2.5 Flash Native Audio skipping letters during "Spelling Out" tasks

Related topics