Gemini 2.5 Flash Native Audio skipping letters during "Spelling Out" tasks

Issue: I am experiencing an inconsistency between the model’s text output (transcripts) and its audio output when using gemini-live-2.5-flash-native-audio. My agent is instructed to spell out names letter-by-letter (e.g., “T-H-O-M-P-S-O-N”).

While the LiveKit transcripts show the name spelled perfectly, the audio output frequently skips letters or “swallows” parts of the spelling during playback. It seems like the TTS/Audio generation layer is dropping tokens that the LLM layer successfully generated.

Prompt Used:

“- For names: When the user provides a name in any form (first name only, last name only, full name, mentioned in conversation, or spelled letter by letter, or spelled using phonetic or word associations such as ‘R for Robert’ or ‘B as in Boy’), the assistant must immediately spell the name out letter by letter (e.g., T-H-O-M-P-S-O-N), treating each letter as an individual spoken unit with a brief pause between letters, and ask the user to confirm it. The assistant MUST NOT ask the user to manually spell or repeat the name. If the user later corrects the name in any form by providing a different name or a different spelling or different letters, the assistant must immediately discard the previous value and must only spell and confirm the corrected name.”

Observations:

  • Transcript : If the user says “Thompson,” the transcript shows “T-H-O-M-P-S-O-N.”

  • Audio behavior: The voice might say “T-H-M-P-S-N,” skipping the “O” and the “O” entirely, or rushing through them so fast they aren’t audible.

  • Setup: Using livekit-plugins-google with gemini-live-2.5-flash-native-audio.

google.realtime.RealtimeModel(
model=“gemini-live-2.5-flash-native-audio”,
voice=“Puck”,
temperature=0.3,
language=“en-US”,
location=“us-central1”,
vertexai=True,

)

Others may well have a better idea. It feels like you have exhausted efforts to have the audio sound correct through modifying your prompts. For complete control you might consider using a separate and more capable TTS, Gemini Live API plugin | LiveKit Documentation, but that would be a large change.