Issue: I am experiencing an inconsistency between the model’s text output (transcripts) and its audio output when using gemini-live-2.5-flash-native-audio. My agent is instructed to spell out names letter-by-letter (e.g., “T-H-O-M-P-S-O-N”).
While the LiveKit transcripts show the name spelled perfectly, the audio output frequently skips letters or “swallows” parts of the spelling during playback. It seems like the TTS/Audio generation layer is dropping tokens that the LLM layer successfully generated.
Prompt Used:
“- For names: When the user provides a name in any form (first name only, last name only, full name, mentioned in conversation, or spelled letter by letter, or spelled using phonetic or word associations such as ‘R for Robert’ or ‘B as in Boy’), the assistant must immediately spell the name out letter by letter (e.g., T-H-O-M-P-S-O-N), treating each letter as an individual spoken unit with a brief pause between letters, and ask the user to confirm it. The assistant MUST NOT ask the user to manually spell or repeat the name. If the user later corrects the name in any form by providing a different name or a different spelling or different letters, the assistant must immediately discard the previous value and must only spell and confirm the corrected name.”
Observations:
-
Transcript : If the user says “Thompson,” the transcript shows “T-H-O-M-P-S-O-N.”
-
Audio behavior: The voice might say “T-H-M-P-S-N,” skipping the “O” and the “O” entirely, or rushing through them so fast they aren’t audible.
-
Setup: Using
livekit-plugins-googlewithgemini-live-2.5-flash-native-audio.
google.realtime.RealtimeModel(
model=“gemini-live-2.5-flash-native-audio”,
voice=“Puck”,
temperature=0.3,
language=“en-US”,
location=“us-central1”,
vertexai=True,
)