Inconsistent agent state events for gemini realtime model

hi @CWilson , @Muhammad_Usman_Bashir @LiveKit-Community , @livekitteams
I am building a realtime agent using AgentSession(llm=RealtimeModel(...)) with the gemini-live-2.5-flash-native-audio model I am unable to calculate accurate agent’s initial response latency lag (after the user first completed an utterance, how long did the agent take to respond?).

The Problem: In native-audio mode, the server-side turn detection triggers the assistant to reply before the user-side final transcript, committed user item, or local listening state events are surfaced or processed locally. Because of this architectural race condition, standard turn-tracking formulas return negative or incorrect numbers.
we are getting

  1. user_state speaking,

  2. agent_state speaking

  3. later user_state listening

  4. later user conversation_item_added

  5. Empty User Metrics: For role='user', conversation_item_added always arrives with metrics={} (no stopped_speaking_at).

  6. Inverted Timestamps: The assistant’s started_speaking_at regularly timestamp earlier than local user turn-end markers like:

    • user_state_changed(old="speaking", new="listening").created_at

    • user_input_transcribed (final)

    • conversation_item_added for the user.

Log Evidence Example: Looking at the epoch timestamps, the assistant starts speaking almost 400ms before the user item is officially created/surfaced locally:

  • Assistant starts speaking: 1779758929.2258492 (GR_e6878fa98277)

  • User item created locally: 1779758929.630676 (GI_0afd446fa831)

Python

# The assistant response fires first
conversation_item_added (assistant) -> metrics={'started_speaking_at': 1779758929.2258492, ...}

# The user item arrives ~400ms LATER with empty metrics
conversation_item_added (user) -> id='GI_0afd446fa831' content=['Yes.'] metrics={} created_at=1779758929.630676

Questions:

  1. Is it expected behavior for conversation_item_added.metrics to be completely empty for the user role when using Gemini native audio?

  2. Given that the server-side VAD/turn-detection kicks off the assistant before local state updates finalize, what is the recommended way in LiveKit to extract the true user_turn_end_at timestamp?

  3. How are others reliably calculating first response latency and turn latency in this specific streaming configuration?

@Harshita_Sukumar_Patil, The empty user metrics are architectural, not a bug. The Gemini Realtime plugin doesn’t populate user-side started_speaking_at / stopped_speaking_at because Gemini Live runs server-side VAD on Google’s side and the LK plugin has no local VAD to drive those events. Audio forwards directly to Gemini via LiveClientRealtimeInput, turn detection happens on the server, and input transcription surfaces text only with no timing metadata [ livekit/agents/livekit-plugins/livekit-plugins-google/livekit/plugins/google/realtime/realtime_api.py ]. So conversation_item_added.metrics={} for the user role is consistent with the design.

The inverted timestamps follow from the same thing: Google’s server detects turn-end and starts the assistant response before LK finishes processing the audio buffer locally and emitting user_state_changed listening / user_input_transcribed. The ~400ms gap is exactly that local processing delay.

You can’t extract a true user_turn_end_at from the LK side in server-VAD mode (the decision lives on Google’s server). Three workable paths:

  1. Run a parallel Silero VAD purely for measurement (not for turn control) and use its end-of-speech timestamp as user_turn_end_at.
  2. Switch to manual activity detection via start_user_activity() / end_user_activity() with manual_activity_detection=True [same file]. You control turn boundaries locally and can measure from your own activity_end.
  3. Measure end-to-end perceived latency directly: (first agent audio frame) - (last user audio above silence threshold) from waveform analysis. Bypasses the server-vs-local race and is what users actually perceive.

For production reporting on Gemini Live native-audio, (3) is the most defensible.