hi @CWilson , @Muhammad_Usman_Bashir @LiveKit-Community , @livekitteams
I am building a realtime agent using AgentSession(llm=RealtimeModel(...)) with the gemini-live-2.5-flash-native-audio model I am unable to calculate accurate agent’s initial response latency lag (after the user first completed an utterance, how long did the agent take to respond?).
The Problem: In native-audio mode, the server-side turn detection triggers the assistant to reply before the user-side final transcript, committed user item, or local listening state events are surfaced or processed locally. Because of this architectural race condition, standard turn-tracking formulas return negative or incorrect numbers.
we are getting
-
user_state speaking, -
agent_state speaking -
later
user_state listening -
later user
conversation_item_added -
Empty User Metrics: For
role='user',conversation_item_addedalways arrives withmetrics={}(nostopped_speaking_at). -
Inverted Timestamps: The assistant’s
started_speaking_atregularly timestamp earlier than local user turn-end markers like:-
user_state_changed(old="speaking", new="listening").created_at -
user_input_transcribed(final) -
conversation_item_addedfor the user.
-
Log Evidence Example: Looking at the epoch timestamps, the assistant starts speaking almost 400ms before the user item is officially created/surfaced locally:
-
Assistant starts speaking:
1779758929.2258492(GR_e6878fa98277) -
User item created locally:
1779758929.630676(GI_0afd446fa831)
Python
# The assistant response fires first
conversation_item_added (assistant) -> metrics={'started_speaking_at': 1779758929.2258492, ...}
# The user item arrives ~400ms LATER with empty metrics
conversation_item_added (user) -> id='GI_0afd446fa831' content=['Yes.'] metrics={} created_at=1779758929.630676
Questions:
-
Is it expected behavior for
conversation_item_added.metricsto be completely empty for theuserrole when using Gemini native audio? -
Given that the server-side VAD/turn-detection kicks off the assistant before local state updates finalize, what is the recommended way in LiveKit to extract the true
user_turn_end_attimestamp? -
How are others reliably calculating first response latency and turn latency in this specific streaming configuration?