Agent SDK states

In the python agents sdk states, is the Agent state “listening” just the same as the user state “talking”, in a 1 on 1 context?
UserState = Literal[“speaking”, “listening”, “away”]
AgentState = Literal[“initializing”, “idle”, “listening”, “thinking”, “speaking”]

Not exactly.

agent.listening means the agent is actively listening for user input, while user.speaking means the user is currently talking. In a 1:1 conversation those often happen at the same time, but they represent different perspectives in the state model, so they’re not the same state.

user.listening is closer to “the user is present and not currently speaking.”

If you’re trying to observe how these actually transition at runtime, I built a small LiveKit debugging playground that helps visualize connection state, transcript updates, and agent/user state changes:

1 Like

In the Python Agents SDK, UserState and AgentState are two independent state machines. Even in a 1:1 conversation, Agent listening is not the same thing as User speaking (aka “talking”)—They refer to different entities.

Where the states are defined

They’re defined as Literal[...] types in voice/events.py:

UserState = Literal["speaking", "listening", "away"]
AgentState = Literal["initializing", "idle", "listening", "thinking", "speaking"]

And when these states change, the SDK emits:

  • user_state_changed (with UserStateChangedEvent)
  • agent_state_changed (with AgentStateChangedEvent)

(in the same file)

What “User speaking” means (and where it’s set)

User speaking means “we detected the user’s speech” (e.g., via VAD start-of-speech), and User listening means “user is not speaking right now.”

Those transitions happen here:

def on_start_of_speech(self, ev: vad.VADEvent | None) -> None:
    ...
    self._session._update_user_state("speaking", last_speaking_time=speech_start_time)

def on_end_of_speech(self, ev: vad.VADEvent | None) -> None:
    ...
    self._session._update_user_state("listening", last_speaking_time=speech_end_time)

So: User speaking is tightly tied to audio/VAD signals.

What “Agent listening” means (and where it’s set)

Agent listening means “the agent is in a non-speaking mode where it’s waiting/ready to receive user input.” It can also be used when the agent pauses output due to interruption handling.

One explicit example: when user activity interrupts agent output and the output supports pausing, the agent pauses and sets the agent state to listening:

if use_pause and self._session.output.audio and self._session.output.audio.can_pause:
    self._session.output.audio.pause()
    self._session._update_agent_state("listening")

And AgentSession is what actually records the state + emits the change event:

old_state = self._agent_state
self._agent_state = state
self.emit(
    "agent_state_changed",
    AgentStateChangedEvent(old_state=old_state, new_state=state),
)

Same pattern for user state:

old_state = self._user_state
self._user_state = state
self.emit("user_state_changed", UserStateChangedEvent(old_state=old_state, new_state=state))

Why they’re not “the same” in 1:1

In 1:1, you often observe a complementary pattern:

  • When User is speaking, the Agent is often listening
  • When Agent is speaking, the User is often listening

…but that’s an interaction pattern, not a shared state. They are separate because:

  • UserState is about what the user is doing (speaking vs silent/away)
  • AgentState is about what the agent pipeline is doing (thinking, speaking, idle, etc.)

You can absolutely have combinations like:

  • User speaking + Agent thinking (agent already started reasoning while user is still talking, depending on settings)
  • User listening + Agent listening (both silent; used for “silence detection” logic in some features)
  • User away + Agent idle

Diagram (high-level 1:1 turn flow)

stateDiagram-v2
  direction LR

  state "UserState" as U {
    [*] --> listening
    listening --> speaking: VAD start_of_speech<br>_update_user_state("speaking")
    speaking --> listening: VAD end_of_speech<br>_update_user_state("listening")
    listening --> away: away timer (no activity)
    away --> listening: final transcript while away<br>(or activity)
  }

  state "AgentState" as A {
    [*] --> initializing
    initializing --> idle: session ready
    idle --> listening: waiting for input
    listening --> thinking: user turn committed<br>(generate reply)
    thinking --> speaking: TTS/audio output starts
    speaking --> listening: output finished
    speaking --> listening: user interrupts & output pauses<br>_update_agent_state("listening")
  }

  U.speaking --> A.listening: typical pattern<br>(user talks, agent waits)
  A.speaking --> U.listening: typical pattern<br>(agent talks, user silent)

One more practical note: “lk.agent.state” is only the agent state

If you’re looking at client-visible state via participant attributes, lk.agent.state is agent state only (it is set when agent_state_changed fires):

await self._room.local_participant.set_attributes(
    {ATTRIBUTE_AGENT_STATE: ev.new_state}
)

So you shouldn’t interpret lk.agent.state="listening" as “the user is talking”; it only means “the agent is in listening mode”.