We have two agents in the same LiveKit room using livekit-agents~=1.5. Agent A speaks first, then sends a text payload to Agent B via send_text(topic=“finding”). Agent B reacts via session.say() + wait_for_playout(), then sends send_text(“done”, topic=“turn-ack”). Agent A awaits the ack before proceeding to the next item.
This works well in testing (zero overlap between agents). We chose this over VAD-based gating because TTS has natural inter-sentence pauses that triggered false “end of speech” signals.
Questions:
Is there a native SDK mechanism for explicit agent-to-agent turn coordination we should be using instead? (e.g., something in the handoff/task system, or a built-in signaling pattern?)
Any concerns with using send_text() as a low-frequency signaling channel? (Our rate is ~1 message per 15-20s.)
For scaling to 3+ agents in the same room, would you recommend a different pattern?
There isn’t a lower-level “turn lock” primitive in the SDK specifically for agent-to-agent coordination. The native pattern for structured multi-agent control is agent sessions with handoffs, tasks, and task groups, where one controlling agent transfers execution explicitly rather than relying on media timing. See Workflows and the linked Agents & handoffs section.
Using send_text() as a low-frequency signaling channel (1 message per 15–20s) is fully reasonable. It rides on the same reliable data mechanisms as other text streams and is appropriate for explicit coordination.
For 3+ agents, consider a single controlling agent (or task group) orchestrating handoffs instead of peer-to-peer acks, which keeps flow centralized and testable.
Are these agents meant to act as distinct “personas” in one conversation, or as cooperative background workers?
Thanks for the tips and doc references, this is really helpful. I think we now need to switch to this approach.
Currently we have one persona with multiple agents under the hood, same voice. STT+VAD is needed since a real human can join.
Two follow-up questions: for a future use case we want 2–4 independent personas in one room. I think we can drive interruptions programmatically via @function_tool rather than STT and VAD. Do you think that’s a good idea or is there a better approach?
I am not sure. It really comes down to your use case and what you are really trying to achieve. How those “personas” interact, or if there only one active at a time, etc.