We need to talk about pre-tool-call speech

Whether or not the agent speaks prior to making a tool call is something that I’d like to be configurable on the LiveKit end. It matters a lot for my application.

Right now I’ve experienced a few bugs due to this not being configurable:

  • the llm_node output speech & a tool call, but the tool call happened before the speech, i.e., the agent performed a tool call that was a web search, the contents were sent to my frontend, and then the agent said “let me look that up for you”.

- the agent started to say “let me look that up for you” but then the tool call execution interrupted the agent’s speech, so we only heard the first word or so.

- the agent doesn’t speak any pre-tool-call speech. The tool call adds latency, so there’s an awkward pause while the user waits until the agent is done performing the tool call.

As of now I don’t prompt the agent to speak before calling the tool. I used to, and it was very unreliable, and oftentimes it seemed that there was a race condition between tool execution and speech streaming.

Simply instructing the agent not to output pre-tool-call speech, and then slapping a self.session.say() is NOT a solution. This is because latency matters and I can’t really afford a tts round trip. But also it’s a nasty solution, given that, ya know, we were just at the llm_node, as that’s what called the tool. Also I do not want the agent saying the same thing every single time. Oh and because the agent might have already output pre-tool-call speech. Then the agent would be double-speaking.

Similarly, slapping generate_reply() at the top of every function_tool is not a solution, as this is the full llm & tts round trip. I can’t afford that latency, and again we run into the possibility of the agent having already spoken pre-tool-call-speech.

This really irks me because right now, oftentimes somehow the tool call happens before the speech output, and so not only does the agent already have the results when it says “let me look that up for you”, but in fact, now my transcribed chat history is out of order, as it shows a tool call happened and then the agent said “let me look that up for you”.

I’ve tried the workarounds. This is my 2nd or 3rd post on this subject. I can’t be shrugged off again and told “just do session.say()”. This is causing major problems for my application.

@Isaac_Huntsman, the gap is real. Verified in livekit-agents/livekit/agents/voice/agent_activity.py on main: no user-facing flag gates tool execution on TTS playout. Internal coordination uses _background_speeches: set[SpeechHandle] (comment: “speeches that audio playout finished but not done because of tool calls”) + await speech_handle.wait_if_not_interrupted(all_tasks). That mechanism can race when the LLM stream interleaves tool calls and text tokens, which matches all three of your failure modes.

Two paths:

  • File a focused issue on livekit/agents with a minimal repro (LLM stream sequence + observed tool-vs-speech ordering). I searched today’s open issues; nothing tracks this specific symptom. Without a repro the team can’t prioritize a requires_speech_first flag.

  • Interim direction: inside the function_tool body, await the current speech handle’s playout before doing work. That gates the tool on pre-tool TTS actually playing, no separate session.say() round trip. Accessing the current speech handle from RunContext is where the LK team needs to confirm the public API, since _background_speeches is private.

@CWilson @davidzhao worth weighing in. The gap is real on main and Isaac’s filed this multiple times.

I’m going to collect some logs and open an issue shortly, thank you

It’s a real problem, and one the team are actively working on addressing with Async tool calling: https://docs.livekit.io/agents/logic/tools/async/ . Whilst this feature is documented, I know the team are still making improvements to it but please try it out.

Edit: The above answer was updated as I originally thought the feature was unreleased.