Is it possible to prewarm the MultilingualModel() that gets assigned to the turn_detection parameter:
turn_detection=MultilingualModel()
With the Silero VAD, I’m able to prewarm per the documentation and that works well but, without prewarming the turn_detection model, the first response I give to an agent takes an extra few seconds to recognize it. After that, the turn detection model is much quicker. I’m assuming it’s a cold start issue. If I comment out this line, I don’t have the first STT lag. I looked through the documentation and code examples but I don’t see anywhere this is done so I wasn’t sure if this was even possible. Appreciate any help or advice you have!
Thanks!
-Dan
It’s not possible to prewarm the turn detection model. The model weights can be downloaded ahead of time, which happens automatically during the build process, but the model is loaded and initialized internally by AgentSession the first time it’s used.
“an extra few seconds” does seem like a long time. Are your agents running on LiveKit cloud?
Thank Darryn! Appreciate the insight.
I do have this running on LiveKit cloud where this is occurring. I thought this was maybe part of the free plan “cold start” but the TTS was immediate and it was only the STT that was slow. I decided to try self-hosting just the agent part (WebRTC in Cloud) and noticed it had the same STT cold start hang to it. I tried commenting out just the turn_detection part, and the agent has been surprisingly fast now (relying only on the prewarmed VAD for now).
I started using Deepgram for their STT and TTS and noticed one of their STT models has turn-detection built in (and LiveKit has the ability to point the turn detection at the STT for direction). Is this the better option now than trying to roll my own turn detection model? Maybe this is a moot point to begin with.
Thanks again for the help!
I thought this was maybe part of the free plan “cold start” but the TTS was immediate and it was only the STT that was slow.
Yes, cold start applies to the agent initializing, Deployment management | LiveKit Documentation , so if you are seeing immediate TTS this isn’t agent cold starts. Also, if this happens for two consecutive calls, it will not be agent cold starts.
I started using Deepgram for their STT and TTS and noticed one of their STT models has turn-detection built in (and LiveKit has the ability to point the turn detection at the STT for direction). Is this the better option
Technically our docs recommend the MultiModal() model, Turns overview | LiveKit Documentation , but in practice, I would say to try both and use whichever one works best in your use case.
Although I know this isn’t what you are asking (more for anyone else that finds this thread), for realtime models I would specifically recommend using that model’s built in turn detection: Turns overview | LiveKit Documentation