Long-running Voice Session incur Significant Costs for WebSocket-based STT model

For streaming STT models like Deepgram, Assembly Universal Pro, and ElevenLabs Scribe v2 Realtime, the current setup works well for dense, back-and-forth voice interactions. The challenge is with long-running voice sessions that include extended silence, where keeping STT active can become quite expensive, especially with ElevenLabs.

Does the framework support automatically disabling STT after a configurable period of no audio activity, then reconnecting once new VAD events are detected? An optimization flag like this could help reduce costs significantly, with the trade-off of some reconnection latency.

I think usage will incur only if you send the audio - you could block / allow that based on the VAD detection.

No, once the websocket connections are established to the STT provider, which happens during session.start() there is no easy way to tear these down and resume the session at a later time. Muting the stream, or not sending audio won’t affect your STT usage.

STT will only run when there is a user in the room and the session is started, so some customers will delay starting the session until an indication is received from the user front-end that they are ‘ready’ - but that doesn’t apply after the session is started.

1 Like