SIP participants transcripts

Was not able to find a reasonable answer to this so posting here. What is the best possible way to produce transcripts for sip participants in a livekit room. I’m aware of the agent way, but it will be cpu intensive as I intend to use VAD for quality and there maye be any number of sip participants. For web participants I’m already trying out running vad in the client side itself and then piping the transcript through an api or ws. Is there a similar way for SIPs?

Feel free to ask any questions if I was not able to articulate the problem.

To get transcripts for SIP participants in a LiveKit room without burning too much CPU, try running VAD on the server side. Use a media server like Janus Gateway or FreeSWITCH to handle the audio streams, detect speech with a VAD library, and generate transcripts using a speech-to-text service. Send the transcripts to your app via an API or WebSocket. This way, you can handle multiple SIP participants efficiently.

What do you mean by “handle it n server side”. Afaik, using an agent to subscribe to audio tracks and then using vad is also basically a server side solution. Are you suggesting the same or something else?

Yes, i think it is what i am trying to convey.

Also, try out on device solutions making use of VAD directly on the client device using lightweight libraries like Vosk or WebRTC VAD.
Alternatively, You may try to Send the generated transcripts to your application via an API or WebSocket.

I am not aware of a better approach than using an agent, as you already explored. To save resource, you only need to enable the STT part of the pipeline, Text and transcriptions | LiveKit Documentation .

We do have an example of a multi user transcriber: agents/examples/other/transcription/multi-user-transcriber.py at main · livekit/agents · GitHub