We are building an agent using the Realtime model, and we’ve noticed that its built‑in ASR transcription quality isn’t accurate enough for our use case. We want to use transcripts generated by an external STT/LLM (Azure Whisper).
Our question is:
Is it possible to feed external STT results (e.g., Whisper transcripts) into the same Realtime Model session, so that the agent uses those transcripts for reasoning instead of its own?
Or is the only practical solution to run a separate STT agent in the same room, send Whisper transcripts from there, and have our main Realtime agent consume them as external messages?
Any guidance on recommended architecture would be greatly appreciated.
Is it possible to feed external STT results (e.g., Whisper transcripts) into the same Realtime Model session, so that the agent uses those transcripts for reasoning instead of its own?
Essentially replacing the realtime model’s STT? No, that’s not possible. It feels like it would be better to use the plugin model at that point, since any benefits you get from using a realtime model would be lost with your workaround.
That code is what you do if you want to use the LiveKit turn detector with a Realtime model, which requires a separate STT, LiveKit turn detector plugin | LiveKit Documentation, but that’s not quite your use case.
The realtime model natively supports server side OpenAI whisper so I a little unclear why you would want to use an external Azure Whisper? Is it significantly better performance than OpenAI whisper?
We are interested also in an external STT with Realtime model for transcripts given the more recent STT engines exist wth higher transcription performance than Whisper, but would ideally have the STT as an independent layer which does not affect the model itself. I think that is possible with livekit if we disable realtime model server side transcription, although one concern we have with that idea is that this blog Developer notes on the Realtime API suggests they rely on the server side transcription for cost efficiency
The GA service will automatically drop some audio tokens when a transcript is available to save tokens.
So I think we would need both server side realtime Whisper transcription (for the model input/cost efficiency) and an external livekit ran STT…but I think livekit is not really built with the idea of two simultanous STT engines like that, so am currently thinking it just might not be possible in Livekit as-is, as @darryncampbell wrote.
We find that the realtime model feels very natural to talk to since it is faster and feels more like talking to a human. We would only rely on the transcritps for certain parts. Like retrieving phonenumbers in norwegian. Since it is only a small part of the conversation that is retrieving the phone number we are looking at alternative ways to retrieve this information, and have found better accuracy in the azure whisper model-
Yes it does indeed support the server side whisper, but we find the transcripts to be quite off, and also there is a lot of the times a difference between what the transcripts hear compared to what the AI hears. We are specifically struggling since we are using the realtime model in Norwegian.
It works for general sentiment, but when retrieving information like phone numbers and such it struggles a lot. We were hoping that we could run the stt in parallel and retrieve the phone number directly from the transcripts as this would yield a greater accuracy.
I agree with having an additional STT could be a solution, although if it is not supported by livekit it seems it wpuld require some workarounds. I would assume that it would be over engineering to have two agents within the same room one realtime model which handels all of the functionality and conversational flow, while the STT only remains in the background writing the transcript.
Do you have any recommendations on how we can increase our accuracy in retrieving specific information like telephone number in a different language(must be done by voice, not from sip information).
I am curious why Azure Whisper would be much better at Norwegian than OpenAI whisper though? I thought it would be the same underlying model weights
Maybe simpler than having two agents would be to record the call and post-process the audio via a seperate STT model that you think is well optimised for the language. This also has the benefit that the model can be very high latency (need not be optimised for real time usage).
If it is just a small part of the audio you are interested, another idea would actually just be to ask the gpt-realtime model itself to transcribe the phone numbers via a tool call - note that this “transcription” would be directly from gpt-realtime, not via whisper, as tool calls are done by gpt-realtime.
I agree, from my understanding Azure has fine-tuned their models with more Norwegian data. This would be an assumption though as I have not researched it fully, but I have performed multiple tests to check which models yields the greatest results, were Azures whisper model preforms significantly better compared to the other models (specifically for the phone number retrieval process).
That is a great Idea, although we rely on having the user confirm the phone number as we are doing an API lookup on this data, but it is a good thought.
Yes that were our original approach, to just have a tool that the realtime model pass in the value we want transcribed, kind of like:
instructions = "Transcribe the phone number exactly as the user says it. Add it as the parameter in transcribe_phone_number()."
...
@function_tool()
async def transcribe_phone_number(self, phone_number: str) -> str:
This was our original idea, which lead to the poor results, and made us look at the alternative ways of retrieving phone_number.
Sidenote here: We also noticed a strong bias for the realtime model to repeatedly assume its previous suggestion to be correct, for instance:
User: “my phone number is 12345“
AI: “did you say 12346“
User: “No I said 12345“
AI: “Sorry did you say 12346“
And this behaviour is something we experienced in more extreme cases as well.
Hi! I’m the project lead on the project Marius is discussing here. Thought I could provide a little further context into our STT decisions.
As Marius mentions, the realtime model has a strong bias towards its previous number interpretations, resulting in users getting caught in a loop that Marius described above. Getting the realtime model to use the tool where it inputs the number does help, but the issue is still persistant.
So we try to use the transcriptions directly, as they do not have the same bias, and can often be more accurate. So when the agent calls the tool, we grab the transcribed user input from the chat context. However these are also not that accurate.
We did an analysis comparing different STTs to see if we can increase the success rate for number interpretation. So we transcribed call recordings using different models, and found that using Azure Speech Service, we get a much higher success rate (nearly double compared to the default transcription when using the realtime model)
I believe Microsoft trains these models themselves, which includes a dataset with Norwegian transcriptions. This could be the reason it’s better at transcribing Norwegian.
@darryncampbell I think there’s been some misunderstanding on the thread you replied to at the top.
The goal is not to use the STT for the reasoning of the realtime model directly. Rather, we would like our conversation to be handled by the realtime model, and the transcriptions to be handled by the transcription model from Azure Speech Service. So two separate parallell tasks. We then make tools that the realtime model can use to fetch those transcriptions when it needs more accuracy.
So to me it sounds like the method mentioned here: Realtime models overview | LiveKit Documentation could be highly relevant for us. However when I set up a separate STT plugin, I can see that the STT node is never triggered. I then checked the conversation items themselves, and by looking at the metadata I could tell that the STT was not used (I don’t quite remember what I looked at since it was a while ago, but I can try again and explain more accurately what we see).
Right now my best solution is to create a whole new agent specifically for STT with Azure Speech Service and dispatch that into the room. So now we have one realtime agent, and one STT agent in the room (two agents total). Which works, and solved our problem, though it’s a very hacky solution - And I feel there must be a better way to approach this.
So I’m wondering if it’s at all possible to handle these two tasks in one session.
Conversation handled by realtime model
Transcription handled by Azure Speech Service through the plugin.
So to me it sounds like the method mentioned here: Realtime models overview | LiveKit Documentation could be highly relevant for us. However when I set up a separate STT plugin, I can see that the STT node is never triggered.
I tested quickly and think this will meet your needs… I don’t have an Azure STT key handy to test with but since it also supports streaming transcription I imagine this would also work.
Tried this, and it looks like the STT plugin is still not being used, and that the transcriptions are still being done by the realtime model.
I log both on conversation_item_added, and user_input_transcribed, and they show the same output. Same hallucinations as well when they just hear noise. So the transcriptions are still being handled by the Realtime Model, not by the STT.
I check the metrics_collected events as well, and we only see type `vad_metrics` and `realtime_model_metrics`.
However a key thing I notice is that you set input_audio_transcription to None. I get the following error message:
ValueError: input_audio_transcription must be an instance of InputAudioTranscription for api-version 2025-04-01-preview
As a result of the following handler in realtime_model.py
if is_given(input_audio_transcription) and not isinstance(
input_audio_transcription, InputAudioTranscription
):
raise ValueError(
f"input_audio_transcription must be an instance of InputAudioTranscription for api-version {api_version}"
)
Were you able to run it with input_audio_transcription=None?
You’re passing input_audio_transcription=None, so the value is “given” but not an InputAudioTranscription instance → it raises:
ValueError: input_audio_transcription must be an instance of InputAudioTranscription for api-version 2025-04-01-preview.
So the message is coming from LiveKit’s livekit.plugins.openai.realtime.realtime_model when it’s configured for the Azure Realtime API with 2025-04-01-preview: that path doesn’t allow None for input_audio_transcription; it expects either “not passed” or an actual InputAudioTranscription instance (from openai.types.beta.realtime.session / the same types the plugin uses).
Looks like that is coded for Azure realtime LLM, but I can’t say if it’s a bug, an omission, or a deliberate choice for a specific reason.
@darryncampbell
I agree, it seems like the key difference is between using the Azure realtime vs Openai realtime models.
Digging a little deeper, it appears that the with_azure function for realtime still assumes that realtime is in preview in Azure. Perhaps a remnant from 4o-realtime-preview?
You can see here that Azure indicates that GA is available for gpt-realtime:
Whereas in realtime_model.py in the plugin we see the following in the docstring for the with_azure method:
Create a RealtimeModelBeta configured for Azure OpenAI. Azure does not currently support the GA API,so we return RealtimeModelBeta instead of RealtimeModel.
Note that with_azure only returns a RealtimeModelBeta object. Could it be that this now is outdated after the release of gpt-realtime (and now gpt-realtime 1.5) ? It apprears to be a remnant from the 4o-realtime days
Oh man, Azure Whisper can be tricky with LiveKit because of how it handles streaming vs batch transcription. Are you hitting the REST API or using their WebSocket endpoint?
The core issue is that LiveKit’s STT plugin interface expects streaming results with interim/final distinctions. Azure’s REST API is batch-only you send audio, wait, get back the full transcript. That won’t work for real-time voice agents.
Here’s what I’d suggest: switch to Azure’s WebSocket streaming API and wrap it in a custom STT plugin. You’ll need to implement LiveKit’s STT interface and emit events as results stream in:
python
**from livekit.agents import stt
class AzureWhisperSTT(stt.STT): async def recognize( self, buffer: AudioBuffer, language: str = "en-US" ) -> AsyncIterator[stt.SpeechEvent]:** # Connect to Azure's streaming endpoint** async with self.azure_client.stream_recognize() as stream: await stream.send(buffer.data)**
** async for result in stream.receive():** # Emit interim results as they arrive** yield stt.SpeechEvent( text=result.text, is_final=result.recognition_status == "Final", confidence=result.confidence, )**
The key is emitting is_final=False for partial results so your LLM doesn’t start responding mid-sentence.
Honestly though? If you’re not locked into Azure for compliance reasons, I’d just use Deepgram or AssemblyAI. They have native LiveKit plugins and handle streaming way better out of the box. Deepgram’s nova-2 model is ridiculously fast and accurate—saved me weeks of custom integration work.
Are you required to use Azure, or is it just what you started with?
The GA service will automatically drop some audio tokens when a transcript is available to save tokens.
i.e. it seems if the realtime model does not have a transcription available in its conversation.item events, then it will not be able to convert audio tokens to text tokens using the transcript, and it will presumably have a much smaller effective context window as a result of this. Now we can of course use the external (e.g. azure whisper STT, deeppgram etc.) to generate a transcript, but the only way to feed that into the gpt-realtime model would be to manually emit conversation.item.create events. That then introduces another problem - either you would disable audio input to gpt-realtime (and rely on text only messages as input for the model) → this has the problem that the audio loses context of tone of voice etc.
Or we would keep audio input and add the external conversation.item.create text –> but then the model is essentially getting duplicate inputs for every user input (one in form of raw audio, and one in form of text tokens), so would likely reduce model performance
My team are interested in achieving roughly same thing as OP - but as far as I can see the most robust way to do it would be to simply do the transcription externally to the agent (either via a seperate agent, or post processing etc.).
Now I suppose theoretically it could possible to enable both server side transcription AND a discrete external STT to avoid the problems above, but then you get into a pickle as livekit is not really built with that use case in mind currently I think - for example, you would get two sets of transcriptions on every user input being emitted to event handlers etc..