We have been testing with Gemini Live 3.1 and see really weird behaviour. For example, when speaking in Dutch, it sometimes transcribes in a totally other language, although the model does seem to hear what we say correct. Is this a known limitation of the model?
Also, when we connect an STT model to the AgentSession to get good transcription for example. This results in getting two streams for user transcription.
I am thinking about the use case of using for example Soniox STT or Gladia STT next to Gemini Live 3.1, to get the correct transcription and preventing in having two streaks recorded in the chat transcript. What would be a good approach for this?
Another point; Is it recommended to use Silero VAD with Realtime models like Gemini?
It is a speech to speech model bro. It doesn’t work on text. If u want the transcription for post- analysis cant say for sure but there must a method within the plugin to do so.
Also, if you’re planning to use this agent for telephony, don’t I have wasted days trying to fix an unsolvable error. The latency is going to be somewhere around 1.5 to 2.5 whole seconds for Gemini to take in the input audio and generate an output speech. I don’t know exactly why this is the case for telephony. I guess you know telephony providers use 8 kilohertz of frequency. So basically, all in all, it’s not a good idea because you will hit a latency bottleneck, and you cannot do anything because it’s Gemini 3.1 Flash Live that is taking the bulk of processing time, and there is literally no way you can speed it up.
Given that you are looking to test across models and continue to monitor this across your test cases; we’ve built Cekura.ai for exactly that. Automated 1000+ test cases across 100+ metrics and scenario judges with A/B testing across models.