Delay in agent response after initial greeting

After the a call is picked up by the user the agent speaks the initial greeting through “session.say”, after this point most of the times the user has to speak 2 to 3 times loudly for the STT to transcribe and then once this is done then the flow proceeds smoothly. This occurs specially when Hindi language is spoken, for english it works fine.
I am using the following providers -

  1. STT = Deepgram Nova-3-multilingual.
  2. LLM = gemii-2.5-flash
  3. TTS = Cartesia Sonic-3

So, is this an issue with deepgram, SIP provider or within LiveKit itself ?

I am not sure about what the root cause of this would be for Hindi language. My best suggestion is:

  1. If you have Agent Insights enabled then listen to the audio to see if you would have understood it or not. This can give some insight if it is an audio quality issue or an STT issue
  2. Try different models on Deepgram or other providers to see if they perform any better for your usecase.

There are many here in the community doing Hindi use-cases that can probably speak more athoratatvely on this subject matter.

@Yash_Zinzuwadiya, the “needs to speak 2-3 times loudly” part is an audio-level signal. PSTN audio from Indian carriers often lands lower than US/EU baseline, and combined with codec compression (especially if your trunk transcodes between Opus and G.711), the first speech burst can fall below Deepgram’s energy floor. Same root-cause class as the Telnyx + French thread last week. Check your trunk codec preferences (force G.711 only, whichever variant your carrier supports) and disable any AGC or audio modifications on the trunk side.

The “Hindi-specifically” part points at STT model fit. nova-3-multilingual is a general multilingual model; Hindi performance varies a lot compared to language-specialized options. Worth A/B-ing against Sarvam (Indian-language specialized), Soniox, or ElevenLabs Scribe v2 on the same recorded audio.

@Pawel_Lach from ai-coustics raised this same tradeoff on a recent multilingual STT thread.

Also worth checking your VAD config. Silero’s default activation_threshold (0.5) can be too high for Hindi prosody. Dropping to 0.3 with min_silence_duration around 0.3 usually catches Hindi speech onset better.

Deepgram language and model overview for reference: Hindi Speech to Text API | Fast & Accurate Transcription | Deepgram

Yes, just like @Muhammad_Usman_Bashir mentioned, also I am testing out thoroughly cases with mixed English and Hindi and even when the audio is clearly enhanced the STTs fails to pick it up correctly. What’s more I have also heard about this particular case from my friends who are building apps around Voice AI. Have you tried this STT model: Supertone/supertonic-3 · Hugging Face ? I have heard it’s really good for Hindi