Delay in agent response after initial greeting

@Yash_Zinzuwadiya, the “needs to speak 2-3 times loudly” part is an audio-level signal. PSTN audio from Indian carriers often lands lower than US/EU baseline, and combined with codec compression (especially if your trunk transcodes between Opus and G.711), the first speech burst can fall below Deepgram’s energy floor. Same root-cause class as the Telnyx + French thread last week. Check your trunk codec preferences (force G.711 only, whichever variant your carrier supports) and disable any AGC or audio modifications on the trunk side.

The “Hindi-specifically” part points at STT model fit. nova-3-multilingual is a general multilingual model; Hindi performance varies a lot compared to language-specialized options. Worth A/B-ing against Sarvam (Indian-language specialized), Soniox, or ElevenLabs Scribe v2 on the same recorded audio.

@Pawel_Lach from ai-coustics raised this same tradeoff on a recent multilingual STT thread.

Also worth checking your VAD config. Silero’s default activation_threshold (0.5) can be too high for Hindi prosody. Dropping to 0.3 with min_silence_duration around 0.3 usually catches Hindi speech onset better.

Deepgram language and model overview for reference: Hindi Speech to Text API | Fast & Accurate Transcription | Deepgram