LLM Comparision

RabbaniF22 · April 30, 2026, 5:18am

What is the equivalent LLM to GPT-4o-mini? I am using 4o-mini, but the latency is slow compared to Groq models. Have you tried any other alternatives with the same level of intelligence?
I am thinking of llama 70B? thoughts?

phaas · April 30, 2026, 12:32pm

My company recently went through this. We went from GPT-4o-mini to GPT-5.4-mini, and measured a large improvement in time to first token with better output. I’m sure there are other suggestions from the community, but we wanted to stay with OpenAI models and this worked well for us. Also, thinking about latency…the other huge optimization was in testing, and tuning the STT. We reduced our latency by 700ms-1000ms with those 2 changes. Good luck

RabbaniF22 · May 5, 2026, 3:50am

Hi, but the 5.4 mini is almost 7x costly than 4o-mini, and i want to offer my services at the lowest rate possible.
When you did this migration to 5.4 mini, what factors did you consider to ignore the price jump?
also can you tell me more about the STT fine tuning? we are using deepgram as our STT, which gives us around ~600 ms p90.
which STT are you using?

phaas · May 5, 2026, 10:59am

You’re right, 5.4 mini is much higher cost. For our use case, we’re creating a voice-enabled avatar that’s not consuming a lot of tokens for a conversation back and forth. Speed is our most important metric. In fact, TTS is our highest cost right now.

Regarding STT. Below is a snippet from Python. It’s important to note, we don’t keep the mic open all the time, the user pushes a “hold to talk” button. So, we are aggressive with our end of turn thresholds.

stt=deepgram.STTv2(model=“flux-general-en”,
eot_threshold=0.5, # default 0.7 — commit sooner (min 0.5)
eot_timeout_ms=800, # default 3000 — cap silence wait
eager_eot_threshold=0.5, # emit eager end-of-turn for preemptive LLM
keyterm=[
“Optional: Place any key-terms here”
],
)

Hope that helps.

RabbaniF22 · May 5, 2026, 11:35am

If speed is your most imp metric, as of my knowledge, right now 4o-mini is the fastest model with ~600ms latency.
I ran the benchmarks and 5.4 mini came in around ~1 sec latency.
so in my view if speed is what you want, 4o-mini is your go to model.

for the STT: i am curious about the flux model that you are using, i use deepgram nova-3. did you use nova-3 and shift to flux or using flux from begining?

phaas · May 5, 2026, 1:13pm

Agreed that 4o-mini wins on raw latency, but speed alone isn’t the only thing we optimize for. Output quality and tool call reliability matter just as much for our use case, and 5.4-mini gave us the best tradeoff across all three. The LLM is also a small slice of our overall cost stack, so the price jump was easier to absorb. With proper prompting and tuning, we’re consistently hitting <1000ms per conversational turn end-to-end, which is where we needed to land.

On Deepgram: we went straight to flux. Came over from other STT providers, so nova-3 was never in our production path. Can’t give you a direct comparison there, sorry.

Topic		Replies	Views
Lowest latency STT/TTS/LLM stack for German - what's your experience? Agents agent-development , stt , llm , tts	1	63	March 13, 2026
Whats your current go-to LLM model? Agents llm	8	80	May 5, 2026
Optimizing Voice Agent Latency, Tool Calling Delays, and Audio Quality Issues with GPT-4o Mini, Sarvam V3 TTS, Deepgram Nova 3 STT, and LiveKit Agents agent-builder	2	16	May 5, 2026
Recommendations for Indian language TTS, STT, and LLM pipeline Getting Started agent-development	1	46	January 21, 2026
Stuck at 2.3s p50 after weeks of tuning - is the livekit.io homepage demo a classic chain or realtime speech? Agents sip-other-provider , elevenlabs , turn-detection , livekit-cloud	2	38	April 14, 2026

LLM Comparision

Related topics