LLM Comparision

What is the equivalent LLM to GPT-4o-mini? I am using 4o-mini, but the latency is slow compared to Groq models. Have you tried any other alternatives with the same level of intelligence?
I am thinking of llama 70B? thoughts?

My company recently went through this. We went from GPT-4o-mini to GPT-5.4-mini, and measured a large improvement in time to first token with better output. I’m sure there are other suggestions from the community, but we wanted to stay with OpenAI models and this worked well for us. Also, thinking about latency…the other huge optimization was in testing, and tuning the STT. We reduced our latency by 700ms-1000ms with those 2 changes. Good luck

Hi, but the 5.4 mini is almost 7x costly than 4o-mini, and i want to offer my services at the lowest rate possible.
When you did this migration to 5.4 mini, what factors did you consider to ignore the price jump?
also can you tell me more about the STT fine tuning? we are using deepgram as our STT, which gives us around ~600 ms p90.
which STT are you using?

You’re right, 5.4 mini is much higher cost. For our use case, we’re creating a voice-enabled avatar that’s not consuming a lot of tokens for a conversation back and forth. Speed is our most important metric. In fact, TTS is our highest cost right now.

Regarding STT. Below is a snippet from Python. It’s important to note, we don’t keep the mic open all the time, the user pushes a “hold to talk” button. So, we are aggressive with our end of turn thresholds.

stt=deepgram.STTv2(model=“flux-general-en”,
eot_threshold=0.5, # default 0.7 — commit sooner (min 0.5)
eot_timeout_ms=800, # default 3000 — cap silence wait
eager_eot_threshold=0.5, # emit eager end-of-turn for preemptive LLM
keyterm=[
“Optional: Place any key-terms here”
],
)

Hope that helps. :grinning_face:

If speed is your most imp metric, as of my knowledge, right now 4o-mini is the fastest model with ~600ms latency.
I ran the benchmarks and 5.4 mini came in around ~1 sec latency.
so in my view if speed is what you want, 4o-mini is your go to model.

for the STT: i am curious about the flux model that you are using, i use deepgram nova-3. did you use nova-3 and shift to flux or using flux from begining?

Agreed that 4o-mini wins on raw latency, but speed alone isn’t the only thing we optimize for. Output quality and tool call reliability matter just as much for our use case, and 5.4-mini gave us the best tradeoff across all three. The LLM is also a small slice of our overall cost stack, so the price jump was easier to absorb. With proper prompting and tuning, we’re consistently hitting <1000ms per conversational turn end-to-end, which is where we needed to land.

On Deepgram: we went straight to flux. Came over from other STT providers, so nova-3 was never in our production path. Can’t give you a direct comparison there, sorry.