Observability for voice AI agents: what's actually working for you?

I’ve been running LiveKit voice agents in production across a few different projects, and each one ended up with a different observability stack. One uses Langfuse for LLM tracing, another Sentry for error tracking and performance, another Datadog module for LLM observability and APM, and the most recent one Arize AI for tracing, evaluation and experiments.

None of them feel purpose-built for voice AI. Langfuse is great for tracing LLM calls and prompt versioning, but it has no concept of audio latency or turn-taking. Sentry catches errors well and the LoggingIntegration works with the Python agents SDK, but it doesn’t give you voice-specific metrics out of the box. Datadog has the broadest coverage but the LLM-specific features feel bolted on (and it can get expensive). Arize is strong on evaluation but the real-time monitoring side is still maturing

What I really want is something that understands the full voice pipeline: STT latency, LLM TTFB, TTS generation time, end-to-end turn latency, interruption rates, and tool call success rates, all correlated per session. Right now I’m stitching that together manually with AgentMetrics events and custom dashboards

Would love to hear opinionated takes. What are you using for observability? Are you happy with it? Has anyone built a good custom dashboard on top of AgentMetrics?

Have you tried LiveKit Agent Insights?

https://docs.livekit.io/deploy/observability/insights/

2 Likes

Thanks @CWilson, yes! Of course I should have mentioned it in the first message :sweat_smile:

Agent Insights has been great during development and early testing. The session timeline with transcripts, traces, and audio playback is really useful for debugging individual sessions

Where I start running into friction is when thinking about production at scale. I’d love to understand if I’m missing something or if there are workarounds, so a few specific areas:

Data retention and trend analysis. The 30-day retention window means I lose the ability to do month-over-month comparisons, capacity planning, or revisit incidents that surface later. Is there a way to export or archive Insights data before it expires, or is this something the Enterprise plan addresses?

PII and data residency. I couldn’t find a way to set a way so sensitive data do not flow to the US, so this ends up being a limitation for EU companies

Full-stack error tracking. It is important to have all tracking integrated in a single tool, so you can correlate events. As I understand, Insights does not give visibility on error tracing across the stack

To be clear, I think Insights is solving a real problem and solving it well for the voice-specific layer. I’m just curious whether others have found good patterns for complementing it with external tools, or whether there’s a roadmap for things like longer retention, data export APIs, or OTel export for Node.js. And if any other tool does what Insights do, it will become very attractive

Welcome to the community @cdutr and thanks for answering in those other threads.

I’ll let others respond to your general questions, but to answer the specifics:

Data retention: we are working on a solution to address this concern - I don’t see anything public so I don’t want to say too much, but the request to export data is a common one we see.

PII and data residency: We hear this request too, and again are making efforts to close this gap which data export will partially address. We would like to make things easier for customers with data residency requirements in general - right now there are a lot of pieces to consider.

1 Like

Hi @cdutr , i’m not sure if I’m in the best position to comment here, because I’ve just started my first LiveKit agent and I got connected up to Langfuse.

After each call, I get the call data back from Langfuse - including all the turns, tool calls, transcript, latency, token count, etc - all the important STT, the TTS and the LLM metrics

It goes into an SQLite database and from there I can create reports and store it as long as I need.

Putting this all together was a real eye-opener for me and has already helped me cut latency massively. And there are some great insights you can gather for both you and the client through SQL queries.

I have added a few images for you to see.

Again, excuse me if I am way of base here - as mentioned. I’m just getting started with LiveKit.

Thanks for sharing @Dan_M! This is really nice analysis. I am doing exactly this, adding tracing to optimize latency, and using Langfuse storing on their db.

The main challenges I am having now are how to track end-of-turn, interruptions and silence measurement. It is being pretty manual. Are you tracking any of those voice-specific metrics or mostly focused on the LLM/tool call side?

I’m not tracking those as defined metrics yet - but the data is already there - (agent_speaking, user_speaking) will have timestamps, so silence gaps and interruptions should be calculable from the trace data. (I must stress, these are my assumptions - still finding my feet with LiveKit/Langfuse)

But my feeling is that as long as I can get ALL the data into the the database, queries will handle the rest.

1 Like

@Dan_M @cdutr cekura.ai can be very helpful in tracking end to end latency for each turn, detecting interruptions, silences and other metrics. Feel free to try it out and lmk in case of any doubts.

PS: I am the co-founder of cekura :slight_smile:

1 Like

@Shashij_Gupta, congrats on the really nice progress with Cekura. I think the product you guys are developing is in the same direction of what I had in mind in this thread