Hey everyone, I’ve been working on a telephony voice agent (SIP calls via LiveKit, using OpenAI Realtime API[text], 11labs TTS) and I’ve been going back and forth on something I’d love some input on.
Basically, I want a way to monitor and steer the agent during a live call, catch things like abusive language, the agent making promises it shouldn’t, customers asking for sensitive info, escalation signals, etc. The main agent is busy having the conversation so it can’t really evaluate itself objectively and don’t want to bloat context window.
I came across the concept of an “observer layer” from a GitHub repo that discusses this pattern for agents, and I’ve been trying to adapt it for voice/telephony. The general idea is running a lightweight parallel process alongside the main agent session that watches the conversation and can intervene when needed.
Here’s roughly what I’ve been experimenting with:
- Listening to
conversation_item_addedevents on the session to get the transcript as it flows (I also tried subscribing to the raw room audio with a separate STT instance to get my own transcription, but not sure if that’s necessary or overkill) - Running evaluations against the conversation — I have a simple keyword matcher for obvious stuff (threat words, “lawyer”, “password”, etc.) and then an LLM-based check (gpt-4o-mini) that evaluates the conversation against a guidelines doc and returns structured JSON with severity + suggested action
- For critical issues: calling
session.interrupt()and thensession.generate_reply(instructions=...)to force the agent to address the problem - For warnings: injecting a system message into the agent’s
chat_ctxviaupdate_chat_ctx()so it’s aware on its next turn - For info-level stuff: just logging
Architecture
flowchart TD
Caller["📞 Caller (SIP)"] <-->|"voice"| Agent["🤖 Agent (Realtime API)"]
Agent -->|"conversation_item_added\nevents / audio"| Observer["👁️ Observer Layer"]
Observer --> KW["Keyword Scan\n(instant, rule-based)"]
Observer --> LLM["LLM Eval\n(gpt-4o-mini, structured JSON)"]
KW --> Severity{Severity?}
LLM --> Severity
Severity -->|"⚠️ WARNING"| Warning["Inject system message\ninto agent's chat_ctx\nvia update_chat_ctx()"]
Severity -->|"🚨 CRITICAL"| Critical["session.interrupt()\n+ session.generate_reply()\nforce redirect agent"]
Severity -->|"ℹ️ INFO"| Info["Log only"]
Warning -->|"agent picks it up\non next turn"| Agent
Critical -->|"immediate\nintervention"| Agent
The part I like about this is that it’s extensible, right now it’s just guardrails, but in the future if I want to add things like quality monitoring, compliance checks, sentiment tracking, or even dynamic instruction updates based on how the conversation is going, the observer can handle all of that without touching the main agent’s logic.
But I keep wondering, am I overcomplicating this? Is there a simpler way to get real-time guardrails working with LiveKit agents that I’m missing? Or is this roughly the right direction?
Also curious if interrupt() + generate_reply() and update_chat_ctx() are the right mechanisms for steering the agent mid-conversation, or if there’s something better I should be using.
Would appreciate any thoughts, even if it’s “yeah that’s way too much, just do X instead.” Thanks!