Building a Real-Time Vision Agent (Continuous Feedback)

I’m building an agent using Gemini Vision assistant(open to other recommendations). The goal is real-time guidance—like onboarding a user on Google Sheets or an internal platform by watching their shared screen.

Instead of the typical turn-based flow (where the agent waits for the user to speak), I need the agent to continuously process the video stream in real time and proactively speak up with feedback as the user performs actions.

Has anyone successfully implemented this kind of continuous, proactive vision processing? Would love some guidance on:

  • Best practices for streaming video frames with minimal latency.

  • How to trigger proactive agent speech based on video events rather than user audio.

Thanks in advance!

You can do this but it can get expensive to try and process a lot of video frames.

I would recommend to have a pre-processor in front the the video so you only process frames that have changed or maybe look for changes in a target area before having the model process a full frame.

I’ve found it helpful to put something like openCV in front for doing vision work before sending to a VLM.

One other technique I have found helpful is to have the agent do a screen share of its browser instead of the user sharing their screen. This allows you to easily capture and act on browser events and the agent can have access to the DOM directly, and more than just what currently fits on the screen. This is useful for cases where you don’t want to access everything as pixels data.

It all comes down to your use case but these are a few techniques I’ve used and found helpful.

Thank you very much this is super. For my POC I’m looking to just have a running version first. Once i have that done I’ll work on the suggestions you gave.

Is there any documentation, example repos or anything else that might come to your mind which might help me in implementing this? Whether gemini vision assistant is the way to go, any specific prompts you might feel work better so on and so forth

I think this is the best example for your starting point. Not sure if that is the one you are already looking at:

yes this is but this alone doesnt give realtime feedback, you need to ask the agent to look into the video and only then does the agent respond accordingly

@Ahmed_Aziz, For a POC, the cleanest start is the official Gemini Live Vision recipe [ Gemini Realtime Agent with Live Vision | LiveKit Documentation ]. It wires RoomOptions(video_input=True) and configures the LLM with:

  llm=google.beta.realtime.RealtimeModel(
      model="gemini-2.5-flash-native-audio-preview-12-2025",
      proactivity=True,
      enable_affective_dialog=True,
  )

The proactivity=True flag is what gives you agent-initiated speech on visual changes without needing user audio. Start the session with await session.generate_reply().

For the underlying frame primitives (custom samplers, manual rtc.VideoStream(track) loops once you outgrow the default), the canonical doc is Video | LiveKit Documentation.

CWilson’s pre-processor advice is the right next step once your POC is running and token cost shows up. For the system prompt, instruct Gemini to only speak on concrete actions (“describe the most recent user action in one short sentence, or stay silent”). Default prompts tend to over-talk for guidance UX.

@Muhammad_Usman_Bashir Thank you! This is helpful