Does Gemini Vision work with H264?

I am trying to connect a WHIP ingress to Gemini. Following this: Gemini Realtime Agent with Live Vision | LiveKit Documentation

My video source is RTSP that streams H264. I couldn’t connect H264 to the Gemini. I had to transcode the video locally to VP8 before sending it to the ingress.

Is there a limitation? Am I missing something?

Hi, I believe this is addressed by this line in the docs: Transcoding configuration | LiveKit Documentation

The Ingress service can transcode the media being received. This is the only supported behavior for RTMP and URL inputs. WHIP ingresses are not transcoded by default, but transcoding can be enabled by setting the enable_transcoding parameter. When transcoding is enabled, the default settings enable video simulcast to ensure media can be consumed by all viewers, and should be suitable for most use cases.

But this is about transcoding which I actually tried with below ingress.json. But why do I even need to transcode?

    {

      "input_type": 1,

      "name": "gst-whip-test",

      "room_name": "test-room",

      "participant_identity": "camera-1",

      "participant_name": "Camera 1",

      "enable_transcoding": true,

      "video": {

        "options": {

          "video_codec":

        }

      }

    }

Gemini Live vision itself is not limited to a specific WebRTC codec like H264 or VP8. It receives video frames from the LiveKit room when video_input is enabled, as described in the Vision guide.

The issue you’re hitting is with WHIP ingress behavior. By default, WHIP forwards media unmodified, so your source must already be compatible with subscribers. If that’s not the case, you must enable transcoding on the ingress, as explained in Transcoding configuration.

So you don’t need VP8 specifically — you need the ingress to produce a WebRTC-compatible stream for the room.

Are you publishing H264 without simulcast from the RTSP → WHIP source?

Correct.

The codec mime type is video/h264. I can verify this by a participant that processes the stream.

My gstream pipeline is this

gst-launch-1.0 \

  rtspsrc location="rtsp://root:pass@192.168.103.139/axis-media/media.amp?audio=0&resolution=1280x720" protocols=tcp latency=200 name=src \

  src. ! \

  application/x-rtp,media=video ! \

  rtph264depay ! \

  h264parse config-interval=-1 ! \

  queue ! \

  whipclientsink name=whip \

signaller::whip-endpoint="https://.....whip.livekit.cloud/w/...." \

    video-caps="video/x-h264"

This is the gst-discover output on the source

Properties:

  Duration: 99:99:99.999999999

  Seekable: no

  Live: yes

  unknown #0: application/x-rtp

    video #1: H.264 (High Profile)

      Stream ID: d2583e23c4d68967633741b12be81d9ef38bdcf11c484ce9d038ed2075932fb6/video:0:0:RTP:AVP:96

      Width: 1280

      Height: 720

      Depth: 24

      Frame rate: 25/1

      Pixel aspect ratio: 1/1

      Interlaced: false

      Bitrate: 0

      Max bitrate: 0

I am not sure what the exact issue is here. But the way I usually go about debugging issues like this one is to remove agent from the equation and I just join the room manually so I can see exactly what the agent would see or hear. Thing usually get pretty clear from that point.

Once you have client to human participant working well then put the agent back in and hopefully it should “just work”

Thanks for the tip.
I can join a room (js based application) and see myself as well as the ingress. I can have a egress client joining the room and capturing everyone’s media. The room logs are like this:

Subscribing to participant camera-1 trackPublished from camera-1 { "trackSid": "TR_VCkKL8uVnGwwGR", "kind": "video" } trackSubscribed from camera-1 { "trackSid": "TR_VCkKL8uVnGwwGR", "kind": "video" }

When the agent is connected, it also shows up in the room. Agent hears my voice but the video is not connected (unless I convert the source stream to VP8).

This is my code. Do you see any smoking gun or do you have any other tips for debugging?

await ctx.connect()

room=ctx.room,

agent=Assistant(),

room_options=room_io.RoomOptions(

audio_input=True,

video_input=True,

participant_identity=VOICE_PARTICIPANT_IDENTITY,

        )

    )

# Wait for both participants to be present in the room.

await ctx.wait_for_participant(identity=VOICE_PARTICIPANT_IDENTITY)

ingress = await ctx.wait_for_participant(

identity=INGRESS_PARTICIPANT_IDENTITY,

kind=rtc.ParticipantKind.PARTICIPANT_KIND_INGRESS,

    )

if session._room_io and session._room_io._init_atask:

await session._room_io._init_atask

# Override only the video input to the ingress. Audio remains on the voice participant.

if session._room_io and session._room_io.video_input:

session._room_io.video_input.set_participant(ingress)

logger.info(

"video input linked to ingress, audio input linked to voice participant",

extra={"ingress": ingress.identity, "voice": VOICE_PARTICIPANT_IDENTITY},

        )

await session.generate_reply()

Can you check my logs please?

INFO:gemini-live-vision:connected to room: test-room

INFO:gemini-live-vision:existing_participant publication: participant=camera-1 sid=TR_VCxkRs2q7qC4hc name=synthesized-camera kind=KIND_VIDEO source=SOURCE_CAMERA mime=video/H264 size=1280x720 subscribed=False muted=False simulcasted=False has_track=False

INFO:gemini-live-vision:existing_participant connected: identity=Work kind=PARTICIPANT_KIND_STANDARD sid=PA_qko48BKd6qPQ name=Work attributes={}

INFO:gemini-live-vision:existing_participant publication: participant=Work sid=TR_AMMjJWobZ38h3T name= kind=KIND_AUDIO source=SOURCE_MICROPHONE mime=audio/red size=0x0 subscribed=False muted=False simulcasted=False has_track=False

INFO:gemini-live-vision:connected to room: test-room

DEBUG:livekit.agents:input stream attached

DEBUG:livekit.agents:input stream attached

    14:35:05.026 INFO … gemini-live-vision connected to room: test-room {"room": "test-room", "pid": 61897, "job_id": "job-2a6bba876f24", "room_id": "RM_LcASivGs8m3G"}

INFO:gemini-live-vision:existing_participant connected: identity=camera-1 kind=PARTICIPANT_KIND_INGRESS sid=PA_BTDzdhtidshS name=Camera 1 attributes={'ingress.ingressID': 'IN_VzHQfQ5iN3VY'}

                 INFO … gemini-live-vision existing_participant connected: identity=camera-1 kind=PARTICIPANT_KIND_INGRESS sid=PA_BTDzdhtidshS name=Camera 1 attributes={'ingress.ingressID':    

                                           'IN_VzHQfQ5iN3VY'}                                                                                                                                    

                                           {"room": "test-room", "pid": 61897, "job_id": "job-2a6bba876f24", "room_id": "RM_LcASivGs8m3G"}

INFO:gemini-live-vision:existing_participant publication: participant=camera-1 sid=TR_VCxkRs2q7qC4hc name=synthesized-camera kind=KIND_VIDEO source=SOURCE_CAMERA mime=video/H264 size=1280x720 subscribed=False muted=False simulcasted=False has_track=False

    14:35:05.027 INFO … gemini-live-vision existing_participant publication: participant=camera-1 sid=TR_VCxkRs2q7qC4hc name=synthesized-camera kind=KIND_VIDEO source=SOURCE_CAMERA             

                                           mime=video/H264 size=1280x720 subscribed=False muted=False simulcasted=False has_track=False                                                          

                                           {"room": "test-room", "pid": 61897, "job_id": "job-2a6bba876f24", "room_id": "RM_LcASivGs8m3G"}

INFO:gemini-live-vision:existing_participant connected: identity=Work kind=PARTICIPANT_KIND_STANDARD sid=PA_qko48BKd6qPQ name=Work attributes={}

                 INFO … gemini-live-vision existing_participant connected: identity=Work kind=PARTICIPANT_KIND_STANDARD sid=PA_qko48BKd6qPQ name=Work attributes={}

                                           {"room": "test-room", "pid": 61897, "job_id": "job-2a6bba876f24", "room_id": "RM_LcASivGs8m3G"}

INFO:gemini-live-vision:existing_participant publication: participant=Work sid=TR_AMMjJWobZ38h3T name= kind=KIND_AUDIO source=SOURCE_MICROPHONE mime=audio/red size=0x0 subscribed=False muted=False simulcasted=False has_track=False

                 INFO … gemini-live-vision existing_participant publication: participant=Work sid=TR_AMMjJWobZ38h3T name= kind=KIND_AUDIO source=SOURCE_MICROPHONE mime=audio/red size=0x0       

                                           subscribed=False muted=False simulcasted=False has_track=False                                                                                        

                                           {"room": "test-room", "pid": 61897, "job_id": "job-2a6bba876f24", "room_id": "RM_LcASivGs8m3G"}

DEBUG:livekit.agents:input stream attached

    14:35:05.030 DEBUG… livekit.agents     input stream attached

                                         {"participant": null, "source": "SOURCE_UNKNOWN", "accepted_sources": ["SOURCE_MICROPHONE"], "room": "test-room", "pid": 61897, "job_id":

"job-2a6bba876f24", "room_id": "RM_LcASivGs8m3G"}

DEBUG:livekit.agents:input stream attached

                 DEBUG… livekit.agents     input stream attached

                                         {"participant": null, "source": "SOURCE_UNKNOWN", "accepted_sources": ["SOURCE_CAMERA", "SOURCE_SCREENSHARE"], "room": "test-room", "pid": 61897,

"job_id": "job-2a6bba876f24", "room_id": "RM_LcASivGs8m3G"}

INFO:gemini-live-vision:track_subscribed: participant=Work sid=TR_AMMjJWobZ38h3T track=RemoteAudioTrack name= kind=KIND_AUDIO source=SOURCE_MICROPHONE mime=audio/red size=0x0

DEBUG:livekit.plugins.google:connecting to Gemini Realtime API...

INFO:gemini-live-vision:track_subscribed: participant=Work sid=TR_AMMjJWobZ38h3T track=RemoteAudioTrack name= kind=KIND_AUDIO source=SOURCE_MICROPHONE mime=audio/red size=0x0

    14:35:05.073 INFO … gemini-live-vision track_subscribed: participant=Work sid=TR_AMMjJWobZ38h3T track=RemoteAudioTrack name= kind=KIND_AUDIO source=SOURCE_MICROPHONE mime=audio/red size=0x0

                                           {"room": "test-room", "pid": 61897, "job_id": "job-2a6bba876f24", "room_id": "RM_LcASivGs8m3G"}

DEBUG:livekit.plugins.google:connecting to Gemini Realtime API...

DEBUG:livekit.agents:start reading stream

INFO:gemini-live-vision:track_subscribed: participant=camera-1 sid=TR_VCxkRs2q7qC4hc track=RemoteVideoTrack name=synthesized-camera kind=KIND_VIDEO source=SOURCE_CAMERA mime=video/H264 size=1280x720

    14:35:05.074 DEBUG… livekit.…ns.google connecting to Gemini Realtime API... {"room": "test-room", "pid": 61897, "job_id": "job-2a6bba876f24", "room_id": "RM_LcASivGs8m3G"}

DEBUG:livekit.agents:start reading stream

    14:35:05.076 DEBUG… livekit.agents     start reading stream

                                         {"participant": "Work", "source": "SOURCE_MICROPHONE", "room": "test-room", "pid": 61897, "job_id": "job-2a6bba876f24", "room_id": "RM_LcASivGs8m3G"}

INFO:gemini-live-vision:track_subscribed: participant=camera-1 sid=TR_VCxkRs2q7qC4hc track=RemoteVideoTrack name=synthesized-camera kind=KIND_VIDEO source=SOURCE_CAMERA mime=video/H264 size=1280x720

DEBUG:livekit.agents:using audio io: `RoomIO` -> `AgentSession` -> `TranscriptSynchronizer` -> `RoomIO`

DEBUG:livekit.agents:using transcript io: `AgentSession` -> `TranscriptSynchronizer` -> `RoomIO`

DEBUG:livekit.agents:using video io: `RoomIO` > `AgentSession` > (none)

INFO:gemini-live-vision:voice_participant connected: identity=Work kind=PARTICIPANT_KIND_STANDARD sid=PA_qko48BKd6qPQ name=Work attributes={}

                 INFO … gemini-live-vision track_subscribed: participant=camera-1 sid=TR_VCxkRs2q7qC4hc track=RemoteVideoTrack name=synthesized-camera kind=KIND_VIDEO source=SOURCE_CAMERA      

                                           mime=video/H264 size=1280x720                                                                                                                         

INFO:gemini-live-vision:voice_participant publication: participant=Work sid=TR_AMMjJWobZ38h3T name= kind=KIND_AUDIO source=SOURCE_MICROPHONE mime=audio/red size=0x0 subscribed=True muted=False simulcasted=False has_track=True

                                           {"room": "test-room", "pid": 61897, "job_id": "job-2a6bba876f24", "room_id": "RM_LcASivGs8m3G"}

DEBUG:livekit.agents:using audio io: `RoomIO` -> `AgentSession` -> `TranscriptSynchronizer` -> `RoomIO`

INFO:gemini-live-vision:ingress_participant connected: identity=camera-1 kind=PARTICIPANT_KIND_INGRESS sid=PA_BTDzdhtidshS name=Camera 1 attributes={'ingress.ingressID': 'IN_VzHQfQ5iN3VY'}

INFO:gemini-live-vision:ingress_participant publication: participant=camera-1 sid=TR_VCxkRs2q7qC4hc name=synthesized-camera kind=KIND_VIDEO source=SOURCE_CAMERA mime=video/H264 size=1280x720 subscribed=True muted=False simulcasted=False has_track=True

                 DEBUG… livekit.agents     using audio io: `RoomIO` -> `AgentSession` -> `TranscriptSynchronizer` -> `RoomIO`

                                         {"room": "test-room", "pid": 61897, "job_id": "job-2a6bba876f24", "room_id": "RM_LcASivGs8m3G"}

DEBUG:livekit.agents:using transcript io: `AgentSession` -> `TranscriptSynchronizer` -> `RoomIO`

    14:35:05.077 DEBUG… livekit.agents     using transcript io: `AgentSession` -> `TranscriptSynchronizer` -> `RoomIO`

                                         {"room": "test-room", "pid": 61897, "job_id": "job-2a6bba876f24", "room_id": "RM_LcASivGs8m3G"}

DEBUG:livekit.agents:using video io: `RoomIO` > `AgentSession` > (none)

                 DEBUG… livekit.agents     using video io: `RoomIO` > `AgentSession` > (none) {"room": "test-room", "pid": 61897, "job_id": "job-2a6bba876f24", "room_id": "RM_LcASivGs8m3G"}

INFO:gemini-live-vision:voice_participant connected: identity=Work kind=PARTICIPANT_KIND_STANDARD sid=PA_qko48BKd6qPQ name=Work attributes={}

                 INFO … gemini-live-vision voice_participant connected: identity=Work kind=PARTICIPANT_KIND_STANDARD sid=PA_qko48BKd6qPQ name=Work attributes={}

                                           {"room": "test-room", "pid": 61897, "job_id": "job-2a6bba876f24", "room_id": "RM_LcASivGs8m3G"}

INFO:gemini-live-vision:voice_participant publication: participant=Work sid=TR_AMMjJWobZ38h3T name= kind=KIND_AUDIO source=SOURCE_MICROPHONE mime=audio/red size=0x0 subscribed=True muted=False simulcasted=False has_track=True

                 INFO … gemini-live-vision voice_participant publication: participant=Work sid=TR_AMMjJWobZ38h3T name= kind=KIND_AUDIO source=SOURCE_MICROPHONE mime=audio/red size=0x0          

                                           subscribed=True muted=False simulcasted=False has_track=True                                                                                          

                                           {"room": "test-room", "pid": 61897, "job_id": "job-2a6bba876f24", "room_id": "RM_LcASivGs8m3G"}

INFO:gemini-live-vision:ingress_participant connected: identity=camera-1 kind=PARTICIPANT_KIND_INGRESS sid=PA_BTDzdhtidshS name=Camera 1 attributes={'ingress.ingressID': 'IN_VzHQfQ5iN3VY'}

                 INFO … gemini-live-vision ingress_participant connected: identity=camera-1 kind=PARTICIPANT_KIND_INGRESS sid=PA_BTDzdhtidshS name=Camera 1 attributes={'ingress.ingressID':     

                                           'IN_VzHQfQ5iN3VY'}                                                                                                                                    

                                           {"room": "test-room", "pid": 61897, "job_id": "job-2a6bba876f24", "room_id": "RM_LcASivGs8m3G"}

INFO:gemini-live-vision:ingress_participant publication: participant=camera-1 sid=TR_VCxkRs2q7qC4hc name=synthesized-camera kind=KIND_VIDEO source=SOURCE_CAMERA mime=video/H264 size=1280x720 subscribed=True muted=False simulcasted=False has_track=True

                 INFO … gemini-live-vision ingress_participant publication: participant=camera-1 sid=TR_VCxkRs2q7qC4hc name=synthesized-camera kind=KIND_VIDEO source=SOURCE_CAMERA              

                                           mime=video/H264 size=1280x720 subscribed=True muted=False simulcasted=False has_track=True                                                            

                                           {"room": "test-room", "pid": 61897, "job_id": "job-2a6bba876f24", "room_id": "RM_LcASivGs8m3G"}

INFO:gemini-live-vision:video input linked to ingress, audio input linked to voice participant: ingress=camera-1 voice=Work

DEBUG:livekit.agents:start reading stream

INFO:gemini-live-vision:video input linked to ingress, audio input linked to voice participant: ingress=camera-1 voice=Work

    14:35:05.438 INFO … gemini-live-vision video input linked to ingress, audio input linked to voice participant: ingress=camera-1 voice=Work

                                           {"room": "test-room", "pid": 61897, "job_id": "job-2a6bba876f24", "room_id": "RM_LcASivGs8m3G"}

DEBUG:livekit.agents:start reading stream

    14:35:05.440 DEBUG… livekit.agents     start reading stream

                                         {"participant": "camera-1", "source": "SOURCE_CAMERA", "room": "test-room", "pid": 61897, "job_id": "job-2a6bba876f24", "room_id": "RM_LcASivGs8m3G"}

DEBUG:livekit.agents:aec warmup active, disabling interruptions for 3.00s

DEBUG:livekit.agents:aec warmup active, disabling interruptions for 3.00s

    14:35:07.924 DEBUG… livekit.agents     aec warmup active, disabling interruptions for 3.00s {"room": "test-room", "pid": 61897, "job_id": "job-2a6bba876f24", "room_id": "RM_LcASivGs8m3G"}

DEBUG:livekit.agents:aec warmup expired, re-enabling interruptions

DEBUG:livekit.agents:aec warmup expired, re-enabling interruptions

    14:35:10.925 DEBUG… livekit.agents     aec warmup expired, re-enabling interruptions {"room": "test-room", "pid": 61897, "job_id": "job-2a6bba876f24", "room_id": "RM_LcASivGs8m3G"}

^CWARNING:livekit.agents:exiting forcefully

    14:35:21.426 WARNI… livekit.agents     exiting forcefully

I have created this. If my gstreamer transcodes H264 to VP8, I see starting video frame monitor, first video frame received and video frames received. If the gstreamer streams H264 I only see starting video frame monitor and nothing else.

Seems like a bug in between my application and Gemini.

        frame_monitor_task = asyncio.create_task(
            monitor_video_frames(session._room_io.video_input, ingress.identity)
        )

async def monitor_video_frames(video_input, participant_identity: str) -> None:
    frame_count = 0
    started_at = time.monotonic()
    last_report_at = started_at

    logger.info(
        "starting video frame monitor",
        extra={"participant": participant_identity},
    )

    async for _frame in video_input:
        frame_count += 1
        now = time.monotonic()

        if frame_count == 1:
            logger.info(
                "first video frame received",
                extra={"participant": participant_identity},
            )

        if now - last_report_at >= 5.0:
            elapsed = now - started_at
            fps = frame_count / elapsed if elapsed > 0 else 0.0
            logger.info(
                "video frames received",
                extra={
                    "participant": participant_identity,
                    "frames": frame_count,
                    "elapsed_s": round(elapsed, 2),
                    "avg_fps": round(fps, 2),
                },
            )
            last_report_at = now

I opened this - H264-specific issue when subscribing to a WHIP ingress video track from Python. · Issue #592 · livekit/python-sdks · GitHub Hope it is the right repository.

Hi again, can you guys please confirm if the issue has been reported to the right repository please?

Hi again, can you guys please confirm if the issue has been reported to the right repository please?

Actually, I think this would be the best place:

apologies I missed that previously.