Latency advice needed: OpenAI Realtime-quality conversation + Runway avatar via LiveKit Agent is too slow

Hey everyone,

I’m building a mobile app. The product lets a user talk to a pet as an emotional AI companion. The core experience needs to feel very responsive and emotionally present.

We have tested a few architectures:

  1. OpenAI Realtime only

    • Very good emotional conversation quality

    • Fast response time

    • Memory/persona works well

    • But the visual layer is not good enough yet

  2. Runway Characters direct mode

    • Very good animation quality

    • Conversation speed feels good

    • But Runway owns too much of the brain/voice/persona

    • Our Talking.Pet memory and personality control are not strong enough

  3. Current LiveKit Agent architecture

    • Flutter client publishes user mic into our LiveKit room

    • LiveKit Agent receives user audio

    • Agent uses OpenAI/LLM + TTS

    • Runway avatar plugin animates the agent audio

    • Flutter receives remote audio/video from the avatar

The current architecture is conceptually what we want:

User mic → Talking.Pet brain/memory/persona → Talking.Pet TTS → Runway avatar animation → Flutter

But the experience feels too slow and the emotional companion feeling disappears.

Some recent latency logs after optimization:

ENDPOINTING_MODE=dynamic
ENDPOINTING_MIN_DELAY=0.2
ENDPOINTING_MAX_DELAY=1.2
ENDPOINTING_ALPHA=0.8
TTS_PIPELINE=phrase_flush_not_sentence_buffer

Example turn: “How are you?”

SPEECH_END_TO_STT_FINAL_MS=1690
STT_LATENCY_MS=1690
LLM_FIRST_TOKEN_MS=569
TTS_LATENCY_MS=880
RUNWAY_PLAYBACK_BUFFER_MS=1
TOTAL_TIME_TO_FIRST_AUDIO_MS=3204
TOTAL_TURN_TIME_MS=5549

Example turn: “What is your name?”

SPEECH_END_TO_STT_FINAL_MS=1104
STT_LATENCY_MS=1104
LLM_FIRST_TOKEN_MS=811
TTS_LATENCY_MS=764
RUNWAY_PLAYBACK_BUFFER_MS=2
TOTAL_TIME_TO_FIRST_AUDIO_MS=2768
TOTAL_TURN_TIME_MS=5697

Example turn: “What’s the name of your mother?”

SPEECH_END_TO_STT_FINAL_MS=1150
STT_LATENCY_MS=1150
LLM_FIRST_TOKEN_MS=556
TTS_LATENCY_MS=1042
RUNWAY_PLAYBACK_BUFFER_MS=2
TOTAL_TIME_TO_FIRST_AUDIO_MS=2850
TOTAL_TURN_TIME_MS=5816

The good news is that the Runway playback buffer is now very low, often 1–2 ms. But the overall response still feels too slow because STT/endpointing + LLM first token + TTS first audio stack up.

We also saw occasional timing/turn-taking warnings like:

playback_finished called before text/audio input is done
push_audio called after close
skipping user input, speech scheduling is paused

Questions:

  1. Is this architecture expected to be slower because we are chaining multiple realtime systems together?

  2. Are there recommended LiveKit Agent settings for a more natural emotional companion / low-latency voice loop?

  3. Can endpointing be made more aggressive than dynamic min_delay=0.2, max_delay=1.2, alpha=0.8 without hurting reliability?

  4. Is there a better way to start TTS/avatar output earlier from partial LLM output?

  5. Are there known limitations when using the Runway avatar plugin for low-latency conversational use?

  6. Would a different LiveKit avatar plugin/provider be better suited for sub-2-second emotional conversation?

  7. Any suggestions for avoiding the playback synchronizer warnings above?

Our target is:

Time to first audible pet response: ideally 1–2 seconds
Very short emotional responses: 3–8 words
Animation must feel alive, but latency matters more than perfect lip-sync

We are currently considering moving live conversation back to OpenAI Realtime and using a local audio-driven pet cutout animation engine, while keeping Runway for offline/premium animation assets.

Before we make that architecture decision, I’d love to know if there are LiveKit-specific optimizations or better patterns we should try.

Thanks for any guidance.

@Niklas_Stalberg, one of the root causes: your STT_LATENCY (1100-1700ms) is the dominant cost in the turn budget. Endpointing min=0.2 / max=1.2 is already aggressive vs the defaults of 0.5 / 3.0 [ Turn-taking tuning | LiveKit Documentation ], and the model is using close to the full max_delay window, so the endpointing wait IS most of that 1.1-1.7s. Pushing max_delay lower trades latency for false-positive turn cuts; the documented move when cuts start happening is the turn-detector model rather than raw VAD timers.

The single biggest lever still on the cascade is preemptive_tts: true. Preemptive LLM generation is on by default (LLM starts when the final transcript arrives), but TTS waits for buffered phrases unless you opt in. preemptive_tts starts TTS on partial LLM output [same page]. That should shave a few hundred ms.

Runway is not your bottleneck. Your own metric shows the buffer at 1-2ms; switching avatars won’t move the needle.

For sub-2-second first audio you have to drop STT. The realtime path eliminates it. The half-cascade architecture (a realtime model handling ASR+LLM, with a separate TTS feeding the avatar) is the documented pattern for keeping your own TTS, persona, and memory in the loop while removing the STT cost [ Realtime models overview | LiveKit Documentation ].

The three timing warnings are separate from the latency problem and don’t need to be solved to hit your latency target. They’re agent-internal lifecycle issues (segment synchronization, session-close ordering, turn-scheduling state).

In addition to :up_arrow: , have you tried switching out your STT provider, your STT latency does seem quite high.

Is this architecture expected to be slower because we are chaining multiple realtime systems together?

Yes, this cascade model can introduce latency, so you need to be conscious of which parts of your pipeline are introducing delay. I usually point people here:

It honestly sounds like you are asking all the right questions, and identifying which part of your pipeline is introducing delays. There aren’t any known limitations with Runway avatars, but they are really new, we only launched them the other week.

Thanks for your help. Now the latency is low and acceptable once I am in the conversation with the agent. However startup is incosistent and very slow. Do you have any insight into the below on how I can improve it?

The issue is that total startup latency is often very high. I am trying to isolate whether this is caused by my implementation, Runway, LiveKit room creation, region selection, or agent dispatch behavior.

Some examples from recent logs:

Example 1 — first session:

BACKEND_SESSION_CREATE_MS=11454
LIVEKIT_ROOM_CREATE_MS=7925
LIVEKIT_DISPATCH_MS=1813
LIVEKIT_CONNECT_FROM_FLUTTER_MS=2050
AVATAR_JOIN_TO_VIDEO_PUBLISHED_MS=5465
TOTAL_APP_START_TO_FIRST_FRAME_MS=23188
BOTTLENECK=backend_session

Example 2 — pet switch / second session:

RUNWAY_BACKEND_REQUEST_MS=24646
BACKEND_SESSION_CREATE_MS=24583
LIVEKIT_ROOM_CREATE_MS=21973
LIVEKIT_DISPATCH_MS=1368

Example 3 — another pet switch:

RUNWAY_BACKEND_REQUEST_MS=23205
BACKEND_SESSION_CREATE_MS=23152
LIVEKIT_ROOM_CREATE_MS=20415
LIVEKIT_DISPATCH_MS=1408

The Python agent worker itself is already running and registered before the test:

registered worker
agent_name=pet-avatar
region=India South
livekit-agents version=1.5.15
rtc-version=1.1.8

The backend dispatches to:

agentName=pet-avatar

The worker does receive the job and joins the room, but the backend’s short join-poll often says:

LIVEKIT_AGENT_JOIN_CONFIRMED=false
RUNWAY_AGENT_JOIN_HANDSHAKE_WARNING=worker_may_join_after_flutter_connects

So I understand that the join-confirmation warning may be due to my polling logic being too short, and not necessarily a LiveKit failure.

However, the bigger concern is the room creation time:

LIVEKIT_ROOM_CREATE_MS=7925
LIVEKIT_ROOM_CREATE_MS=21973
LIVEKIT_ROOM_CREATE_MS=20415

Questions:

  1. Is it normal for LiveKit Cloud room creation to take 8–22 seconds?
  2. Could this be related to region selection or cold-start behavior?
  3. The worker is registered in India South. My app/backend are running locally from Thailand. Could region choice explain some of this delay?
  4. Is there a recommended way to pre-create or reuse rooms for low-latency agent experiences without keeping an expensive avatar session alive?
  5. Should I avoid explicit room creation and instead let the first participant connection create the room?
  6. Is there a better pattern for Agent Dispatch where the backend can return faster and the client connects while the worker joins?
  7. Is there a recommended health/readiness API for checking that a specific agent worker is registered before dispatching?
  8. Are there any LiveKit Cloud settings, region pinning options, or room creation patterns that would reduce startup time?

To be clear, I am not assuming LiveKit is the only cause. Runway avatar video publishing also adds latency. But the LIVEKIT_ROOM_CREATE_MS spikes look large enough that I want to confirm whether this is expected or whether my room/dispatch flow is suboptimal.

Any guidance on the recommended lowest-latency architecture for Flutter client → backend → LiveKit Agent → Runway avatar would be appreciated.

Can you share the session IDs for these examples. Room session id starts with RM_ and you can find them on your LiveKit dashboard under sessions.

Do you have Observability enabled? If so can you share the sessions with us?

@Niklas_Stalberg, The startup-latency split is illuminating. LIVEKIT_ROOM_CREATE_MS=7925-21973ms is your dominant startup cost, and that’s because you’re calling CreateRoom explicitly when you don’t need to.

Per the dispatch docs: “The room is automatically created during dispatch if it doesn’t already exist” [ Agent dispatch | LiveKit Documentation ]. Rooms also auto-create when the first participant connects [ Room management | LiveKit Documentation ]. The explicit CreateRoom call before dispatch is redundant. Drop it:

  # Before: 3 API roundtrips
  # 1. await lkapi.room.create_room(...)
  # 2. await lkapi.agent_dispatch.create_dispatch(...)
  # 3. Flutter connects

  # After: 2 roundtrips
  await lkapi.agent_dispatch.create_dispatch(
      api.CreateAgentDispatchRequest(
          agent_name="pet-avatar",
          room="my-room",  # auto-created by dispatch if missing
      )
  )
  # Flutter connects in parallel; whoever wins, the room exists.

That eliminates the 8-22s step. BACKEND_SESSION_CREATE_MS should drop in line.

On the Thailand→India South region mismatch: cross-region API roundtrips compound. Region pinning is for compliance traffic isolation, not latency optimization [ Region pinning | LiveKit Documentation ]. The cleanest fix is colocating your backend closer to the worker region.

The three timing warnings are separate from startup latency and don’t need to block your target.