Hey everyone,
I’m building a mobile app. The product lets a user talk to a pet as an emotional AI companion. The core experience needs to feel very responsive and emotionally present.
We have tested a few architectures:
-
OpenAI Realtime only
-
Very good emotional conversation quality
-
Fast response time
-
Memory/persona works well
-
But the visual layer is not good enough yet
-
-
Runway Characters direct mode
-
Very good animation quality
-
Conversation speed feels good
-
But Runway owns too much of the brain/voice/persona
-
Our Talking.Pet memory and personality control are not strong enough
-
-
Current LiveKit Agent architecture
-
Flutter client publishes user mic into our LiveKit room
-
LiveKit Agent receives user audio
-
Agent uses OpenAI/LLM + TTS
-
Runway avatar plugin animates the agent audio
-
Flutter receives remote audio/video from the avatar
-
The current architecture is conceptually what we want:
User mic → Talking.Pet brain/memory/persona → Talking.Pet TTS → Runway avatar animation → Flutter
But the experience feels too slow and the emotional companion feeling disappears.
Some recent latency logs after optimization:
ENDPOINTING_MODE=dynamic
ENDPOINTING_MIN_DELAY=0.2
ENDPOINTING_MAX_DELAY=1.2
ENDPOINTING_ALPHA=0.8
TTS_PIPELINE=phrase_flush_not_sentence_buffer
Example turn: “How are you?”
SPEECH_END_TO_STT_FINAL_MS=1690
STT_LATENCY_MS=1690
LLM_FIRST_TOKEN_MS=569
TTS_LATENCY_MS=880
RUNWAY_PLAYBACK_BUFFER_MS=1
TOTAL_TIME_TO_FIRST_AUDIO_MS=3204
TOTAL_TURN_TIME_MS=5549
Example turn: “What is your name?”
SPEECH_END_TO_STT_FINAL_MS=1104
STT_LATENCY_MS=1104
LLM_FIRST_TOKEN_MS=811
TTS_LATENCY_MS=764
RUNWAY_PLAYBACK_BUFFER_MS=2
TOTAL_TIME_TO_FIRST_AUDIO_MS=2768
TOTAL_TURN_TIME_MS=5697
Example turn: “What’s the name of your mother?”
SPEECH_END_TO_STT_FINAL_MS=1150
STT_LATENCY_MS=1150
LLM_FIRST_TOKEN_MS=556
TTS_LATENCY_MS=1042
RUNWAY_PLAYBACK_BUFFER_MS=2
TOTAL_TIME_TO_FIRST_AUDIO_MS=2850
TOTAL_TURN_TIME_MS=5816
The good news is that the Runway playback buffer is now very low, often 1–2 ms. But the overall response still feels too slow because STT/endpointing + LLM first token + TTS first audio stack up.
We also saw occasional timing/turn-taking warnings like:
playback_finished called before text/audio input is done
push_audio called after close
skipping user input, speech scheduling is paused
Questions:
-
Is this architecture expected to be slower because we are chaining multiple realtime systems together?
-
Are there recommended LiveKit Agent settings for a more natural emotional companion / low-latency voice loop?
-
Can endpointing be made more aggressive than dynamic min_delay=0.2, max_delay=1.2, alpha=0.8 without hurting reliability?
-
Is there a better way to start TTS/avatar output earlier from partial LLM output?
-
Are there known limitations when using the Runway avatar plugin for low-latency conversational use?
-
Would a different LiveKit avatar plugin/provider be better suited for sub-2-second emotional conversation?
-
Any suggestions for avoiding the playback synchronizer warnings above?
Our target is:
Time to first audible pet response: ideally 1–2 seconds
Very short emotional responses: 3–8 words
Animation must feel alive, but latency matters more than perfect lip-sync
We are currently considering moving live conversation back to OpenAI Realtime and using a local audio-driven pet cutout animation engine, while keeping Runway for offline/premium animation assets.
Before we make that architecture decision, I’d love to know if there are LiveKit-specific optimizations or better patterns we should try.
Thanks for any guidance.