I wanted to have deeper understanding of how internal flow happens when a session has avatar in it.
My understanding and clarification :
So avatar is a 3rd participant in a conversation. The user, agent and avatar.
User speaks → Agent receives and generates response → the tts audio is sent to avatar → Avatar syncs w audio and publishes audio and video to room. [Q1. Avatar publishes both the audio and video right? Its not that agent gives out the audio and avatar just the video]
How does the text stream flow work here. My assumption is that whatever LLM node generates, gets sent to the rooms text stream without any other intermediaries. So frontend can listen and show it on screen. [Q2. Does this behavior change when an avatar is present? Is the avatar sending out text streams when its there? Wanted what all vectors does introduction of avatar changes]
Please correct me / add more context. Doing this deeper dive as I had been seeing the following issues: Text streams get chopped off in front end but audio is audible. This issue is prominent when using tavus avatar. Was exploring the possibilities of parallely initializing avatars so the audio plumbing work need deeper understanding.
Q1: Yes, the avatar publishes both the audio and video together since they are synchronized.
Q2: As of right now, avatars are separate of the text streams. We recently added support for synchronized transcripts by passing along the transcript timestamp data (if the TTS supports it), but I don’t think any providers have opted in yet (as in republish the text stream). The text stream still comes from the LLM node.
@Tina_Nguyen’s answer pins down the architecture, and it actually narrows your chopping symptom. Since text comes from the LLM node and audio comes from Tavus on a separate timing path, the two streams aren’t synchronized by default. Tavus buffers TTS audio for several seconds to align with its lip-sync video, so the audio you hear lags the text stream finishing.
If your frontend ties text rendering completion to audio playback events (track end, the lk.transcription_final flag aligned with audio end, etc.), you’d see exactly what you describe: text gets visually cut while audio is still playing or when the frontend prematurely marks the turn finished.
To localize it, log every chunk received on the lk.transcription topic in the frontend with timestamps. If the full text arrives in the chunks but the UI shows truncated, the chop is in your render logic, probably the sync-to-audio path. If chunks themselves are missing, that’s a different bug, likely interruption logic firing mid-stream and clipping the publish.