I wanted to have deeper understanding of how internal flow happens when a session has avatar in it.
My understanding and clarification :
- So avatar is a 3rd participant in a conversation. The user, agent and avatar.
- User speaks → Agent receives and generates response → the tts audio is sent to avatar → Avatar syncs w audio and publishes audio and video to room. [Q1. Avatar publishes both the audio and video right? Its not that agent gives out the audio and avatar just the video]
- How does the text stream flow work here. My assumption is that whatever LLM node generates, gets sent to the rooms text stream without any other intermediaries. So frontend can listen and show it on screen. [Q2. Does this behavior change when an avatar is present? Is the avatar sending out text streams when its there? Wanted what all vectors does introduction of avatar changes]
Please correct me / add more context. Doing this deeper dive as I had been seeing the following issues: Text streams get chopped off in front end but audio is audible. This issue is prominent when using tavus avatar. Was exploring the possibilities of parallely initializing avatars so the audio plumbing work need deeper understanding.