I’m building a real-time translation agent on LiveKit that runs as a hidden room participant with one STT → translate → TTS pipeline per speaker, translated audio published as named tracks and routed via selective subscription based on participant language metadata.
The core design: same-language participants hear raw audio, cross-language participants hear translated tracks. Pipelines only exist when both language groups are present, so same-language rooms incur zero translation cost.
I wrote up the full architecture, latency budget, and cost analysis here: Building a Real-Time Translator Agent on LiveKit: Architecture, Latency, and Cost
A few open questions I’d especially appreciate input on:
- Has anyone hit a practical ceiling on concurrent pipelines per agent process? At 10 participants I’d have 10 simultaneous STT + translation + TTS streams through one process.
- For asymmetric rooms (e.g. 8 English speakers, 1 Spanish speaker), that one person receives 8 translated tracks. Anyone dealt with mixing that many agent-published tracks on the client side?
- Thoughts on auto-dispatching the agent when translator_langs is set in room metadata vs. explicit dispatch?
Happy to share more details on any part of the design.