Real-time translator agent: architecture feedback wanted

I’m building a real-time translation agent on LiveKit that runs as a hidden room participant with one STT → translate → TTS pipeline per speaker, translated audio published as named tracks and routed via selective subscription based on participant language metadata.

The core design: same-language participants hear raw audio, cross-language participants hear translated tracks. Pipelines only exist when both language groups are present, so same-language rooms incur zero translation cost.

I wrote up the full architecture, latency budget, and cost analysis here: Building a Real-Time Translator Agent on LiveKit: Architecture, Latency, and Cost

A few open questions I’d especially appreciate input on:

- Has anyone hit a practical ceiling on concurrent pipelines per agent process? At 10 participants I’d have 10 simultaneous STT + translation + TTS streams through one process.

- For asymmetric rooms (e.g. 8 English speakers, 1 Spanish speaker), that one person receives 8 translated tracks. Anyone dealt with mixing that many agent-published tracks on the client side?

- Thoughts on auto-dispatching the agent when translator_langs is set in room metadata vs. explicit dispatch?

Happy to share more details on any part of the design.

When I worked on a project like this in the past the big win was to make sure that the “translated” track was based on language so any participants that want that language all subscribe to the same track.

In the project I worked on it was a single speaker with many listeners so that made it easier. Not sure if you have similar usecase or not.