Hey everyone,
I’m currently building a Voice AI agent using LiveKit and experimenting with the Cartesia provider (Sonic 3 model).
I had a question regarding pronunciation control during inference. I noticed there seems to be a limitation when loading a custom pronunciation dictionary (word → phoneme mapping).
From an engineering perspective, I’m trying to understand:
1. Why is there a restriction on loading pronunciation dictionaries during inference?
2. Is this due to latency constraints, model architecture, or provider-level limitations?
3. If we need fine-grained pronunciation control (especially for domain-specific terms, names, etc.), what is the recommended approach?
For example, in my use case, I need consistent and accurate pronunciation for dynamically generated content, and a static preloaded dictionary doesn’t fully solve the problem.
Would really appreciate insights from the team or anyone who has tackled this in production.