Pronunciation Dictionary Limitations in Inference (Cartesia Sonic 3)

Hey everyone,

I’m currently building a Voice AI agent using LiveKit and experimenting with the Cartesia provider (Sonic 3 model).

I had a question regarding pronunciation control during inference. I noticed there seems to be a limitation when loading a custom pronunciation dictionary (word → phoneme mapping).

From an engineering perspective, I’m trying to understand:

1. Why is there a restriction on loading pronunciation dictionaries during inference?

2. Is this due to latency constraints, model architecture, or provider-level limitations?

3. If we need fine-grained pronunciation control (especially for domain-specific terms, names, etc.), what is the recommended approach?

For example, in my use case, I need consistent and accurate pronunciation for dynamically generated content, and a static preloaded dictionary doesn’t fully solve the problem.

Would really appreciate insights from the team or anyone who has tackled this in production.

For Cartesia, the pronunciation_dict_id option is available in Agents for both Python and JS, however it doesn’t seem to be documented (I’ll follow up on that).

The pronunciation_dict_id takes a unique identifier corresponding to a dictionary stored in your Cartesia account, so it would be more difficult (but I’m sure not impossible) for us to expose this through LiveKit Inference. That is most likely the real reason it hasn’t been implemented (yet).

You have two options in this short term: