Hi,
I see some TTS volume variation coming from ElevenLabs + LiveKit; is this something that happens because of whatever reason on LiveKit’s side, or on ElevenLabs’ side? Is there any mechanism to control it to be consistent from one TTS to the next one, in the same conversation?
Thank you.
Can you clarify what you mean?
I am not sure if mean “volume” like audio volume or “volume” like how much usage you have.
If you have some concrete examples (for whichever case you mean) that would be helpful to.
I mean audio volume. In the context of an audio/call bot, the agent says the greeting, the user responds, and so on and on. Some of the agent responses are louder while others are softer.
The parameters are kept the same throughout the conversation, and I’ve also implemented the ElevenLabs seed concept, which I don’t think you normally support, but I’ve extended the code so that the seed parameter is passed to ElevenLabs.
@Cristi_Constantin, ElevenLabs’ loudness variation between turns isn’t randomness, it’s prosodic adaptation: the model places emphasis (and therefore volume) based on the text content. A greeting and a one-word “yes” naturally generate at different RMS even with stability=1 and a fixed seed. Your seed extension fixes content reproducibility, not per-turn loudness.
Two LK-side knobs that reduce the spread:
from livekit.plugins import elevenlabs
tts = elevenlabs.TTS(
voice_id="...",
model="eleven_turbo_v2_5",
voice_settings=elevenlabs.VoiceSettings(
stability=0.75, # higher = less expressive, less swing
similarity_boost=0.75,
use_speaker_boost=True, # passed through to ElevenLabs' boost flag
),
)
stability ranges 0.0-1.0; pushing it up flattens prosodic variation. The plugin exposes use_speaker_boost as a passthrough to ElevenLabs’ API [ livekit/agents elevenlabs/tts.py VoiceSettings ].
For hard consistency across turns, the real fix is post-TTS audio normalization. ElevenLabs has no API parameter that guarantees identical loudness across different texts. The LK Agents pipeline doesn’t bundle a normalizer, but you can wrap the TTS output with a custom AudioFrameProcessor applying a fixed-target RMS to each frame before publish. That guarantees consistent perceived loudness regardless of text content.
I checked with ElevenLabs as well; there isn’t any solution that fixes this at the root, as their system is non-deterministic. One thing they mentioned is the previous_request_ids, but that is only supported for REST, not for WebSocket (which is the solution LiveKit uses to connect to ElevenLabs). They also mentioned this related ticket: Feature Request: Add Voice Similarity Parameters Support to ElevenLabs Plugin · Issue #3076 · livekit/agents · GitHub . In my opinion, davidzhao’s comment is incorrect, as consistency is not ensured by simply using WebSocket. A better long-term solution would be for LiveKit and ElevenLabs to collaborate to add support for previous_request_ids in the WebSocket communication mode.