Hi, I’m running a LiveKit Agents JS voice worker and trying to use ElevenLabs TTS with eleven_v3.
Environment:
@livekit/agents: 1.3.2
@livekit/agents-plugin-elevenlabs: 1.3.2
@livekit/rtc-node: 0.13.27
Runtime: Node 20+
LiveKit Cloud server: 1.10.1
My TTS config is roughly:
tts: new elevenlabs.TTS({
apiKey: process.env.ELEVENLABS_API_KEY,
model: "eleven_v3",
voiceId: process.env.ELEVENLABS_DUTCH_VOICE_ID,
language: "nl",
})
When a SIP call starts, the worker joins the room correctly. STT, room connection, and egress all appear fine, but the first agent speech fails with:
WebSocket connection error: Unexpected server response: 403
failed to synthesize speech, retrying...
LiveKit agent session emitted error
source: TTS
errorName: APIConnectionError
errorMessage: could not connect to ElevenLabs
After looking through the plugin source, it seems the default AgentSession TTS path uses the ElevenLabs plugin’s streaming path, which opens the multi-context WebSocket endpoint:
wss://api.elevenlabs.io/v1/text-to-speech/{voiceId}/multi-stream-input?model_id=eleven_v3...
I also reproduced the WebSocket handshake directly with the same API key and voice ID:
eleven_v3 -> 403
eleven_flash_v2_5 -> opens
eleven_turbo_v2_5 -> opens
ElevenLabs docs state that multi-context WebSockets are not available for eleven_v3. They also say v3 is available through the Create Speech / Stream Speech HTTP endpoints by specifying model_id: “eleven_v3”.
So this looks like a model/transport mismatch: the LiveKit ElevenLabs plugin accepts eleven_v3 as a model string, but the default AgentSession streaming path uses an ElevenLabs WebSocket endpoint that does not support that model.
Questions:
-
Is eleven_v3 intended to be unsupported with @livekit/agents-plugin-elevenlabs inside AgentSession?
-
Should the plugin validate this earlier and reject eleven_v3 for WebSocket streaming?
-
If I want to keep eleven_v3, is the recommended path:
-
LiveKit Inference with elevenlabs/eleven_v3, assuming supported/default voices,
-
A custom ttsNode / custom TTS adapter that calls ElevenLabs HTTP Stream Speech sentence-by-sentence or
-
Something else?
-