Hi everyone
We’re running a self-hosted multi-node LiveKit deployment across two regions (Germany on-prem + Qatar on GCP). Both nodes run on microk8s with hostNetwork: true. Everything works great within the same node, but we’re hitting a consistent DTLS timeout when participants connect cross-region via TURN/TCP.
Would really appreciate any insights from the community or the LiveKit team — especially from anyone running a similar on-prem multi-region setup with Kubernetes.
Would really appreciate any insights from the community or the LiveKit team — especially from anyone running a similar on-prem multi-region setup.
Setup
- LiveKit v1.9.12, self-hosted, 2 nodes
- Node 1: Germany (on-prem), Node 2: Qatar (GCP me-central1)
- Shared Redis via WireGuard tunnel (~123ms latency)
- TURN enabled with TLS passthrough via Contour/Envoy on port 443
- Both nodes registered in Redis, signaling works correctly
Problem
When a participant connects from the Gulf region to a room hosted on the Germany node:
- WSS signaling connects fine (proxied via Qatar → Germany)
- ICE resolves to Germany TURN server via
turns:on TCP 443 - Media works for ~10 seconds (audio + video both directions)
- Then:
dtls timeout: read/write timeout: context deadline exceeded - Video freezes, signaling stays connected
Same-node connections work perfectly. Issue only occurs in cross-node scenarios.
What we tried
packet_buffer_size_video: 5000,packet_buffer_size_audio: 2000- OS UDP buffers increased to 5MB (
rmem_max,wmem_max) - Confirmed TCP 7881 connectivity between nodes
- RTT ~130ms between client and TURN server
TURN Config
turn:
enabled: true
tls_port: 3478
udp_port: 443
Questions
-
Is there a configurable DTLS timeout or keepalive interval for high-latency TURN/TCP scenarios?
-
We’re using Contour/Envoy as a reverse proxy for TLS termination (WSS) and TLS passthrough (TURN). Could this be causing the DTLS timeout? What reverse proxy setup do you recommend for self-hosted on-prem deployments?
-
For those running multi-node LiveKit on-prem — what does your production setup look like in terms of reverse proxy, TURN, and TLS? Any gotchas with high-latency cross-region TURN/TCP?
-
Any recommended configuration for self-hosted multi-node deployments with 100ms+ RTT between client and TURN?