DTLS timeout after ~10s with TURN/TCP in multi-node setup (v1.9.12)

Hi everyone

We’re running a self-hosted multi-node LiveKit deployment across two regions (Germany on-prem + Qatar on GCP). Both nodes run on microk8s with hostNetwork: true. Everything works great within the same node, but we’re hitting a consistent DTLS timeout when participants connect cross-region via TURN/TCP.

Would really appreciate any insights from the community or the LiveKit team — especially from anyone running a similar on-prem multi-region setup with Kubernetes.

Would really appreciate any insights from the community or the LiveKit team — especially from anyone running a similar on-prem multi-region setup.

Setup

  • LiveKit v1.9.12, self-hosted, 2 nodes
  • Node 1: Germany (on-prem), Node 2: Qatar (GCP me-central1)
  • Shared Redis via WireGuard tunnel (~123ms latency)
  • TURN enabled with TLS passthrough via Contour/Envoy on port 443
  • Both nodes registered in Redis, signaling works correctly

Problem

When a participant connects from the Gulf region to a room hosted on the Germany node:

  1. WSS signaling connects fine (proxied via Qatar → Germany)
  2. ICE resolves to Germany TURN server via turns: on TCP 443
  3. Media works for ~10 seconds (audio + video both directions)
  4. Then: dtls timeout: read/write timeout: context deadline exceeded
  5. Video freezes, signaling stays connected

Same-node connections work perfectly. Issue only occurs in cross-node scenarios.

What we tried

  • packet_buffer_size_video: 5000, packet_buffer_size_audio: 2000
  • OS UDP buffers increased to 5MB (rmem_max, wmem_max)
  • Confirmed TCP 7881 connectivity between nodes
  • RTT ~130ms between client and TURN server

TURN Config

turn:
  enabled: true
  tls_port: 3478
  udp_port: 443

Questions

  1. Is there a configurable DTLS timeout or keepalive interval for high-latency TURN/TCP scenarios?

  2. We’re using Contour/Envoy as a reverse proxy for TLS termination (WSS) and TLS passthrough (TURN). Could this be causing the DTLS timeout? What reverse proxy setup do you recommend for self-hosted on-prem deployments?

  3. For those running multi-node LiveKit on-prem — what does your production setup look like in terms of reverse proxy, TURN, and TLS? Any gotchas with high-latency cross-region TURN/TCP?

  4. Any recommended configuration for self-hosted multi-node deployments with 100ms+ RTT between client and TURN?