Skip to main content

Prerequisites

What your host needs before running Voice SDK locally or in production.

Hardware

RequirementMinimumRecommended
NVIDIA GPU1× (8 GB VRAM)4×+ (24 GB+ each)
CUDA12.112.6
RAM16 GB64 GB+
Storage50 GB200 GB+ (model cache)

The recommended footprint matches the production layout — GPUs are allocated per service so STT/TTS, denoising, biometrics, and the LLM run in parallel without contention. See Deployment for the default GPU mapping.

Operating system

ComponentMinimumRecommended
OSUbuntu 22.04Ubuntu 24.04
Docker24.xLatest
ffmpegRequiredLatest
Python3.103.13
Node.js18.x22.x

ffmpeg is required for MP3, M4A, AAC, and other compressed-format support across the noise suppression and STT/TTS services.

Tokens & secrets

VariableRequired forHow to obtain
HF_TOKENVoice Biometrics (pyannote diarization)HuggingFace account → access token
LIVEKIT_API_KEY / LIVEKIT_API_SECRETVoice AgentGenerated by the LiveKit server (devkey / secret in dev)

Voice Biometrics will fall back to the SpeechBrain sliding-window diarizer when HF_TOKEN is absent — you lose pyannote's accuracy but keep functionality.

Network

  • HTTPS reverse proxy (Nginx is the reference) for path-based dispatch across modules
  • WebSocket support (/noise/ws/) and WebRTC support (/livekit) on the proxy
  • Outbound access to huggingface.co and pypi.org for first-time model and dependency downloads