Prerequisites
What your host needs before running Voice SDK locally or in production.
Hardware
| Requirement | Minimum | Recommended |
|---|---|---|
| NVIDIA GPU | 1× (8 GB VRAM) | 4×+ (24 GB+ each) |
| CUDA | 12.1 | 12.6 |
| RAM | 16 GB | 64 GB+ |
| Storage | 50 GB | 200 GB+ (model cache) |
The recommended footprint matches the production layout — GPUs are allocated per service so STT/TTS, denoising, biometrics, and the LLM run in parallel without contention. See Deployment for the default GPU mapping.
Operating system
| Component | Minimum | Recommended |
|---|---|---|
| OS | Ubuntu 22.04 | Ubuntu 24.04 |
| Docker | 24.x | Latest |
ffmpeg | Required | Latest |
| Python | 3.10 | 3.13 |
| Node.js | 18.x | 22.x |
ffmpeg is required for MP3, M4A, AAC, and other compressed-format support across the noise suppression and STT/TTS services.
Tokens & secrets
| Variable | Required for | How to obtain |
|---|---|---|
HF_TOKEN | Voice Biometrics (pyannote diarization) | HuggingFace account → access token |
LIVEKIT_API_KEY / LIVEKIT_API_SECRET | Voice Agent | Generated by the LiveKit server (devkey / secret in dev) |
Voice Biometrics will fall back to the SpeechBrain sliding-window diarizer when HF_TOKEN is absent — you lose pyannote's accuracy but keep functionality.
Network
- HTTPS reverse proxy (Nginx is the reference) for path-based dispatch across modules
- WebSocket support (
/noise/ws/) and WebRTC support (/livekit) on the proxy - Outbound access to
huggingface.coandpypi.orgfor first-time model and dependency downloads