Skip to main content

Tech stack

Frameworks, models, and versions that ship inside Voice SDK.

Backend

Layer	Technology	Version
Language	Python	3.10 – 3.15
Web framework	FastAPI	≥ 0.111.0
ASGI server	Uvicorn	≥ 0.29.0
WebSocket	`websockets`	≥ 12.0
ML framework	PyTorch + torchaudio	CUDA 12.x
Audio I/O	`soundfile`, `librosa`, `pydub` (+ `ffmpeg`)	latest
Voice agent	LiveKit Agents SDK	1.4.6
Database	PostgreSQL	16 Alpine
ORM	SQLAlchemy 2.0 (async)	≥ 2.0.0
Migrations	Alembic	≥ 1.13.0

Frontend

Layer	Technology	Version
Framework	Next.js	15.x / 16.x
UI library	React	19.x
Styling	TailwindCSS	4.x
Language	TypeScript	5.x
Audio	Web Audio API, AudioWorklet	native browser
WebRTC	LiveKit Client SDK	2.17.x

AI models

Function	Model	Dimensions / details
Noise suppression	DeepFilterNet3	48 kHz native, ~50 MB
STT (multilingual)	Whisper Large V3 (`Systran/faster-whisper-large-v3`)	90+ languages
STT (Arabic)	Wav2Vec2 XLSR-53 Arabic	CTC-based
TTS (English)	Kokoro-82M (ONNX)	24 kHz, multiple voices
TTS (Arabic + cloning)	XTTS v2	voice cloning from reference
LLM	Qwen2.5-7B-Instruct	served via vLLM
VAD	Silero	ONNX, via LiveKit plugin
Speaker embedding	ECAPA-TDNN (SpeechBrain)	192-dim, VoxCeleb2
Diarization	`pyannote/speaker-diarization-3.1`	end-to-end

All models are self-hosted. No data leaves your infrastructure.

Known constraints

Constraint	Reason
`numpy < 2.0`	pyannote uses removed `np.NaN`
`huggingface_hub < 0.24`	pyannote uses deprecated `use_auth_token`
`torch >= 2.4`	SpeechBrain ≥ 1.0 uses `torch.amp.custom_fwd`

Backend
Frontend
AI models
Known constraints