Frameworks, models, and versions that ship inside Voice SDK.
Backend
| Layer | Technology | Version |
|---|
| Language | Python | 3.10 – 3.15 |
| Web framework | FastAPI | ≥ 0.111.0 |
| ASGI server | Uvicorn | ≥ 0.29.0 |
| WebSocket | websockets | ≥ 12.0 |
| ML framework | PyTorch + torchaudio | CUDA 12.x |
| Audio I/O | soundfile, librosa, pydub (+ ffmpeg) | latest |
| Voice agent | LiveKit Agents SDK | 1.4.6 |
| Database | PostgreSQL | 16 Alpine |
| ORM | SQLAlchemy 2.0 (async) | ≥ 2.0.0 |
| Migrations | Alembic | ≥ 1.13.0 |
Frontend
| Layer | Technology | Version |
|---|
| Framework | Next.js | 15.x / 16.x |
| UI library | React | 19.x |
| Styling | TailwindCSS | 4.x |
| Language | TypeScript | 5.x |
| Audio | Web Audio API, AudioWorklet | native browser |
| WebRTC | LiveKit Client SDK | 2.17.x |
AI models
| Function | Model | Dimensions / details |
|---|
| Noise suppression | DeepFilterNet3 | 48 kHz native, ~50 MB |
| STT (multilingual) | Whisper Large V3 (Systran/faster-whisper-large-v3) | 90+ languages |
| STT (Arabic) | Wav2Vec2 XLSR-53 Arabic | CTC-based |
| TTS (English) | Kokoro-82M (ONNX) | 24 kHz, multiple voices |
| TTS (Arabic + cloning) | XTTS v2 | voice cloning from reference |
| LLM | Qwen2.5-7B-Instruct | served via vLLM |
| VAD | Silero | ONNX, via LiveKit plugin |
| Speaker embedding | ECAPA-TDNN (SpeechBrain) | 192-dim, VoxCeleb2 |
| Diarization | pyannote/speaker-diarization-3.1 | end-to-end |
All models are self-hosted. No data leaves your infrastructure.
Known constraints
| Constraint | Reason |
|---|
numpy < 2.0 | pyannote uses removed np.NaN |
huggingface_hub < 0.24 | pyannote uses deprecated use_auth_token |
torch >= 2.4 | SpeechBrain ≥ 1.0 uses torch.amp.custom_fwd |