Skip to main content

Tech stack

Frameworks, models, and versions that ship inside Voice SDK.

Backend

LayerTechnologyVersion
LanguagePython3.10 – 3.15
Web frameworkFastAPI≥ 0.111.0
ASGI serverUvicorn≥ 0.29.0
WebSocketwebsockets≥ 12.0
ML frameworkPyTorch + torchaudioCUDA 12.x
Audio I/Osoundfile, librosa, pydub (+ ffmpeg)latest
Voice agentLiveKit Agents SDK1.4.6
DatabasePostgreSQL16 Alpine
ORMSQLAlchemy 2.0 (async)≥ 2.0.0
MigrationsAlembic≥ 1.13.0

Frontend

LayerTechnologyVersion
FrameworkNext.js15.x / 16.x
UI libraryReact19.x
StylingTailwindCSS4.x
LanguageTypeScript5.x
AudioWeb Audio API, AudioWorkletnative browser
WebRTCLiveKit Client SDK2.17.x

AI models

FunctionModelDimensions / details
Noise suppressionDeepFilterNet348 kHz native, ~50 MB
STT (multilingual)Whisper Large V3 (Systran/faster-whisper-large-v3)90+ languages
STT (Arabic)Wav2Vec2 XLSR-53 ArabicCTC-based
TTS (English)Kokoro-82M (ONNX)24 kHz, multiple voices
TTS (Arabic + cloning)XTTS v2voice cloning from reference
LLMQwen2.5-7B-Instructserved via vLLM
VADSileroONNX, via LiveKit plugin
Speaker embeddingECAPA-TDNN (SpeechBrain)192-dim, VoxCeleb2
Diarizationpyannote/speaker-diarization-3.1end-to-end

All models are self-hosted. No data leaves your infrastructure.

Known constraints

ConstraintReason
numpy < 2.0pyannote uses removed np.NaN
huggingface_hub < 0.24pyannote uses deprecated use_auth_token
torch >= 2.4SpeechBrain ≥ 1.0 uses torch.amp.custom_fwd