Architecture
How the four Voice SDK modules connect, route, and share infrastructure.
Service topology
┌───────────────────┐
│ Client Apps │
│ (Web, Mobile, │
│ Telephony) │
└─────────┬─────────┘
│
┌─────────▼─────────┐
│ Reverse Proxy │ ← Nginx: routing, TLS, path dispatch
└─────────┬─────────┘
│
┌──────────┬───────────┼───────────┬──────────┐
│ │ │ │ │
┌────▼────┐┌────▼────┐ ┌────▼─────┐┌────▼────┐┌────▼─────┐
│ Voice ││Speaches │ │ Noise ││ Voice ││ LiveKit │
│ Agent ││(STT/TTS)│ │ Suppres. ││ Biomet. ││ Server │
│ :front ││ :8051 │ │ :8060 ││ :8066 ││ :7880 │
│ :back ││ │ │ :8061 ││ ││ │
└─────────┘└─────────┘ └──────────┘└────┬────┘└──────────┘
│
┌────▼─────┐
│PostgreSQL│
│ :8065 │
└──────────┘
A single Nginx reverse proxy fronts the platform at voiceai.trouve.works and dispatches to the right backend service based on URL path.
URL routing
| Path | Backend service | Port |
|---|---|---|
/ | Voice Agent (Next.js frontend) | – |
/livekit | LiveKit Server (WebRTC) | 7880 |
/services/ | Speaches (STT/TTS backend) | 8051 |
/utilities/ | VoiceUtilities (static frontend) | 8112 |
/noise/ | Noise Suppression frontend | 8061 |
/noise/api/ | Noise Suppression backend | 8060 |
/noise/ws/ | Noise Suppression WebSocket | 8060 |
/biometric/ | Voice Biometrics frontend | – |
/biometric/api/ | Voice Biometrics backend | 8066 |
Data flow
| Concern | Implementation |
|---|---|
| API style | REST (FastAPI) + WebSocket |
| Real-time audio | WebSocket with binary Float32 LE PCM |
| Voice rooms | LiveKit (WebRTC) |
| API compatibility | OpenAI API format for STT/TTS endpoints |
| Database | PostgreSQL 16 (Voice Biometrics only) |
| Containerization | Docker + Docker Compose |
| GPU compute | NVIDIA CUDA 12.x |
Storage layout
/storage/
enrollments/{speaker_id}/ # Voice Biometrics — enrolled speaker audio
jobs/{job_id}/ # Voice Biometrics — uploads + extracted segments
models/ # HuggingFace + Torch model cache (shared)
The /storage/models/ directory is mounted into every container that loads HuggingFace or Torch weights — first-pull caching across services prevents duplicate downloads.
Stateful vs stateless modules
| Module | State |
|---|---|
| Voice Agent | Stateless. Per-room ephemeral state inside LiveKit |
| Voice Utilities (Speaches) | Stateless. Models cached in-process; WHISPER__TTL=-1 keeps them resident |
| Noise Suppression | Stateless. DeepFilterNet3 processes each chunk independently |
| Voice Biometrics | Stateful. PostgreSQL stores speakers, embeddings, jobs, and segments |
Only Voice Biometrics requires durable storage. The other modules can be torn down and rebuilt without data loss.
Where to next
- Tech stack — frameworks, models, and versions across the platform
- Deployment — Docker Compose, GPU allocation, and reverse proxy
- Modules — internal architecture per module