Skip to main content

Architecture

How the four Voice SDK modules connect, route, and share infrastructure.

Service topology

┌───────────────────┐
│ Client Apps │
│ (Web, Mobile, │
│ Telephony) │
└─────────┬─────────┘

┌─────────▼─────────┐
│ Reverse Proxy │ ← Nginx: routing, TLS, path dispatch
└─────────┬─────────┘

┌──────────┬───────────┼───────────┬──────────┐
│ │ │ │ │
┌────▼────┐┌────▼────┐ ┌────▼─────┐┌────▼────┐┌────▼─────┐
│ Voice ││Speaches │ │ Noise ││ Voice ││ LiveKit │
│ Agent ││(STT/TTS)│ │ Suppres. ││ Biomet. ││ Server │
│ :front ││ :8051 │ │ :8060 ││ :8066 ││ :7880 │
│ :back ││ │ │ :8061 ││ ││ │
└─────────┘└─────────┘ └──────────┘└────┬────┘└──────────┘

┌────▼─────┐
│PostgreSQL│
│ :8065 │
└──────────┘

A single Nginx reverse proxy fronts the platform at voiceai.trouve.works and dispatches to the right backend service based on URL path.

URL routing

PathBackend servicePort
/Voice Agent (Next.js frontend)
/livekitLiveKit Server (WebRTC)7880
/services/Speaches (STT/TTS backend)8051
/utilities/VoiceUtilities (static frontend)8112
/noise/Noise Suppression frontend8061
/noise/api/Noise Suppression backend8060
/noise/ws/Noise Suppression WebSocket8060
/biometric/Voice Biometrics frontend
/biometric/api/Voice Biometrics backend8066

Data flow

ConcernImplementation
API styleREST (FastAPI) + WebSocket
Real-time audioWebSocket with binary Float32 LE PCM
Voice roomsLiveKit (WebRTC)
API compatibilityOpenAI API format for STT/TTS endpoints
DatabasePostgreSQL 16 (Voice Biometrics only)
ContainerizationDocker + Docker Compose
GPU computeNVIDIA CUDA 12.x

Storage layout

/storage/
enrollments/{speaker_id}/ # Voice Biometrics — enrolled speaker audio
jobs/{job_id}/ # Voice Biometrics — uploads + extracted segments
models/ # HuggingFace + Torch model cache (shared)

The /storage/models/ directory is mounted into every container that loads HuggingFace or Torch weights — first-pull caching across services prevents duplicate downloads.

Stateful vs stateless modules

ModuleState
Voice AgentStateless. Per-room ephemeral state inside LiveKit
Voice Utilities (Speaches)Stateless. Models cached in-process; WHISPER__TTL=-1 keeps them resident
Noise SuppressionStateless. DeepFilterNet3 processes each chunk independently
Voice BiometricsStateful. PostgreSQL stores speakers, embeddings, jobs, and segments

Only Voice Biometrics requires durable storage. The other modules can be torn down and rebuilt without data loss.

Where to next

  • Tech stack — frameworks, models, and versions across the platform
  • Deployment — Docker Compose, GPU allocation, and reverse proxy
  • Modules — internal architecture per module