Voice Biometrics
Identify who is speaking in an audio recording — voice recognition for speech, conceptually similar to face recognition for images. Provides both speaker identification (matching voices against an enrolled gallery) and speaker diarization ("who spoke when").
Live at voiceai.trouve.works/biometric/.
Two core capabilities
Speaker identification
- Enroll speakers by uploading voice samples (minimum 10 seconds per sample)
- The system creates a 192-dimensional voice "fingerprint" using SpeechBrain's ECAPA-TDNN model
- New audio is diarized into segments, each segment matched against the enrolled gallery
- Returns speaker identity with confidence score, or
UNKNOWNif below threshold
Speaker diarization
- Segments audio into "who spoke when" without prior enrollment
- Primary engine:
pyannote/speaker-diarization-3.1(state-of-the-art) - Fallback engine: SpeechBrain sliding-window clustering (no HuggingFace token required)
- Returns time-stamped segments with speaker labels (
SPEAKER_00,SPEAKER_01, …)
Identification flow
1. ENROLL speakers (upload voice samples)
│
▼
2. Submit audio for IDENTIFICATION
│
▼
3. DIARIZE audio into segments (who spoke when)
│
▼
4. EMBED each segment (create voice fingerprint)
│
▼
5. MATCH against enrolled speakers (cosine similarity)
│
▼
6. Return: speaker name, confidence, time range
Key capabilities
| Feature | Details |
|---|---|
| Embedding model | SpeechBrain ECAPA-TDNN (192-dim, trained on VoxCeleb2) |
| Diarization | pyannote 3.1 (primary) + SpeechBrain fallback |
| Matching | Cosine similarity with configurable threshold (default: 0.75) |
| Audio formats | WAV, MP3, M4A, and any ffmpeg-compatible format |
| Segment export | Optionally extract per-speaker WAV clips |
| Async processing | Long audio files processed in background with job status polling |
| Persistent storage | PostgreSQL for speakers, embeddings, and job history |
| GPU acceleration | NVIDIA CUDA 12.1 support |
Project structure
VoiceBiometrics/
└── app/
├── main.py # FastAPI app setup, lifespan, CORS
├── config.py # Pydantic settings (env-based config)
├── database.py # SQLAlchemy async engine + session
├── models/
│ ├── speaker.py # Speaker, SpeakerEmbedding ORM models
│ └── job.py # Job, Segment, JobType/JobStatus enums
├── routers/
│ ├── speakers.py # Enrollment, list, delete
│ ├── identify.py # Diarize + match to gallery
│ ├── diarize.py # Diarization only
│ ├── segments.py # Download segment audio clips
│ └── jobs.py # Job status polling
├── schemas/
│ ├── speaker.py # SpeakerOut, EnrollResponse
│ ├── job.py # JobOut, SegmentOut
│ └── common.py # ApiResponse<T> generic wrapper
├── services/
│ ├── embedding_service.py # ECAPA-TDNN (192-dim embeddings)
│ ├── diarization_service.py # pyannote 3.1 + SpeechBrain fallback
│ ├── identification_service.py # Gallery-based speaker matching
│ ├── audio_service.py # Audio validation, loading, conversion
│ └── job_service.py # Job CRUD + status management
├── utils/
│ ├── device.py # GPU/CPU device selection
│ ├── similarity.py # Cosine similarity, L2 normalization
│ └── audio_utils.py # ffmpeg wrappers
└── workers/
└── background.py # Async job execution (diarize, identify)
alembic/ # Database migrations
tests/ # pytest test suite
Dockerfile # CUDA 12.1 multi-stage build
docker-compose.yml # PostgreSQL + API
Services architecture
Embedding service
Model: SpeechBrain ECAPA-TDNN (speechbrain/spkrec-ecapa-voxceleb)
- 192-dimensional L2-normalized speaker embeddings
- Trained on VoxCeleb2 dataset
- Lazy-loaded on first call (cached globally)
| Function | Input | Output |
|---|---|---|
get_embedding_model() | – | Cached model instance |
embed_waveform(tensor) | torch.Tensor | 192-dim numpy array |
embed_file(path) | audio file path | (embedding, duration_sec) |
embed_clip(path, start, end) | source + time range | embedding or None |
serialize_embedding(emb) | numpy array | bytes (for DB storage) |
deserialize_embedding(data) | bytes | numpy array |
Diarization service — dual backend
| Backend | Priority | Model | Requirement |
|---|---|---|---|
| pyannote | Primary | pyannote/speaker-diarization-3.1 | HuggingFace token |
| SpeechBrain | Fallback | Sliding-window ECAPA clustering | No token needed |
SpeechBrain fallback pipeline:
- Slide 2-second window with 1-second step
- Compute ECAPA embedding per window
- Estimate speaker count (2–6 range heuristic)
- Agglomerative clustering (cosine distance)
- Merge consecutive same-speaker windows
- Filter segments < 0.5 seconds
Output: DiarizedSegment(start: float, end: float, label: str)
Identification service
Gallery-based matching with in-memory cache:
_gallery_matrix: np.ndarray # Shape (N, 192) — all enrolled speakers
_speaker_ids: list[UUID] # Parallel array of speaker IDs
_speaker_names: list[str] # Parallel array of speaker names
_gallery_dirty: bool # Cache invalidation flag
Flow:
invalidate_gallery()— called after enrollment / deletion_rebuild_gallery()— loads all speakers, averages embeddings per speakeridentify(embedding)— cosine similarity against gallery matrix- Returns
(speaker_id, speaker_name, confidence)or("UNKNOWN", score)if below threshold
Job service
Manages async job lifecycle:
| Function | Purpose |
|---|---|
create_job(type, path) | Create PENDING job |
set_running(job) | Mark as RUNNING |
set_done(job, ...) | Mark as DONE with results |
set_failed(job, error) | Mark as FAILED |
add_segment(job_id, ...) | Add diarized / identified segment |
cleanup_orphaned_jobs() | Mark stale RUNNING jobs as FAILED (crash recovery) |
Background workers
run_diarize_job(job_id, export_segments):
- Mark job
RUNNING - Diarize audio (pyannote or SpeechBrain)
- For each segment: optionally extract WAV clip, create DB record
- Mark job
DONE
run_identify_job(job_id, export_segments):
- Mark job
RUNNING - Diarize audio
- For each segment: embed clip, identify speaker against gallery
- Create segment records with matched speaker + confidence
- Mark job
DONE
API reference
Base URL: https://voiceai.trouve.works/biometric/api
All responses wrapped in ApiResponse<T>:
{
"success": true,
"data": { /* ... */ },
"error": null
}
Speaker enrollment
POST /speakers/enroll
Content-Type: multipart/form-data
Parameters:
name: "Alice" (required)
files: <audio file(s)> (required, min 10s each)
{
"success": true,
"data": {
"speaker": {
"id": "uuid",
"name": "Alice",
"sample_count": 2,
"created_at": "2026-04-14T..."
},
"embeddings_added": 2,
"total_duration_sec": 25.5
}
}
Speaker management
GET /speakers # List all (offset, limit pagination)
GET /speakers/{speaker_id} # Get details + embedding count
DELETE /speakers/{speaker_id} # Remove speaker + files
Identification
POST /identify
Content-Type: multipart/form-data
Parameters:
file: <audio file> (required)
export_segments: false (bool — extract per-speaker WAV clips)
sync: false (bool — wait for results vs. async)
Response (async, status 202):
{
"success": true,
"data": {
"id": "job-uuid",
"type": "IDENTIFY",
"status": "PENDING"
}
}
Response (sync or after polling):
{
"success": true,
"data": {
"id": "job-uuid",
"type": "IDENTIFY",
"status": "DONE",
"duration_sec": 120.5,
"num_speakers": 3,
"diarizer_backend": "pyannote",
"segments": [
{
"id": "segment-uuid",
"diarization_label": "SPEAKER_00",
"matched_speaker_id": "speaker-uuid",
"matched_name": "Alice",
"start_sec": 0.5,
"end_sec": 10.2,
"confidence": 0.89,
"has_audio": true,
"audio_url": "/segments/{job_id}/audio/{segment_id}"
},
{
"id": "segment-uuid",
"diarization_label": "SPEAKER_01",
"matched_speaker_id": null,
"matched_name": "UNKNOWN",
"start_sec": 10.5,
"end_sec": 45.0,
"confidence": 0.45,
"has_audio": true,
"audio_url": "/segments/{job_id}/audio/{segment_id}"
}
]
}
}
Diarization only
POST /diarize
Content-Type: multipart/form-data
Parameters:
file: <audio file> (required)
export_segments: false (bool)
sync: false (bool)
Same response shape as /identify, without matched_speaker_id / matched_name.
Segment audio download
GET /segments/{job_id}/audio/{segment_id}
Response: WAV audio file (Content-Type: audio/wav).
Job status
GET /jobs/{job_id} # Get job details + segments
GET /jobs # List all jobs (offset, limit)
Async polling pattern:
POST /identify→202withjob_idGET /jobs/{job_id}→ poll untilstatus == "DONE"or"FAILED"GET /segments/{job_id}/audio/{segment_id}→ download clips
Database schema
speakers
id: UUID (PK)
name: VARCHAR (indexed)
metadata_: JSONB
sample_count: INTEGER
created_at: TIMESTAMP
updated_at: TIMESTAMP
speaker_embeddings
id: UUID (PK)
speaker_id: UUID (FK → speakers.id, CASCADE)
embedding: BYTEA (serialized float32[192])
dim: INTEGER (default: 192)
duration_sec: FLOAT
source_path: VARCHAR
created_at: TIMESTAMP
jobs
id: UUID (PK)
type: ENUM (IDENTIFY, DIARIZE)
status: ENUM (PENDING, RUNNING, DONE, FAILED)
input_path: VARCHAR
duration_sec: FLOAT
num_speakers: INTEGER
diarizer_backend: VARCHAR
error: TEXT
created_at: TIMESTAMP
completed_at: TIMESTAMP
segments
id: UUID (PK)
job_id: UUID (FK → jobs.id, CASCADE)
diarization_label: VARCHAR
matched_speaker_id: UUID (FK → speakers.id, SET NULL)
matched_name: VARCHAR
start_sec: FLOAT
end_sec: FLOAT
confidence: FLOAT
audio_path: VARCHAR
created_at: TIMESTAMP
Configuration
Environment variables
| Variable | Default | Purpose |
|---|---|---|
POSTGRES_USER | voicebio | Database user |
POSTGRES_PASSWORD | voicebio | Database password |
POSTGRES_DB | voicebio | Database name |
POSTGRES_HOST | postgres | Database host |
POSTGRES_PORT | 8065 | Database port |
HF_TOKEN | – | HuggingFace token (required for pyannote) |
DEVICE | autodetect | cuda:0, cpu, etc. |
SIMILARITY_THRESHOLD | 0.75 | Speaker match threshold |
MIN_ENROLLMENT_SECONDS | 10 | Minimum enrollment audio duration |
MAX_UPLOAD_MB | 500 | Max file upload size |
STORAGE_DIR | /storage | Root storage path |
Storage layout
/storage/
enrollments/{speaker_id}/ # Enrolled speaker audio files
jobs/{job_id}/ # Uploaded audio + extracted segments
models/ # HuggingFace + Torch model cache
Key dependencies
| Package | Version | Purpose |
|---|---|---|
torch | ≥ 2.4.0 | Deep learning framework |
torchaudio | ≥ 2.4.0 | Audio I/O |
speechbrain | ≥ 1.0.0 | ECAPA-TDNN embeddings |
pyannote.audio | ≥ 3.1.0, < 4.0 | Speaker diarization |
numpy | ≥ 1.26.0, < 2.0 | Numeric computing |
sqlalchemy[asyncio] | ≥ 2.0.0 | ORM (async) |
asyncpg | ≥ 0.29.0 | PostgreSQL driver |
alembic | ≥ 1.13.0 | Database migrations |
scikit-learn | ≥ 1.4.0 | Agglomerative clustering (fallback) |
fastapi | ≥ 0.111.0 | Web framework |
:::caution Known constraints
numpy < 2.0required (pyannote uses removednp.NaN)huggingface_hub < 0.24required (pyannote uses deprecateduse_auth_token)torch >= 2.4required (SpeechBrain ≥ 1.0 usestorch.amp.custom_fwd) :::
Use cases
- Voice-based identity verification / authentication
- Meeting transcription with speaker labels
- Call center analytics — identify repeat callers
- Media content indexing — who said what in recordings
- Security and access control
Roadmap
- Health signal detection (vocal biomarkers: stress, fatigue)
- Age / gender estimation from voice
- Real-time identification during live conversations