Skip to main content

Voice Biometrics

Identify who is speaking in an audio recording — voice recognition for speech, conceptually similar to face recognition for images. Provides both speaker identification (matching voices against an enrolled gallery) and speaker diarization ("who spoke when").

Live at voiceai.trouve.works/biometric/.

Two core capabilities

Speaker identification

  • Enroll speakers by uploading voice samples (minimum 10 seconds per sample)
  • The system creates a 192-dimensional voice "fingerprint" using SpeechBrain's ECAPA-TDNN model
  • New audio is diarized into segments, each segment matched against the enrolled gallery
  • Returns speaker identity with confidence score, or UNKNOWN if below threshold

Speaker diarization

  • Segments audio into "who spoke when" without prior enrollment
  • Primary engine: pyannote/speaker-diarization-3.1 (state-of-the-art)
  • Fallback engine: SpeechBrain sliding-window clustering (no HuggingFace token required)
  • Returns time-stamped segments with speaker labels (SPEAKER_00, SPEAKER_01, …)

Identification flow

1. ENROLL speakers (upload voice samples)


2. Submit audio for IDENTIFICATION


3. DIARIZE audio into segments (who spoke when)


4. EMBED each segment (create voice fingerprint)


5. MATCH against enrolled speakers (cosine similarity)


6. Return: speaker name, confidence, time range

Key capabilities

FeatureDetails
Embedding modelSpeechBrain ECAPA-TDNN (192-dim, trained on VoxCeleb2)
Diarizationpyannote 3.1 (primary) + SpeechBrain fallback
MatchingCosine similarity with configurable threshold (default: 0.75)
Audio formatsWAV, MP3, M4A, and any ffmpeg-compatible format
Segment exportOptionally extract per-speaker WAV clips
Async processingLong audio files processed in background with job status polling
Persistent storagePostgreSQL for speakers, embeddings, and job history
GPU accelerationNVIDIA CUDA 12.1 support

Project structure

VoiceBiometrics/
└── app/
├── main.py # FastAPI app setup, lifespan, CORS
├── config.py # Pydantic settings (env-based config)
├── database.py # SQLAlchemy async engine + session
├── models/
│ ├── speaker.py # Speaker, SpeakerEmbedding ORM models
│ └── job.py # Job, Segment, JobType/JobStatus enums
├── routers/
│ ├── speakers.py # Enrollment, list, delete
│ ├── identify.py # Diarize + match to gallery
│ ├── diarize.py # Diarization only
│ ├── segments.py # Download segment audio clips
│ └── jobs.py # Job status polling
├── schemas/
│ ├── speaker.py # SpeakerOut, EnrollResponse
│ ├── job.py # JobOut, SegmentOut
│ └── common.py # ApiResponse<T> generic wrapper
├── services/
│ ├── embedding_service.py # ECAPA-TDNN (192-dim embeddings)
│ ├── diarization_service.py # pyannote 3.1 + SpeechBrain fallback
│ ├── identification_service.py # Gallery-based speaker matching
│ ├── audio_service.py # Audio validation, loading, conversion
│ └── job_service.py # Job CRUD + status management
├── utils/
│ ├── device.py # GPU/CPU device selection
│ ├── similarity.py # Cosine similarity, L2 normalization
│ └── audio_utils.py # ffmpeg wrappers
└── workers/
└── background.py # Async job execution (diarize, identify)
alembic/ # Database migrations
tests/ # pytest test suite
Dockerfile # CUDA 12.1 multi-stage build
docker-compose.yml # PostgreSQL + API

Services architecture

Embedding service

Model: SpeechBrain ECAPA-TDNN (speechbrain/spkrec-ecapa-voxceleb)

  • 192-dimensional L2-normalized speaker embeddings
  • Trained on VoxCeleb2 dataset
  • Lazy-loaded on first call (cached globally)
FunctionInputOutput
get_embedding_model()Cached model instance
embed_waveform(tensor)torch.Tensor192-dim numpy array
embed_file(path)audio file path(embedding, duration_sec)
embed_clip(path, start, end)source + time rangeembedding or None
serialize_embedding(emb)numpy arraybytes (for DB storage)
deserialize_embedding(data)bytesnumpy array

Diarization service — dual backend

BackendPriorityModelRequirement
pyannotePrimarypyannote/speaker-diarization-3.1HuggingFace token
SpeechBrainFallbackSliding-window ECAPA clusteringNo token needed

SpeechBrain fallback pipeline:

  1. Slide 2-second window with 1-second step
  2. Compute ECAPA embedding per window
  3. Estimate speaker count (2–6 range heuristic)
  4. Agglomerative clustering (cosine distance)
  5. Merge consecutive same-speaker windows
  6. Filter segments < 0.5 seconds

Output: DiarizedSegment(start: float, end: float, label: str)

Identification service

Gallery-based matching with in-memory cache:

_gallery_matrix: np.ndarray # Shape (N, 192) — all enrolled speakers
_speaker_ids: list[UUID] # Parallel array of speaker IDs
_speaker_names: list[str] # Parallel array of speaker names
_gallery_dirty: bool # Cache invalidation flag

Flow:

  1. invalidate_gallery() — called after enrollment / deletion
  2. _rebuild_gallery() — loads all speakers, averages embeddings per speaker
  3. identify(embedding) — cosine similarity against gallery matrix
  4. Returns (speaker_id, speaker_name, confidence) or ("UNKNOWN", score) if below threshold

Job service

Manages async job lifecycle:

FunctionPurpose
create_job(type, path)Create PENDING job
set_running(job)Mark as RUNNING
set_done(job, ...)Mark as DONE with results
set_failed(job, error)Mark as FAILED
add_segment(job_id, ...)Add diarized / identified segment
cleanup_orphaned_jobs()Mark stale RUNNING jobs as FAILED (crash recovery)

Background workers

run_diarize_job(job_id, export_segments):

  1. Mark job RUNNING
  2. Diarize audio (pyannote or SpeechBrain)
  3. For each segment: optionally extract WAV clip, create DB record
  4. Mark job DONE

run_identify_job(job_id, export_segments):

  1. Mark job RUNNING
  2. Diarize audio
  3. For each segment: embed clip, identify speaker against gallery
  4. Create segment records with matched speaker + confidence
  5. Mark job DONE

API reference

Base URL: https://voiceai.trouve.works/biometric/api

All responses wrapped in ApiResponse<T>:

{
"success": true,
"data": { /* ... */ },
"error": null
}

Speaker enrollment

POST /speakers/enroll
Content-Type: multipart/form-data

Parameters:
name: "Alice" (required)
files: <audio file(s)> (required, min 10s each)
{
"success": true,
"data": {
"speaker": {
"id": "uuid",
"name": "Alice",
"sample_count": 2,
"created_at": "2026-04-14T..."
},
"embeddings_added": 2,
"total_duration_sec": 25.5
}
}

Speaker management

GET /speakers # List all (offset, limit pagination)
GET /speakers/{speaker_id} # Get details + embedding count
DELETE /speakers/{speaker_id} # Remove speaker + files

Identification

POST /identify
Content-Type: multipart/form-data

Parameters:
file: <audio file> (required)
export_segments: false (bool — extract per-speaker WAV clips)
sync: false (bool — wait for results vs. async)

Response (async, status 202):

{
"success": true,
"data": {
"id": "job-uuid",
"type": "IDENTIFY",
"status": "PENDING"
}
}

Response (sync or after polling):

{
"success": true,
"data": {
"id": "job-uuid",
"type": "IDENTIFY",
"status": "DONE",
"duration_sec": 120.5,
"num_speakers": 3,
"diarizer_backend": "pyannote",
"segments": [
{
"id": "segment-uuid",
"diarization_label": "SPEAKER_00",
"matched_speaker_id": "speaker-uuid",
"matched_name": "Alice",
"start_sec": 0.5,
"end_sec": 10.2,
"confidence": 0.89,
"has_audio": true,
"audio_url": "/segments/{job_id}/audio/{segment_id}"
},
{
"id": "segment-uuid",
"diarization_label": "SPEAKER_01",
"matched_speaker_id": null,
"matched_name": "UNKNOWN",
"start_sec": 10.5,
"end_sec": 45.0,
"confidence": 0.45,
"has_audio": true,
"audio_url": "/segments/{job_id}/audio/{segment_id}"
}
]
}
}

Diarization only

POST /diarize
Content-Type: multipart/form-data

Parameters:
file: <audio file> (required)
export_segments: false (bool)
sync: false (bool)

Same response shape as /identify, without matched_speaker_id / matched_name.

Segment audio download

GET /segments/{job_id}/audio/{segment_id}

Response: WAV audio file (Content-Type: audio/wav).

Job status

GET /jobs/{job_id} # Get job details + segments
GET /jobs # List all jobs (offset, limit)

Async polling pattern:

  1. POST /identify202 with job_id
  2. GET /jobs/{job_id} → poll until status == "DONE" or "FAILED"
  3. GET /segments/{job_id}/audio/{segment_id} → download clips

Database schema

speakers
id: UUID (PK)
name: VARCHAR (indexed)
metadata_: JSONB
sample_count: INTEGER
created_at: TIMESTAMP
updated_at: TIMESTAMP

speaker_embeddings
id: UUID (PK)
speaker_id: UUID (FK → speakers.id, CASCADE)
embedding: BYTEA (serialized float32[192])
dim: INTEGER (default: 192)
duration_sec: FLOAT
source_path: VARCHAR
created_at: TIMESTAMP

jobs
id: UUID (PK)
type: ENUM (IDENTIFY, DIARIZE)
status: ENUM (PENDING, RUNNING, DONE, FAILED)
input_path: VARCHAR
duration_sec: FLOAT
num_speakers: INTEGER
diarizer_backend: VARCHAR
error: TEXT
created_at: TIMESTAMP
completed_at: TIMESTAMP

segments
id: UUID (PK)
job_id: UUID (FK → jobs.id, CASCADE)
diarization_label: VARCHAR
matched_speaker_id: UUID (FK → speakers.id, SET NULL)
matched_name: VARCHAR
start_sec: FLOAT
end_sec: FLOAT
confidence: FLOAT
audio_path: VARCHAR
created_at: TIMESTAMP

Configuration

Environment variables

VariableDefaultPurpose
POSTGRES_USERvoicebioDatabase user
POSTGRES_PASSWORDvoicebioDatabase password
POSTGRES_DBvoicebioDatabase name
POSTGRES_HOSTpostgresDatabase host
POSTGRES_PORT8065Database port
HF_TOKENHuggingFace token (required for pyannote)
DEVICEautodetectcuda:0, cpu, etc.
SIMILARITY_THRESHOLD0.75Speaker match threshold
MIN_ENROLLMENT_SECONDS10Minimum enrollment audio duration
MAX_UPLOAD_MB500Max file upload size
STORAGE_DIR/storageRoot storage path

Storage layout

/storage/
enrollments/{speaker_id}/ # Enrolled speaker audio files
jobs/{job_id}/ # Uploaded audio + extracted segments
models/ # HuggingFace + Torch model cache

Key dependencies

PackageVersionPurpose
torch≥ 2.4.0Deep learning framework
torchaudio≥ 2.4.0Audio I/O
speechbrain≥ 1.0.0ECAPA-TDNN embeddings
pyannote.audio≥ 3.1.0, < 4.0Speaker diarization
numpy≥ 1.26.0, < 2.0Numeric computing
sqlalchemy[asyncio]≥ 2.0.0ORM (async)
asyncpg≥ 0.29.0PostgreSQL driver
alembic≥ 1.13.0Database migrations
scikit-learn≥ 1.4.0Agglomerative clustering (fallback)
fastapi≥ 0.111.0Web framework

:::caution Known constraints

  • numpy < 2.0 required (pyannote uses removed np.NaN)
  • huggingface_hub < 0.24 required (pyannote uses deprecated use_auth_token)
  • torch >= 2.4 required (SpeechBrain ≥ 1.0 uses torch.amp.custom_fwd) :::

Use cases

  • Voice-based identity verification / authentication
  • Meeting transcription with speaker labels
  • Call center analytics — identify repeat callers
  • Media content indexing — who said what in recordings
  • Security and access control

Roadmap

  • Health signal detection (vocal biomarkers: stress, fatigue)
  • Age / gender estimation from voice
  • Real-time identification during live conversations