Voice Biometrics

Identify who is speaking in an audio recording — voice recognition for speech, conceptually similar to face recognition for images. Provides both speaker identification (matching voices against an enrolled gallery) and speaker diarization ("who spoke when").

Live at voiceai.trouve.works/biometric/.

Two core capabilities

Speaker identification

Enroll speakers by uploading voice samples (minimum 10 seconds per sample)
The system creates a 192-dimensional voice "fingerprint" using SpeechBrain's ECAPA-TDNN model
New audio is diarized into segments, each segment matched against the enrolled gallery
Returns speaker identity with confidence score, or UNKNOWN if below threshold

Speaker diarization

Segments audio into "who spoke when" without prior enrollment
Primary engine: pyannote/speaker-diarization-3.1 (state-of-the-art)
Fallback engine: SpeechBrain sliding-window clustering (no HuggingFace token required)
Returns time-stamped segments with speaker labels (SPEAKER_00, SPEAKER_01, …)

Identification flow

1. ENROLL speakers (upload voice samples)
        │
        ▼
2. Submit audio for IDENTIFICATION
        │
        ▼
3. DIARIZE audio into segments (who spoke when)
        │
        ▼
4. EMBED each segment (create voice fingerprint)
        │
        ▼
5. MATCH against enrolled speakers (cosine similarity)
        │
        ▼
6. Return: speaker name, confidence, time range

Key capabilities

Feature	Details
Embedding model	SpeechBrain ECAPA-TDNN (192-dim, trained on VoxCeleb2)
Diarization	pyannote 3.1 (primary) + SpeechBrain fallback
Matching	Cosine similarity with configurable threshold (default: `0.75`)
Audio formats	WAV, MP3, M4A, and any ffmpeg-compatible format
Segment export	Optionally extract per-speaker WAV clips
Async processing	Long audio files processed in background with job status polling
Persistent storage	PostgreSQL for speakers, embeddings, and job history
GPU acceleration	NVIDIA CUDA 12.1 support

Project structure

VoiceBiometrics/
└── app/
    ├── main.py                          # FastAPI app setup, lifespan, CORS
    ├── config.py                        # Pydantic settings (env-based config)
    ├── database.py                      # SQLAlchemy async engine + session
    ├── models/
    │   ├── speaker.py                   # Speaker, SpeakerEmbedding ORM models
    │   └── job.py                       # Job, Segment, JobType/JobStatus enums
    ├── routers/
    │   ├── speakers.py                  # Enrollment, list, delete
    │   ├── identify.py                  # Diarize + match to gallery
    │   ├── diarize.py                   # Diarization only
    │   ├── segments.py                  # Download segment audio clips
    │   └── jobs.py                      # Job status polling
    ├── schemas/
    │   ├── speaker.py                   # SpeakerOut, EnrollResponse
    │   ├── job.py                       # JobOut, SegmentOut
    │   └── common.py                    # ApiResponse<T> generic wrapper
    ├── services/
    │   ├── embedding_service.py         # ECAPA-TDNN (192-dim embeddings)
    │   ├── diarization_service.py       # pyannote 3.1 + SpeechBrain fallback
    │   ├── identification_service.py    # Gallery-based speaker matching
    │   ├── audio_service.py             # Audio validation, loading, conversion
    │   └── job_service.py               # Job CRUD + status management
    ├── utils/
    │   ├── device.py                    # GPU/CPU device selection
    │   ├── similarity.py                # Cosine similarity, L2 normalization
    │   └── audio_utils.py               # ffmpeg wrappers
    └── workers/
        └── background.py                # Async job execution (diarize, identify)
alembic/                                 # Database migrations
tests/                                   # pytest test suite
Dockerfile                               # CUDA 12.1 multi-stage build
docker-compose.yml                       # PostgreSQL + API

Services architecture

Embedding service

Model: SpeechBrain ECAPA-TDNN (speechbrain/spkrec-ecapa-voxceleb)

192-dimensional L2-normalized speaker embeddings
Trained on VoxCeleb2 dataset
Lazy-loaded on first call (cached globally)

Function	Input	Output
`get_embedding_model()`	–	Cached model instance
`embed_waveform(tensor)`	`torch.Tensor`	192-dim numpy array
`embed_file(path)`	audio file path	`(embedding, duration_sec)`
`embed_clip(path, start, end)`	source + time range	embedding or `None`
`serialize_embedding(emb)`	numpy array	bytes (for DB storage)
`deserialize_embedding(data)`	bytes	numpy array

Diarization service — dual backend

Backend	Priority	Model	Requirement
pyannote	Primary	`pyannote/speaker-diarization-3.1`	HuggingFace token
SpeechBrain	Fallback	Sliding-window ECAPA clustering	No token needed

SpeechBrain fallback pipeline:

Slide 2-second window with 1-second step
Compute ECAPA embedding per window
Estimate speaker count (2–6 range heuristic)
Agglomerative clustering (cosine distance)
Merge consecutive same-speaker windows
Filter segments < 0.5 seconds

Output: DiarizedSegment(start: float, end: float, label: str)

Identification service

Gallery-based matching with in-memory cache:

_gallery_matrix: np.ndarray   # Shape (N, 192) — all enrolled speakers
_speaker_ids: list[UUID]      # Parallel array of speaker IDs
_speaker_names: list[str]     # Parallel array of speaker names
_gallery_dirty: bool          # Cache invalidation flag

Flow:

invalidate_gallery() — called after enrollment / deletion
_rebuild_gallery() — loads all speakers, averages embeddings per speaker
identify(embedding) — cosine similarity against gallery matrix
Returns (speaker_id, speaker_name, confidence) or ("UNKNOWN", score) if below threshold

Job service

Manages async job lifecycle:

Function	Purpose
`create_job(type, path)`	Create `PENDING` job
`set_running(job)`	Mark as `RUNNING`
`set_done(job, ...)`	Mark as `DONE` with results
`set_failed(job, error)`	Mark as `FAILED`
`add_segment(job_id, ...)`	Add diarized / identified segment
`cleanup_orphaned_jobs()`	Mark stale `RUNNING` jobs as `FAILED` (crash recovery)

Background workers

run_diarize_job(job_id, export_segments):

Mark job RUNNING
Diarize audio (pyannote or SpeechBrain)
For each segment: optionally extract WAV clip, create DB record
Mark job DONE

run_identify_job(job_id, export_segments):

Mark job RUNNING
Diarize audio
For each segment: embed clip, identify speaker against gallery
Create segment records with matched speaker + confidence
Mark job DONE

API reference

Base URL: https://voiceai.trouve.works/biometric/api

All responses wrapped in ApiResponse<T>:

{
  "success": true,
  "data": { /* ... */ },
  "error": null
}

Speaker enrollment

POST /speakers/enroll
Content-Type: multipart/form-data

Parameters:
  name: "Alice"                (required)
  files: <audio file(s)>       (required, min 10s each)

{
  "success": true,
  "data": {
    "speaker": {
      "id": "uuid",
      "name": "Alice",
      "sample_count": 2,
      "created_at": "2026-04-14T..."
    },
    "embeddings_added": 2,
    "total_duration_sec": 25.5
  }
}

Speaker management

GET    /speakers                  # List all (offset, limit pagination)
GET    /speakers/{speaker_id}     # Get details + embedding count
DELETE /speakers/{speaker_id}     # Remove speaker + files

Identification

POST /identify
Content-Type: multipart/form-data

Parameters:
  file: <audio file>           (required)
  export_segments: false       (bool — extract per-speaker WAV clips)
  sync: false                  (bool — wait for results vs. async)

Response (async, status 202):

{
  "success": true,
  "data": {
    "id": "job-uuid",
    "type": "IDENTIFY",
    "status": "PENDING"
  }
}

Response (sync or after polling):

{
  "success": true,
  "data": {
    "id": "job-uuid",
    "type": "IDENTIFY",
    "status": "DONE",
    "duration_sec": 120.5,
    "num_speakers": 3,
    "diarizer_backend": "pyannote",
    "segments": [
      {
        "id": "segment-uuid",
        "diarization_label": "SPEAKER_00",
        "matched_speaker_id": "speaker-uuid",
        "matched_name": "Alice",
        "start_sec": 0.5,
        "end_sec": 10.2,
        "confidence": 0.89,
        "has_audio": true,
        "audio_url": "/segments/{job_id}/audio/{segment_id}"
      },
      {
        "id": "segment-uuid",
        "diarization_label": "SPEAKER_01",
        "matched_speaker_id": null,
        "matched_name": "UNKNOWN",
        "start_sec": 10.5,
        "end_sec": 45.0,
        "confidence": 0.45,
        "has_audio": true,
        "audio_url": "/segments/{job_id}/audio/{segment_id}"
      }
    ]
  }
}

Diarization only

POST /diarize
Content-Type: multipart/form-data

Parameters:
  file: <audio file>           (required)
  export_segments: false       (bool)
  sync: false                  (bool)

Same response shape as /identify, without matched_speaker_id / matched_name.

Segment audio download

GET /segments/{job_id}/audio/{segment_id}

Response: WAV audio file (Content-Type: audio/wav).

Job status

GET /jobs/{job_id}    # Get job details + segments
GET /jobs             # List all jobs (offset, limit)

Async polling pattern:

POST /identify → 202 with job_id
GET /jobs/{job_id} → poll until status == "DONE" or "FAILED"
GET /segments/{job_id}/audio/{segment_id} → download clips

Database schema

speakers
  id: UUID (PK)
  name: VARCHAR (indexed)
  metadata_: JSONB
  sample_count: INTEGER
  created_at: TIMESTAMP
  updated_at: TIMESTAMP

speaker_embeddings
  id: UUID (PK)
  speaker_id: UUID (FK → speakers.id, CASCADE)
  embedding: BYTEA (serialized float32[192])
  dim: INTEGER (default: 192)
  duration_sec: FLOAT
  source_path: VARCHAR
  created_at: TIMESTAMP

jobs
  id: UUID (PK)
  type: ENUM (IDENTIFY, DIARIZE)
  status: ENUM (PENDING, RUNNING, DONE, FAILED)
  input_path: VARCHAR
  duration_sec: FLOAT
  num_speakers: INTEGER
  diarizer_backend: VARCHAR
  error: TEXT
  created_at: TIMESTAMP
  completed_at: TIMESTAMP

segments
  id: UUID (PK)
  job_id: UUID (FK → jobs.id, CASCADE)
  diarization_label: VARCHAR
  matched_speaker_id: UUID (FK → speakers.id, SET NULL)
  matched_name: VARCHAR
  start_sec: FLOAT
  end_sec: FLOAT
  confidence: FLOAT
  audio_path: VARCHAR
  created_at: TIMESTAMP

Configuration

Environment variables

Variable	Default	Purpose
`POSTGRES_USER`	`voicebio`	Database user
`POSTGRES_PASSWORD`	`voicebio`	Database password
`POSTGRES_DB`	`voicebio`	Database name
`POSTGRES_HOST`	`postgres`	Database host
`POSTGRES_PORT`	`8065`	Database port
`HF_TOKEN`	–	HuggingFace token (required for pyannote)
`DEVICE`	autodetect	`cuda:0`, `cpu`, etc.
`SIMILARITY_THRESHOLD`	`0.75`	Speaker match threshold
`MIN_ENROLLMENT_SECONDS`	`10`	Minimum enrollment audio duration
`MAX_UPLOAD_MB`	`500`	Max file upload size
`STORAGE_DIR`	`/storage`	Root storage path

Storage layout

/storage/
  enrollments/{speaker_id}/    # Enrolled speaker audio files
  jobs/{job_id}/               # Uploaded audio + extracted segments
  models/                      # HuggingFace + Torch model cache

Key dependencies

Package	Version	Purpose
`torch`	≥ 2.4.0	Deep learning framework
`torchaudio`	≥ 2.4.0	Audio I/O
`speechbrain`	≥ 1.0.0	ECAPA-TDNN embeddings
`pyannote.audio`	≥ 3.1.0, < 4.0	Speaker diarization
`numpy`	≥ 1.26.0, < 2.0	Numeric computing
`sqlalchemy[asyncio]`	≥ 2.0.0	ORM (async)
`asyncpg`	≥ 0.29.0	PostgreSQL driver
`alembic`	≥ 1.13.0	Database migrations
`scikit-learn`	≥ 1.4.0	Agglomerative clustering (fallback)
`fastapi`	≥ 0.111.0	Web framework

:::caution Known constraints

numpy < 2.0 required (pyannote uses removed np.NaN)
huggingface_hub < 0.24 required (pyannote uses deprecated use_auth_token)
torch >= 2.4 required (SpeechBrain ≥ 1.0 uses torch.amp.custom_fwd) :::

Use cases

Voice-based identity verification / authentication
Meeting transcription with speaker labels
Call center analytics — identify repeat callers
Media content indexing — who said what in recordings
Security and access control

Roadmap

Health signal detection (vocal biomarkers: stress, fatigue)
Age / gender estimation from voice
Real-time identification during live conversations

Two core capabilities​

Speaker identification​

Speaker diarization​

Identification flow​

Key capabilities​

Project structure​

Services architecture​

Embedding service​

Diarization service — dual backend​

Identification service​

Job service​

Background workers​

API reference​

Speaker enrollment​

Speaker management​

Identification​

Diarization only​

Segment audio download​

Job status​

Database schema​

Configuration​

Environment variables​

Storage layout​

Key dependencies​

Use cases​

Roadmap​