Voice SDK · AVML R&D

One SDK. Every voice surface.

A containerized, end-to-end voice platform for building intelligent audio systems anywhere — transcription, synthesis, denoising, biometrics, and conversational AI in one self-hosted stack.

Get started Read the documentation

Core pipeline

Audio in. Decisions out. Five stages, one SDK.

Every voice workload follows the same shape. Voice SDK provides production-grade primitives at each stage — and lets you wire only the ones you need.

Audio input

mic · file · stream

Processing

denoise · VAD

Understanding

STT · embed

Response

LLM · TTS

Monitoring

SNR · latency

By the numbers

Numbers, units, and conditions stated plainly.

Languages (STT)

90+

Whisper Large V3

Real-time latency

150

ms · denoise streaming

Voice agent loop

sub-s

STT · LLM · TTS · WebRTC

Embedding dim

192

ECAPA-TDNN · VoxCeleb2

Modules

Four building blocks. Use one. Use all four.

Each module runs as a standalone service with its own REST or WebSocket API, or composes into the unified VoiceSDK platform with a single web interface.

Module 01Voice Agent

End-to-end conversational voice AI over WebRTC. STT → LLM → TTS, with VAD, preemptive generation, and tool calling.

Read docs Module 02Voice Utilities

Self-hosted transcription and speech synthesis. Whisper Large V3, Kokoro, XTTS v2 with voice cloning. 90+ languages.

Read docs Module 03Noise Suppression

DeepFilterNet3 audio cleanup. File mode for podcasts and recordings; real-time WebSocket mode at ~150 ms latency.

Read docs Module 04Voice Biometrics

Speaker identification and diarization. ECAPA-TDNN embeddings, pyannote 3.1 segmentation, async job pipeline.

Read docs

Capabilities

Production primitives, ready to compose.

Containerized everywhere

Every module ships as a Docker container. Deploy on local, cloud, or edge with zero environment friction.

Self-hosted by default

No data leaves your infrastructure. All models — Whisper, Kokoro, XTTS, Qwen, ECAPA, pyannote — run on your hardware.

GPU-accelerated inference

NVIDIA CUDA 12.x throughout. Sub-second voice agent loop, 14.2 ms file denoising, 150 ms real-time streaming.

OpenAI-compatible API

Drop-in replacement for OpenAI audio transcription and speech endpoints. Migrate without rewriting client code.

Quick start

OpenAI-compatible. Drop in. Move on.

Point your existing OpenAI client at voiceai.trouve.works and transcribe, synthesize, denoise, or identify speakers — without rewriting a line of integration code.

quickstart.py
from openai import OpenAI

client = OpenAI(
    base_url="https://voiceai.trouve.works/services/v1",
    api_key="not-needed",
)

# Speech-to-text — OpenAI-compatible, self-hosted
transcript = client.audio.transcriptions.create(
    model="stt-1",
    file=open("call.wav", "rb"),
)
print(transcript.text)

terminal
$ pip install openai
$ python quickstart.py

See the full quick start