Quick start

Five paths into Voice SDK — pick the one that matches what you're trying to build.

1. Talk to the Voice Agent

Open: https://voiceai.trouve.works/
Click: Start
Speak: "Hello"

The agent runs Whisper Large V3 → Qwen2.5-7B → Kokoro over LiveKit's WebRTC layer. Sub-second latency, voice activity detection, preemptive generation, and tool calling are on by default.

2. Transcribe an audio file

from openai import OpenAI

client = OpenAI(
    base_url="https://voiceai.trouve.works/services/v1",
    api_key="not-needed",
)

transcript = client.audio.transcriptions.create(
    model="stt-1",
    file=open("audio.wav", "rb"),
    response_format="json",
)
print(transcript.text)

The endpoint is OpenAI-compatible. Replace stt-1 with any HuggingFace model name when running the STT test adapter instead of Speaches.

3. Synthesize speech

audio = client.audio.speech.create(
    model="tts-1",
    voice="af_heart",
    input="Voice SDK runs at sub-second latency on Tier 1 hardware.",
    response_format="mp3",
)
audio.stream_to_file("out.mp3")

tts-1 resolves to Kokoro (English). Use the /tts/arabic_xtts direct endpoint for Arabic with optional voice cloning.

4. Denoise a recording

curl -X POST \
  -F "[email protected]" \
  -F "output_format=wav" \
  -F "restore_sample_rate=true" \
  https://voiceai.trouve.works/noise/api/denoise/file \
  -o cleaned.wav

For real-time denoising — microphone in, clean audio out — connect a WebSocket to wss://voiceai.trouve.works/noise/ws/denoise and stream Float32 LE PCM. End-to-end latency is ~150 ms.

5. Identify a speaker

# Step 1 — enroll
curl -X POST \
  -F "name=Alice" \
  -F "files=@alice_sample1.wav" \
  -F "files=@alice_sample2.wav" \
  https://voiceai.trouve.works/biometric/api/speakers/enroll

# Step 2 — identify
curl -X POST \
  -F "[email protected]" \
  -F "sync=true" \
  https://voiceai.trouve.works/biometric/api/identify

The response wraps every result in ApiResponse<T> with diarized segments, matched names, confidence scores, and per-segment audio download URLs.

Where to next

Architecture — how the modules connect
Modules — per-module deep dives with full API references
Use cases — common module compositions (call center AI, meeting transcription, voice auth)

1. Talk to the Voice Agent​

2. Transcribe an audio file​

3. Synthesize speech​

4. Denoise a recording​

5. Identify a speaker​

Where to next​