Quick start
Five paths into Voice SDK — pick the one that matches what you're trying to build.
1. Talk to the Voice Agent
Open: https://voiceai.trouve.works/
Click: Start
Speak: "Hello"
The agent runs Whisper Large V3 → Qwen2.5-7B → Kokoro over LiveKit's WebRTC layer. Sub-second latency, voice activity detection, preemptive generation, and tool calling are on by default.
2. Transcribe an audio file
from openai import OpenAI
client = OpenAI(
base_url="https://voiceai.trouve.works/services/v1",
api_key="not-needed",
)
transcript = client.audio.transcriptions.create(
model="stt-1",
file=open("audio.wav", "rb"),
response_format="json",
)
print(transcript.text)
The endpoint is OpenAI-compatible. Replace stt-1 with any HuggingFace model name when running the STT test adapter instead of Speaches.
3. Synthesize speech
audio = client.audio.speech.create(
model="tts-1",
voice="af_heart",
input="Voice SDK runs at sub-second latency on Tier 1 hardware.",
response_format="mp3",
)
audio.stream_to_file("out.mp3")
tts-1 resolves to Kokoro (English). Use the /tts/arabic_xtts direct endpoint for Arabic with optional voice cloning.
4. Denoise a recording
curl -X POST \
-F "output_format=wav" \
-F "restore_sample_rate=true" \
https://voiceai.trouve.works/noise/api/denoise/file \
-o cleaned.wav
For real-time denoising — microphone in, clean audio out — connect a WebSocket to wss://voiceai.trouve.works/noise/ws/denoise and stream Float32 LE PCM. End-to-end latency is ~150 ms.
5. Identify a speaker
# Step 1 — enroll
curl -X POST \
-F "name=Alice" \
-F "files=@alice_sample1.wav" \
-F "files=@alice_sample2.wav" \
https://voiceai.trouve.works/biometric/api/speakers/enroll
# Step 2 — identify
curl -X POST \
-F "sync=true" \
https://voiceai.trouve.works/biometric/api/identify
The response wraps every result in ApiResponse<T> with diarized segments, matched names, confidence scores, and per-segment audio download URLs.
Where to next
- Architecture — how the modules connect
- Modules — per-module deep dives with full API references
- Use cases — common module compositions (call center AI, meeting transcription, voice auth)