Voice Utilities
Speech-to-text and text-to-speech services, available as APIs and through the VoiceLab web interface ("AI Speech Studio"). Three components: the VoiceLab frontend, the Speaches containerized backend, and the STT test adapter for arbitrary HuggingFace models.
Live at voiceai.trouve.works/utilities/.
Speech-to-text
| Capability | Details |
|---|---|
| Multilingual transcription | Whisper Large V3, 90+ languages |
| Arabic-specialized | Wav2Vec2 XLSR-53 fine-tuned for Arabic |
| Flexible model support | HuggingFace adapter loads any ASR model dynamically |
| Response formats | text, json, verbose_json (with timestamps and segments) |
| OpenAI-compatible API | Drop-in replacement for OpenAI's audio transcription endpoint |
Text-to-speech
| Capability | Details |
|---|---|
| English TTS | Kokoro pipeline — fast, natural-sounding synthesis |
| Arabic TTS with voice cloning | XTTS v2 — clone a voice from a short reference clip |
| Multiple voices | Selectable presets with speed control (0.5× to 2.0×) |
| Output formats | MP3, WAV, Opus |
| OpenAI-compatible API | Drop-in replacement for OpenAI's audio speech endpoint |
VoiceLab web interface
A feature-rich web UI ("AI Speech Studio") for interacting with STT and TTS services without writing code:
- STT panel — record from microphone or upload audio files; real-time waveform visualization and VU meters; transcription with copy-to-clipboard
- TTS panel — enter text (up to 5,000 characters), select voice and speed, choose output format, preview and download
- Configurable — point at different backend URLs and models
Project structure
VoiceUtilities/ # Frontend + monolithic backend
├── speech_server.py # FastAPI backend (port 8112) — all models
├── serve.py # Static file server
├── static/index.html # Single-page VoiceLab UI
├── english_stt.py / arabic_stt.py # Standalone STT test scripts
├── english_tts.py / arabic_tts.py # Standalone TTS test scripts
└── voice.wav / voice_ali.wav # Reference voices for cloning
speaches/ # Containerized speech backend
├── compose.yaml # Docker Compose (CPU)
├── compose.cuda.yaml # Docker Compose (GPU)
├── .env # root_path, TTL
└── model_aliases.json # OpenAI model name → HuggingFace mapping
stt_test/ # HuggingFace model adapter
├── server.py # Multi-strategy ASR loader (port 8000)
├── example_usage.py # Client examples
└── requirements.txt
Speaches backend
The production speech service, deployed as a Docker container with an OpenAI-compatible API for both STT and TTS.
Model alias mapping (model_aliases.json)
{
"tts-1": "speaches-ai/Kokoro-82M-v1.0-ONNX",
"tts-1-hd": "speaches-ai/Kokoro-82M-v1.0-ONNX",
"stt-1": "Systran/faster-whisper-large-v3"
}
Clients use standard OpenAI model names (tts-1, stt-1); the backend resolves them to self-hosted HuggingFace models.
Environment configuration
WHISPER__TTL=-1 # Model cache TTL — keep loaded forever
UVICORN_ROOT_PATH=/services # API path prefix (for nginx proxy)
API reference (OpenAI-compatible)
Base URL: https://voiceai.trouve.works/services/v1
Transcription
POST /audio/transcriptions
Content-Type: multipart/form-data
Parameters:
file: <audio file> (required)
model: "stt-1" (optional, default from config)
language: "en" (optional, auto-detect)
response_format: "json" (json | text | verbose_json)
temperature: 0.0 (optional)
Response (json):
{ "text": "transcribed text here" }
Response (verbose_json):
{
"task": "transcribe",
"language": "en",
"duration": 12.543,
"text": "...",
"segments": [
{"id": 0, "start": 0.0, "end": 5.2, "text": "..."},
{"id": 1, "start": 5.2, "end": 12.5, "text": "..."}
]
}
Speech synthesis
POST /audio/speech
Content-Type: application/json
{
"model": "tts-1",
"input": "Text to synthesize into speech",
"voice": "af_heart",
"speed": 1.0,
"response_format": "mp3"
}
Response is binary audio with Content-Type: audio/mpeg | audio/wav | audio/ogg.
Direct backend endpoints (speech_server.py)
Exposed by the monolithic backend for direct model access:
| Endpoint | Method | Input | Output |
|---|---|---|---|
/tts/arabic_xtts | POST | Form: text, language (ar/en), voice_file (optional) | WAV audio |
/tts/english_kokoro | POST | Form: text, voice, speed | WAV audio |
/stt/arabic_wav2vec2 | POST | File: audio | {"transcription": "..."} |
/stt/whisper | POST | File: audio | {"transcription": "..."} |
/health | GET | – | {"status": "healthy"} |
STT test adapter
A flexible OpenAI-compatible STT wrapper that loads any HuggingFace ASR model using a multi-strategy backend chain:
| Priority | Strategy | Models supported |
|---|---|---|
| 1 | PipelineBackend | Standard models (Whisper, Wav2Vec2, HuBERT) |
| 2 | TrustRemotePipelineBackend | Models with custom code |
| 3 | AutoModelForSpeechSeq2Seq | Seq2Seq models (Qwen3-ASR, newer Whisper forks) |
| 4 | AutoModelForCTC | CTC-based models (Wav2Vec2, HuBERT fallback) |
Usage
python server.py --model openai/whisper-large-v3 --port 8000
python server.py --model Qwen/Qwen3-ASR-1.7B --port 8000
python server.py --model facebook/wav2vec2-large-960h --port 8000
python server.py --model nvidia/canary-1b --device cuda:0
API endpoints
| Endpoint | Method | Purpose |
|---|---|---|
/health | GET | Health check |
/v1/models | GET | List loaded models |
/v1/audio/transcriptions | POST | Transcribe audio (OpenAI-compatible) |
Client example
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="not-needed")
transcript = client.audio.transcriptions.create(
model="openai/whisper-large-v3",
file=open("audio.wav", "rb"),
response_format="json",
)
print(transcript.text)
VoiceLab frontend
Single-file web application with embedded CSS and JavaScript. Vanilla JS, no build process, HTML5 Canvas + Web Audio API.
| Setting | Default |
|---|---|
| Base URL | https://voiceai.trouve.works/services/v1 |
| STT model | stt-1 |
| TTS model | tts-1 |
| TTS voice | af_heart |
| TTS speed | 1.0 |
| Character limit | 5,000 |
Roadmap
- Timestamp-based transcriptions (word-level timing)
- Additional language-specific models (especially Arabic enhancements)
- Voice cloning improvements
- Broader language coverage