Skip to main content

Voice Utilities

Speech-to-text and text-to-speech services, available as APIs and through the VoiceLab web interface ("AI Speech Studio"). Three components: the VoiceLab frontend, the Speaches containerized backend, and the STT test adapter for arbitrary HuggingFace models.

Live at voiceai.trouve.works/utilities/.

Speech-to-text

CapabilityDetails
Multilingual transcriptionWhisper Large V3, 90+ languages
Arabic-specializedWav2Vec2 XLSR-53 fine-tuned for Arabic
Flexible model supportHuggingFace adapter loads any ASR model dynamically
Response formatstext, json, verbose_json (with timestamps and segments)
OpenAI-compatible APIDrop-in replacement for OpenAI's audio transcription endpoint

Text-to-speech

CapabilityDetails
English TTSKokoro pipeline — fast, natural-sounding synthesis
Arabic TTS with voice cloningXTTS v2 — clone a voice from a short reference clip
Multiple voicesSelectable presets with speed control (0.5× to 2.0×)
Output formatsMP3, WAV, Opus
OpenAI-compatible APIDrop-in replacement for OpenAI's audio speech endpoint

VoiceLab web interface

A feature-rich web UI ("AI Speech Studio") for interacting with STT and TTS services without writing code:

  • STT panel — record from microphone or upload audio files; real-time waveform visualization and VU meters; transcription with copy-to-clipboard
  • TTS panel — enter text (up to 5,000 characters), select voice and speed, choose output format, preview and download
  • Configurable — point at different backend URLs and models

Project structure

VoiceUtilities/ # Frontend + monolithic backend
├── speech_server.py # FastAPI backend (port 8112) — all models
├── serve.py # Static file server
├── static/index.html # Single-page VoiceLab UI
├── english_stt.py / arabic_stt.py # Standalone STT test scripts
├── english_tts.py / arabic_tts.py # Standalone TTS test scripts
└── voice.wav / voice_ali.wav # Reference voices for cloning

speaches/ # Containerized speech backend
├── compose.yaml # Docker Compose (CPU)
├── compose.cuda.yaml # Docker Compose (GPU)
├── .env # root_path, TTL
└── model_aliases.json # OpenAI model name → HuggingFace mapping

stt_test/ # HuggingFace model adapter
├── server.py # Multi-strategy ASR loader (port 8000)
├── example_usage.py # Client examples
└── requirements.txt

Speaches backend

The production speech service, deployed as a Docker container with an OpenAI-compatible API for both STT and TTS.

Model alias mapping (model_aliases.json)

{
"tts-1": "speaches-ai/Kokoro-82M-v1.0-ONNX",
"tts-1-hd": "speaches-ai/Kokoro-82M-v1.0-ONNX",
"stt-1": "Systran/faster-whisper-large-v3"
}

Clients use standard OpenAI model names (tts-1, stt-1); the backend resolves them to self-hosted HuggingFace models.

Environment configuration

WHISPER__TTL=-1 # Model cache TTL — keep loaded forever
UVICORN_ROOT_PATH=/services # API path prefix (for nginx proxy)

API reference (OpenAI-compatible)

Base URL: https://voiceai.trouve.works/services/v1

Transcription

POST /audio/transcriptions
Content-Type: multipart/form-data

Parameters:
file: <audio file> (required)
model: "stt-1" (optional, default from config)
language: "en" (optional, auto-detect)
response_format: "json" (json | text | verbose_json)
temperature: 0.0 (optional)

Response (json):

{ "text": "transcribed text here" }

Response (verbose_json):

{
"task": "transcribe",
"language": "en",
"duration": 12.543,
"text": "...",
"segments": [
{"id": 0, "start": 0.0, "end": 5.2, "text": "..."},
{"id": 1, "start": 5.2, "end": 12.5, "text": "..."}
]
}

Speech synthesis

POST /audio/speech
Content-Type: application/json

{
"model": "tts-1",
"input": "Text to synthesize into speech",
"voice": "af_heart",
"speed": 1.0,
"response_format": "mp3"
}

Response is binary audio with Content-Type: audio/mpeg | audio/wav | audio/ogg.

Direct backend endpoints (speech_server.py)

Exposed by the monolithic backend for direct model access:

EndpointMethodInputOutput
/tts/arabic_xttsPOSTForm: text, language (ar/en), voice_file (optional)WAV audio
/tts/english_kokoroPOSTForm: text, voice, speedWAV audio
/stt/arabic_wav2vec2POSTFile: audio{"transcription": "..."}
/stt/whisperPOSTFile: audio{"transcription": "..."}
/healthGET{"status": "healthy"}

STT test adapter

A flexible OpenAI-compatible STT wrapper that loads any HuggingFace ASR model using a multi-strategy backend chain:

PriorityStrategyModels supported
1PipelineBackendStandard models (Whisper, Wav2Vec2, HuBERT)
2TrustRemotePipelineBackendModels with custom code
3AutoModelForSpeechSeq2SeqSeq2Seq models (Qwen3-ASR, newer Whisper forks)
4AutoModelForCTCCTC-based models (Wav2Vec2, HuBERT fallback)

Usage

python server.py --model openai/whisper-large-v3 --port 8000
python server.py --model Qwen/Qwen3-ASR-1.7B --port 8000
python server.py --model facebook/wav2vec2-large-960h --port 8000
python server.py --model nvidia/canary-1b --device cuda:0

API endpoints

EndpointMethodPurpose
/healthGETHealth check
/v1/modelsGETList loaded models
/v1/audio/transcriptionsPOSTTranscribe audio (OpenAI-compatible)

Client example

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="not-needed")

transcript = client.audio.transcriptions.create(
model="openai/whisper-large-v3",
file=open("audio.wav", "rb"),
response_format="json",
)
print(transcript.text)

VoiceLab frontend

Single-file web application with embedded CSS and JavaScript. Vanilla JS, no build process, HTML5 Canvas + Web Audio API.

SettingDefault
Base URLhttps://voiceai.trouve.works/services/v1
STT modelstt-1
TTS modeltts-1
TTS voiceaf_heart
TTS speed1.0
Character limit5,000

Roadmap

  • Timestamp-based transcriptions (word-level timing)
  • Additional language-specific models (especially Arabic enhancements)
  • Voice cloning improvements
  • Broader language coverage