Voice Utilities

Speech-to-text and text-to-speech services, available as APIs and through the VoiceLab web interface ("AI Speech Studio"). Three components: the VoiceLab frontend, the Speaches containerized backend, and the STT test adapter for arbitrary HuggingFace models.

Live at voiceai.trouve.works/utilities/.

Speech-to-text

Capability	Details
Multilingual transcription	Whisper Large V3, 90+ languages
Arabic-specialized	Wav2Vec2 XLSR-53 fine-tuned for Arabic
Flexible model support	HuggingFace adapter loads any ASR model dynamically
Response formats	`text`, `json`, `verbose_json` (with timestamps and segments)
OpenAI-compatible API	Drop-in replacement for OpenAI's audio transcription endpoint

Text-to-speech

Capability	Details
English TTS	Kokoro pipeline — fast, natural-sounding synthesis
Arabic TTS with voice cloning	XTTS v2 — clone a voice from a short reference clip
Multiple voices	Selectable presets with speed control (0.5× to 2.0×)
Output formats	MP3, WAV, Opus
OpenAI-compatible API	Drop-in replacement for OpenAI's audio speech endpoint

VoiceLab web interface

A feature-rich web UI ("AI Speech Studio") for interacting with STT and TTS services without writing code:

STT panel — record from microphone or upload audio files; real-time waveform visualization and VU meters; transcription with copy-to-clipboard
TTS panel — enter text (up to 5,000 characters), select voice and speed, choose output format, preview and download
Configurable — point at different backend URLs and models

Project structure

VoiceUtilities/                       # Frontend + monolithic backend
├── speech_server.py                  # FastAPI backend (port 8112) — all models
├── serve.py                          # Static file server
├── static/index.html                 # Single-page VoiceLab UI
├── english_stt.py / arabic_stt.py    # Standalone STT test scripts
├── english_tts.py / arabic_tts.py    # Standalone TTS test scripts
└── voice.wav / voice_ali.wav         # Reference voices for cloning

speaches/                             # Containerized speech backend
├── compose.yaml                      # Docker Compose (CPU)
├── compose.cuda.yaml                 # Docker Compose (GPU)
├── .env                              # root_path, TTL
└── model_aliases.json                # OpenAI model name → HuggingFace mapping

stt_test/                             # HuggingFace model adapter
├── server.py                         # Multi-strategy ASR loader (port 8000)
├── example_usage.py                  # Client examples
└── requirements.txt

Speaches backend

The production speech service, deployed as a Docker container with an OpenAI-compatible API for both STT and TTS.

Model alias mapping (`model_aliases.json`)

{
  "tts-1": "speaches-ai/Kokoro-82M-v1.0-ONNX",
  "tts-1-hd": "speaches-ai/Kokoro-82M-v1.0-ONNX",
  "stt-1": "Systran/faster-whisper-large-v3"
}

Clients use standard OpenAI model names (tts-1, stt-1); the backend resolves them to self-hosted HuggingFace models.

Environment configuration

WHISPER__TTL=-1                  # Model cache TTL — keep loaded forever
UVICORN_ROOT_PATH=/services      # API path prefix (for nginx proxy)

API reference (OpenAI-compatible)

Base URL: https://voiceai.trouve.works/services/v1

Transcription

POST /audio/transcriptions
Content-Type: multipart/form-data

Parameters:
  file: <audio file>           (required)
  model: "stt-1"               (optional, default from config)
  language: "en"               (optional, auto-detect)
  response_format: "json"      (json | text | verbose_json)
  temperature: 0.0             (optional)

Response (json):

{ "text": "transcribed text here" }

Response (verbose_json):

{
  "task": "transcribe",
  "language": "en",
  "duration": 12.543,
  "text": "...",
  "segments": [
    {"id": 0, "start": 0.0, "end": 5.2, "text": "..."},
    {"id": 1, "start": 5.2, "end": 12.5, "text": "..."}
  ]
}

Speech synthesis

POST /audio/speech
Content-Type: application/json

{
  "model": "tts-1",
  "input": "Text to synthesize into speech",
  "voice": "af_heart",
  "speed": 1.0,
  "response_format": "mp3"
}

Response is binary audio with Content-Type: audio/mpeg | audio/wav | audio/ogg.

Direct backend endpoints (`speech_server.py`)

Exposed by the monolithic backend for direct model access:

Endpoint	Method	Input	Output
`/tts/arabic_xtts`	POST	Form: `text`, `language` (ar/en), `voice_file` (optional)	WAV audio
`/tts/english_kokoro`	POST	Form: `text`, `voice`, `speed`	WAV audio
`/stt/arabic_wav2vec2`	POST	File: audio	`{"transcription": "..."}`
`/stt/whisper`	POST	File: audio	`{"transcription": "..."}`
`/health`	GET	–	`{"status": "healthy"}`

STT test adapter

A flexible OpenAI-compatible STT wrapper that loads any HuggingFace ASR model using a multi-strategy backend chain:

Priority	Strategy	Models supported
1	`PipelineBackend`	Standard models (Whisper, Wav2Vec2, HuBERT)
2	`TrustRemotePipelineBackend`	Models with custom code
3	`AutoModelForSpeechSeq2Seq`	Seq2Seq models (Qwen3-ASR, newer Whisper forks)
4	`AutoModelForCTC`	CTC-based models (Wav2Vec2, HuBERT fallback)

Usage

python server.py --model openai/whisper-large-v3 --port 8000
python server.py --model Qwen/Qwen3-ASR-1.7B --port 8000
python server.py --model facebook/wav2vec2-large-960h --port 8000
python server.py --model nvidia/canary-1b --device cuda:0

API endpoints

Endpoint	Method	Purpose
`/health`	GET	Health check
`/v1/models`	GET	List loaded models
`/v1/audio/transcriptions`	POST	Transcribe audio (OpenAI-compatible)

Client example

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="not-needed")

transcript = client.audio.transcriptions.create(
    model="openai/whisper-large-v3",
    file=open("audio.wav", "rb"),
    response_format="json",
)
print(transcript.text)

VoiceLab frontend

Single-file web application with embedded CSS and JavaScript. Vanilla JS, no build process, HTML5 Canvas + Web Audio API.

Setting	Default
Base URL	`https://voiceai.trouve.works/services/v1`
STT model	`stt-1`
TTS model	`tts-1`
TTS voice	`af_heart`
TTS speed	`1.0`
Character limit	5,000

Roadmap

Timestamp-based transcriptions (word-level timing)
Additional language-specific models (especially Arabic enhancements)
Voice cloning improvements
Broader language coverage

Speech-to-text​

Text-to-speech​

VoiceLab web interface​

Project structure​

Speaches backend​

Model alias mapping (model_aliases.json)​

Environment configuration​

API reference (OpenAI-compatible)​

Transcription​

Speech synthesis​

Direct backend endpoints (speech_server.py)​

STT test adapter​

Usage​

API endpoints​

Client example​

VoiceLab frontend​

Roadmap​