A containerized Whisper API powered by whisperX on GPU, designed for deployment on Google Cloud Run (GPU) or any NVIDIA-enabled Docker environment.
- GPU-accelerated transcription/alignment using whisperX
- Async job queue - handles long-running transcriptions without timeout
- Callback support - integrates with n8n, webhooks, and automation tools
- WhisperX alignment for better timestamps; optional speaker diarization (requires HF token)
- Multi-language support and English translation task
- Accepts both file uploads and URLs
- Queue statistics and monitoring
# Build the GPU Docker image (for Cloud Run GPU / NVIDIA hosts)
docker build --platform linux/amd64 -t whisperx-cloudrun .
# Run locally (requires NVIDIA runtime)
docker run --rm --gpus all -p 8080:8080 whisperx-cloudrun
# Optional CPU-only build on ARM Macs (slow, for smoke tests)
docker build --no-cache --platform linux/arm64 \
--build-arg TORCH_INDEX_URL=https://download.pytorch.org/whl/cpu \
-t whisperx-cloudrun .
docker run --rm -p 8080:8080 \
-e WHISPERX_DEVICE=cpu -e WHISPERX_COMPUTE_TYPE=float32 \
-e WHISPERX_BATCH_SIZE=2 \
whisperx-cloudrunchmod +x install-local-ubuntu.sh
HF_TOKEN=your_hf_token ./install-local-ubuntu.sh
source .venv/bin/activate
# GPU (default)
uvicorn server:app --host 0.0.0.0 --port 8080
# CPU fallback
export WHISPERX_DEVICE=cpu WHISPERX_COMPUTE_TYPE=float32 WHISPERX_BATCH_SIZE=2
uvicorn server:app --host 0.0.0.0 --port 8080Notes:
- Requires Ubuntu 22.04+ with Python 3.8+ and optionally NVIDIA drivers. The script installs system deps via
aptunlessSKIP_APT=1. - CUDA wheels default to
cu124; setTORCH_INDEX_URLto match your installed CUDA (e.g.,https://download.pytorch.org/whl/cu118for older drivers) or force CPU withFORCE_CPU=1. - Requirements are pinned in
requirements.txt. The installer setsPIP_EXTRA_INDEX_URLfor PyTorch based onTORCH_INDEX_URL; when runningpip install -r requirements.txtmanually, setPIP_EXTRA_INDEX_URL=https://download.pytorch.org/whl/cu124(or/cpu) to pick the right wheels. HF_TOKENis required to download the diarization model (pyannote/speaker-diarization-3.1) ifdiarize=true.- Known-good local setup: Ubuntu 22.04 + RTX 3090 (compute 8.6) + NVIDIA driver 535+ using the default
cu124wheels.
docker run --rm --gpus all -p 8080:8080 \
-e WHISPERX_MODEL=large-v3 \
whisperx-cloudrunCommon model names: large-v3 (default, multilingual), large-v2, medium.en (English-only, faster).
curl http://localhost:8080/healthzResponse:
{"ok": true}The async API prevents timeout issues on long audio files and supports webhooks.
1. Start a transcription job:
curl -X POST http://localhost:8080/transcribe/start \
-F "url=https://example.com/audio.mp3"Response:
{
"job_id": "550e8400-e29b-41d4-a716-446655440000",
"status": "queued",
"message": "Job queued for processing"
}2. Poll for results:
curl http://localhost:8080/transcribe/status/550e8400-e29b-41d4-a716-446655440000Response (queued/running):
{
"job_id": "550e8400-e29b-41d4-a716-446655440000",
"status": "running",
"result": null,
"created_at": 1699564800.0,
"updated_at": 1699564820.5
}Response (completed):
{
"job_id": "550e8400-e29b-41d4-a716-446655440000",
"status": "done",
"result": {
"ok": true,
"text": "Full transcription here...",
"segments": [...],
"language": "en",
"warnings": null
},
"created_at": 1699564800.0,
"updated_at": 1699564850.2
}3. Use callback URL (webhook):
Instead of polling, provide a callback URL to receive results automatically:
curl -X POST http://localhost:8080/transcribe/start \
-F "url=https://example.com/audio.mp3" \
-F "callback_url=https://your-webhook.com/transcription-complete"When the job completes, the service will POST to your callback URL:
{
"job_id": "550e8400-e29b-41d4-a716-446655440000",
"status": "done",
"ok": true,
"text": "Transcription...",
"segments": [...],
"language": "en"
}/transcribe/start instead.
Basic transcription with file upload:
curl -X POST http://localhost:8080/transcribe \
-F "file=@/path/to/audio.mp3"Transcribe from URL:
curl -X POST http://localhost:8080/transcribe \
-F "url=https://example.com/audio.mp3"With speaker diarization disabled:
curl -X POST http://localhost:8080/transcribe \
-F "file=@audio.mp3" \
-F "diarize=false"Specify language:
curl -X POST http://localhost:8080/transcribe \
-F "file=@audio.mp3" \
-F "language=es"Translate to English:
curl -X POST http://localhost:8080/transcribe \
-F "file=@audio.mp3" \
-F "language=es" \
-F "translate=true"Full example with all options:
curl -X POST http://localhost:8080/transcribe \
-F "file=@meeting.wav" \
-F "diarize=true" \
-F "language=en" \
-F "translate=false"| Parameter | Type | Default | Description |
|---|---|---|---|
file |
File | - | Audio file to transcribe (provide this OR url) |
url |
String | - | URL to audio file (provide this OR file) |
diarize |
Boolean | true |
Enable speaker diarization |
language |
String | auto-detect | Language code (e.g., en, es, fr, de) |
translate |
Boolean | false |
Translate output to English |
model_path |
String | WHISPERX_MODEL |
Ignored; model set via WHISPERX_MODEL env var |
callback_url |
String | - | Webhook URL to POST results when complete |
| Parameter | Type | Default | Description |
|---|---|---|---|
file |
File | - | Audio file to transcribe (provide this OR url) |
url |
String | - | URL to audio file (provide this OR file) |
diarize |
Boolean | true |
Enable speaker diarization |
language |
String | auto-detect | Language code (e.g., en, es, fr, de) |
translate |
Boolean | false |
Translate output to English |
model_path |
String | WHISPERX_MODEL |
Ignored; model set via WHISPERX_MODEL env var |
{
"ok": true,
"text": "Full transcription text here...",
"segments": [
{
"start": 0.0,
"end": 2.5,
"text": "First segment of speech"
},
{
"start": 2.5,
"end": 5.0,
"text": "Second segment of speech"
}
],
"language": "en",
"warnings": []
}Response Fields:
ok(boolean):trueif transcription succeeded,falseotherwisetext(string): Full transcription textsegments(array): Timestamped segments with start/end times in secondslanguage(string): Detected/source language codewarnings(array|null): Alignment/diarization warnings (e.g., missing HF token)
Monitor the job queue:
curl http://localhost:8080/queue/statsResponse:
{
"queued": 5,
"running": 2,
"done": 150,
"error": 3
}The API supports any format that FFmpeg can decode, including:
- MP3
- WAV
- M4A
- FLAC
- OGG
- WEBM
- MP4 (audio track)
- Google Cloud CLI installed
- A Google Cloud project with billing enabled
- Cloud Run API enabled
# Set your project ID
export PROJECT_ID=your-project-id
gcloud config set project $PROJECT_ID
# Build and push to Google Container Registry
docker build --platform linux/amd64 -t gcr.io/$PROJECT_ID/whisperx-cloudrun .
docker push gcr.io/$PROJECT_ID/whisperx-cloudrun
# Deploy to Cloud Run
gcloud run deploy whisperx-api-gpu \
--image gcr.io/$PROJECT_ID/whisperx-cloudrun \
--platform managed \
--region us-central1 \
--memory 16Gi \
--cpu 4 \
--timeout 600 \
--gpu 1 --gpu-type nvidia-l4 --no-cpu-throttling --no-cpu-boost \
--max-instances 1 \
--allow-unauthenticatedYou can set these during deployment:
gcloud run deploy whisperx-api-gpu \
--image gcr.io/$PROJECT_ID/whisperx-cloudrun \
--set-env-vars WHISPERX_MODEL=large-v3 \
--set-env-vars RESULT_TTL_SECONDS=86400 \
--set-env-vars WORKER_POLL_SEC=1.0 \
--memory 16Gi \
--cpu 4| Variable | Default | Description |
|---|---|---|
WHISPERX_MODEL |
large-v3 |
WhisperX model name |
WHISPERX_DEVICE |
cuda |
Device to run on (GPU required) |
WHISPERX_COMPUTE_TYPE |
float16 |
Compute type for whisperX |
WHISPERX_BATCH_SIZE |
16 |
Batch size passed to model.transcribe |
WHISPERX_CACHE |
/app/.cache/whisperx |
Cache dir for whisperX/align models |
HF_TOKEN / HUGGINGFACE_HUB_TOKEN |
- | Required for diarization model downloads |
PORT |
8080 |
Server port |
QUEUE_DB |
/tmp/whisper_jobs.sqlite3 |
SQLite database path |
RESULT_TTL_SECONDS |
86400 |
How long to keep completed jobs (1 day) |
WORKER_POLL_SEC |
1.0 |
Worker polling interval |
CLEANUP_INTERVAL_SEC |
3600 |
Job cleanup interval (1 hour) |
- GPU: Cloud Run GPU (e.g., L4) or equivalent NVIDIA hardware
- Memory: 16Gi minimum for large-v3; scale up for heavy concurrency
- CPU: 4 vCPU recommended to keep the pipeline fed
- Timeout: 300s+ for long audio files; async API preferred
- Max instances: Tune based on throughput and GPU quota
.
├── Dockerfile # GPU whisperX build
├── server.py # FastAPI application
├── entrypoint.sh # Container entrypoint (starts uvicorn)
└── README.md # This file
On Ubuntu, prefer the scripted install (install-local-ubuntu.sh) to set up system and Python deps in .venv. For non-Ubuntu systems, mirror the versions from Dockerfile:
PIP_EXTRA_INDEX_URL=https://download.pytorch.org/whl/cu124 pip install -r requirements.txt
# GPU default; set for CPU fallback:
# export WHISPERX_DEVICE=cpu WHISPERX_COMPUTE_TYPE=float32 WHISPERX_BATCH_SIZE=2
uvicorn server:app --reload --port 8080- Set
HF_TOKEN/HUGGINGFACE_HUB_TOKENwhen using diarization (pyannote) or private models. - Ensure the token has access to
pyannote/speaker-diarization-3.1if diarization is enabled.
- Use a smaller model (e.g.,
medium.en) viaWHISPERX_MODEL. - Lower
WHISPERX_BATCH_SIZE(default 16). - Increase container memory/GPU RAM allocation.
- Use a smaller model.
- Verify the container has access to the GPU (
--gpus alllocally; Cloud Run GPU in production). - Reduce diarization usage if not needed.
This project relies on whisperX (MIT licensed). See the whisperX repository for details.