Skip to content

dtconceptsnc/whisper-cloudrun

Repository files navigation

Whisper Cloud Run API (GPU, whisperX)

A containerized Whisper API powered by whisperX on GPU, designed for deployment on Google Cloud Run (GPU) or any NVIDIA-enabled Docker environment.

Features

  • GPU-accelerated transcription/alignment using whisperX
  • Async job queue - handles long-running transcriptions without timeout
  • Callback support - integrates with n8n, webhooks, and automation tools
  • WhisperX alignment for better timestamps; optional speaker diarization (requires HF token)
  • Multi-language support and English translation task
  • Accepts both file uploads and URLs
  • Queue statistics and monitoring

Quick Start

Build and Run Locally

# Build the GPU Docker image (for Cloud Run GPU / NVIDIA hosts)
docker build --platform linux/amd64 -t whisperx-cloudrun .

# Run locally (requires NVIDIA runtime)
docker run --rm --gpus all -p 8080:8080 whisperx-cloudrun

# Optional CPU-only build on ARM Macs (slow, for smoke tests)
docker build --no-cache --platform linux/arm64 \
  --build-arg TORCH_INDEX_URL=https://download.pytorch.org/whl/cpu \
  -t whisperx-cloudrun .
docker run --rm -p 8080:8080 \
  -e WHISPERX_DEVICE=cpu -e WHISPERX_COMPUTE_TYPE=float32 \
  -e WHISPERX_BATCH_SIZE=2 \
  whisperx-cloudrun

Run Locally Without Docker (Ubuntu)

chmod +x install-local-ubuntu.sh
HF_TOKEN=your_hf_token ./install-local-ubuntu.sh
source .venv/bin/activate

# GPU (default)
uvicorn server:app --host 0.0.0.0 --port 8080

# CPU fallback
export WHISPERX_DEVICE=cpu WHISPERX_COMPUTE_TYPE=float32 WHISPERX_BATCH_SIZE=2
uvicorn server:app --host 0.0.0.0 --port 8080

Notes:

  • Requires Ubuntu 22.04+ with Python 3.8+ and optionally NVIDIA drivers. The script installs system deps via apt unless SKIP_APT=1.
  • CUDA wheels default to cu124; set TORCH_INDEX_URL to match your installed CUDA (e.g., https://download.pytorch.org/whl/cu118 for older drivers) or force CPU with FORCE_CPU=1.
  • Requirements are pinned in requirements.txt. The installer sets PIP_EXTRA_INDEX_URL for PyTorch based on TORCH_INDEX_URL; when running pip install -r requirements.txt manually, set PIP_EXTRA_INDEX_URL=https://download.pytorch.org/whl/cu124 (or /cpu) to pick the right wheels.
  • HF_TOKEN is required to download the diarization model (pyannote/speaker-diarization-3.1) if diarize=true.
  • Known-good local setup: Ubuntu 22.04 + RTX 3090 (compute 8.6) + NVIDIA driver 535+ using the default cu124 wheels.

Run with Custom Model

docker run --rm --gpus all -p 8080:8080 \
  -e WHISPERX_MODEL=large-v3 \
  whisperx-cloudrun

Common model names: large-v3 (default, multilingual), large-v2, medium.en (English-only, faster).

API Usage

Health Check

curl http://localhost:8080/healthz

Response:

{"ok": true}

Async Transcription (Recommended for Cloud Run)

The async API prevents timeout issues on long audio files and supports webhooks.

1. Start a transcription job:

curl -X POST http://localhost:8080/transcribe/start \
  -F "url=https://example.com/audio.mp3"

Response:

{
  "job_id": "550e8400-e29b-41d4-a716-446655440000",
  "status": "queued",
  "message": "Job queued for processing"
}

2. Poll for results:

curl http://localhost:8080/transcribe/status/550e8400-e29b-41d4-a716-446655440000

Response (queued/running):

{
  "job_id": "550e8400-e29b-41d4-a716-446655440000",
  "status": "running",
  "result": null,
  "created_at": 1699564800.0,
  "updated_at": 1699564820.5
}

Response (completed):

{
  "job_id": "550e8400-e29b-41d4-a716-446655440000",
  "status": "done",
  "result": {
    "ok": true,
    "text": "Full transcription here...",
    "segments": [...],
    "language": "en",
    "warnings": null
  },
  "created_at": 1699564800.0,
  "updated_at": 1699564850.2
}

3. Use callback URL (webhook):

Instead of polling, provide a callback URL to receive results automatically:

curl -X POST http://localhost:8080/transcribe/start \
  -F "url=https://example.com/audio.mp3" \
  -F "callback_url=https://your-webhook.com/transcription-complete"

When the job completes, the service will POST to your callback URL:

{
  "job_id": "550e8400-e29b-41d4-a716-446655440000",
  "status": "done",
  "ok": true,
  "text": "Transcription...",
  "segments": [...],
  "language": "en"
}

Synchronous Transcription (Legacy)

⚠️ Warning: This endpoint may timeout on Cloud Run for long audio files. Use /transcribe/start instead.

Transcribe Audio File

Basic transcription with file upload:

curl -X POST http://localhost:8080/transcribe \
  -F "file=@/path/to/audio.mp3"

Transcribe from URL:

curl -X POST http://localhost:8080/transcribe \
  -F "url=https://example.com/audio.mp3"

With speaker diarization disabled:

curl -X POST http://localhost:8080/transcribe \
  -F "file=@audio.mp3" \
  -F "diarize=false"

Specify language:

curl -X POST http://localhost:8080/transcribe \
  -F "file=@audio.mp3" \
  -F "language=es"

Translate to English:

curl -X POST http://localhost:8080/transcribe \
  -F "file=@audio.mp3" \
  -F "language=es" \
  -F "translate=true"

Full example with all options:

curl -X POST http://localhost:8080/transcribe \
  -F "file=@meeting.wav" \
  -F "diarize=true" \
  -F "language=en" \
  -F "translate=false"

Request Parameters

Async API (/transcribe/start)

Parameter Type Default Description
file File - Audio file to transcribe (provide this OR url)
url String - URL to audio file (provide this OR file)
diarize Boolean true Enable speaker diarization
language String auto-detect Language code (e.g., en, es, fr, de)
translate Boolean false Translate output to English
model_path String WHISPERX_MODEL Ignored; model set via WHISPERX_MODEL env var
callback_url String - Webhook URL to POST results when complete

Sync API (/transcribe)

Parameter Type Default Description
file File - Audio file to transcribe (provide this OR url)
url String - URL to audio file (provide this OR file)
diarize Boolean true Enable speaker diarization
language String auto-detect Language code (e.g., en, es, fr, de)
translate Boolean false Translate output to English
model_path String WHISPERX_MODEL Ignored; model set via WHISPERX_MODEL env var

Response Format

{
  "ok": true,
  "text": "Full transcription text here...",
  "segments": [
    {
      "start": 0.0,
      "end": 2.5,
      "text": "First segment of speech"
    },
    {
      "start": 2.5,
      "end": 5.0,
      "text": "Second segment of speech"
    }
  ],
  "language": "en",
  "warnings": []
}

Response Fields:

  • ok (boolean): true if transcription succeeded, false otherwise
  • text (string): Full transcription text
  • segments (array): Timestamped segments with start/end times in seconds
  • language (string): Detected/source language code
  • warnings (array|null): Alignment/diarization warnings (e.g., missing HF token)

Queue Statistics

Monitor the job queue:

curl http://localhost:8080/queue/stats

Response:

{
  "queued": 5,
  "running": 2,
  "done": 150,
  "error": 3
}

Supported Audio Formats

The API supports any format that FFmpeg can decode, including:

  • MP3
  • WAV
  • M4A
  • FLAC
  • OGG
  • WEBM
  • MP4 (audio track)

Deploy to Google Cloud Run

Prerequisites

  • Google Cloud CLI installed
  • A Google Cloud project with billing enabled
  • Cloud Run API enabled

Deployment Steps

# Set your project ID
export PROJECT_ID=your-project-id
gcloud config set project $PROJECT_ID

# Build and push to Google Container Registry
docker build --platform linux/amd64 -t gcr.io/$PROJECT_ID/whisperx-cloudrun .
docker push gcr.io/$PROJECT_ID/whisperx-cloudrun

# Deploy to Cloud Run
gcloud run deploy whisperx-api-gpu \
  --image gcr.io/$PROJECT_ID/whisperx-cloudrun \
  --platform managed \
  --region us-central1 \
  --memory 16Gi \
  --cpu 4 \
  --timeout 600 \
  --gpu 1 --gpu-type nvidia-l4 --no-cpu-throttling --no-cpu-boost \
  --max-instances 1 \
  --allow-unauthenticated

Environment Variables

You can set these during deployment:

gcloud run deploy whisperx-api-gpu \
  --image gcr.io/$PROJECT_ID/whisperx-cloudrun \
  --set-env-vars WHISPERX_MODEL=large-v3 \
  --set-env-vars RESULT_TTL_SECONDS=86400 \
  --set-env-vars WORKER_POLL_SEC=1.0 \
  --memory 16Gi \
  --cpu 4

Available Environment Variables

Variable Default Description
WHISPERX_MODEL large-v3 WhisperX model name
WHISPERX_DEVICE cuda Device to run on (GPU required)
WHISPERX_COMPUTE_TYPE float16 Compute type for whisperX
WHISPERX_BATCH_SIZE 16 Batch size passed to model.transcribe
WHISPERX_CACHE /app/.cache/whisperx Cache dir for whisperX/align models
HF_TOKEN / HUGGINGFACE_HUB_TOKEN - Required for diarization model downloads
PORT 8080 Server port
QUEUE_DB /tmp/whisper_jobs.sqlite3 SQLite database path
RESULT_TTL_SECONDS 86400 How long to keep completed jobs (1 day)
WORKER_POLL_SEC 1.0 Worker polling interval
CLEANUP_INTERVAL_SEC 3600 Job cleanup interval (1 hour)

Production Recommendations

  • GPU: Cloud Run GPU (e.g., L4) or equivalent NVIDIA hardware
  • Memory: 16Gi minimum for large-v3; scale up for heavy concurrency
  • CPU: 4 vCPU recommended to keep the pipeline fed
  • Timeout: 300s+ for long audio files; async API preferred
  • Max instances: Tune based on throughput and GPU quota

Development

Project Structure

.
├── Dockerfile           # GPU whisperX build
├── server.py           # FastAPI application
├── entrypoint.sh       # Container entrypoint (starts uvicorn)
└── README.md           # This file

Local Development

On Ubuntu, prefer the scripted install (install-local-ubuntu.sh) to set up system and Python deps in .venv. For non-Ubuntu systems, mirror the versions from Dockerfile:

PIP_EXTRA_INDEX_URL=https://download.pytorch.org/whl/cu124 pip install -r requirements.txt

# GPU default; set for CPU fallback:
# export WHISPERX_DEVICE=cpu WHISPERX_COMPUTE_TYPE=float32 WHISPERX_BATCH_SIZE=2
uvicorn server:app --reload --port 8080

Troubleshooting

Hugging Face auth errors

  • Set HF_TOKEN/HUGGINGFACE_HUB_TOKEN when using diarization (pyannote) or private models.
  • Ensure the token has access to pyannote/speaker-diarization-3.1 if diarization is enabled.

Out of memory errors

  • Use a smaller model (e.g., medium.en) via WHISPERX_MODEL.
  • Lower WHISPERX_BATCH_SIZE (default 16).
  • Increase container memory/GPU RAM allocation.

Slow transcription

  • Use a smaller model.
  • Verify the container has access to the GPU (--gpus all locally; Cloud Run GPU in production).
  • Reduce diarization usage if not needed.

License

This project relies on whisperX (MIT licensed). See the whisperX repository for details.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors