Skip to content

sekmo/transcriptor

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 

Repository files navigation

Audio Transcription with Speaker Diarization

A bash script that transcribes audio files using OpenAI's Whisper API with optional speaker diarization (identifying who said what).

Features

  • Speaker Diarization: Identify and label individual speakers in multi-person audio
  • Flexible Configuration: Support for 0-4 speakers
  • Multiple Output Formats: Plain text and structured JSON segments
  • Error Handling: Comprehensive validation and clear error messages
  • Italian Language Support: Optimized for Italian transcription

Requirements

  • Bash (v4.0+)
  • curl - for API requests
  • jq - for JSON processing
  • base64 - for encoding voice references
  • OpenAI API Key - set as OPENAI_API_KEY environment variable
  • ffmpeg (optional) - for preparing voice reference files

Installation

  1. Clone or download the script:
chmod +x transcriptize.sh
  1. Set your OpenAI API key:
export OPENAI_API_KEY="your-api-key-here"
  1. Install dependencies (if not already installed):
# macOS
brew install jq ffmpeg

# Ubuntu/Debian
sudo apt-get install jq ffmpeg

Usage

Basic Syntax

./transcriptize.sh <audio_file.mp3> [--speaker Name:voicefile.wav] [--speaker Name2:voicefile2.wav] ...

Examples

1. With Speaker Diarization (2 speakers)

./transcriptize.sh interview.mp3 --speaker Luana:voce_luana_mini.wav --speaker Chiara:voce_chiara_mini.wav

Output:

  • interview_raw.json - Complete JSON response with text and speaker-labeled segments
  • interview_diarized.txt - Human-readable formatted transcript with merged speaker segments
  • interview_text.txt - Plain text transcription only

2. Without Speaker Diarization (Basic Transcription)

./transcriptize.sh interview.mp3

Output includes generic speaker labels (A, B, C...)

3. With 1 Speaker

./transcriptize.sh lecture.mp3 --speaker Professor:prof_voice.wav

4. With 4 Speakers

./transcriptize.sh meeting.mp3 \
  --speaker Alice:alice.wav \
  --speaker Bob:bob.wav \
  --speaker Carol:carol.wav \
  --speaker Dave:dave.wav

Preparing Voice Reference Files

Voice reference files must be:

  • Format: WAV (uncompressed PCM)
  • Sample Rate: 16 kHz
  • Channels: Mono (1 channel)
  • Duration: 1.2 - 10 seconds (3 seconds recommended)

Creating Voice References with ffmpeg

Extract a 3-second sample from an MP3 file:

ffmpeg -i speaker_audio.mp3 -t 3 -ar 16000 -ac 1 speaker_reference.wav

Parameters explained:

  • -i speaker_audio.mp3 - Input file
  • -t 3 - Duration (3 seconds)
  • -ar 16000 - Sample rate (16 kHz)
  • -ac 1 - Audio channels (mono)
  • speaker_reference.wav - Output file

Output Files

The script generates three files for each transcription:

{filename}_raw.json

Complete API response including:

  • Full text transcription
  • Detailed segments with speaker identification
  • Timestamps (start/end)
  • Segment IDs

Example:

{
  "text": "per questo ti volevo fare un po' di domande, perché io so che Luana è abbastanza storico come come locale. Allora, la storia nostra è cominciata che noi era una famiglia dei contadini, quindi mio padre, loro erano undici figli.",
  "segments": [
    {
      "type": "transcript.text.segment",
      "text": " per questo ti volevo fare un po' di domande...",
      "speaker": "Chiara",
      "start": 0.0,
      "end": 5.35,
      "id": "seg_0"
    },
    {
      "type": "transcript.text.segment",
      "text": " Allora, la storia nostra è cominciata...",
      "speaker": "Luana",
      "start": 5.8,
      "end": 12.3,
      "id": "seg_1"
    }
  ]
}

To extract just the text: jq -r '.text' {filename}_raw.json To extract just segments: jq '.segments' {filename}_raw.json

{filename}_diarized.txt

Human-readable formatted transcript with:

  • Speaker names in uppercase
  • Timestamps in [HH:MM:SS] format
  • Consecutive segments from the same speaker merged together

Example:

CHIARA [00:00:00]
 per questo ti volevo fare un po' di domande, perché io so che Luana è abbastanza storico come come locale.

LUANA [00:00:05]
 Allora, la storia nostra è cominciata che noi era una famiglia dei contadini, quindi mio padre, loro erano undici figli.

{filename}_text.txt

Plain text transcription extracted from the .text field of the API response.

Example:

per questo ti volevo fare un po' di domande, perché io so che Luana è abbastanza storico come come locale. Allora, la storia nostra è cominciata che noi era una famiglia dei contadini, quindi mio padre, loro erano undici figli.

Limitations

  • Maximum 4 speakers - Script enforces this limit
  • WAV files only - Voice references must be pre-converted to WAV format
  • Duration limits - Voice references must be 1.2-10 seconds
  • API costs - OpenAI charges for API usage based on audio duration

Error Handling

The script validates:

  • ✅ Input file exists
  • --speaker format is correct (Name:file.wav)
  • ✅ Voice reference files exist
  • ✅ Voice reference files are WAV format
  • ✅ Maximum speaker count (4)
  • ✅ API response errors

Common errors and solutions:

Error Solution
Voice file 'x.wav' not found Check file path is correct
Voice file must be a WAV file Convert to WAV using ffmpeg
Invalid --speaker format Use format Name:file.wav
Maximum 4 speakers allowed Reduce number of speakers
Known speaker references has duration... Voice file must be 1.2-10 seconds

Environment Variables

  • OPENAI_API_KEY - (Required) Your OpenAI API key

API Details

This script uses:

  • Model: gpt-4o-transcribe-diarize
  • Language: Italian (it)
  • Response Format: diarized_json
  • Chunking Strategy: auto

Troubleshooting

Script shows "command not found: jq"

Install jq: brew install jq (macOS) or sudo apt-get install jq (Linux)

Voice reference rejected by API

Verify your WAV file meets requirements:

ffprobe -i your_voice.wav -show_streams

Look for:

  • codec_name: pcm_s16le (or similar PCM codec)
  • sample_rate: 16000
  • channels: 1
  • duration: 1.2-10.0 seconds

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages