Faster Whisper — Local Speech-to-Text

Scripts: ~/clawd/skills/faster-whisper/scripts/transcribe Setup: ~/clawd/skills/faster-whisper/setup.sh Requires: python3, ffmpeg; optional NVIDIA GPU + CUDA for ~20x speed

When to Use

Transcribing audio/video files locally (no API needed)
Need word-level timestamps
Privacy-sensitive transcription (everything stays local)
Batch transcription of multiple files
Need language detection

Local STT Stack

faster-whisper 1.2.1 installed in Hermes venv ✓
openai-whisper 20250625 installed via pipx ✓
insanely-fast-whisper 0.0.15 installed via pipx ✓ (HuggingFace Transformers-based, GPU-accelerated)
All are STT (speech-to-text), NOT TTS — for TTS use text_to_speech tool or ElevenLabs

Basic Transcription

~/clawd/skills/faster-whisper/scripts/transcribe audio.mp3

Max Accuracy

~/clawd/skills/faster-whisper/scripts/transcribe audio.mp3 \
  --model large-v3-turbo --beam-size 10

Word-Level Timestamps (JSON output)

~/clawd/skills/faster-whisper/scripts/transcribe audio.mp3 \
  --word-timestamps --json -o transcript.json

With VAD (Voice Activity Detection) + Language

~/clawd/skills/faster-whisper/scripts/transcribe audio.mp3 \
  --vad --language en

Models (smallest → largest)

Model	Size	Speed	Quality
`tiny`	39M	Fastest	Basic
`base`	74M	Fast	OK
`small`	244M	Good	Good
`medium`	769M	Slower	Great
`large-v3`	1.6G	Slow	Best
`large-v3-turbo`	809M	Fast	Near-best

Options

--model <name>: Model to use (default: auto-selects)
--beam-size <n>: Beam size for search (higher = more accurate, slower)
--word-timestamps: Enable word-level timing
--vad: Voice Activity Detection (skip silence)
--language <code>: Force language (auto-detects if omitted)
--json: JSON output format
-o <path>: Output file path

Setup (first time)

cd ~/clawd/skills/faster-whisper && ./setup.sh

insanely-fast-whisper (Alternative)

HuggingFace Transformers-based STT — different backend, similar speed goals.

Install

pipx install insanely-fast-whisper --force

Critical Fix: Remove broken torchcodec

The default install pulls torchcodec which needs CUDA NPP libraries (libnppicc.so) that aren't in the driver-only NVIDIA install. Remove it — transformers falls back to soundfile/ffmpeg:

$(pipx environment -v PIPX_LOCAL_VENVS)/insanely-fast-whisper/bin/python -m pip uninstall torchcodec -y

Usage

# GPU (pass just the device number, tool prepends cuda: automatically)
insanely-fast-whisper --file-name audio.mp3 --device-id 0 --model-name openai/whisper-tiny

# Larger model for accuracy
insanely-fast-whisper --file-name audio.mp3 --device-id 0 --model-name openai/whisper-large-v3

# With output file
insanely-fast-whisper --file-name audio.mp3 --device-id 0 --model-name openai/whisper-tiny --transcript-path output.json

Pitfalls

--device-id does NOT accept cuda:0 or cpu — pass just 0 (it hardcodes cuda:{device_id})
No CPU fallback in the CLI — GPU required (check source: cli.py line 134)
torchcodec will crash on systems without full CUDA toolkit — uninstall it
First run downloads the model from HuggingFace (set HF_TOKEN for faster downloads)
The pipx venv can silently break (shows as installed but empty) — use --force to reinstall
English-only models reject --language. distil-whisper/distil-small.en and any other .en suffix model raise ValueError: Cannot specify task or language for an English-only model. Either drop --language for .en models or switch to the multilingual variant. In code: if not model.rstrip("/").endswith(".en"): cmd.extend(["--language", lang])
~25–30s cold-start per CLI invocation — the subprocess loads the model into GPU each call. Fine for one-off transcriptions, terrible for voice loops or batched short clips. For sub-second turnaround use the in-process Transformers pipeline (load model once, reuse) or switch to faster-whisper Python API which keeps the CTranslate2 model resident
The CLI prints a torchcodec is not installed correctly warning to stderr even when transcription succeeds — the actual stderr to check is the Voila! line which marks success
Empty-result with chunks: [] and text: "" usually means the audio was too quiet, all silence, or chunked badly — not a tool failure

Pitfalls (faster-whisper)

First run downloads the model (~1-2GB for large models)
CPU transcription is usable but much slower than GPU
VAD helps with long recordings that have lots of silence
large-v3-turbo is the best speed/quality tradeoff

Verification

# Generate a 3-second test clip and transcribe it:
ffmpeg -f lavfi -i "sine=frequency=1000:duration=1" /tmp/test_clip.mp3 -y 2>/dev/null
echo "Hello world test" | espeak-ng --stdout > /tmp/test_speech.wav 2>/dev/null || true
~/clawd/skills/faster-whisper/scripts/transcribe /tmp/test_speech.wav --model tiny
# Output should contain transcribed text with no errors

Faster Whisper