Faster Whisper — Local Speech-to-Text
Scripts: ~/clawd/skills/faster-whisper/scripts/transcribe
Setup: ~/clawd/skills/faster-whisper/setup.sh
Requires: python3, ffmpeg; optional NVIDIA GPU + CUDA for ~20x speed
When to Use
- Transcribing audio/video files locally (no API needed)
- Need word-level timestamps
- Privacy-sensitive transcription (everything stays local)
- Batch transcription of multiple files
- Need language detection
Local STT Stack
faster-whisper1.2.1 installed in Hermes venv ✓openai-whisper20250625 installed via pipx ✓insanely-fast-whisper0.0.15 installed via pipx ✓ (HuggingFace Transformers-based, GPU-accelerated)- All are STT (speech-to-text), NOT TTS — for TTS use
text_to_speechtool or ElevenLabs
Basic Transcription
~/clawd/skills/faster-whisper/scripts/transcribe audio.mp3
Max Accuracy
~/clawd/skills/faster-whisper/scripts/transcribe audio.mp3 \
--model large-v3-turbo --beam-size 10
Word-Level Timestamps (JSON output)
~/clawd/skills/faster-whisper/scripts/transcribe audio.mp3 \
--word-timestamps --json -o transcript.json
With VAD (Voice Activity Detection) + Language
~/clawd/skills/faster-whisper/scripts/transcribe audio.mp3 \
--vad --language en
Models (smallest → largest)
| Model | Size | Speed | Quality |
|---|---|---|---|
tiny |
39M | Fastest | Basic |
base |
74M | Fast | OK |
small |
244M | Good | Good |
medium |
769M | Slower | Great |
large-v3 |
1.6G | Slow | Best |
large-v3-turbo |
809M | Fast | Near-best |
Options
--model <name>: Model to use (default: auto-selects)--beam-size <n>: Beam size for search (higher = more accurate, slower)--word-timestamps: Enable word-level timing--vad: Voice Activity Detection (skip silence)--language <code>: Force language (auto-detects if omitted)--json: JSON output format-o <path>: Output file path
Setup (first time)
cd ~/clawd/skills/faster-whisper && ./setup.sh
insanely-fast-whisper (Alternative)
HuggingFace Transformers-based STT — different backend, similar speed goals.
Install
pipx install insanely-fast-whisper --force
Critical Fix: Remove broken torchcodec
The default install pulls torchcodec which needs CUDA NPP libraries (libnppicc.so) that aren't in the driver-only NVIDIA install. Remove it — transformers falls back to soundfile/ffmpeg:
$(pipx environment -v PIPX_LOCAL_VENVS)/insanely-fast-whisper/bin/python -m pip uninstall torchcodec -y
Usage
# GPU (pass just the device number, tool prepends cuda: automatically)
insanely-fast-whisper --file-name audio.mp3 --device-id 0 --model-name openai/whisper-tiny
# Larger model for accuracy
insanely-fast-whisper --file-name audio.mp3 --device-id 0 --model-name openai/whisper-large-v3
# With output file
insanely-fast-whisper --file-name audio.mp3 --device-id 0 --model-name openai/whisper-tiny --transcript-path output.json
Pitfalls
--device-iddoes NOT acceptcuda:0orcpu— pass just0(it hardcodescuda:{device_id})- No CPU fallback in the CLI — GPU required (check source:
cli.pyline 134) torchcodecwill crash on systems without full CUDA toolkit — uninstall it- First run downloads the model from HuggingFace (set
HF_TOKENfor faster downloads) - The pipx venv can silently break (shows as installed but empty) — use
--forceto reinstall - English-only models reject
--language.distil-whisper/distil-small.enand any other.ensuffix model raiseValueError: Cannot specify task or language for an English-only model. Either drop--languagefor.enmodels or switch to the multilingual variant. In code:if not model.rstrip("/").endswith(".en"): cmd.extend(["--language", lang]) - ~25–30s cold-start per CLI invocation — the subprocess loads the model into GPU each call. Fine for one-off transcriptions, terrible for voice loops or batched short clips. For sub-second turnaround use the in-process Transformers pipeline (load model once, reuse) or switch to
faster-whisperPython API which keeps the CTranslate2 model resident - The CLI prints a
torchcodec is not installed correctlywarning to stderr even when transcription succeeds — the actual stderr to check is theVoila!line which marks success - Empty-result with
chunks: []andtext: ""usually means the audio was too quiet, all silence, or chunked badly — not a tool failure
Pitfalls (faster-whisper)
- First run downloads the model (~1-2GB for large models)
- CPU transcription is usable but much slower than GPU
- VAD helps with long recordings that have lots of silence
large-v3-turbois the best speed/quality tradeoff
Verification
# Generate a 3-second test clip and transcribe it:
ffmpeg -f lavfi -i "sine=frequency=1000:duration=1" /tmp/test_clip.mp3 -y 2>/dev/null
echo "Hello world test" | espeak-ng --stdout > /tmp/test_speech.wav 2>/dev/null || true
~/clawd/skills/faster-whisper/scripts/transcribe /tmp/test_speech.wav --model tiny
# Output should contain transcribed text with no errors