Skip to content
All skills
COOK productivity v1.0.0 · Apache-2.0

Faster Whisper

Local, offline speech-to-text using CTranslate2 reimplementation of OpenAI Whisper. 4-6x faster, 99+ languages, word-level timestamps, VAD support.

Audited
Source
SHA-256
Last reviewed
How we audit →

Install in your agent

Tell your agent: "install the recipes skill, then add faster-whisper"
Or via curl: curl -sL https://recipes.wisechef.ai/skill -o ~/.claude/skills/recipes/SKILL.md

Full skill source · SKILL.md

Faster Whisper — Local Speech-to-Text

Scripts: ~/clawd/skills/faster-whisper/scripts/transcribe Setup: ~/clawd/skills/faster-whisper/setup.sh Requires: python3, ffmpeg; optional NVIDIA GPU + CUDA for ~20x speed

When to Use

  • Transcribing audio/video files locally (no API needed)
  • Need word-level timestamps
  • Privacy-sensitive transcription (everything stays local)
  • Batch transcription of multiple files
  • Need language detection

Local STT Stack

  • faster-whisper 1.2.1 installed in Hermes venv ✓
  • openai-whisper 20250625 installed via pipx ✓
  • insanely-fast-whisper 0.0.15 installed via pipx ✓ (HuggingFace Transformers-based, GPU-accelerated)
  • All are STT (speech-to-text), NOT TTS — for TTS use text_to_speech tool or ElevenLabs

Basic Transcription

~/clawd/skills/faster-whisper/scripts/transcribe audio.mp3

Max Accuracy

~/clawd/skills/faster-whisper/scripts/transcribe audio.mp3 \
  --model large-v3-turbo --beam-size 10

Word-Level Timestamps (JSON output)

~/clawd/skills/faster-whisper/scripts/transcribe audio.mp3 \
  --word-timestamps --json -o transcript.json

With VAD (Voice Activity Detection) + Language

~/clawd/skills/faster-whisper/scripts/transcribe audio.mp3 \
  --vad --language en

Models (smallest → largest)

Model Size Speed Quality
tiny 39M Fastest Basic
base 74M Fast OK
small 244M Good Good
medium 769M Slower Great
large-v3 1.6G Slow Best
large-v3-turbo 809M Fast Near-best

Options

  • --model <name>: Model to use (default: auto-selects)
  • --beam-size <n>: Beam size for search (higher = more accurate, slower)
  • --word-timestamps: Enable word-level timing
  • --vad: Voice Activity Detection (skip silence)
  • --language <code>: Force language (auto-detects if omitted)
  • --json: JSON output format
  • -o <path>: Output file path

Setup (first time)

cd ~/clawd/skills/faster-whisper && ./setup.sh

insanely-fast-whisper (Alternative)

HuggingFace Transformers-based STT — different backend, similar speed goals.

Install

pipx install insanely-fast-whisper --force

Critical Fix: Remove broken torchcodec

The default install pulls torchcodec which needs CUDA NPP libraries (libnppicc.so) that aren't in the driver-only NVIDIA install. Remove it — transformers falls back to soundfile/ffmpeg:

$(pipx environment -v PIPX_LOCAL_VENVS)/insanely-fast-whisper/bin/python -m pip uninstall torchcodec -y

Usage

# GPU (pass just the device number, tool prepends cuda: automatically)
insanely-fast-whisper --file-name audio.mp3 --device-id 0 --model-name openai/whisper-tiny

# Larger model for accuracy
insanely-fast-whisper --file-name audio.mp3 --device-id 0 --model-name openai/whisper-large-v3

# With output file
insanely-fast-whisper --file-name audio.mp3 --device-id 0 --model-name openai/whisper-tiny --transcript-path output.json

Pitfalls

  • --device-id does NOT accept cuda:0 or cpu — pass just 0 (it hardcodes cuda:{device_id})
  • No CPU fallback in the CLI — GPU required (check source: cli.py line 134)
  • torchcodec will crash on systems without full CUDA toolkit — uninstall it
  • First run downloads the model from HuggingFace (set HF_TOKEN for faster downloads)
  • The pipx venv can silently break (shows as installed but empty) — use --force to reinstall
  • English-only models reject --language. distil-whisper/distil-small.en and any other .en suffix model raise ValueError: Cannot specify task or language for an English-only model. Either drop --language for .en models or switch to the multilingual variant. In code: if not model.rstrip("/").endswith(".en"): cmd.extend(["--language", lang])
  • ~25–30s cold-start per CLI invocation — the subprocess loads the model into GPU each call. Fine for one-off transcriptions, terrible for voice loops or batched short clips. For sub-second turnaround use the in-process Transformers pipeline (load model once, reuse) or switch to faster-whisper Python API which keeps the CTranslate2 model resident
  • The CLI prints a torchcodec is not installed correctly warning to stderr even when transcription succeeds — the actual stderr to check is the Voila! line which marks success
  • Empty-result with chunks: [] and text: "" usually means the audio was too quiet, all silence, or chunked badly — not a tool failure

Pitfalls (faster-whisper)

  • First run downloads the model (~1-2GB for large models)
  • CPU transcription is usable but much slower than GPU
  • VAD helps with long recordings that have lots of silence
  • large-v3-turbo is the best speed/quality tradeoff

Verification

# Generate a 3-second test clip and transcribe it:
ffmpeg -f lavfi -i "sine=frequency=1000:duration=1" /tmp/test_clip.mp3 -y 2>/dev/null
echo "Hello world test" | espeak-ng --stdout > /tmp/test_speech.wav 2>/dev/null || true
~/clawd/skills/faster-whisper/scripts/transcribe /tmp/test_speech.wav --model tiny
# Output should contain transcribed text with no errors

Skill graph

Works well with

Recommended by tag overlap, category, and co-install patterns. See full graph →