faster-whisper

Included with Lifetime

$97 forever

Local speech-to-text using faster-whisper. 4-6x faster than OpenAI Whisper with identical accuracy; GPU acceleration enables ~20x realtime transcription. SRT/VTT/TTML/CSV subtitles, speaker diarization, URL/YouTube input, batch processing with ETA, transcript search, chapter detection, per-file language map.

AI Agentsaudiotranscriptionwhisperspeech-to-textmlcudagpusubtitlesscripts

What this skill does


# Faster Whisper

Local speech-to-text using faster-whisper — a CTranslate2 reimplementation of OpenAI's Whisper that runs **4-6x faster** with identical accuracy. With GPU acceleration, expect **~20x realtime** transcription (a 10-minute audio file in ~30 seconds).

## When to Use

Use this skill when you need to:

- **Transcribe audio/video files** — meetings, interviews, podcasts, lectures, YouTube videos
- **Generate subtitles** — SRT, VTT, ASS, LRC, or TTML broadcast-standard subtitles
- **Identify speakers** — diarization labels who said what (`--diarize`)
- **Transcribe from URLs** — YouTube links and direct audio URLs (auto-downloads via yt-dlp)
- **Transcribe podcast feeds** — `--rss <feed-url>` fetches and transcribes episodes
- **Batch process files** — glob patterns, directories, skip-existing support; ETA shown automatically
- **Convert speech to text locally** — no API costs, works offline (after model download)
- **Translate to English** — translate any language to English with `--translate`
- **Do multilingual transcription** — supports 99+ languages with auto-detection
- **Transcribe a batch of files in different languages** — `--language-map` assigns a different language per file
- **Transcribe multilingual audio** — `--multilingual` for mixed-language audio
- **Transcribe audio with specific terms** — use `--initial-prompt` for jargon-heavy content or any other terms to look out for
- **Preprocess noisy audio (before transcription)** — `--normalize` and `--denoise` before transcription
- **Stream output** — `--stream` shows segments as they're transcribed
- **Clip time ranges** — `--clip-timestamps` to transcribe specific sections
- **Search the transcript** — `--search "term"` finds all timestamps where a word/phrase appears
- **Detect chapters** — `--detect-chapters` finds section breaks from silence gaps
- **Export speaker audio** — `--export-speakers DIR` saves each speaker's turns as separate WAV files
- **Spreadsheet output** — `--format csv` produces a properly-quoted CSV with timestamps

**Trigger phrases:**
"transcribe this audio", "convert speech to text", "what did they say", "make a transcript",
"audio to text", "subtitle this video", "who's speaking", "translate this audio", "translate to English",
"find where X is mentioned", "search transcript for", "when did they say", "at what timestamp",
"add chapters", "detect chapters", "find breaks in the audio", "table of contents for this recording",
"TTML subtitles", "DFXP subtitles", "broadcast format subtitles", "Netflix format",
"ASS subtitles", "aegisub format", "advanced substation alpha", "mpv subtitles",
"LRC subtitles", "timed lyrics", "karaoke subtitles", "music player lyrics",
"HTML transcript", "confidence-colored transcript", "color-coded transcript",
"separate audio per speaker", "export speaker audio", "split by speaker",
"transcript as CSV", "spreadsheet output", "transcribe podcast", "podcast RSS feed",
"different languages in batch", "per-file language",
"transcribe in multiple formats", "srt and txt at the same time", "output both srt and text",
"remove filler words", "clean up ums and uhs", "strip hesitation sounds", "remove you know and I mean",
"transcribe left channel", "transcribe right channel", "stereo channel", "left track only",
"wrap subtitle lines", "character limit per line", "max chars per subtitle",
"detect paragraphs", "paragraph breaks", "group into paragraphs", "add paragraph spacing"

**⚠️ Agent guidance — keep invocations minimal:**

_CORE RULE: default command (`./scripts/transcribe audio.mp3`) is the fastest path — add flags only when the user explicitly asks for that capability._

**Transcription:**

- Only add `--diarize` if the user asks "who said what" / "identify speakers" / "label speakers"
- Only add `--format srt/vtt/ass/lrc/ttml` if the user asks for subtitles/captions in that format
- Only add `--format csv` if the user asks for CSV or spreadsheet output
- Only add `--word-timestamps` if the user needs word-level timing
- Only add `--initial-prompt` if there's domain-specific jargon to prime
- Only add `--translate` if the user wants non-English audio translated to English
- Only add `--normalize`/`--denoise` if the user mentions bad audio quality or noise
- Only add `--stream` if the user wants live/progressive output for long files
- Only add `--clip-timestamps` if the user wants a specific time range
- Only add `--temperature 0.0` if the model is hallucinating on music/silence
- Only add `--vad-threshold` if VAD is aggressively cutting speech or including noise
- Only add `--min-speakers`/`--max-speakers` when you know the speaker count
- Only add `--hf-token` if the token is not cached at `~/.cache/huggingface/token`
- Only add `--max-words-per-line` for subtitle readability on long segments
- Only add `--filter-hallucinations` if the transcript contains obvious artifacts (music markers, duplicates)
- Only add `--merge-sentences` if the user asks for sentence-level subtitle cues
- Only add `--clean-filler` if the user asks to remove filler words (um, uh, you know, I mean, hesitation sounds)
- Only add `--channel left|right` if the user mentions stereo tracks, dual-channel recordings, or asks for a specific channel
- Only add `--max-chars-per-line N` when the user specifies a character limit per subtitle line (e.g., "Netflix format", "42 chars per line"); takes priority over `--max-words-per-line`
- Only add `--detect-paragraphs` if the user asks for paragraph breaks or structured text output; `--paragraph-gap` (default 3.0s) only if they want a custom gap
- Only add `--speaker-names "Alice,Bob"` when the user provides real names to replace SPEAKER_1/2 — always requires `--diarize`
- Only add `--hotwords WORDS` when the user names specific rare terms not well served by `--initial-prompt`; prefer `--initial-prompt` for general domain jargon
- Only add `--prefix TEXT` when the user knows the exact words the audio starts with
- Only add `--detect-language-only` when the user only wants to identify the language, not transcribe
- Only add `--stats-file PATH` if the user asks for performance stats, RTF, or benchmark info
- Only add `--parallel N` for large CPU batch jobs; GPU handles one file efficiently on its own — don't add for single files or small batches
- Only add `--retries N` for unreliable inputs (URLs, network files) where transient failures are expected
- Only add `--burn-in OUTPUT` only when user explicitly asks to embed/burn subtitles into the video; requires ffmpeg and a video file input
- Only add `--keep-temp` when the user may re-process the same URL to avoid re-downloading
- Only add `--output-template` when user specifies a custom naming pattern in batch mode
- **Multi-format output** (`--format srt,text`): only when user explicitly wants multiple formats in one pass; always pair with `-o <dir>`
- Any word-level feature auto-runs wav2vec2 alignment (~5-10s overhead)
- `--diarize` adds ~20-30s on top of that

**Search:**

- Only add `--search "term"` when the user asks to find/locate/search for a specific word or phrase in audio
- `--search` **replaces** the normal transcript output — it prints only matching segments with timestamps
- Add `--search-fuzzy` only when the user mentions approximate/partial matching or typos
- To save search results to a file, use `-o results.txt`

**Chapter detection:**

- Only add `--detect-chapters` when the user asks for chapters, sections, a table of contents, or "where does the topic change"
- Default `--chapter-gap 8` (8-second silence = new chapter) works for most podcasts/lectures; tune down for dense content
- `--chapter-format youtube` (default) outputs YouTube-ready timestamps; use `json` for programmatic use
- **Always use `--chapters-file PATH`** when combining chapters with a transcript output — avoids mixing chapter markers into the transcript text
- If the user only wants chapters (not the transcript), pipe stdout to a file with `-o /dev/null` and use `--chapters

Files: 14

Size: 240.8 KB

Complexity: 71/100

Category: AI Agents

Source: https://github.com/theplasmak/faster-whisper

Related in AI Agents

skill-development

Included

Comprehensive meta-skill for creating, managing, validating, auditing, and distributing Claude Code skills and slash commands (unified in v2.1.3+). Provides skill templates, creation workflows, validation patterns, audit checklists, naming conventions, YAML frontmatter guidance, progressive disclosure examples, and best practices lookup. Use when creating new skills, validating existing skills, auditing skill quality, understanding skill architecture, needing skill templates, learning about YAML frontmatter requirements, progressive disclosure patterns, tool restrictions (allowed-tools), skill composition, skill naming conventions, troubleshooting skill activation issues, creating custom slash commands, configuring command frontmatter, using command arguments ($ARGUMENTS, $1, $2), bash execution in commands, file references in commands, command namespacing, plugin commands, MCP slash commands, Skill tool configuration, or deciding between skills vs slash commands. Delegates to docs-management skill for official documentation.

AI Agentsscripts

reprompter

Included

Transform messy prompts into well-structured, effective prompts — single or multi-agent. Use when: "reprompt", "reprompt this", "clean up this prompt", "structure my prompt", rough text needing XML tags and best practices, "reprompter teams", "repromptception", "run with quality", "smart run", "smart agents", multi-agent tasks, audits, parallel work, anything going to agent teams. Don't use when: simple Q&A, pure chat, immediate execution-only tasks. See "Don't Use When" section for details. Outputs: Structured XML/Markdown prompt, quality score (before/after), optional team brief + per-agent sub-prompts, agent team output files. Success criteria: Single mode quality score ≥ 7/10; Repromptception per-agent prompt quality score 8+/10; all required sections present, actionable and specific.

AI Agentsscripts

adaptive-compaction

Included

Adaptive add-on policy and recovery layer that decides WHEN to compact, prune, snapshot, or fork -- replacing fixed-percent auto-compaction across Claude Code, Codex, and MCP-capable hosts. Trigger on auto-compact timing or damage: "when should I compact", "is it safe to compact now or start a fresh session", "auto-compact fires too early/mid-task", "switching to an unrelated task but the window still has space", "context rot", "answers get worse the longer the session runs", "the agent forgot the plan or my decisions after it summarized", "add a layer on top that manages context without changing the agent", raising autoCompactWindow to give the policy room, or installing/tuning a cross-tool compaction policy or PreCompact hook -- even when "compaction" is never said but the problem is context-window pressure or post-summarization memory loss. Do NOT use to summarize a conversation, build RAG, write a summarization prompt (decides WHEN not HOW), or answer max-context-length trivia.

AI Agentsscripts

agent-skill-creator

Included

Create cross-platform agent skills from workflow descriptions. Activates when users ask to create an agent, automate a repetitive workflow, create a custom skill, or need advanced agent creation. Triggers on phrases like create agent for, automate workflow, create skill for, every day I have to, daily I need to, turn process into agent, need to automate, create a cross-platform skill, validate this skill, export this skill, migrate this skill. Supports single skills, multi-agent suites, transcript processing, template-based creation, interactive configuration, cross-platform export, and spec validation.

AI Agentsscripts

llm-wiki

Included

Use when building or maintaining a persistent personal knowledge base (second brain) in Obsidian where an LLM incrementally ingests sources, updates entity/concept pages, maintains cross-references, and keeps a synthesis current. Triggers include "second brain", "Obsidian wiki", "personal knowledge management", "ingest this paper/article/book", "build a research wiki", "compound knowledge", "Memex", or whenever the user wants knowledge to accumulate across sessions instead of being re-derived by RAG on every query.

AI Agentsscripts

skill-master

Included

Agent Skills authoring, evaluation, and optimization. Create, edit, validate, benchmark, and improve skills following the agentskills.io specification. Use when designing SKILL.md files, structuring skill folders (references, scripts, assets), ingesting external documentation into skills, running trigger evals, benchmarking skill quality, optimizing descriptions, or performing blind A/B comparisons. Keywords: agentskills.io, SKILL.md, skill authoring, eval, benchmark, trigger optimization.

AI Agentsscripts