omnicaptions-transcribe

Included with Lifetime

$97 forever

Use when transcribing audio/video to text with timestamps, speaker labels, and chapters. Supports YouTube URLs and local files. Produces structured markdown output.

Image & Video

What this skill does


# Gemini Transcription

Transcribe audio/video using Google Gemini API with structured markdown output.

## YouTube Video Workflow

**Important**: Check for existing captions before transcribing:

```
1. Check captions: yt-dlp --list-subs "URL"
2. Has caption → Use /omnicaptions:download to get existing captions (better quality)
3. No caption → Transcribe directly with URL (don't download first!)
```

**Confirm with user**: Before transcribing, ask if they want to check for existing captions first.

## URL & Local File Support

**Gemini natively supports YouTube URLs** - no need to download, just pass the URL directly:

```bash
# YouTube URL (recommended, no download needed)
omnicaptions transcribe "https://www.youtube.com/watch?v=VIDEO_ID"

# Local files
omnicaptions transcribe video.mp4
```

**Note**: Output defaults to current directory unless user specifies `-o`.

## When to Use

- **Video URLs** - YouTube, direct video links (Gemini native support)
- Transcribing podcasts, interviews, lectures
- Need verbatim transcript with timestamps and speaker labels
- Want auto-generated chapters from content
- Mixed-language audio (code-switching preserved)

## When NOT to Use

- **Video has existing captions** - Use `/omnicaptions:download` to get existing captions first
- Need real-time streaming transcription (use Whisper)
- Audio >2 hours (Gemini upload limit)
- Want translation instead of transcription

## Quick Reference

| Method | Description |
|--------|-------------|
| `transcribe(path)` | Transcribe file or URL (sync) |
| `translate(in, out, lang)` | Translate captions |
| `write(text, path)` | Save text to file |

## Setup

```bash
pip install omni-captions-skills --extra-index-url https://lattifai.github.io/pypi/simple/
```

## API Key

Priority: `GEMINI_API_KEY` env → `.env` file → `~/.config/omnicaptions/config.json`

If not set, ask user: `Please enter your Gemini API key (get from https://aistudio.google.com/apikey):`

Then run with `-k <key>`. Key will be saved to config file automatically.

## CLI Usage

**IMPORTANT**: CLI requires subcommand (`transcribe`, `translate`, `convert`)

```bash
# Transcribe (auto-output to same directory)
omnicaptions transcribe video.mp4              # → ./video_GeminiUnd.md
omnicaptions transcribe "https://youtu.be/abc" # → ./abc_GeminiUnd.md

# Specify output file or directory
omnicaptions transcribe video.mp4 -o output/   # → output/video_GeminiUnd.md
omnicaptions transcribe video.mp4 -o my.md     # → my.md

# Options
omnicaptions transcribe -m gemini-3-pro-preview video.mp4
omnicaptions transcribe -l zh video.mp4  # Force Chinese
```

| Option | Description |
|--------|-------------|
| `-k, --api-key` | Gemini API key (auto-prompted if missing) |
| `-o, --output` | Output file or directory (default: auto) |
| `-m, --model` | Model (default: gemini-3-flash-preview) |
| `-l, --language` | Force language (zh, en, ja) |
| `-t, --translate LANG` | Translate to language (one-step) |
| `--bilingual` | Bilingual output (with -t) |
| `-v, --verbose` | Verbose output |

## Bilingual Captions (Optional)

If user requests bilingual output, add `-t <lang> --bilingual`:

```bash
omnicaptions transcribe video.mp4 -t zh --bilingual
```

For precise timing, use separate workflow: transcribe → LaiCut → translate (see Related Skills).

## Output Format

```markdown
## Table of Contents
* [00:00:00] Introduction
* [00:02:15] Main Topic

## [00:00:00] Introduction

**Host:** Welcome to the show. [00:00:01]

**Guest:** Thanks for having me. [00:00:05]

[Applause] [00:00:08]
```

Key features:
- `## [HH:MM:SS] Title` chapter headers
- `**Speaker:**` labels (auto-detected)
- `[HH:MM:SS]` timestamp at paragraph end
- `[Event]` for non-speech (laughter, music)

## Common Mistakes

| Mistake | Fix |
|---------|-----|
| No API key error | Use `-k YOUR_KEY` or follow the prompt |
| Empty response | Check file format (mp3/mp4/wav/m4a supported) |
| Upload timeout | File too large (>2GB); split first |
| Wrong language | Use `-l en` to force language |

## Related Skills

| Skill | Use When |
|-------|----------|
| `/omnicaptions:convert` | Convert output to SRT/VTT/ASS |
| `/omnicaptions:translate` | Translate (Gemini API or Claude native) |
| `/omnicaptions:download` | Download video/audio first |

### Workflow Examples

```bash
# Basic transcription
omnicaptions transcribe video.mp4
# → video_GeminiUnd.md

# Precise timing needed: transcribe → LaiCut align → convert
omnicaptions transcribe video.mp4
omnicaptions LaiCut video.mp4 video_GeminiUnd.md
# → video_GeminiUnd_LaiCut.json
omnicaptions convert video_GeminiUnd_LaiCut.json -o video_GeminiUnd_LaiCut.srt
```

> **Note**: For translation, use `/omnicaptions:translate` (default: Claude, optional: Gemini API)

Files: 1

Size: 5.0 KB

Complexity: 13/100

Category: Image & Video

Source: https://github.com/lattifai/omni-captions-skills/tree/main/skills/omnicaptions-transcribe

Related in Image & Video

watch

Included

Watch a video (URL or local path). Downloads with yt-dlp, extracts auto-scaled frames with ffmpeg, pulls the transcript from captions (or Whisper API fallback), and hands the result to Claude so it can answer questions about what's in the video.

Image & Videoscriptsfeatured

physical-ai-defect-image-generation

Included

Use when the user wants to orchestrate defect image generation, run associated setup, or handle outputs on OSMO. The Day 0 path handles cold-start with USD-to-ROI, image-edit augmentation, and AnomalyGen to create initial PCBA datasets. The Day 1 path performs inference and labeling on real images. This skill helps with first-time asset setup, creation of finetuning checkpoints, and configuring deployment. Trigger keywords: defect image generation, dig workflow, dig pipeline, defect image detection workflow, aoi pipeline, aoi anomalygen, usd2roi anomalygen, day 0 pcba, day 1 pcba, day 1 real-photo alignment, day 1 manual roi, metal surface anomaly, glass defect, anomalygen finetune, setup_pcb, setup_metal, setup_glass, setup_pretrained, dig setup, dig datasets, dig pretrained checkpoint, dig image-edit endpoint.

Image & Videoscripts

accelint-react-best-practices

Included

React performance optimization and best practices. ALWAYS use this skill when working with any React code - writing components, hooks, JSX; refactoring; optimizing re-renders, memoization, state management; reviewing for performance; fixing hydration mismatches; debugging infinite re-renders, stale closures, input focus loss, animations restarting; preventing remounting; implementing transitions, lazy initialization, effect dependencies. Even simple React tasks benefit from these patterns. Covers React 19+ (useEffectEvent, Activity, ref props). Triggers - useEffect, useState, useMemo, useCallback, memo, inline components, nested components, components inside components, re-render, performance, hydration, SSR, Next.js, useDeferredValue, combined hooks.

Image & Videoscripts

elevenlabs-agents

Included

Build conversational AI voice agents with ElevenLabs Platform using React, JavaScript, React Native, or Swift SDKs. Configure agents, tools (client/server/MCP), RAG knowledge bases, multi-voice, and Scribe real-time STT. Use when: building voice chat interfaces, implementing AI phone agents with Twilio, configuring agent workflows or tools, adding RAG knowledge bases, testing with CLI "agents as code", or troubleshooting deprecated @11labs packages, Android audio cutoff, CSP violations, dynamic variables, or WebRTC config. Keywords: ElevenLabs Agents, ElevenLabs voice agents, AI voice agents, conversational AI, @elevenlabs/react, @elevenlabs/client, @elevenlabs/react-native, @elevenlabs/elevenlabs-js, @elevenlabs/agents-cli, elevenlabs SDK, voice AI, TTS, text-to-speech, ASR, speech recognition, turn-taking model, WebRTC voice, WebSocket voice, ElevenLabs conversation, agent system prompt, agent tools, agent knowledge base, RAG voice agents, multi-voice agents, pronunciation dictionary, voice speed control, elevenlabs scribe, @11labs deprecated, Android audio cutoff, CSP violation elevenlabs, dynamic variables elevenlabs, case-sensitive tool names, webhook authentication

Image & Videoscripts

humanizer

Included

Humanize AI-generated text by detecting and removing patterns typical of LLM output. Rewrites text to sound natural, specific, and human. Uses 28 pattern detectors, 560+ AI vocabulary terms across 3 tiers, and statistical analysis (burstiness, type-token ratio, readability) for comprehensive detection. Use when asked to humanize text, de-AI writing, make content sound more natural/human, review writing for AI patterns, score text for AI detection, or improve AI-generated drafts. Covers content, language, style, communication, and filler categories.

Image & Videoscripts

generating-mermaid-diagrams

Included

Salesforce architecture diagrams using Mermaid with ASCII fallback. Use this skill when generating text-based diagrams for Salesforce architecture, OAuth flows, ERDs, integration sequences, or Agentforce structure. TRIGGER when: user says "diagram", "visualize", "ERD", or asks for sequence diagrams, flowcharts, class diagrams, or architecture visualizations in Mermaid. DO NOT TRIGGER when: user wants PNG/SVG image output (use generating-visual-diagrams), or asks about non-Salesforce systems.

Image & Videoscripts