omnicaptions-transcribe
Use when transcribing audio/video to text with timestamps, speaker labels, and chapters. Supports YouTube URLs and local files. Produces structured markdown output.
What this skill does
# Gemini Transcription Transcribe audio/video using Google Gemini API with structured markdown output. ## YouTube Video Workflow **Important**: Check for existing captions before transcribing: ``` 1. Check captions: yt-dlp --list-subs "URL" 2. Has caption → Use /omnicaptions:download to get existing captions (better quality) 3. No caption → Transcribe directly with URL (don't download first!) ``` **Confirm with user**: Before transcribing, ask if they want to check for existing captions first. ## URL & Local File Support **Gemini natively supports YouTube URLs** - no need to download, just pass the URL directly: ```bash # YouTube URL (recommended, no download needed) omnicaptions transcribe "https://www.youtube.com/watch?v=VIDEO_ID" # Local files omnicaptions transcribe video.mp4 ``` **Note**: Output defaults to current directory unless user specifies `-o`. ## When to Use - **Video URLs** - YouTube, direct video links (Gemini native support) - Transcribing podcasts, interviews, lectures - Need verbatim transcript with timestamps and speaker labels - Want auto-generated chapters from content - Mixed-language audio (code-switching preserved) ## When NOT to Use - **Video has existing captions** - Use `/omnicaptions:download` to get existing captions first - Need real-time streaming transcription (use Whisper) - Audio >2 hours (Gemini upload limit) - Want translation instead of transcription ## Quick Reference | Method | Description | |--------|-------------| | `transcribe(path)` | Transcribe file or URL (sync) | | `translate(in, out, lang)` | Translate captions | | `write(text, path)` | Save text to file | ## Setup ```bash pip install omni-captions-skills --extra-index-url https://lattifai.github.io/pypi/simple/ ``` ## API Key Priority: `GEMINI_API_KEY` env → `.env` file → `~/.config/omnicaptions/config.json` If not set, ask user: `Please enter your Gemini API key (get from https://aistudio.google.com/apikey):` Then run with `-k <key>`. Key will be saved to config file automatically. ## CLI Usage **IMPORTANT**: CLI requires subcommand (`transcribe`, `translate`, `convert`) ```bash # Transcribe (auto-output to same directory) omnicaptions transcribe video.mp4 # → ./video_GeminiUnd.md omnicaptions transcribe "https://youtu.be/abc" # → ./abc_GeminiUnd.md # Specify output file or directory omnicaptions transcribe video.mp4 -o output/ # → output/video_GeminiUnd.md omnicaptions transcribe video.mp4 -o my.md # → my.md # Options omnicaptions transcribe -m gemini-3-pro-preview video.mp4 omnicaptions transcribe -l zh video.mp4 # Force Chinese ``` | Option | Description | |--------|-------------| | `-k, --api-key` | Gemini API key (auto-prompted if missing) | | `-o, --output` | Output file or directory (default: auto) | | `-m, --model` | Model (default: gemini-3-flash-preview) | | `-l, --language` | Force language (zh, en, ja) | | `-t, --translate LANG` | Translate to language (one-step) | | `--bilingual` | Bilingual output (with -t) | | `-v, --verbose` | Verbose output | ## Bilingual Captions (Optional) If user requests bilingual output, add `-t <lang> --bilingual`: ```bash omnicaptions transcribe video.mp4 -t zh --bilingual ``` For precise timing, use separate workflow: transcribe → LaiCut → translate (see Related Skills). ## Output Format ```markdown ## Table of Contents * [00:00:00] Introduction * [00:02:15] Main Topic ## [00:00:00] Introduction **Host:** Welcome to the show. [00:00:01] **Guest:** Thanks for having me. [00:00:05] [Applause] [00:00:08] ``` Key features: - `## [HH:MM:SS] Title` chapter headers - `**Speaker:**` labels (auto-detected) - `[HH:MM:SS]` timestamp at paragraph end - `[Event]` for non-speech (laughter, music) ## Common Mistakes | Mistake | Fix | |---------|-----| | No API key error | Use `-k YOUR_KEY` or follow the prompt | | Empty response | Check file format (mp3/mp4/wav/m4a supported) | | Upload timeout | File too large (>2GB); split first | | Wrong language | Use `-l en` to force language | ## Related Skills | Skill | Use When | |-------|----------| | `/omnicaptions:convert` | Convert output to SRT/VTT/ASS | | `/omnicaptions:translate` | Translate (Gemini API or Claude native) | | `/omnicaptions:download` | Download video/audio first | ### Workflow Examples ```bash # Basic transcription omnicaptions transcribe video.mp4 # → video_GeminiUnd.md # Precise timing needed: transcribe → LaiCut align → convert omnicaptions transcribe video.mp4 omnicaptions LaiCut video.mp4 video_GeminiUnd.md # → video_GeminiUnd_LaiCut.json omnicaptions convert video_GeminiUnd_LaiCut.json -o video_GeminiUnd_LaiCut.srt ``` > **Note**: For translation, use `/omnicaptions:translate` (default: Claude, optional: Gemini API)
Related in Image & Video
watch
IncludedWatch a video (URL or local path). Downloads with yt-dlp, extracts auto-scaled frames with ffmpeg, pulls the transcript from captions (or Whisper API fallback), and hands the result to Claude so it can answer questions about what's in the video.
physical-ai-defect-image-generation
IncludedUse when the user wants to orchestrate defect image generation, run associated setup, or handle outputs on OSMO. The Day 0 path handles cold-start with USD-to-ROI, image-edit augmentation, and AnomalyGen to create initial PCBA datasets. The Day 1 path performs inference and labeling on real images. This skill helps with first-time asset setup, creation of finetuning checkpoints, and configuring deployment. Trigger keywords: defect image generation, dig workflow, dig pipeline, defect image detection workflow, aoi pipeline, aoi anomalygen, usd2roi anomalygen, day 0 pcba, day 1 pcba, day 1 real-photo alignment, day 1 manual roi, metal surface anomaly, glass defect, anomalygen finetune, setup_pcb, setup_metal, setup_glass, setup_pretrained, dig setup, dig datasets, dig pretrained checkpoint, dig image-edit endpoint.
accelint-react-best-practices
IncludedReact performance optimization and best practices. ALWAYS use this skill when working with any React code - writing components, hooks, JSX; refactoring; optimizing re-renders, memoization, state management; reviewing for performance; fixing hydration mismatches; debugging infinite re-renders, stale closures, input focus loss, animations restarting; preventing remounting; implementing transitions, lazy initialization, effect dependencies. Even simple React tasks benefit from these patterns. Covers React 19+ (useEffectEvent, Activity, ref props). Triggers - useEffect, useState, useMemo, useCallback, memo, inline components, nested components, components inside components, re-render, performance, hydration, SSR, Next.js, useDeferredValue, combined hooks.
elevenlabs-agents
IncludedBuild conversational AI voice agents with ElevenLabs Platform using React, JavaScript, React Native, or Swift SDKs. Configure agents, tools (client/server/MCP), RAG knowledge bases, multi-voice, and Scribe real-time STT. Use when: building voice chat interfaces, implementing AI phone agents with Twilio, configuring agent workflows or tools, adding RAG knowledge bases, testing with CLI "agents as code", or troubleshooting deprecated @11labs packages, Android audio cutoff, CSP violations, dynamic variables, or WebRTC config. Keywords: ElevenLabs Agents, ElevenLabs voice agents, AI voice agents, conversational AI, @elevenlabs/react, @elevenlabs/client, @elevenlabs/react-native, @elevenlabs/elevenlabs-js, @elevenlabs/agents-cli, elevenlabs SDK, voice AI, TTS, text-to-speech, ASR, speech recognition, turn-taking model, WebRTC voice, WebSocket voice, ElevenLabs conversation, agent system prompt, agent tools, agent knowledge base, RAG voice agents, multi-voice agents, pronunciation dictionary, voice speed control, elevenlabs scribe, @11labs deprecated, Android audio cutoff, CSP violation elevenlabs, dynamic variables elevenlabs, case-sensitive tool names, webhook authentication
humanizer
IncludedHumanize AI-generated text by detecting and removing patterns typical of LLM output. Rewrites text to sound natural, specific, and human. Uses 28 pattern detectors, 560+ AI vocabulary terms across 3 tiers, and statistical analysis (burstiness, type-token ratio, readability) for comprehensive detection. Use when asked to humanize text, de-AI writing, make content sound more natural/human, review writing for AI patterns, score text for AI detection, or improve AI-generated drafts. Covers content, language, style, communication, and filler categories.
generating-mermaid-diagrams
IncludedSalesforce architecture diagrams using Mermaid with ASCII fallback. Use this skill when generating text-based diagrams for Salesforce architecture, OAuth flows, ERDs, integration sequences, or Agentforce structure. TRIGGER when: user says "diagram", "visualize", "ERD", or asks for sequence diagrams, flowcharts, class diagrams, or architecture visualizations in Mermaid. DO NOT TRIGGER when: user wants PNG/SVG image output (use generating-visual-diagrams), or asks about non-Salesforce systems.