youtube-screenshotter

Included with Lifetime

$97 forever

Download a YouTube video and extract frames at specified timestamps with perceptual hashes; also discovers candidate timestamps via ffmpeg scene-detect + per-second pHash run-grouping; analyses optical flow over a sub-range and composes a sprite strip showing a motion arc. Use when capturing specific moments from a video as PNG images, or when a downstream skill (e.g. youtube-synthesizer) needs a high-recall list of where things change in the video. Mechanical only — no LLM calls.

Image & Videoscripts

What this skill does


# YouTube Screenshotter

Mechanical primitives for video analysis. Downloads a YouTube video at 720p via `yt-dlp`, extracts PNG frames at requested timestamps via `ffmpeg`, computes perceptual hashes (pHash) per frame, and discovers candidate timestamps from the video itself using ffmpeg primitives. Emits JSON manifests.

This skill is deliberately narrow: it does not classify content kinds and makes no LLM calls. The caller decides what to do with the discovered candidates and extracted frames. See `youtube-synthesizer` for the smart kind-classification loop that drives this skill.

## Usage

```bash
# Discover candidate timestamps (ffmpeg scene-detect + per-second pHash runs)
scripts/discover.py "<URL_OR_VIDEO_ID>" -o ./out

# Extract frames at specific timestamps
scripts/extract.py "<URL_OR_VIDEO_ID>" -t 1 -t 30 -t 500 -o ./out
```

`extract.py` output is a JSON manifest printed to stdout. The output dir holds the cached video file (`<video_id>.mp4`) and a `frames/` subdir of `t<ms>.png` extracted frames. Repeated calls with the same URL skip the download; repeated timestamps hit the per-frame cache.

`discover.py` returns a manifest with two source signals (`scene_detect` and `phash_runs`) plus a unioned `candidates` list. Typical pipeline: `discover.py` → pick candidates from the manifest → `extract.py -t <ts1> -t <ts2> ...` → classify the extracted frames.

## Manifest shape

```json
{
  "video_path": "/abs/path/to/<video_id>.mp4",
  "metadata": {
    "video_id": "...", "source_url": "...", "source_title": "...",
    "source_author": "...", "source_published_date": "YYYY-MM-DD",
    "channel": "...", "channel_id": "...",
    "duration_seconds": 1119,
    "chapter_markers": [{"title": "...", "start_time": 0.0, "end_time": 153.0}, ...],
    "playlist_id": null, "playlist_title": null
  },
  "entries": [
    {"timestamp": 1.0, "frame_path": "/abs/.../frames/t00001000.png", "phash": "b230d3532d4b8de9"},
    ...
  ]
}
```

## Sub-scripts

`scripts/extract.py` composes three primitives that can also be invoked independently:

- `scripts/video.py <url> [-o DIR] [--no-download]` — yt-dlp wrapper for download + metadata. `--no-download` for cheap metadata-only probes.
- `scripts/frames.py <video> -t <ts> [-t <ts> ...] [-o DIR]` — ffmpeg frame extraction.
- `scripts/phash.py compute <image>` / `scripts/phash.py compare <a> <b>` — perceptual-hash compute + Hamming-distance pair classifier (`same` / `ambiguous` / `different` bands).

`scripts/discover.py` is a separate entrypoint that produces a candidate-timestamp manifest:

- Runs `ffmpeg select='gt(scene,N)'` over the video for sharp-cut transitions (configurable via `--threshold-scene`, default 0.2).
- Runs `ffmpeg fps=1,scale=320:180` for a single-pass per-second thumbnail set, then computes pHash per thumbnail and groups consecutive thumbnails into runs by Hamming distance ≤ `--threshold-phash` (default 12 — merges talking-head microexpressions into single runs).
- Filters runs to ≥ `--min-run` seconds (default 3).
- Unions both signals, dedups within 1s.

The two signals complement each other: scene-detect catches sharp cuts (including 1–2s content stretches that the run-duration filter would drop) but misses slow fade-ins; pHash run-grouping catches every sustained content stretch ≥ `--min-run` seconds (including fade-in diagrams).

`scripts/motion.py` is a third entrypoint for the case discover.py can't catch on its own — a small object moving across an otherwise-static background, where the global pHash stays within threshold and run-grouping merges the whole animation into one start frame. Given a sub-range, it densely re-samples (default 4 fps), computes Farneback optical flow per consecutive pair, picks the largest moving connected component, and scores each frame by `area * centeredness * edge_penalty`. With `--sprite-out PATH`, it also re-extracts N evenly-spaced timestamps across the motion-active span at full source resolution and composites a Brady-Bunch grid — 9 frames in a 3×3 layout by default, so a reader can take in the whole motion arc at a glance.

```bash
scripts/motion.py <video> --start S --end E [--fps 4] [-o DIR]
                          [--sprite-out PATH] [--sprite-frames 9] [--sprite-cols 3]
```

The primitive is content-agnostic; the decision *whether* to sprite a span (vs. pick a single frame, vs. emit a text-only `> [animation: ...]` annotation) lives in the caller — typically `youtube-synthesizer` Phase A.5.

### Discover manifest shape

```json
{
  "video_path": "...", "duration_seconds": 1119,
  "thresholds": {"scene": 0.2, "phash": 12, "min_run": 3.0},
  "sources": {
    "scene_detect": [4.8, 6.9, 18.6, ...],
    "phash_runs": [
      {"start_t": 88.0, "end_t": 90.0, "duration": 3.0, "phash": "..."},
      ...
    ]
  },
  "candidates": [
    {"timestamp": 4.8,  "source": "scene_detect", "run_duration": null},
    {"timestamp": 88.0, "source": "phash_run",    "run_duration": 3.0},
    ...
  ]
}
```

## pHash interpretation

For 64-bit pHash with default thresholds (`SAME_MAX=5`, `DIFFERENT_MIN=20`):

- Hamming distance ≤ 5 → `same` (no meaningful change)
- Hamming distance ≥ 20 → `different` (clear scene/content change)
- 6–19 → `ambiguous` — pHash can't disambiguate; needs visual judgment

Disambiguating the ambiguous band into `same` / `different` / `additive` is the caller's job — typically the `youtube-synthesizer` skill, which does it with the parent agent's native vision capability.

## Supported URL formats

Same as `youtube-transcript`:
- `https://www.youtube.com/watch?v=VIDEO_ID`
- `https://youtu.be/VIDEO_ID`
- `https://youtube.com/embed/VIDEO_ID`
- `https://youtube.com/shorts/VIDEO_ID`
- Raw 11-character video ID

## Prerequisites

- `uv` — installs Python deps in an ephemeral venv per invocation. `curl -LsSf https://astral.sh/uv/install.sh | sh` or `pip install uv`.
- `ffmpeg` — required for frame extraction and yt-dlp's video+audio merge. `sudo apt install ffmpeg` (Linux), `brew install ffmpeg` (macOS).
- Network access to `youtube.com` and `googlevideo.com` for downloads.

Files: 7

Size: 50.7 KB

Complexity: 59/100

Category: Image & Video

Source: https://github.com/devonjones/devon-claude-skills/tree/main/plugins/youtube-screenshotter/skills/youtube-screenshotter

Related in Image & Video

watch

Included

Watch a video (URL or local path). Downloads with yt-dlp, extracts auto-scaled frames with ffmpeg, pulls the transcript from captions (or Whisper API fallback), and hands the result to Claude so it can answer questions about what's in the video.

Image & Videoscriptsfeatured

physical-ai-defect-image-generation

Included

Use when the user wants to orchestrate defect image generation, run associated setup, or handle outputs on OSMO. The Day 0 path handles cold-start with USD-to-ROI, image-edit augmentation, and AnomalyGen to create initial PCBA datasets. The Day 1 path performs inference and labeling on real images. This skill helps with first-time asset setup, creation of finetuning checkpoints, and configuring deployment. Trigger keywords: defect image generation, dig workflow, dig pipeline, defect image detection workflow, aoi pipeline, aoi anomalygen, usd2roi anomalygen, day 0 pcba, day 1 pcba, day 1 real-photo alignment, day 1 manual roi, metal surface anomaly, glass defect, anomalygen finetune, setup_pcb, setup_metal, setup_glass, setup_pretrained, dig setup, dig datasets, dig pretrained checkpoint, dig image-edit endpoint.

Image & Videoscripts

accelint-react-best-practices

Included

React performance optimization and best practices. ALWAYS use this skill when working with any React code - writing components, hooks, JSX; refactoring; optimizing re-renders, memoization, state management; reviewing for performance; fixing hydration mismatches; debugging infinite re-renders, stale closures, input focus loss, animations restarting; preventing remounting; implementing transitions, lazy initialization, effect dependencies. Even simple React tasks benefit from these patterns. Covers React 19+ (useEffectEvent, Activity, ref props). Triggers - useEffect, useState, useMemo, useCallback, memo, inline components, nested components, components inside components, re-render, performance, hydration, SSR, Next.js, useDeferredValue, combined hooks.

Image & Videoscripts

elevenlabs-agents

Included

Build conversational AI voice agents with ElevenLabs Platform using React, JavaScript, React Native, or Swift SDKs. Configure agents, tools (client/server/MCP), RAG knowledge bases, multi-voice, and Scribe real-time STT. Use when: building voice chat interfaces, implementing AI phone agents with Twilio, configuring agent workflows or tools, adding RAG knowledge bases, testing with CLI "agents as code", or troubleshooting deprecated @11labs packages, Android audio cutoff, CSP violations, dynamic variables, or WebRTC config. Keywords: ElevenLabs Agents, ElevenLabs voice agents, AI voice agents, conversational AI, @elevenlabs/react, @elevenlabs/client, @elevenlabs/react-native, @elevenlabs/elevenlabs-js, @elevenlabs/agents-cli, elevenlabs SDK, voice AI, TTS, text-to-speech, ASR, speech recognition, turn-taking model, WebRTC voice, WebSocket voice, ElevenLabs conversation, agent system prompt, agent tools, agent knowledge base, RAG voice agents, multi-voice agents, pronunciation dictionary, voice speed control, elevenlabs scribe, @11labs deprecated, Android audio cutoff, CSP violation elevenlabs, dynamic variables elevenlabs, case-sensitive tool names, webhook authentication

Image & Videoscripts

humanizer

Included

Humanize AI-generated text by detecting and removing patterns typical of LLM output. Rewrites text to sound natural, specific, and human. Uses 28 pattern detectors, 560+ AI vocabulary terms across 3 tiers, and statistical analysis (burstiness, type-token ratio, readability) for comprehensive detection. Use when asked to humanize text, de-AI writing, make content sound more natural/human, review writing for AI patterns, score text for AI detection, or improve AI-generated drafts. Covers content, language, style, communication, and filler categories.

Image & Videoscripts

generating-mermaid-diagrams

Included

Salesforce architecture diagrams using Mermaid with ASCII fallback. Use this skill when generating text-based diagrams for Salesforce architecture, OAuth flows, ERDs, integration sequences, or Agentforce structure. TRIGGER when: user says "diagram", "visualize", "ERD", or asks for sequence diagrams, flowcharts, class diagrams, or architecture visualizations in Mermaid. DO NOT TRIGGER when: user wants PNG/SVG image output (use generating-visual-diagrams), or asks about non-Salesforce systems.

Image & Videoscripts