qwen-vision

Included with Lifetime

$97 forever

Use when the user asks to "analyze video", "watch this video", "what happens in this video", "describe this clip", "review this footage", "classify these videos", "compare videos", "analyze this image", "what's in this screenshot", or when the user provides a video/image file path and expects visual understanding. Also trigger on: "qwen", "video bridge", "multimodal analysis", "motion analysis", "video reference", "video breakdown", "batch classify", or any task requiring understanding of video content that Claude cannot do natively.

Image & Videoscripts

What this skill does


# Qwen Vision Bridge

Claude cannot natively understand video. This skill bridges that gap by calling Qwen Omni — a natively multimodal model that processes video with temporal attention (it sees motion, not just individual frames).

The bridge also handles images, useful when you want Qwen's analysis on screenshots, diagrams, or photos.

## How it works

A Python script at `${CLAUDE_PLUGIN_ROOT}/skills/qwen-vision/scripts/qwen_bridge.py` sends media files to the Qwen API and returns the analysis as text. Call it via Bash.

## Prerequisites

The user must have:
1. `DASHSCOPE_API_KEY` environment variable set (get one at https://dashscope.console.aliyun.com/ or https://modelstudio.console.alibabacloud.com/)
2. Python 3.9+ with `dashscope` package installed

If the user hasn't set up yet, suggest running `/qwen-setup` first.

## Basic usage

```bash
python3 "${CLAUDE_PLUGIN_ROOT}/skills/qwen-vision/scripts/qwen_bridge.py" "/path/to/video.mp4" "Describe what happens in this video"
```

## Parameters

| Flag | Default | Description |
|------|---------|-------------|
| (positional 1) | required | Path to video or image file |
| (positional 2) | generic prompt | Analysis prompt |
| `--fps` | 2.0 | Frames per second to sample from video. Lower = cheaper, higher = more detail |
| `--model` | qwen-omni-plus-latest | Qwen model to use |
| `--json` | off | Output as JSON (for parsing) |
| `--context` | none | Path to JSON file with previous conversation (multi-turn) |
| `--save-context` | none | Save conversation context for follow-up questions |
| `--system-prompt` | none | Custom system prompt for Qwen |
| `--prompt-file` | none | Read prompt from a file instead of argument |

## Supported formats

**Video:** .mp4, .mov, .avi, .mkv, .webm, .flv, .wmv
**Image:** .png, .jpg, .jpeg, .gif, .webp, .bmp, .tiff

## Patterns

### Single video analysis

```bash
python3 "${CLAUDE_PLUGIN_ROOT}/skills/qwen-vision/scripts/qwen_bridge.py" "/path/to/video.mp4" "Describe the character's body movement, poses, and transitions" --fps 2
```

Parse the text response and use it in your answer to the user.

### Batch analysis

When the user has multiple videos to analyze, write a Python script that loops through files and calls the bridge for each one. Use `--json` flag for machine-readable output. See `references/batch-pattern.md` for a template.

### Multi-turn (follow-up questions)

```bash
# First question
python3 "${CLAUDE_PLUGIN_ROOT}/skills/qwen-vision/scripts/qwen_bridge.py" video.mp4 "General analysis" --save-context /tmp/ctx.json

# Follow-up
python3 "${CLAUDE_PLUGIN_ROOT}/skills/qwen-vision/scripts/qwen_bridge.py" video.mp4 "Tell me more about the lighting" --context /tmp/ctx.json
```

### Image analysis

Same script, just pass an image path instead of video:

```bash
python3 "${CLAUDE_PLUGIN_ROOT}/skills/qwen-vision/scripts/qwen_bridge.py" "/path/to/screenshot.png" "What UI elements are visible in this screenshot?"
```

### Cost-saving tips

- Use `--fps 1` for long videos or when fine detail isn't needed
- Use `--fps 0.5` for very long videos (minutes+)
- For batch jobs, start with `--fps 1` and increase only if results are too vague

## Error handling

- If `DASHSCOPE_API_KEY` is not set, the script exits with a clear error message. Guide the user to set it up.
- If `dashscope` is not installed, suggest `pip install dashscope`.
- If the API returns an error, the script prints the error code and message. Common issues: invalid key, quota exceeded, unsupported file format.
- If a video file is too large for the API, suggest lowering `--fps` or trimming the video first.

## What Qwen sees vs what Claude sees

This is important context for the user: Qwen processes video frames with temporal attention — it understands motion, direction, rhythm, and transitions between frames. Claude analyzing individual screenshots cannot do this. When the user needs to understand *what happens* in a video (not just what a single frame looks like), this bridge is the right tool.

## Additional resources

- **`references/batch-pattern.md`** — template for batch video classification
- **`references/prompt-tips.md`** — effective prompts for different analysis types

Files: 4

Size: 16.6 KB

Complexity: 56/100

Category: Image & Video

Source: https://github.com/davepoon/buildwithclaude/tree/main/plugins/give-claude-eyes/skills/qwen-vision

Related in Image & Video

watch

Included

Watch a video (URL or local path). Downloads with yt-dlp, extracts auto-scaled frames with ffmpeg, pulls the transcript from captions (or Whisper API fallback), and hands the result to Claude so it can answer questions about what's in the video.

Image & Videoscriptsfeatured

physical-ai-defect-image-generation

Included

Use when the user wants to orchestrate defect image generation, run associated setup, or handle outputs on OSMO. The Day 0 path handles cold-start with USD-to-ROI, image-edit augmentation, and AnomalyGen to create initial PCBA datasets. The Day 1 path performs inference and labeling on real images. This skill helps with first-time asset setup, creation of finetuning checkpoints, and configuring deployment. Trigger keywords: defect image generation, dig workflow, dig pipeline, defect image detection workflow, aoi pipeline, aoi anomalygen, usd2roi anomalygen, day 0 pcba, day 1 pcba, day 1 real-photo alignment, day 1 manual roi, metal surface anomaly, glass defect, anomalygen finetune, setup_pcb, setup_metal, setup_glass, setup_pretrained, dig setup, dig datasets, dig pretrained checkpoint, dig image-edit endpoint.

Image & Videoscripts

accelint-react-best-practices

Included

React performance optimization and best practices. ALWAYS use this skill when working with any React code - writing components, hooks, JSX; refactoring; optimizing re-renders, memoization, state management; reviewing for performance; fixing hydration mismatches; debugging infinite re-renders, stale closures, input focus loss, animations restarting; preventing remounting; implementing transitions, lazy initialization, effect dependencies. Even simple React tasks benefit from these patterns. Covers React 19+ (useEffectEvent, Activity, ref props). Triggers - useEffect, useState, useMemo, useCallback, memo, inline components, nested components, components inside components, re-render, performance, hydration, SSR, Next.js, useDeferredValue, combined hooks.

Image & Videoscripts

elevenlabs-agents

Included

Build conversational AI voice agents with ElevenLabs Platform using React, JavaScript, React Native, or Swift SDKs. Configure agents, tools (client/server/MCP), RAG knowledge bases, multi-voice, and Scribe real-time STT. Use when: building voice chat interfaces, implementing AI phone agents with Twilio, configuring agent workflows or tools, adding RAG knowledge bases, testing with CLI "agents as code", or troubleshooting deprecated @11labs packages, Android audio cutoff, CSP violations, dynamic variables, or WebRTC config. Keywords: ElevenLabs Agents, ElevenLabs voice agents, AI voice agents, conversational AI, @elevenlabs/react, @elevenlabs/client, @elevenlabs/react-native, @elevenlabs/elevenlabs-js, @elevenlabs/agents-cli, elevenlabs SDK, voice AI, TTS, text-to-speech, ASR, speech recognition, turn-taking model, WebRTC voice, WebSocket voice, ElevenLabs conversation, agent system prompt, agent tools, agent knowledge base, RAG voice agents, multi-voice agents, pronunciation dictionary, voice speed control, elevenlabs scribe, @11labs deprecated, Android audio cutoff, CSP violation elevenlabs, dynamic variables elevenlabs, case-sensitive tool names, webhook authentication

Image & Videoscripts

humanizer

Included

Humanize AI-generated text by detecting and removing patterns typical of LLM output. Rewrites text to sound natural, specific, and human. Uses 28 pattern detectors, 560+ AI vocabulary terms across 3 tiers, and statistical analysis (burstiness, type-token ratio, readability) for comprehensive detection. Use when asked to humanize text, de-AI writing, make content sound more natural/human, review writing for AI patterns, score text for AI detection, or improve AI-generated drafts. Covers content, language, style, communication, and filler categories.

Image & Videoscripts

generating-mermaid-diagrams

Included

Salesforce architecture diagrams using Mermaid with ASCII fallback. Use this skill when generating text-based diagrams for Salesforce architecture, OAuth flows, ERDs, integration sequences, or Agentforce structure. TRIGGER when: user says "diagram", "visualize", "ERD", or asks for sequence diagrams, flowcharts, class diagrams, or architecture visualizations in Mermaid. DO NOT TRIGGER when: user wants PNG/SVG image output (use generating-visual-diagrams), or asks about non-Salesforce systems.

Image & Videoscripts