Claude
Skills
Sign in
Back

qwencloud-vision

Included with Lifetime
$97 forever

[QwenCloud] Understand images and videos with Qwen vision models. TRIGGER when: user wants to analyze, describe, or extract information from images or videos, OCR text extraction, chart/table reading, visual reasoning, multi-image comparison, screenshot understanding, video comprehension, or explicitly invokes this skill by name (e.g. use qwencloud-vision). DO NOT TRIGGER when: user wants to generate/create images (use qwencloud-image-generation), generate videos (use qwencloud-video-generation), text-only tasks without visual input, or non-Qwen vision tasks.

Image & Videoscripts

What this skill does


> **Agent setup**: If your agent doesn't auto-load skills (e.g. Claude Code),
> see [agent-compatibility.md](references/agent-compatibility.md) once per session.

# Qwen Vision (Image & Video Understanding)

Analyze images and videos using Qwen VL and QVQ models.
This skill is part of **qwencloud/qwencloud-ai**.

## Skill directory

Use this skill's internal files to execute and learn. Load reference files on demand when the default path fails or you need details.

| Location | Purpose |
|----------|---------|
| `scripts/analyze.py` | Image/video understanding, multi-image, thinking mode |
| `scripts/reason.py` | Visual reasoning (QVQ, chain-of-thought, streaming) |
| `scripts/ocr.py` | OCR text extraction |
| `scripts/vision_lib.py` | Shared helpers (base64, upload, streaming) |
| `references/execution-guide.md` | Fallback: curl, code generation |
| `references/curl-examples.md` | Curl for base64, multi-image, video, OCR |
| `references/visual-reasoning.md` | QVQ and thinking mode details |
| `references/prompt-guide.md` | Query prompt templates by task, thinking mode decision |
| `references/ocr.md` | OCR parameters and examples |
| `references/sources.md` | Official documentation URLs |
| `references/agent-compatibility.md` | Agent self-check: register skills in project config for agents that don't auto-load |

## Security

**NEVER output any API key or credential in plaintext.** Always use variable references (`$DASHSCOPE_API_KEY` in shell, `os.environ["DASHSCOPE_API_KEY"]` in Python). Any check or detection of credentials must be **non-plaintext**: report only status (e.g. "set" / "not set", "valid" / "invalid"), never the value. Never display contents of `.env` or config files that may contain secrets.

**When the API key is not configured, NEVER ask the user to provide it directly.** Instead, help create a `.env` file with a placeholder (`DASHSCOPE_API_KEY=sk-your-key-here`) and instruct the user to replace it with their actual key from the [QwenCloud Console](https://home.qwencloud.com/api-keys). Only write the actual key value if the user explicitly requests it.

## Key Compatibility

Scripts require a **standard QwenCloud API key** (`sk-...`). Coding Plan keys (`sk-sp-...`) cannot be used for direct API calls and do not support dedicated vision models (qwen3-vl-plus, qvq-max, etc.). The scripts detect `sk-sp-` keys at startup and print a warning. If qwencloud-ops-auth is installed, see its `references/codingplan.md` for full details.

## Model Selection

| Model | Use Case |
|-------|----------|
| **qwen3.6-plus** | **Preferred** — latest flagship, unified multimodal (text+image+video). Thinking on by default. Best balance of quality, speed, cost. |
| **qwen3.5-plus** | Unified multimodal (text+image+video). Thinking on by default. |
| **qwen3.5-flash** | Fast multimodal — cheaper, faster. Thinking on by default. |
| **qwen3-vl-plus** | High-precision — object localization (2D/3D), document/webpage parsing. |
| **qwen3-vl-flash** | Fast vision — lower latency, 33 languages. |
| **qvq-max** | Visual reasoning — chain-of-thought for math, charts. **Streaming only.** |
| **qwen-vl-ocr** | OCR — text extraction, table parsing, document scanning. |
| **qwen-vl-max** | Qwen2.5-VL — best-performing in 2.5 series. |
| **qwen-vl-plus** | Qwen2.5-VL — faster, good balance of performance and cost, 11 languages. |

1. **User specified a model** → use directly.
2. **Consult the qwencloud-model-selector skill** when model choice depends on requirement, scenario, or pricing.
3. **No signal, clear task** → `qwen3.6-plus`. Use `qwen3-vl-plus` for precise localization or 3D detection.

> **⚠️ Important**: The model list above is a **point-in-time snapshot** and may be outdated. Model availability
> changes frequently. **Always check the [official model list](https://www.qwencloud.com/models)
> for the authoritative, up-to-date catalog before making model decisions.**

> **Model details**: For more information about a specific model, direct the user to its detail page: `https://www.qwencloud.com/models/<model-name>` (replace `<model-name>` with the exact model ID, e.g. `qwen3.6-plus` → https://www.qwencloud.com/models/qwen3.6-plus). NEVER modify or guess the model name in the URL.

> **Dynamic model queries**: If the **qwencloud-model-selector** skill or **QwenCloud CLI** (`qwencloud models info <model>`) is available, use it for real-time model data. CLI requires authentication — see the **qwencloud-usage** skill for login flow.

## Execution

### Prerequisites

- **API Key**: Check that `DASHSCOPE_API_KEY` (or `QWEN_API_KEY`) is set using a **non-plaintext** check only (e.g. in shell:
  `[ -n "$DASHSCOPE_API_KEY" ]`; report only "set" or "not set", never the key value). If not set: run the *
  *qwencloud-ops-auth** skill if available; otherwise guide the user to obtain a key from [QwenCloud Console](https://home.qwencloud.com/api-keys) and set it via `.env` file (
  `echo 'DASHSCOPE_API_KEY=sk-your-key-here' >> .env` in project root or current directory) or environment variable. The
  script searches for `.env` in the current working directory and the project root. Skills may be installed
  independently — do not assume qwencloud-ops-auth is present.
- Python 3.9+ (stdlib only, **no pip install needed**)

### Environment Check

Before first execution, verify Python is available:

```bash
python3 --version  # must be 3.9+
```

If `python3` is not found, try `python --version` or `py -3 --version`. If Python is unavailable or below 3.9, skip to **Path 2 (curl)** in [execution-guide.md](references/execution-guide.md).

### Default: Run Script

**Script path**: Scripts are in the `scripts/` subdirectory **of this skill's directory** (the directory containing this SKILL.md). **You MUST first locate this skill's installation directory, then ALWAYS use the full absolute path to execute scripts.** Do NOT assume scripts are in the current working directory. Do NOT use `cd` to switch directories before execution. Shared infrastructure lives in `scripts/vision_lib.py`.

**Execution note:** Run all scripts in the **foreground** — wait for stdout; do not background.

**Discovery:** Run `python3 <this-skill-dir>/scripts/analyze.py --help` (or `reason.py`, `ocr.py`) first to see all available arguments.

| Script | Purpose | Default Model |
|--------|---------|---------------|
| `scripts/analyze.py` | Image understanding, multi-image, video, thinking mode, high-res | `qwen3.6-plus` |
| `scripts/reason.py` | Visual reasoning with chain-of-thought, video reasoning (always streaming) | `qvq-max` |
| `scripts/ocr.py` | OCR text extraction from documents, receipts, tables | `qwen-vl-ocr` |

**Input type fields** (use exactly one in `--request` JSON):

| Field | Use for | Example |
|-------|---------|--------|
| `"image"` | Single image (URL or local path) | `"image": "photo.jpg"` |
| `"images"` | Multi-image comparison (array) | `"images": ["a.jpg", "b.jpg"]` |
| `"video"` | Video file (URL or local path) | `"video": "clip.mp4"` |
| `"video_frames"` | Video as frame array | `"video_frames": ["f1.jpg", "f2.jpg"]` |

> **⚠️ Common mistake**: Do NOT use `"image"` for video files — use `"video"` instead.

```bash
# Image analysis
python3 <this-skill-dir>/scripts/analyze.py \
  --request '{"prompt":"What is in this image?","image":"https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241022/emyrja/dog_and_girl.jpeg"}' \
  --output output/qwencloud-vision/result.json --print-response

# Video analysis (local file — add --upload-files for files >= 7 MB)
python3 <this-skill-dir>/scripts/analyze.py \
  --request '{"prompt":"Describe what happens in this video","video":"clip.mp4"}' \
  --upload-files --print-response

python3 <this-skill-dir>/scripts/reason.py \
  --request '{"prompt":"Solve this math problem step by step","image":"problem.png"}' \
  --print-response

python3 <this-skill-dir>/scripts/ocr.py \
  --request '{"image":"invoice.jpg"}' \
  --print-response

Related in Image & Video