Claude
Skills
Sign in
Back

computer-automation

Included with Lifetime
$97 forever

Vision-driven desktop automation using Midscene. Control your local desktop (macOS, Windows, Linux) or a remote Windows desktop over RDP with natural language commands. Operates entirely from screenshots — no DOM or accessibility labels required. Can interact with all visible elements on screen regardless of technology stack. ⚠️ In local mode this takes over the user's real mouse and keyboard. For web apps, prefer "Browser Automation" instead. Only use this for desktop-native apps (Electron, Qt, native macOS/Windows/Linux) that cannot run in a browser, or for driving a remote Windows host via RDP. Triggers: open app, press key, desktop, computer, click on screen, type text, screenshot desktop, launch application, switch window, desktop automation, control computer, mouse click, keyboard shortcut, screen capture, find on screen, read screen, verify window, close app, test Electron app, rdp, remote desktop, windows server, connect via rdp Powered by Midscene.js (https://midscenejs.com)

Code Review

What this skill does


# Desktop Computer Automation

> **CRITICAL RULES — VIOLATIONS WILL BREAK THE WORKFLOW:**
>
> 1. **Never run midscene commands in the background.** Each command must run synchronously so you can read its output (especially screenshots) before deciding the next action. Background execution breaks the screenshot-analyze-act loop.
> 2. **Run only one midscene command at a time.** Wait for the previous command to finish, read the screenshot, then decide the next action. Never chain multiple commands together.
> 3. **Allow enough time for each command to complete.** Midscene commands involve AI inference and screen interaction, which can take longer than typical shell commands. A typical command needs about 1 minute; complex `act` commands may need even longer.
> 4. **Always report task results before finishing.** After completing the automation task, you MUST proactively summarize the results to the user — including key data found, actions completed, screenshots taken, and any relevant findings. Never silently end after the last automation step; the user expects a complete response in a single interaction.
> 5. **Only minimize windows, never close them unless explicitly asked.** When you need to dismiss or get a window out of the way, minimize it instead of closing it. Do not close any app or window unless the user explicitly asks you to do so.

Control your desktop (macOS, Windows, Linux) using `npx -y @midscene/computer@1`. Each CLI command maps directly to an MCP tool — you (the AI agent) act as the brain, deciding which actions to take based on screenshots.

## What `act` Can Do

Inside a single `act` call on desktop, Midscene can move the mouse, click, double-click, right-click, drag items, type or clear text, scroll, press single keys or keyboard shortcuts, and work through multi-step interactions on whatever is visible on the selected display.

## Prerequisites

Midscene requires models with strong visual grounding capabilities. The following environment variables must be configured — either as system environment variables or in a `.env` file in the current working directory (Midscene loads `.env` automatically):

```bash
MIDSCENE_MODEL_API_KEY="your-api-key"
MIDSCENE_MODEL_NAME="model-name"
MIDSCENE_MODEL_BASE_URL="https://..."
MIDSCENE_MODEL_FAMILY="family-identifier"
```

Example: Gemini (Gemini-3-Flash)

```bash
MIDSCENE_MODEL_API_KEY="your-google-api-key"
MIDSCENE_MODEL_NAME="gemini-3-flash"
MIDSCENE_MODEL_BASE_URL="https://generativelanguage.googleapis.com/v1beta/openai/"
MIDSCENE_MODEL_FAMILY="gemini"
```

Example: Qwen 3.5

```bash
MIDSCENE_MODEL_API_KEY="your-aliyun-api-key"
MIDSCENE_MODEL_NAME="qwen3.5-plus"
MIDSCENE_MODEL_BASE_URL="https://dashscope.aliyuncs.com/compatible-mode/v1"
MIDSCENE_MODEL_FAMILY="qwen3.5"
MIDSCENE_MODEL_REASONING_ENABLED="false"
# If using OpenRouter, set:
# MIDSCENE_MODEL_API_KEY="your-openrouter-api-key"
# MIDSCENE_MODEL_NAME="qwen/qwen3.5-plus"
# MIDSCENE_MODEL_BASE_URL="https://openrouter.ai/api/v1"
```

Example: Doubao Seed 2.0 Lite

```bash
MIDSCENE_MODEL_API_KEY="your-doubao-api-key"
MIDSCENE_MODEL_NAME="doubao-seed-2-0-lite"
MIDSCENE_MODEL_BASE_URL="https://ark.cn-beijing.volces.com/api/v3"
MIDSCENE_MODEL_FAMILY="doubao-seed"
```

Commonly used models: Doubao Seed 2.0 Lite, Qwen 3.5, Zhipu GLM-4.6V, Gemini-3-Pro, Gemini-3-Flash.

If the model is not configured, ask the user to set it up. See [Model Configuration](https://midscenejs.com/model-common-config) for supported providers.

## Commands

### Connect to Desktop

```bash
npx -y @midscene/computer@1 connect
npx -y @midscene/computer@1 connect --displayId <id>
```

### Connect via RDP

Use RDP mode to drive a **remote Windows desktop** instead of the local machine. Providing `--host` switches `connect` to RDP and routes every subsequent command (`act`, `tap`, `take_screenshot`, `assert`, `disconnect`) through the RDP helper binary bundled with `@midscene/computer`. The local mouse/keyboard is **not** touched.

Minimum example:

```bash
npx -y @midscene/computer@1 connect \
  --host rdp.example.com \
  --username Administrator \
  --password "$RDP_PASSWORD"
```

All RDP options for `connect` (RDP mode is activated when `--host` is set; the other flags are optional):

- `--host <fqdn-or-ip>` — RDP host. Required to enter RDP mode.
- `--port <number>` — RDP port (default `3389`).
- `--username <user>` — RDP user account.
- `--password <secret>` — RDP password. Prefer reading from an environment variable, secrets manager, or interactive prompt; never paste it into a shared transcript.
- `--domain <domain>` — Active Directory / NTLM domain.
- `--security-protocol <auto|tls|nla|rdp>` — Security protocol negotiation. Defaults to `auto`.
- `--ignore-certificate` — Skip TLS certificate validation. Use only for trusted dev hosts with self-signed certs.
- `--admin-session` — Attach to the admin/console session (equivalent to `mstsc /admin`).
- `--desktop-width <px>` and `--desktop-height <px>` — **Request** a specific remote desktop resolution. The actual size is whatever the RDP server negotiates back; e.g. requesting `1024x768` against a host that pins `1280x720` will land on `1280x720`. Confirm the negotiated size with `listdisplays --host ... --username ... --password ...` after connect.

Notes specific to RDP mode:

- `--displayId` and `--headless` are **ignored** in RDP mode. A connected RDP session always exposes a single virtual display whose size is whatever the server negotiated.
- Two display-listing commands exist and behave differently — pick the right one:
  - `list_displays` (with underscore, platform tool) — enumerates **local** physical displays only. Does not accept RDP flags. Useless after an RDP connect.
  - `listdisplays` (no underscore, action tool) — accepts the same RDP flags as `connect`/`take_screenshot`/etc. In RDP mode it returns the negotiated virtual display, e.g. `[{ "id": "...", "name": "RDP 10.70.86.26:3389 (1280x720)", "primary": true }]`. Use this to verify the actual resolution.
- The RDP transport uses a native helper binary shipped inside `@midscene/computer`. If you see `RDP helper binary not found` errors, the optional `bin/<platform>/rdp-helper` was stripped from your install — reinstall the package or unpack a fresh tarball.
- Treat RDP credentials as secrets: do not commit `.env` files containing `--password` to the repo; prefer `export RDP_PASSWORD=...` in the current shell and reference it as `--password "$RDP_PASSWORD"`.
- **Latency expectations**: every CLI invocation is a fresh node process, so each command re-establishes the RDP session. Budget roughly:
  - `connect` / `take_screenshot` / `keyboardpress` / `scroll`: ~5 s (node startup + RDP TLS+NLA handshake + first frame).
  - `act` / `assert` / `tap --locate`: ~5 s + AI inference + any planned sub-actions; expect 8–20 s end-to-end for typical interactions.
  - The RDP handshake itself is ~700 ms; the rest is unavoidable cold-start cost in the CLI shape.
- **Connect failure diagnostics**: when `connect` fails, the first line of stderr is the actionable error (e.g. `connect_failed: Failed to connect to RDP server: ERRCONNECT_LOGON_FAILURE: Logon failed.`). The subsequent stack trace is diagnostic noise — read the first line, then check credentials/network. Common `ERRCONNECT_*` causes:
  - `LOGON_FAILURE` — bad username/password/domain.
  - `CONNECT_TRANSPORT_FAILED` — host unreachable or RDP port blocked. Verify with `nc -zv <host> 3389`.
  - `TLS_CONNECT_FAILED` — TLS handshake rejected. Try `--ignore-certificate` for self-signed dev hosts, or pin `--security-protocol nla`.

After `connect --host ...` succeeds, the rest of the workflow (`act`, `tap --locate`, `assert`, `take_screenshot`, `listdisplays`, `report-tool`, `disconnect`) is identical to local mode — just remember to pass the same `--host`/`--username`/`--password`/`--ignore-certificate` flags to every subsequent command, since each CLI invocation is stateless and reconnects.

### List Displays

```

Related in Code Review