computer-automation
Vision-driven desktop automation using Midscene. Control your local desktop (macOS, Windows, Linux) or a remote Windows desktop over RDP with natural language commands. Operates entirely from screenshots — no DOM or accessibility labels required. Can interact with all visible elements on screen regardless of technology stack. ⚠️ In local mode this takes over the user's real mouse and keyboard. For web apps, prefer "Browser Automation" instead. Only use this for desktop-native apps (Electron, Qt, native macOS/Windows/Linux) that cannot run in a browser, or for driving a remote Windows host via RDP. Triggers: open app, press key, desktop, computer, click on screen, type text, screenshot desktop, launch application, switch window, desktop automation, control computer, mouse click, keyboard shortcut, screen capture, find on screen, read screen, verify window, close app, test Electron app, rdp, remote desktop, windows server, connect via rdp Powered by Midscene.js (https://midscenejs.com)
What this skill does
# Desktop Computer Automation
> **CRITICAL RULES — VIOLATIONS WILL BREAK THE WORKFLOW:**
>
> 1. **Never run midscene commands in the background.** Each command must run synchronously so you can read its output (especially screenshots) before deciding the next action. Background execution breaks the screenshot-analyze-act loop.
> 2. **Run only one midscene command at a time.** Wait for the previous command to finish, read the screenshot, then decide the next action. Never chain multiple commands together.
> 3. **Allow enough time for each command to complete.** Midscene commands involve AI inference and screen interaction, which can take longer than typical shell commands. A typical command needs about 1 minute; complex `act` commands may need even longer.
> 4. **Always report task results before finishing.** After completing the automation task, you MUST proactively summarize the results to the user — including key data found, actions completed, screenshots taken, and any relevant findings. Never silently end after the last automation step; the user expects a complete response in a single interaction.
> 5. **Only minimize windows, never close them unless explicitly asked.** When you need to dismiss or get a window out of the way, minimize it instead of closing it. Do not close any app or window unless the user explicitly asks you to do so.
Control your desktop (macOS, Windows, Linux) using `npx -y @midscene/computer@1`. Each CLI command maps directly to an MCP tool — you (the AI agent) act as the brain, deciding which actions to take based on screenshots.
## What `act` Can Do
Inside a single `act` call on desktop, Midscene can move the mouse, click, double-click, right-click, drag items, type or clear text, scroll, press single keys or keyboard shortcuts, and work through multi-step interactions on whatever is visible on the selected display.
## Prerequisites
Midscene requires models with strong visual grounding capabilities. The following environment variables must be configured — either as system environment variables or in a `.env` file in the current working directory (Midscene loads `.env` automatically):
```bash
MIDSCENE_MODEL_API_KEY="your-api-key"
MIDSCENE_MODEL_NAME="model-name"
MIDSCENE_MODEL_BASE_URL="https://..."
MIDSCENE_MODEL_FAMILY="family-identifier"
```
Example: Gemini (Gemini-3-Flash)
```bash
MIDSCENE_MODEL_API_KEY="your-google-api-key"
MIDSCENE_MODEL_NAME="gemini-3-flash"
MIDSCENE_MODEL_BASE_URL="https://generativelanguage.googleapis.com/v1beta/openai/"
MIDSCENE_MODEL_FAMILY="gemini"
```
Example: Qwen 3.5
```bash
MIDSCENE_MODEL_API_KEY="your-aliyun-api-key"
MIDSCENE_MODEL_NAME="qwen3.5-plus"
MIDSCENE_MODEL_BASE_URL="https://dashscope.aliyuncs.com/compatible-mode/v1"
MIDSCENE_MODEL_FAMILY="qwen3.5"
MIDSCENE_MODEL_REASONING_ENABLED="false"
# If using OpenRouter, set:
# MIDSCENE_MODEL_API_KEY="your-openrouter-api-key"
# MIDSCENE_MODEL_NAME="qwen/qwen3.5-plus"
# MIDSCENE_MODEL_BASE_URL="https://openrouter.ai/api/v1"
```
Example: Doubao Seed 2.0 Lite
```bash
MIDSCENE_MODEL_API_KEY="your-doubao-api-key"
MIDSCENE_MODEL_NAME="doubao-seed-2-0-lite"
MIDSCENE_MODEL_BASE_URL="https://ark.cn-beijing.volces.com/api/v3"
MIDSCENE_MODEL_FAMILY="doubao-seed"
```
Commonly used models: Doubao Seed 2.0 Lite, Qwen 3.5, Zhipu GLM-4.6V, Gemini-3-Pro, Gemini-3-Flash.
If the model is not configured, ask the user to set it up. See [Model Configuration](https://midscenejs.com/model-common-config) for supported providers.
## Commands
### Connect to Desktop
```bash
npx -y @midscene/computer@1 connect
npx -y @midscene/computer@1 connect --displayId <id>
```
### Connect via RDP
Use RDP mode to drive a **remote Windows desktop** instead of the local machine. Providing `--host` switches `connect` to RDP and routes every subsequent command (`act`, `tap`, `take_screenshot`, `assert`, `disconnect`) through the RDP helper binary bundled with `@midscene/computer`. The local mouse/keyboard is **not** touched.
Minimum example:
```bash
npx -y @midscene/computer@1 connect \
--host rdp.example.com \
--username Administrator \
--password "$RDP_PASSWORD"
```
All RDP options for `connect` (RDP mode is activated when `--host` is set; the other flags are optional):
- `--host <fqdn-or-ip>` — RDP host. Required to enter RDP mode.
- `--port <number>` — RDP port (default `3389`).
- `--username <user>` — RDP user account.
- `--password <secret>` — RDP password. Prefer reading from an environment variable, secrets manager, or interactive prompt; never paste it into a shared transcript.
- `--domain <domain>` — Active Directory / NTLM domain.
- `--security-protocol <auto|tls|nla|rdp>` — Security protocol negotiation. Defaults to `auto`.
- `--ignore-certificate` — Skip TLS certificate validation. Use only for trusted dev hosts with self-signed certs.
- `--admin-session` — Attach to the admin/console session (equivalent to `mstsc /admin`).
- `--desktop-width <px>` and `--desktop-height <px>` — **Request** a specific remote desktop resolution. The actual size is whatever the RDP server negotiates back; e.g. requesting `1024x768` against a host that pins `1280x720` will land on `1280x720`. Confirm the negotiated size with `listdisplays --host ... --username ... --password ...` after connect.
Notes specific to RDP mode:
- `--displayId` and `--headless` are **ignored** in RDP mode. A connected RDP session always exposes a single virtual display whose size is whatever the server negotiated.
- Two display-listing commands exist and behave differently — pick the right one:
- `list_displays` (with underscore, platform tool) — enumerates **local** physical displays only. Does not accept RDP flags. Useless after an RDP connect.
- `listdisplays` (no underscore, action tool) — accepts the same RDP flags as `connect`/`take_screenshot`/etc. In RDP mode it returns the negotiated virtual display, e.g. `[{ "id": "...", "name": "RDP 10.70.86.26:3389 (1280x720)", "primary": true }]`. Use this to verify the actual resolution.
- The RDP transport uses a native helper binary shipped inside `@midscene/computer`. If you see `RDP helper binary not found` errors, the optional `bin/<platform>/rdp-helper` was stripped from your install — reinstall the package or unpack a fresh tarball.
- Treat RDP credentials as secrets: do not commit `.env` files containing `--password` to the repo; prefer `export RDP_PASSWORD=...` in the current shell and reference it as `--password "$RDP_PASSWORD"`.
- **Latency expectations**: every CLI invocation is a fresh node process, so each command re-establishes the RDP session. Budget roughly:
- `connect` / `take_screenshot` / `keyboardpress` / `scroll`: ~5 s (node startup + RDP TLS+NLA handshake + first frame).
- `act` / `assert` / `tap --locate`: ~5 s + AI inference + any planned sub-actions; expect 8–20 s end-to-end for typical interactions.
- The RDP handshake itself is ~700 ms; the rest is unavoidable cold-start cost in the CLI shape.
- **Connect failure diagnostics**: when `connect` fails, the first line of stderr is the actionable error (e.g. `connect_failed: Failed to connect to RDP server: ERRCONNECT_LOGON_FAILURE: Logon failed.`). The subsequent stack trace is diagnostic noise — read the first line, then check credentials/network. Common `ERRCONNECT_*` causes:
- `LOGON_FAILURE` — bad username/password/domain.
- `CONNECT_TRANSPORT_FAILED` — host unreachable or RDP port blocked. Verify with `nc -zv <host> 3389`.
- `TLS_CONNECT_FAILED` — TLS handshake rejected. Try `--ignore-certificate` for self-signed dev hosts, or pin `--security-protocol nla`.
After `connect --host ...` succeeds, the rest of the workflow (`act`, `tap --locate`, `assert`, `take_screenshot`, `listdisplays`, `report-tool`, `disconnect`) is identical to local mode — just remember to pass the same `--host`/`--username`/`--password`/`--ignore-certificate` flags to every subsequent command, since each CLI invocation is stateless and reconnects.
### List Displays
```Related in Code Review
gstack
IncludedFast headless browser for QA testing and site dogfooding. Navigate pages, interact with elements, verify state, diff before/after, take annotated screenshots, test responsive layouts, forms, uploads, dialogs, and capture bug evidence. Use when asked to open or test a site, verify a deployment, dogfood a user flow, or file a bug with screenshots. (gstack)
startup-due-diligence
IncludedLegal due diligence review for seed-stage and Series A startups (US, Delaware C-Corp focus). Supports both investor and founder perspectives. Capabilities include: (1) Interactive document review and issue spotting; (2) Document request list generation; (3) Cap table and SAFE/convertible note analysis; (4) Red flag identification with severity ratings; (5) Diligence report generation. TRIGGERS: due diligence, DD, startup investment, cap table review, Series A, seed round, investor diligence, legal review startup, SAFE analysis, convertible note, 409A, founder vesting.
interview-master
IncludedThis skill should be used when the user asks to "generate interview questions", "prepare for interview", "optimize resume", "conduct mock interview", "analyze git commits for resume", "generate resume from code", "review my resume", or mentions interview preparation, career assistance, or extracting project experience from git history. Provides comprehensive interview and career development guidance for both job seekers and interviewers.
fix-issue
IncludedFixes GitHub issues using parallel analysis agents for root cause investigation, code exploration, and regression detection. Reads issue context from gh CLI, searches codebase and memory for related patterns, generates a fix with tests, and links the resolution back to the issue via PR. Includes prevention analysis to avoid recurrence. Use when debugging errors, resolving regressions, fixing bugs, or triaging issues.
sf-apex
IncludedGenerates and reviews Salesforce Apex code with 150-point scoring. TRIGGER when: user writes, reviews, or fixes Apex classes, triggers, test classes, batch/queueable/schedulable jobs, or touches .cls/.trigger files. DO NOT TRIGGER when: LWC JavaScript (use sf-lwc), Flow XML (use sf-flow), SOQL-only queries (use sf-soql), or non-Salesforce code.
swift-development
IncludedComprehensive Swift development for building, testing, and deploying iOS/macOS applications. Use when Claude needs to: (1) Build Swift packages or Xcode projects from command line, (2) Run tests with XCTest or Swift Testing framework, (3) Manage iOS simulators with simctl, (4) Handle code signing, provisioning profiles, and app distribution, (5) Format or lint Swift code with SwiftFormat/SwiftLint, (6) Work with Swift Package Manager (SPM), (7) Implement Swift 6 concurrency patterns (async/await, actors, Sendable), (8) Create SwiftUI views with MVVM architecture, (9) Set up Core Data or SwiftData persistence, or any other Swift/iOS/macOS development tasks.