computer-use-cli
Automate and test Linux desktop apps (Electron, Tauri, React Native Desktop) in an X11 session using CLI commands for screenshot + mouse + keyboard. Includes coordinate scaling so you can click/type based on a smaller “API-sized” screenshot.
What this skill does
# Computer Use (CLI) Use this skill to **drive and validate a Linux GUI** from the command line: - **Take screenshots** to observe the current UI state - **Click / type / scroll / drag** to exercise flows - **Smoke-test desktop apps** (Electron / Tauri / React Native Desktop) inside a sandboxed X11 session It’s most useful when you need GUI automation but only have shell access (no native UI automation framework available). ## Quick start 1. Ensure dependencies are installed (see “Dependencies”). 2. Point the commands at the active display (usually already set): - `export DISPLAY=:1` (or whatever your container uses) 3. Take a screenshot to establish the agent’s visual state: - `cu-screenshot --base64` 4. Perform actions in **API coordinate space** (the resized screenshot), then screenshot again: - `cu-click --x 512 --y 384 --base64` - `cu-type --text "hello" --base64` Tip: start the app under the same display, for example: ```bash (DISPLAY=:1 ./my-electron-app &) && sleep 2 cu-screenshot --base64 ``` ## Coordinate spaces (critical) These commands support two coordinate spaces: - **API space (default)**: coordinates refer to the **resized** screenshot (e.g. XGA 1024×768). This keeps vision tokens lower and matches “computer use” style interactions. - **Screen space**: coordinates refer to the **real** display resolution. By default, if scaling is enabled, inputs are treated as **API space** and are scaled up before calling `xdotool`. Use: - `--space api` (default) or `--space screen` - `cu-info` to see screen size, chosen API size, and scale factors. ## Standard interaction loop (recommended) Use this loop to keep actions grounded: 1. `cu-screenshot --base64` (observe) 2. Decide next action(s) 3. Run **one** action command (or a short chain) and **always** capture a new screenshot: - click/move/type/scroll/drag → screenshot 4. Repeat until done ## Commands All commands print a single JSON object to stdout. Most accept: - `--display :N` (defaults to `$DISPLAY` if set) - `--base64` to include `screenshot_base64_png` (can be large) - `--out PATH` for screenshot output (default: `$TWILL_SCREENSHOTS_DIR/...png`) - `--no-screenshot` to skip the post-action screenshot (use sparingly) ### `scripts/cu-info` Prints JSON describing the detected screen size, chosen API size, and scale factors. ### `scripts/cu-screenshot` Takes a screenshot (optionally scaled down to a target like XGA). Use `--base64` when your agent needs the image content inline. ### `scripts/cu-move --x INT --y INT` Moves the mouse pointer. ### `scripts/cu-click --x INT --y INT` Left-click at a coordinate. Use `--button right|middle|left` or `scripts/cu-right-click`. ### `scripts/cu-double-click`, `scripts/cu-right-click`, `scripts/cu-middle-click` Click variants. ### `scripts/cu-drag --x0 INT --y0 INT --x1 INT --y1 INT` Drag from start to end coordinate. ### `scripts/cu-type --text STRING` Types text using `xdotool type` with a small delay. ### `scripts/cu-key --keys STRING` Sends a key chord (e.g. `ctrl+l`, `Return`, `Alt+F4`) via `xdotool key --`. ### `scripts/cu-scroll --dir up|down|left|right --amount INT [--x INT --y INT]` Scrolls by repeated scroll wheel clicks (4/5/6/7). Optionally moves to a coordinate first. ### `scripts/cu-wait --seconds FLOAT` Sleeps then returns a screenshot. ### `scripts/cu-cursor` Returns current cursor position (in **API space** if scaling is enabled). ### `scripts/cu-zoom --x0 INT --y0 INT --x1 INT --y1 INT` Returns a cropped screenshot of the region (like a “zoom” tool). ## Dependencies Install these in your Ubuntu image: ```bash apt-get update && apt-get install -y --no-install-recommends \ xdotool \ scrot \ imagemagick \ x11-utils \ && rm -rf /var/lib/apt/lists/* ``` If you run a headless desktop in-container, you typically also need: ```bash apt-get update && apt-get install -y --no-install-recommends \ xvfb \ mutter \ x11vnc \ && rm -rf /var/lib/apt/lists/* ``` ## Notes and pitfalls - Prefer a maximum screenshot size like **XGA (1024×768)** to reduce image tokens; scaling is automatic when the screen is larger and has a similar aspect ratio. - If your window manager is missing, GUI apps may not render or focus correctly even though X11 is running. - If actions seem “offset”, confirm the agent is using **API space** coordinates from the most recent screenshot and that scaling is enabled.
Related in Web Dev
generating-lwc-components
IncludedLightning Web Components with PICKLES methodology and 165-point scoring. Use this skill when the user creates or edits LWC components, builds wire service patterns, or writes Jest tests for LWC. TRIGGER when: user creates/edits LWC components, touches lwc/**/*.js, .html, .css, .js-meta.xml files, or asks about wire service, SLDS, or Jest LWC tests. DO NOT TRIGGER when: Apex classes (use generating-apex), Aura components, or Visualforce.
tanstack-query
IncludedManage server state in React with TanStack Query v5. Set up queries with useQuery, mutations with useMutation, configure QueryClient caching strategies, implement optimistic updates, and handle infinite scroll with useInfiniteQuery. Use when: setting up data fetching in React projects, migrating from v4 to v5, or fixing object syntax required errors, query callbacks removed issues, cacheTime renamed to gcTime, isPending vs isLoading confusion, keepPreviousData removed problems.
document-processor-api
IncludedProcess documents with Nutrient DWS. Use when the user wants to generate PDFs from HTML or URLs, convert Office/images/PDFs, assemble or split packets, OCR scans, extract text/tables/key-value pairs, redact PII, watermark, sign, fill forms, optimize PDFs, or produce compliance outputs like PDF/A or PDF/UA. Triggers include convert to PDF, merge these PDFs, OCR this scan, extract tables, redact PII, sign this PDF, make this PDF/A, or linearize for web delivery.
nutrient-document-processing
IncludedProcess documents with Nutrient DWS. Use when the user wants to generate PDFs from HTML or URLs, convert Office/images/PDFs, assemble or split packets, OCR scans, extract text/tables/key-value pairs, redact PII, watermark, sign, fill forms, optimize PDFs, or produce compliance outputs like PDF/A or PDF/UA. Triggers include convert to PDF, merge these PDFs, OCR this scan, extract tables, redact PII, sign this PDF, make this PDF/A, or linearize for web delivery.
tanstack-query
IncludedManage server state in React with TanStack Query v5. Covers useMutationState, simplified optimistic updates, throwOnError, network mode (offline/PWA), and infiniteQueryOptions. Use when setting up data fetching, fixing v4→v5 migration errors (object syntax, gcTime, isPending, keepPreviousData), or debugging SSR/hydration issues with streaming server components.
accelint-nextjs-best-practices
IncludedNext.js performance optimization and best practices. Use when writing Next.js code (App Router or Pages Router); implementing Server Components, Server Actions, or API routes; optimizing RSC serialization, data fetching, or server-side rendering; reviewing Next.js code for performance issues; fixing authentication in Server Actions; or implementing Suspense boundaries, parallel data fetching, or request deduplication.