agent-browser

Included with Lifetime

$97 forever

Browser automation for AI agents via inference.sh. Navigate web pages, interact with elements using @e refs, take screenshots, record video. Capabilities: web scraping, form filling, clicking, typing, drag-drop, file upload, JavaScript execution. Use for: web automation, data extraction, testing, agent browsing, research. Triggers: browser, web automation, scrape, navigate, click, fill form, screenshot, browse web, playwright, headless browser, web agent, surf internet, record video

Image & Video

What this skill does


> **Install the belt CLI skill:** `npx skills add belt-sh/cli`

# Agentic Browser

Browser automation for AI agents via [inference.sh](https://inference.sh). Uses Playwright under the hood with a simple `@e` ref system for element interaction.

![Agentic Browser](https://cloud.inference.sh/app/files/u/4mg21r6ta37mpaz6ktzwtt8krr/01kgjw8atdxgkrsr8a2t5peq7b.jpeg)

## Quick Start

> Requires inference.sh CLI (`belt`). [Install instructions](https://raw.githubusercontent.com/inference-sh/skills/refs/heads/main/cli-install.md)

```bash
belt login

# Open a page and get interactive elements
belt app run agent-browser --function open --input '{"url": "https://example.com"}' --session new
```


## Core Workflow

Every browser automation follows this pattern:

1. **Open** - Navigate to URL, get `@e` refs for elements
2. **Interact** - Use refs to click, fill, drag, etc.
3. **Re-snapshot** - After navigation/changes, get fresh refs
4. **Close** - End session (returns video if recording)

```bash
# 1. Start session
RESULT=$(belt app run agent-browser --function open --session new --input '{
  "url": "https://example.com/login"
}')
SESSION_ID=$(echo $RESULT | jq -r '.session_id')
# Elements: @e1 [input] "Email", @e2 [input] "Password", @e3 [button] "Sign In"

# 2. Fill and submit
belt app run agent-browser --function interact --session $SESSION_ID --input '{
  "action": "fill", "ref": "@e1", "text": "[email protected]"
}'
belt app run agent-browser --function interact --session $SESSION_ID --input '{
  "action": "fill", "ref": "@e2", "text": "password123"
}'
belt app run agent-browser --function interact --session $SESSION_ID --input '{
  "action": "click", "ref": "@e3"
}'

# 3. Re-snapshot after navigation
belt app run agent-browser --function snapshot --session $SESSION_ID --input '{}'

# 4. Close when done
belt app run agent-browser --function close --session $SESSION_ID --input '{}'
```

## Functions

| Function | Description |
|----------|-------------|
| `open` | Navigate to URL, configure browser (viewport, proxy, video recording) |
| `snapshot` | Re-fetch page state with `@e` refs after DOM changes |
| `interact` | Perform actions using `@e` refs (click, fill, drag, upload, etc.) |
| `screenshot` | Take page screenshot (viewport or full page) |
| `execute` | Run JavaScript code on the page |
| `close` | Close session, returns video if recording was enabled |

## Interact Actions

| Action | Description | Required Fields |
|--------|-------------|-----------------|
| `click` | Click element | `ref` |
| `dblclick` | Double-click element | `ref` |
| `fill` | Clear and type text | `ref`, `text` |
| `type` | Type text (no clear) | `text` |
| `press` | Press key (Enter, Tab, etc.) | `text` |
| `select` | Select dropdown option | `ref`, `text` |
| `hover` | Hover over element | `ref` |
| `check` | Check checkbox | `ref` |
| `uncheck` | Uncheck checkbox | `ref` |
| `drag` | Drag and drop | `ref`, `target_ref` |
| `upload` | Upload file(s) | `ref`, `file_paths` |
| `scroll` | Scroll page | `direction` (up/down/left/right), `scroll_amount` |
| `back` | Go back in history | - |
| `wait` | Wait milliseconds | `wait_ms` |
| `goto` | Navigate to URL | `url` |

## Element Refs

Elements are returned with `@e` refs:

```
@e1 [a] "Home" href="/"
@e2 [input type="text"] placeholder="Search"
@e3 [button] "Submit"
@e4 [select] "Choose option"
@e5 [input type="checkbox"] name="agree"
```

**Important:** Refs are invalidated after navigation. Always re-snapshot after:
- Clicking links/buttons that navigate
- Form submissions
- Dynamic content loading

## Features

### Video Recording

Record browser sessions for debugging or documentation:

```bash
# Start with recording enabled (optionally show cursor indicator)
SESSION=$(belt app run agent-browser --function open --session new --input '{
  "url": "https://example.com",
  "record_video": true,
  "show_cursor": true
}' | jq -r '.session_id')

# ... perform actions ...

# Close to get the video file
belt app run agent-browser --function close --session $SESSION --input '{}'
# Returns: {"success": true, "video": <File>}
```

### Cursor Indicator

Show a visible cursor in screenshots and video (useful for demos):

```bash
belt app run agent-browser --function open --session new --input '{
  "url": "https://example.com",
  "show_cursor": true,
  "record_video": true
}'
```

The cursor appears as a red dot that follows mouse movements and shows click feedback.

### Proxy Support

Route traffic through a proxy server:

```bash
belt app run agent-browser --function open --session new --input '{
  "url": "https://example.com",
  "proxy_url": "http://proxy.example.com:8080",
  "proxy_username": "user",
  "proxy_password": "pass"
}'
```

### File Upload

Upload files to file inputs:

```bash
belt app run agent-browser --function interact --session $SESSION --input '{
  "action": "upload",
  "ref": "@e5",
  "file_paths": ["/path/to/file.pdf"]
}'
```

### Drag and Drop

Drag elements to targets:

```bash
belt app run agent-browser --function interact --session $SESSION --input '{
  "action": "drag",
  "ref": "@e1",
  "target_ref": "@e2"
}'
```

### JavaScript Execution

Run custom JavaScript:

```bash
belt app run agent-browser --function execute --session $SESSION --input '{
  "code": "document.querySelectorAll(\"h2\").length"
}'
# Returns: {"result": "5", "screenshot": <File>}
```

## Deep-Dive Documentation

| Reference | Description |
|-----------|-------------|
| [references/commands.md](references/commands.md) | Full function reference with all options |
| [references/snapshot-refs.md](references/snapshot-refs.md) | Ref lifecycle, invalidation rules, troubleshooting |
| [references/session-management.md](references/session-management.md) | Session persistence, parallel sessions |
| [references/authentication.md](references/authentication.md) | Login flows, OAuth, 2FA handling |
| [references/video-recording.md](references/video-recording.md) | Recording workflows for debugging |
| [references/proxy-support.md](references/proxy-support.md) | Proxy configuration, geo-testing |

## Ready-to-Use Templates

| Template | Description |
|----------|-------------|
| [templates/form-automation.sh](templates/form-automation.sh) | Form filling with validation |
| [templates/authenticated-session.sh](templates/authenticated-session.sh) | Login once, reuse session |
| [templates/capture-workflow.sh](templates/capture-workflow.sh) | Content extraction with screenshots |

## Examples

### Form Submission

```bash
SESSION=$(belt app run agent-browser --function open --session new --input '{
  "url": "https://example.com/contact"
}' | jq -r '.session_id')

# Get elements: @e1 [input] "Name", @e2 [input] "Email", @e3 [textarea], @e4 [button] "Send"

belt app run agent-browser --function interact --session $SESSION --input '{"action": "fill", "ref": "@e1", "text": "John Doe"}'
belt app run agent-browser --function interact --session $SESSION --input '{"action": "fill", "ref": "@e2", "text": "[email protected]"}'
belt app run agent-browser --function interact --session $SESSION --input '{"action": "fill", "ref": "@e3", "text": "Hello!"}'
belt app run agent-browser --function interact --session $SESSION --input '{"action": "click", "ref": "@e4"}'

belt app run agent-browser --function snapshot --session $SESSION --input '{}'
belt app run agent-browser --function close --session $SESSION --input '{}'
```

### Search and Extract

```bash
SESSION=$(belt app run agent-browser --function open --session new --input '{
  "url": "https://google.com"
}' | jq -r '.session_id')

belt app run agent-browser --function interact --session $SESSION --input '{"action": "fill", "ref": "@e1", "text": "weather today"}'
belt app run agent-browser --function interact --session $SESSION --input '{"action": "press", "text": "Enter"}'
belt app run agent-browser --function interact --session $SESSION --input '{"action": "wait", "wait_ms": 2000}'

belt app run agent-browser --fun

Files: 10

Size: 62.3 KB

Complexity: 61/100

Category: Image & Video

Source: https://github.com/inferen-sh/skills/tree/main/tools/utilities/agent-browser

Related in Image & Video

watch

Included

Watch a video (URL or local path). Downloads with yt-dlp, extracts auto-scaled frames with ffmpeg, pulls the transcript from captions (or Whisper API fallback), and hands the result to Claude so it can answer questions about what's in the video.

Image & Videoscriptsfeatured

physical-ai-defect-image-generation

Included

Use when the user wants to orchestrate defect image generation, run associated setup, or handle outputs on OSMO. The Day 0 path handles cold-start with USD-to-ROI, image-edit augmentation, and AnomalyGen to create initial PCBA datasets. The Day 1 path performs inference and labeling on real images. This skill helps with first-time asset setup, creation of finetuning checkpoints, and configuring deployment. Trigger keywords: defect image generation, dig workflow, dig pipeline, defect image detection workflow, aoi pipeline, aoi anomalygen, usd2roi anomalygen, day 0 pcba, day 1 pcba, day 1 real-photo alignment, day 1 manual roi, metal surface anomaly, glass defect, anomalygen finetune, setup_pcb, setup_metal, setup_glass, setup_pretrained, dig setup, dig datasets, dig pretrained checkpoint, dig image-edit endpoint.

Image & Videoscripts

accelint-react-best-practices

Included

React performance optimization and best practices. ALWAYS use this skill when working with any React code - writing components, hooks, JSX; refactoring; optimizing re-renders, memoization, state management; reviewing for performance; fixing hydration mismatches; debugging infinite re-renders, stale closures, input focus loss, animations restarting; preventing remounting; implementing transitions, lazy initialization, effect dependencies. Even simple React tasks benefit from these patterns. Covers React 19+ (useEffectEvent, Activity, ref props). Triggers - useEffect, useState, useMemo, useCallback, memo, inline components, nested components, components inside components, re-render, performance, hydration, SSR, Next.js, useDeferredValue, combined hooks.

Image & Videoscripts

elevenlabs-agents

Included

Build conversational AI voice agents with ElevenLabs Platform using React, JavaScript, React Native, or Swift SDKs. Configure agents, tools (client/server/MCP), RAG knowledge bases, multi-voice, and Scribe real-time STT. Use when: building voice chat interfaces, implementing AI phone agents with Twilio, configuring agent workflows or tools, adding RAG knowledge bases, testing with CLI "agents as code", or troubleshooting deprecated @11labs packages, Android audio cutoff, CSP violations, dynamic variables, or WebRTC config. Keywords: ElevenLabs Agents, ElevenLabs voice agents, AI voice agents, conversational AI, @elevenlabs/react, @elevenlabs/client, @elevenlabs/react-native, @elevenlabs/elevenlabs-js, @elevenlabs/agents-cli, elevenlabs SDK, voice AI, TTS, text-to-speech, ASR, speech recognition, turn-taking model, WebRTC voice, WebSocket voice, ElevenLabs conversation, agent system prompt, agent tools, agent knowledge base, RAG voice agents, multi-voice agents, pronunciation dictionary, voice speed control, elevenlabs scribe, @11labs deprecated, Android audio cutoff, CSP violation elevenlabs, dynamic variables elevenlabs, case-sensitive tool names, webhook authentication

Image & Videoscripts

humanizer

Included

Humanize AI-generated text by detecting and removing patterns typical of LLM output. Rewrites text to sound natural, specific, and human. Uses 28 pattern detectors, 560+ AI vocabulary terms across 3 tiers, and statistical analysis (burstiness, type-token ratio, readability) for comprehensive detection. Use when asked to humanize text, de-AI writing, make content sound more natural/human, review writing for AI patterns, score text for AI detection, or improve AI-generated drafts. Covers content, language, style, communication, and filler categories.

Image & Videoscripts

generating-mermaid-diagrams

Included

Salesforce architecture diagrams using Mermaid with ASCII fallback. Use this skill when generating text-based diagrams for Salesforce architecture, OAuth flows, ERDs, integration sequences, or Agentforce structure. TRIGGER when: user says "diagram", "visualize", "ERD", or asks for sequence diagrams, flowcharts, class diagrams, or architecture visualizations in Mermaid. DO NOT TRIGGER when: user wants PNG/SVG image output (use generating-visual-diagrams), or asks about non-Salesforce systems.

Image & Videoscripts