web-scraper

Included with Lifetime

$97 forever

Scrape web pages and save as HTML or Markdown (with text and images). Minimal dependencies - only requests and beautifulsoup4. Use when the user provides a URL and wants to download/archive the content locally.

Web Devscripts

What this skill does

# Web Scraper

Fetch web page content (text + images) and save as HTML or Markdown locally.

**Minimal dependencies**: Only requires `requests` and `beautifulsoup4` - no browser automation.

**Default behavior**: Downloads images to local `images/` directory automatically.

## Quick start

### Single page

```bash
{baseDir}/scripts/scrape.py --url "https://example.com" --format html --output /tmp/page.html
{baseDir}/scripts/scrape.py --url "https://example.com" --format md --output /tmp/page.md
```

### Recursive (follow links)

```bash
{baseDir}/scripts/scrape.py --url "https://docs.example.com" --format md --recursive --max-depth 2 --output ~/Downloads/docs-archive
```

## Setup

Requires Python 3.8+ and minimal dependencies:

```bash
cd {baseDir}
pip install -r requirements.txt
```

Or install manually:

```bash
pip install requests beautifulsoup4
```

**Note**: No browser or driver needed - uses pure HTTP requests.

## Inputs to collect

### Single page mode

- **URL**: The web page to scrape (required)
- **Format**: `html` or `md` (default: `html`)
- **Output path**: Where to save the file (default: current directory with auto-generated name)
- **Images**: Downloads images by default (use `--no-download-images` to disable)

### Recursive mode (--recursive)

- **URL**: Starting point for recursive scraping
- **Format**: `html` or `md`
- **Output directory**: Where to save all scraped pages
- **Max depth**: How many levels deep to follow links (default: 2)
- **Max pages**: Maximum total pages to scrape (default: 50)
- **Domain filter**: Whether to stay within same domain (default: yes)
- **Images**: Downloads images by default

## Conversation Flow

1. Ask user for the URL to scrape
2. Ask preferred output format (HTML or Markdown)
   - Note: Both formats include text and images by default
   - HTML: Preserves original structure with downloaded images
   - Markdown: Clean text format with downloaded images in `images/` folder
3. For recursive mode: Ask max depth and max pages (optional, has sensible defaults)
4. Ask where to save (or suggest a default path like `/tmp/` or `~/Downloads/`)
5. Run the script and confirm success
6. Show the saved file/directory path

## Examples

### Single Page Scraping

#### Save as HTML

```bash
{baseDir}/scripts/scrape.py --url "https://docs.openclaw.ai/start/quickstart" --format html --output ~/Downloads/openclaw-quickstart.html
```

#### Save as Markdown (with images, default)

```bash
{baseDir}/scripts/scrape.py --url "https://en.wikipedia.org/wiki/Web_scraping" --format md --output ~/Documents/web-scraping.md
```

**Result**: Creates `web-scraping.md` + `images/` folder with all downloaded images (text + images).

#### Without downloading images (optional)

```bash
{baseDir}/scripts/scrape.py --url "https://example.com" --format md --no-download-images
```

**Result**: Only text + image URLs (not downloaded locally).

#### Auto-generate filename

```bash
{baseDir}/scripts/scrape.py --url "https://example.com" --format html
# Saves to: example-com-{timestamp}.html
```

### Recursive Scraping

#### Basic recursive crawl (depth 2, same domain, with images)

```bash
{baseDir}/scripts/scrape.py --url "https://docs.example.com" --format md --recursive --output ~/Downloads/docs-archive
```

**Output structure** (text + images for all pages):
```
docs-archive/
├── index.md
├── getting-started.md
├── api/
│   ├── authentication.md
│   └── endpoints.md
└── images/              # Shared images from all pages
    ├── logo.png
    └── diagram.svg
```

#### Deep crawl with custom limits

```bash
{baseDir}/scripts/scrape.py \
  --url "https://blog.example.com" \
  --format html \
  --recursive \
  --max-depth 3 \
  --max-pages 100 \
  --output ~/Archives/blog-backup
```

#### Ignore robots.txt (use with caution)

```bash
{baseDir}/scripts/scrape.py \
  --url "https://example.com" \
  --format md \
  --recursive \
  --no-respect-robots \
  --rate-limit 1.0
```

#### Faster scraping (reduced rate limit)

```bash
{baseDir}/scripts/scrape.py \
  --url "https://yoursite.com" \
  --format md \
  --recursive \
  --rate-limit 0.2
```

## Features

### Single Page Mode

- **HTML output**: Preserves original page structure
  - ✅ Clean, readable HTML document
  - ✅ All images downloaded to `images/` folder
  - ✅ Suitable for offline viewing
- **Markdown output**: Extracts clean text content
  - ✅ **Auto-downloads images** to local `images/` directory (default)
  - ✅ Converts image URLs to relative paths
  - ✅ Clean, readable format for archiving
  - ✅ Fallback to original URLs if download fails
  - Use `--no-download-images` flag to keep original URLs only
- **Simple and fast**: Pure HTTP requests, no browser needed
- **Auto filename**: Generates safe filename from URL if not specified

### Recursive Mode (`--recursive`)

- **✅ Intelligent link discovery**: Automatically follows all links on crawled pages
- **✅ Depth control**: `--max-depth` limits how many levels deep to crawl (default: 2)
- **✅ Page limit**: `--max-pages` caps total pages to prevent runaway crawls (default: 50)
- **✅ Domain filtering**: `--same-domain` keeps crawl within starting domain (default: on)
- **✅ robots.txt compliance**: Respects site's crawling rules by default
- **✅ Rate limiting**: `--rate-limit` adds delay between requests (default: 0.5s)
- **✅ Smart URL filtering**: Skips images, scripts, CSS, and duplicate URLs
- **✅ Progress tracking**: Real-time console output with success/fail/skip counts
- **✅ Organized output**: Preserves URL structure in directory hierarchy
- **✅ Efficient crawling**: Sequential with rate limiting to respect servers

## Guardrails

### Single Page Mode

- Respect robots.txt and site terms of service
- Some sites may block automated access; this tool uses standard HTTP requests
- Large pages with many images may take time to download

### Recursive Mode

- **Start small**: Test with `--max-depth 1 --max-pages 10` first
- **Respect robots.txt**: Default is on; only use `--no-respect-robots` for your own sites
- **Rate limiting**: Default 0.5s is polite; don't go below 0.2s for public sites
- **Same domain**: Strongly recommended to keep `--same-domain` enabled
- **Monitor progress**: Watch for high fail rates (may indicate blocking)
- **Storage**: Recursive crawls can generate many files; ensure sufficient disk space
- **Legal**: Ensure you have permission to crawl and archive the target site

## Troubleshooting

- **Connection errors**: Check your internet connection and URL validity
- **403/blocked**: Some sites block scrapers; the tool uses realistic User-Agent headers
- **Timeout**: Increase `--timeout` flag for slow-loading pages (value in seconds)
- **Image download fails**: Images will fall back to original URLs
- **Missing images**: Some sites use JavaScript to load images dynamically (not supported)

Files: 3

Size: 30.3 KB

Complexity: 56/100

Category: Web Dev

Source: https://github.com/agentbay-ai/agentbay-skills/tree/main/web-scraper

Related in Web Dev

generating-lwc-components

Included

Lightning Web Components with PICKLES methodology and 165-point scoring. Use this skill when the user creates or edits LWC components, builds wire service patterns, or writes Jest tests for LWC. TRIGGER when: user creates/edits LWC components, touches lwc/**/*.js, .html, .css, .js-meta.xml files, or asks about wire service, SLDS, or Jest LWC tests. DO NOT TRIGGER when: Apex classes (use generating-apex), Aura components, or Visualforce.

Web Devscripts

tanstack-query

Included

Manage server state in React with TanStack Query v5. Set up queries with useQuery, mutations with useMutation, configure QueryClient caching strategies, implement optimistic updates, and handle infinite scroll with useInfiniteQuery. Use when: setting up data fetching in React projects, migrating from v4 to v5, or fixing object syntax required errors, query callbacks removed issues, cacheTime renamed to gcTime, isPending vs isLoading confusion, keepPreviousData removed problems.

Web Devscripts

document-processor-api

Included

Process documents with Nutrient DWS. Use when the user wants to generate PDFs from HTML or URLs, convert Office/images/PDFs, assemble or split packets, OCR scans, extract text/tables/key-value pairs, redact PII, watermark, sign, fill forms, optimize PDFs, or produce compliance outputs like PDF/A or PDF/UA. Triggers include convert to PDF, merge these PDFs, OCR this scan, extract tables, redact PII, sign this PDF, make this PDF/A, or linearize for web delivery.

Web Devscripts

nutrient-document-processing

Included

Web Devscripts

tanstack-query

Included

Manage server state in React with TanStack Query v5. Covers useMutationState, simplified optimistic updates, throwOnError, network mode (offline/PWA), and infiniteQueryOptions. Use when setting up data fetching, fixing v4→v5 migration errors (object syntax, gcTime, isPending, keepPreviousData), or debugging SSR/hydration issues with streaming server components.

Web Devscripts

accelint-nextjs-best-practices

Included

Next.js performance optimization and best practices. Use when writing Next.js code (App Router or Pages Router); implementing Server Components, Server Actions, or API routes; optimizing RSC serialization, data fetching, or server-side rendering; reviewing Next.js code for performance issues; fixing authentication in Server Actions; or implementing Suspense boundaries, parallel data fetching, or request deduplication.

Web Devscripts