Claude
Skills
Sign in
Back

intelligent-web-scraper

Included with Lifetime
$97 forever

Self-learning intelligent web scraper agent - automatically analyzes page structure, handles pagination, anti-blocking, and discovers article series. No user configuration needed - AI decides everything.

AI Agentsscripts

What this skill does


# Intelligent Web Scraper Agent

A self-aware, self-learning web scraping agent that accumulates experience over time.

## Prerequisites

> **IMPORTANT**: Before using this skill, ensure these dependencies are installed.

### One-Click Setup (Recommended)

```bash
# Navigate to skill directory and run setup
cd ~/.claude/skills/intelligent-web-scraper
chmod +x setup.sh && ./setup.sh

# Or for VPS/headless environment:
./setup.sh --headless
```

The setup script will:
- Detect your environment (macOS/Linux, GUI/headless)
- Install all Python dependencies
- Install Playwright browsers
- Set up Crawl4AI
- Create necessary directories

### Check Dependencies

```bash
# Check if everything is installed correctly
python3 ~/.claude/skills/intelligent-web-scraper/scripts/check_deps.py

# Auto-install missing dependencies
python3 ~/.claude/skills/intelligent-web-scraper/scripts/check_deps.py --install

# Output as JSON (for programmatic use)
python3 ~/.claude/skills/intelligent-web-scraper/scripts/check_deps.py --json
```

### Manual Setup (Alternative)

If you prefer manual installation:

```bash
# 1. Install Python dependencies
pip install -r ~/.claude/skills/intelligent-web-scraper/requirements.txt

# 2. Install Playwright browser
python -m playwright install chromium

# 3. Set up Crawl4AI
crawl4ai-setup

# 4. For VPS/headless Linux, also run:
python -m playwright install-deps chromium
```

### Required Tools

| Tool | Purpose | Installation |
|------|---------|--------------|
| **Playwright MCP** | Browser automation | Must be configured in Claude Code MCP settings |
| **Python 3.9+** | Script execution | `brew install python` or system package manager |
| **Crawl4AI** | Advanced crawling | Included in setup.sh |

### Verify Installation

```bash
# Quick verification
python -c "from crawl4ai import AsyncWebCrawler; print('Crawl4AI OK')"
python -c "from playwright.sync_api import sync_playwright; print('Playwright OK')"

# Full check
python3 ~/.claude/skills/intelligent-web-scraper/scripts/check_deps.py
```

### VPS/Headless Deployment

For servers without GUI (VPS, Docker, CI):

```bash
# 1. Run setup in headless mode
./setup.sh --headless

# 2. The scraper will automatically use headless browser
# No GUI required - all scraping works via headless Chromium

# 3. Check configuration
python3 ~/.claude/skills/intelligent-web-scraper/scripts/config.py
```

**Note**: `local_browser_scraper.py` (CDP mode) requires a GUI and is not available in headless environments. Use `crawl4ai_wrapper.py` instead.

---

## Local Browser Scraping (CDP)

**Use this when the user wants to scrape using their local browser (Comet, Chrome, etc.)**

This allows scraping while preserving user's login sessions and cookies!

### Quick Start

```bash
# 1. Install websockets (one time)
pip install websockets

# 2. Launch Comet/Chrome with debugging (or use the script)
python ~/.claude/skills/intelligent-web-scraper/scripts/local_browser_scraper.py \
    --launch comet \
    --url "https://example.com"

# 3. Scrape data from current page
python ~/.claude/skills/intelligent-web-scraper/scripts/local_browser_scraper.py \
    --extract articles \
    --output data.json
```

### Using local_browser_scraper.py

**Launch browser with debugging:**
```bash
python scripts/local_browser_scraper.py --launch comet --url "https://douban.com"
python scripts/local_browser_scraper.py --launch chrome --url "https://example.com"
```

**Scrape from current tab:**
```bash
# Get page text
python scripts/local_browser_scraper.py --extract text

# Get all links
python scripts/local_browser_scraper.py --extract links

# Get articles
python scripts/local_browser_scraper.py --extract articles

# Custom JavaScript
python scripts/local_browser_scraper.py --extract "document.querySelectorAll('h1').length"
```

**Scrape specific tab (by URL pattern):**
```bash
python scripts/local_browser_scraper.py --url-pattern "douban.com" --extract articles
```

**Built-in extractors:** `text`, `html`, `title`, `links`, `images`, `tables`, `articles`, `metadata`

### Manual CDP (Alternative)

**CRITICAL: Always use user's EXISTING profile to preserve login sessions!**

```bash
# Comet - use existing profile
/Applications/Comet.app/Contents/MacOS/Comet \
    --remote-debugging-port=9222 \
    --user-data-dir="$HOME/Library/Application Support/Comet" \
    "https://example.com" &

# Chrome - use existing profile
/Applications/Google\ Chrome.app/Contents/MacOS/Google\ Chrome \
    --remote-debugging-port=9222 \
    --user-data-dir="$HOME/Library/Application Support/Google/Chrome" \
    "https://example.com" &
```

### Browser Profile Locations (macOS)

| Browser | User Data Directory |
|---------|---------------------|
| Comet | `~/Library/Application Support/Comet` |
| Chrome | `~/Library/Application Support/Google/Chrome` |
| Edge | `~/Library/Application Support/Microsoft Edge` |
| Brave | `~/Library/Application Support/BraveSoftware/Brave-Browser` |

**NEVER use temp directory** like `/tmp/browser-debug` - this creates empty profile without logins!

**Why use existing profile:**
- Preserves all login sessions (no re-authentication needed)
- Browser extensions work normally
- User's bookmarks, history available

### When to Use Local Browser vs Playwright

| Use Local Browser | Use Playwright |
|-------------------|----------------|
| User requests specific browser | Automated bulk scraping |
| Need user's login session | Don't need authentication |
| User wants to see the page | Headless scraping OK |
| Site blocks headless browsers | Standard sites |

---

## Core Capabilities

### 1. Intelligent Page Analysis
- Auto-detect page type (product list, article, series, etc.)
- Identify data containers and selectors
- Recognize pagination mechanisms

### 2. Smart Pagination & Scroll Loading
- Page numbers, Next button, Infinite scroll, Load more, API pagination
- Auto-detect and handle appropriately

**CRITICAL: Scroll Loading Requirements**
- **MUST** automatically scroll down the page until all content is loaded
- Many modern websites use lazy loading/infinite scroll, initially showing only partial content
- Scroll strategy:
  1. Record current page height
  2. Scroll to bottom
  3. Wait 1-2 seconds for new content to load
  4. Check if page height increased
  5. Repeat until height stops changing (3 consecutive times)
- Use `browser_evaluate` to execute scrolling:
  ```javascript
  window.scrollTo(0, document.body.scrollHeight)
  ```

### 3. Detail Link Following

**CRITICAL: Detail Page Scraping Requirements**
- List pages may only show title/summary, full content is on detail pages
- **MUST** detect and follow detail links to get complete data
- Patterns to identify detail links:
  - "View more", "Read more", "Details", "Full article"
  - Title itself is a link
  - Arrow links like "→"
- Scraping workflow:
  1. First scrape all entries from list page
  2. Identify detail link for each entry
  3. Visit detail pages one by one to extract full content
  4. Merge data and save
- Note: Detail page scraping needs appropriate delays (2-5s) to avoid blocking

### 4. Anti-Blocking
- Adaptive delays (2-60s based on signals)
- User-Agent rotation
- Block detection and recovery

### 5. Series Discovery
- Find prev/next links
- Detect table of contents
- URL pattern analysis
- Build complete series from single article

### 6. **Self-Learning** (NEW)
- Records successful patterns for each domain
- Accumulates lessons learned
- Reuses knowledge on similar sites

---

## Self-Learning System

### How It Works

```
┌─────────────────────────────────────────────────────────────┐
│                    SCRAPE REQUEST                            │
└─────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────┐
│ Step 1: Check Experience Database                            │
│ - Read experiences/site_patterns.json          

Related in AI Agents