intelligent-web-scraper

Included with Lifetime

$97 forever

Self-learning intelligent web scraper agent - automatically analyzes page structure, handles pagination, anti-blocking, and discovers article series. No user configuration needed - AI decides everything.

AI Agentsscripts

What this skill does


# Intelligent Web Scraper Agent

A self-aware, self-learning web scraping agent that accumulates experience over time.

## Prerequisites

> **IMPORTANT**: Before using this skill, ensure these dependencies are installed.

### One-Click Setup (Recommended)

```bash
# Navigate to skill directory and run setup
cd ~/.claude/skills/intelligent-web-scraper
chmod +x setup.sh && ./setup.sh

# Or for VPS/headless environment:
./setup.sh --headless
```

The setup script will:
- Detect your environment (macOS/Linux, GUI/headless)
- Install all Python dependencies
- Install Playwright browsers
- Set up Crawl4AI
- Create necessary directories

### Check Dependencies

```bash
# Check if everything is installed correctly
python3 ~/.claude/skills/intelligent-web-scraper/scripts/check_deps.py

# Auto-install missing dependencies
python3 ~/.claude/skills/intelligent-web-scraper/scripts/check_deps.py --install

# Output as JSON (for programmatic use)
python3 ~/.claude/skills/intelligent-web-scraper/scripts/check_deps.py --json
```

### Manual Setup (Alternative)

If you prefer manual installation:

```bash
# 1. Install Python dependencies
pip install -r ~/.claude/skills/intelligent-web-scraper/requirements.txt

# 2. Install Playwright browser
python -m playwright install chromium

# 3. Set up Crawl4AI
crawl4ai-setup

# 4. For VPS/headless Linux, also run:
python -m playwright install-deps chromium
```

### Required Tools

| Tool | Purpose | Installation |
|------|---------|--------------|
| **Playwright MCP** | Browser automation | Must be configured in Claude Code MCP settings |
| **Python 3.9+** | Script execution | `brew install python` or system package manager |
| **Crawl4AI** | Advanced crawling | Included in setup.sh |

### Verify Installation

```bash
# Quick verification
python -c "from crawl4ai import AsyncWebCrawler; print('Crawl4AI OK')"
python -c "from playwright.sync_api import sync_playwright; print('Playwright OK')"

# Full check
python3 ~/.claude/skills/intelligent-web-scraper/scripts/check_deps.py
```

### VPS/Headless Deployment

For servers without GUI (VPS, Docker, CI):

```bash
# 1. Run setup in headless mode
./setup.sh --headless

# 2. The scraper will automatically use headless browser
# No GUI required - all scraping works via headless Chromium

# 3. Check configuration
python3 ~/.claude/skills/intelligent-web-scraper/scripts/config.py
```

**Note**: `local_browser_scraper.py` (CDP mode) requires a GUI and is not available in headless environments. Use `crawl4ai_wrapper.py` instead.

---

## Local Browser Scraping (CDP)

**Use this when the user wants to scrape using their local browser (Comet, Chrome, etc.)**

This allows scraping while preserving user's login sessions and cookies!

### Quick Start

```bash
# 1. Install websockets (one time)
pip install websockets

# 2. Launch Comet/Chrome with debugging (or use the script)
python ~/.claude/skills/intelligent-web-scraper/scripts/local_browser_scraper.py \
    --launch comet \
    --url "https://example.com"

# 3. Scrape data from current page
python ~/.claude/skills/intelligent-web-scraper/scripts/local_browser_scraper.py \
    --extract articles \
    --output data.json
```

### Using local_browser_scraper.py

**Launch browser with debugging:**
```bash
python scripts/local_browser_scraper.py --launch comet --url "https://douban.com"
python scripts/local_browser_scraper.py --launch chrome --url "https://example.com"
```

**Scrape from current tab:**
```bash
# Get page text
python scripts/local_browser_scraper.py --extract text

# Get all links
python scripts/local_browser_scraper.py --extract links

# Get articles
python scripts/local_browser_scraper.py --extract articles

# Custom JavaScript
python scripts/local_browser_scraper.py --extract "document.querySelectorAll('h1').length"
```

**Scrape specific tab (by URL pattern):**
```bash
python scripts/local_browser_scraper.py --url-pattern "douban.com" --extract articles
```

**Built-in extractors:** `text`, `html`, `title`, `links`, `images`, `tables`, `articles`, `metadata`

### Manual CDP (Alternative)

**CRITICAL: Always use user's EXISTING profile to preserve login sessions!**

```bash
# Comet - use existing profile
/Applications/Comet.app/Contents/MacOS/Comet \
    --remote-debugging-port=9222 \
    --user-data-dir="$HOME/Library/Application Support/Comet" \
    "https://example.com" &

# Chrome - use existing profile
/Applications/Google\ Chrome.app/Contents/MacOS/Google\ Chrome \
    --remote-debugging-port=9222 \
    --user-data-dir="$HOME/Library/Application Support/Google/Chrome" \
    "https://example.com" &
```

### Browser Profile Locations (macOS)

| Browser | User Data Directory |
|---------|---------------------|
| Comet | `~/Library/Application Support/Comet` |
| Chrome | `~/Library/Application Support/Google/Chrome` |
| Edge | `~/Library/Application Support/Microsoft Edge` |
| Brave | `~/Library/Application Support/BraveSoftware/Brave-Browser` |

**NEVER use temp directory** like `/tmp/browser-debug` - this creates empty profile without logins!

**Why use existing profile:**
- Preserves all login sessions (no re-authentication needed)
- Browser extensions work normally
- User's bookmarks, history available

### When to Use Local Browser vs Playwright

| Use Local Browser | Use Playwright |
|-------------------|----------------|
| User requests specific browser | Automated bulk scraping |
| Need user's login session | Don't need authentication |
| User wants to see the page | Headless scraping OK |
| Site blocks headless browsers | Standard sites |

---

## Core Capabilities

### 1. Intelligent Page Analysis
- Auto-detect page type (product list, article, series, etc.)
- Identify data containers and selectors
- Recognize pagination mechanisms

### 2. Smart Pagination & Scroll Loading
- Page numbers, Next button, Infinite scroll, Load more, API pagination
- Auto-detect and handle appropriately

**CRITICAL: Scroll Loading Requirements**
- **MUST** automatically scroll down the page until all content is loaded
- Many modern websites use lazy loading/infinite scroll, initially showing only partial content
- Scroll strategy:
  1. Record current page height
  2. Scroll to bottom
  3. Wait 1-2 seconds for new content to load
  4. Check if page height increased
  5. Repeat until height stops changing (3 consecutive times)
- Use `browser_evaluate` to execute scrolling:
  ```javascript
  window.scrollTo(0, document.body.scrollHeight)
  ```

### 3. Detail Link Following

**CRITICAL: Detail Page Scraping Requirements**
- List pages may only show title/summary, full content is on detail pages
- **MUST** detect and follow detail links to get complete data
- Patterns to identify detail links:
  - "View more", "Read more", "Details", "Full article"
  - Title itself is a link
  - Arrow links like "→"
- Scraping workflow:
  1. First scrape all entries from list page
  2. Identify detail link for each entry
  3. Visit detail pages one by one to extract full content
  4. Merge data and save
- Note: Detail page scraping needs appropriate delays (2-5s) to avoid blocking

### 4. Anti-Blocking
- Adaptive delays (2-60s based on signals)
- User-Agent rotation
- Block detection and recovery

### 5. Series Discovery
- Find prev/next links
- Detect table of contents
- URL pattern analysis
- Build complete series from single article

### 6. **Self-Learning** (NEW)
- Records successful patterns for each domain
- Accumulates lessons learned
- Reuses knowledge on similar sites

---

## Self-Learning System

### How It Works

```
┌─────────────────────────────────────────────────────────────┐
│                    SCRAPE REQUEST                            │
└─────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────┐
│ Step 1: Check Experience Database                            │
│ - Read experiences/site_patterns.json

Files: 17

Size: 209.6 KB

Complexity: 81/100

Category: AI Agents

Source: https://github.com/yrzhe/claude-skills/tree/main/plugins/intelligent-web-scraper/skills/intelligent-web-scraper

Related in AI Agents

skill-development

Included

Comprehensive meta-skill for creating, managing, validating, auditing, and distributing Claude Code skills and slash commands (unified in v2.1.3+). Provides skill templates, creation workflows, validation patterns, audit checklists, naming conventions, YAML frontmatter guidance, progressive disclosure examples, and best practices lookup. Use when creating new skills, validating existing skills, auditing skill quality, understanding skill architecture, needing skill templates, learning about YAML frontmatter requirements, progressive disclosure patterns, tool restrictions (allowed-tools), skill composition, skill naming conventions, troubleshooting skill activation issues, creating custom slash commands, configuring command frontmatter, using command arguments ($ARGUMENTS, $1, $2), bash execution in commands, file references in commands, command namespacing, plugin commands, MCP slash commands, Skill tool configuration, or deciding between skills vs slash commands. Delegates to docs-management skill for official documentation.

AI Agentsscripts

reprompter

Included

Transform messy prompts into well-structured, effective prompts — single or multi-agent. Use when: "reprompt", "reprompt this", "clean up this prompt", "structure my prompt", rough text needing XML tags and best practices, "reprompter teams", "repromptception", "run with quality", "smart run", "smart agents", multi-agent tasks, audits, parallel work, anything going to agent teams. Don't use when: simple Q&A, pure chat, immediate execution-only tasks. See "Don't Use When" section for details. Outputs: Structured XML/Markdown prompt, quality score (before/after), optional team brief + per-agent sub-prompts, agent team output files. Success criteria: Single mode quality score ≥ 7/10; Repromptception per-agent prompt quality score 8+/10; all required sections present, actionable and specific.

AI Agentsscripts

adaptive-compaction

Included

Adaptive add-on policy and recovery layer that decides WHEN to compact, prune, snapshot, or fork -- replacing fixed-percent auto-compaction across Claude Code, Codex, and MCP-capable hosts. Trigger on auto-compact timing or damage: "when should I compact", "is it safe to compact now or start a fresh session", "auto-compact fires too early/mid-task", "switching to an unrelated task but the window still has space", "context rot", "answers get worse the longer the session runs", "the agent forgot the plan or my decisions after it summarized", "add a layer on top that manages context without changing the agent", raising autoCompactWindow to give the policy room, or installing/tuning a cross-tool compaction policy or PreCompact hook -- even when "compaction" is never said but the problem is context-window pressure or post-summarization memory loss. Do NOT use to summarize a conversation, build RAG, write a summarization prompt (decides WHEN not HOW), or answer max-context-length trivia.

AI Agentsscripts

agent-skill-creator

Included

Create cross-platform agent skills from workflow descriptions. Activates when users ask to create an agent, automate a repetitive workflow, create a custom skill, or need advanced agent creation. Triggers on phrases like create agent for, automate workflow, create skill for, every day I have to, daily I need to, turn process into agent, need to automate, create a cross-platform skill, validate this skill, export this skill, migrate this skill. Supports single skills, multi-agent suites, transcript processing, template-based creation, interactive configuration, cross-platform export, and spec validation.

AI Agentsscripts

llm-wiki

Included

Use when building or maintaining a persistent personal knowledge base (second brain) in Obsidian where an LLM incrementally ingests sources, updates entity/concept pages, maintains cross-references, and keeps a synthesis current. Triggers include "second brain", "Obsidian wiki", "personal knowledge management", "ingest this paper/article/book", "build a research wiki", "compound knowledge", "Memex", or whenever the user wants knowledge to accumulate across sessions instead of being re-derived by RAG on every query.

AI Agentsscripts

skill-master

Included

Agent Skills authoring, evaluation, and optimization. Create, edit, validate, benchmark, and improve skills following the agentskills.io specification. Use when designing SKILL.md files, structuring skill folders (references, scripts, assets), ingesting external documentation into skills, running trigger evals, benchmarking skill quality, optimizing descriptions, or performing blind A/B comparisons. Keywords: agentskills.io, SKILL.md, skill authoring, eval, benchmark, trigger optimization.

AI Agentsscripts