web-scraper
Web scraping inteligente multi-estrategia. Extrai dados estruturados de paginas web (tabelas, listas, precos). Paginacao, monitoramento e export CSV/JSON.
What this skill does
# Web Scraper
## Overview
Web scraping inteligente multi-estrategia. Extrai dados estruturados de paginas web (tabelas, listas, precos). Paginacao, monitoramento e export CSV/JSON.
## When to Use This Skill
- When the user mentions "scraper" or related topics
- When the user mentions "scraping" or related topics
- When the user mentions "extrair dados web" or related topics
- When the user mentions "web scraping" or related topics
- When the user mentions "raspar dados" or related topics
- When the user mentions "coletar dados site" or related topics
## Do Not Use This Skill When
- The task is unrelated to web scraper
- A simpler, more specific tool can handle the request
- The user needs general-purpose assistance without domain expertise
## How It Works
Execute phases in strict order. Each phase feeds the next.
```
1. CLARIFY -> 2. RECON -> 3. STRATEGY -> 4. EXTRACT -> 5. TRANSFORM -> 6. VALIDATE -> 7. FORMAT
```
Never skip Phase 1 or Phase 2. They prevent wasted effort and failed extractions.
**Fast path**: If user provides URL + clear data target + the request is simple
(single page, one data type), compress Phases 1-3 into a single action:
fetch, classify, and extract in one WebFetch call. Still validate and format.
---
## Capabilities
- **Multi-strategy**: WebFetch (static), Browser automation (JS-rendered), Bash/curl (APIs), WebSearch (discovery)
- **Extraction modes**: table, list, article, product, contact, FAQ, pricing, events, jobs, custom
- **Output formats**: Markdown tables (default), JSON, CSV
- **Pagination**: auto-detect and follow (page numbers, infinite scroll, load-more)
- **Multi-URL**: extract same structure across sources with comparison and diff
- **Validation**: confidence ratings (HIGH/MEDIUM/LOW) on every extraction
- **Auto-escalation**: WebFetch fails silently -> automatic Browser fallback
- **Data transforms**: cleaning, normalization, deduplication, enrichment
- **Differential mode**: detect changes between scraping runs
## Web Scraper
Multi-strategy web data extraction with intelligent approach selection,
automatic fallback escalation, data transformation, and structured output.
## Phase 1: Clarify
Establish extraction parameters before touching any URL.
## Required Parameters
| Parameter | Resolve | Default |
|:--------------|:-------------------------------------|:---------------|
| Target URL(s) | Which page(s) to scrape? | *(required)* |
| Data Target | What specific data to extract? | *(required)* |
| Output Format | Markdown table, JSON, CSV, or text? | Markdown table |
| Scope | Single page, paginated, or multi-URL?| Single page |
## Optional Parameters
| Parameter | Resolve | Default |
|:--------------|:---------------------------------------|:-------------|
| Pagination | Follow pagination? Max pages? | No, 1 page |
| Max Items | Maximum number of items to collect? | Unlimited |
| Filters | Data to exclude or include? | None |
| Sort Order | How to sort results? | Source order |
| Save Path | Save to file? Which path? | Display only |
| Language | Respond in which language? | User's lang |
| Diff Mode | Compare with previous run? | No |
## Clarification Rules
- If user provides a URL and clear data target, proceed directly to Phase 2.
Do NOT ask unnecessary questions.
- If request is ambiguous (e.g. "scrape this site"), ask ONLY:
"What specific data do you want me to extract from this page?"
- Default to Markdown table output. Mention alternatives only if relevant.
- Accept requests in any language. Always respond in the user's language.
- If user says "everything" or "all data", perform recon first, then present
what's available and let user choose.
## Discovery Mode
When user has a topic but no specific URL:
1. Use WebSearch to find the most relevant pages
2. Present top 3-5 URLs with descriptions
3. Let user choose which to scrape, or scrape all
4. Proceed to Phase 2 with selected URL(s)
Example: "find and extract pricing data for CRM tools"
-> WebSearch("CRM tools pricing comparison 2026")
-> Present top results -> User selects -> Extract
---
## Phase 2: Reconnaissance
Analyze the target page before extraction.
## Step 2.1: Initial Fetch
Use WebFetch to retrieve and analyze the page structure:
```
WebFetch(
url = TARGET_URL,
prompt = "Analyze this page structure and report:
1. Page type: article, product listing, search results, data table,
directory, dashboard, API docs, FAQ, pricing page, job board, events, or other
2. Main content structure: tables, ordered/unordered lists, card grid, free-form text,
accordion/collapsible sections, tabs
3. Approximate number of distinct data items visible
4. JavaScript rendering indicators: empty containers, loading spinners,
SPA framework markers (React root, Vue app, Angular), minimal HTML with heavy JS
5. Pagination: next/prev links, page numbers, load-more buttons,
infinite scroll indicators, total results count
6. Data density: how much structured, extractable data exists
7. List the main data fields/columns available for extraction
8. Embedded structured data: JSON-LD, microdata, OpenGraph tags
9. Available download links: CSV, Excel, PDF, API endpoints"
)
```
## Step 2.2: Evaluate Fetch Quality
| Signal | Interpretation | Action |
|:--------------------------------------------|:----------------------------------|:--------------------------|
| Rich content with data clearly visible | Static page | Strategy A (WebFetch) |
| Empty containers, "loading...", minimal text | JS-rendered | Strategy B (Browser) |
| Login wall, CAPTCHA, 403/401 response | Blocked | Report to user |
| Content present but poorly structured | Needs precision | Strategy B (Browser) |
| JSON or XML response body | API endpoint | Strategy C (Bash/curl) |
| Download links for CSV/Excel available | Direct data file | Strategy C (download) |
## Step 2.3: Content Classification
Classify into an extraction mode:
| Mode | Indicators | Examples |
|:-----------|:-------------------------------------------|:----------------------------------|
| `table` | HTML `<table>`, grid layout with headers | Price comparison, statistics, specs|
| `list` | Repeated similar elements, card grids | Search results, product listings |
| `article` | Long-form text with headings/paragraphs | Blog post, news article, docs |
| `product` | Product name, price, specs, images, rating | E-commerce product page |
| `contact` | Names, emails, phones, addresses, roles | Team page, staff directory |
| `faq` | Question-answer pairs, accordions | FAQ page, help center |
| `pricing` | Plan names, prices, features, tiers | SaaS pricing page |
| `events` | Dates, locations, titles, descriptions | Event listings, conferences |
| `jobs` | Titles, companies, locations, salaries | Job boards, career pages |
| `custom` | User specified CSS selectors or fields | Anything not matching above |
Record: **page type**, **extraction mode**, **JS rendering needed (yes/no)**,
**available fields**, **structured data present (JSON-LD etc.)**.
If user asked for "everything", present the available fields and let them choose.
---
## Phase 3: Strategy Selection
Choose the extraction approach based on recon results.
## DecRelated in Data & Analytics
clawarr-suite
IncludedComprehensive management for self-hosted media stacks (Sonarr, Radarr, Lidarr, Readarr, Prowlarr, Bazarr, Overseerr, Plex, Tautulli, SABnzbd, Recyclarr, Unpackerr, Notifiarr, Maintainerr, Kometa, FlareSolverr). Deep library exploration, analytics, dashboard generation, content management, request handling, subtitle management, indexer control, download monitoring, quality profile sync, library cleanup automation, notification routing, collection/overlay management, and media tracker integration (Trakt, Letterboxd, Simkl).
querying-soql
IncludedSOQL query generation, optimization, and analysis with 100-point scoring. Use this skill when the user needs SOQL/SOSL authoring or optimization: natural-language-to-query generation, relationship queries, aggregates, query-plan analysis, and performance or safety improvements for Salesforce queries. TRIGGER when: user writes, optimizes, or debugs SOQL/SOSL queries, touches .soql files, or asks about relationship queries, aggregates, or query performance. DO NOT TRIGGER when: bulk data operations (use handling-sf-data), Apex DML logic (use generating-apex), or report/dashboard queries.
app-store-optimization
IncludedApp Store Optimization (ASO) toolkit for researching keywords, analyzing competitor rankings, generating metadata suggestions, and improving app visibility on Apple App Store and Google Play Store. Use when the user asks about ASO, app store rankings, app metadata, app titles and descriptions, app store listings, app visibility, or mobile app marketing on iOS or Android. Supports keyword research and scoring, competitor keyword analysis, metadata optimization, A/B test planning, launch checklists, and tracking ranking changes.
habit-flow
IncludedAI-powered atomic habit tracker with natural language logging, streak tracking, smart reminders, and coaching. Use for creating habits, logging completions naturally ("I meditated today"), viewing progress, and getting personalized coaching.
app-store-optimization
IncludedApp Store Optimization (ASO) toolkit for researching keywords, analyzing competitor rankings, generating metadata suggestions, and improving app visibility on Apple App Store and Google Play Store. Use when the user asks about ASO, app store rankings, app metadata, app titles and descriptions, app store listings, app visibility, or mobile app marketing on iOS or Android. Supports keyword research and scoring, competitor keyword analysis, metadata optimization, A/B test planning, launch checklists, and tracking ranking changes.
visualizing-data
IncludedBuilds dashboards, reports, and data-driven interfaces requiring charts, graphs, or visual analytics. Provides systematic framework for selecting appropriate visualizations based on data characteristics and analytical purpose. Includes 24+ visualization types organized by purpose (trends, comparisons, distributions, relationships, flows, hierarchies, geospatial), accessibility patterns (WCAG 2.1 AA compliance), colorblind-safe palettes, and performance optimization strategies. Use when creating visualizations, choosing chart types, displaying data graphically, or designing data interfaces.