firecrawl-policy-guardrails
Implement Firecrawl scraping policy enforcement: domain blocklists, credit budgets, content filtering, and robots.txt compliance guardrails. Use when setting up scraping policies, enforcing crawl limits, or preventing accidental scraping of prohibited domains. Trigger with phrases like "firecrawl policy", "firecrawl guardrails", "firecrawl domain blocklist", "firecrawl scraping rules", "firecrawl compliance".
What this skill does
# Firecrawl Policy Guardrails
## Overview
Automated guardrails for Firecrawl scraping pipelines. Web scraping carries legal (robots.txt, ToS), ethical (rate limiting, attribution), and cost (credit burn) risks. This skill implements domain blocklists, credit budgets, content quality gates, and per-domain rate limits as enforceable policies.
## Instructions
### Step 1: Domain Policy Enforcement
```typescript
import FirecrawlApp from "@mendable/firecrawl-js";
const firecrawl = new FirecrawlApp({
apiKey: process.env.FIRECRAWL_API_KEY!,
});
class ScrapePolicy {
// Domains that explicitly prohibit scraping in their ToS
static BLOCKED_DOMAINS = [
"facebook.com", "instagram.com", // Meta ToS
"linkedin.com", // LinkedIn ToS
"twitter.com", "x.com", // X/Twitter ToS
];
// Domains with sensitive/regulated content
static SENSITIVE_DOMAINS = [
"*.gov", "*.mil", // Government
"*.edu", // Educational (FERPA)
];
static validateUrl(url: string): void {
const hostname = new URL(url).hostname;
for (const blocked of this.BLOCKED_DOMAINS) {
if (hostname === blocked || hostname.endsWith(`.${blocked}`)) {
throw new PolicyViolation(`Domain "${hostname}" is blocked: ToS prohibits scraping`);
}
}
for (const pattern of this.SENSITIVE_DOMAINS) {
const regex = new RegExp("^" + pattern.replace("*.", ".*\\.") + "$");
if (regex.test(hostname)) {
console.warn(`CAUTION: "${hostname}" matches sensitive domain pattern "${pattern}"`);
}
}
}
}
class PolicyViolation extends Error {
constructor(message: string) {
super(message);
this.name = "PolicyViolation";
}
}
```
### Step 2: Credit Budget Enforcement
```typescript
class CrawlBudget {
private usage = new Map<string, number>();
private dailyLimit: number;
constructor(dailyLimit = 5000) {
this.dailyLimit = dailyLimit;
}
authorize(estimatedPages: number): void {
const today = new Date().toISOString().split("T")[0];
const used = this.usage.get(today) || 0;
if (used + estimatedPages > this.dailyLimit) {
throw new PolicyViolation(
`Daily credit limit would be exceeded: ${used} used + ${estimatedPages} requested > ${this.dailyLimit} limit`
);
}
}
record(pagesScraped: number) {
const today = new Date().toISOString().split("T")[0];
this.usage.set(today, (this.usage.get(today) || 0) + pagesScraped);
}
}
const budget = new CrawlBudget(5000);
```
### Step 3: Content Quality Gate
```typescript
function validateScrapedContent(result: any): {
accepted: boolean;
reason?: string;
} {
const md = result.markdown || "";
// Reject thin content
if (md.length < 50) {
return { accepted: false, reason: "Content too short (<50 chars)" };
}
// Reject error pages
if (/403 forbidden|access denied|captcha/i.test(md)) {
return { accepted: false, reason: "Error page detected" };
}
// Reject login walls
if (/sign in to continue|create an account|login required/i.test(md)) {
return { accepted: false, reason: "Login wall detected" };
}
// Reject cookie consent pages (only content is cookie notice)
if (md.length < 500 && /cookie|consent|gdpr/i.test(md)) {
return { accepted: false, reason: "Cookie consent page only" };
}
return { accepted: true };
}
```
### Step 4: Crawl Limit Enforcement
```typescript
const MAX_CRAWL_LIMIT = 500;
const MAX_DEPTH = 5;
async function policedCrawl(url: string, requestedLimit: number) {
// Validate URL
ScrapePolicy.validateUrl(url);
// Enforce hard limits
const limit = Math.min(requestedLimit, MAX_CRAWL_LIMIT);
if (requestedLimit > MAX_CRAWL_LIMIT) {
console.warn(`Crawl limit capped: ${requestedLimit} -> ${MAX_CRAWL_LIMIT}`);
}
// Check budget
budget.authorize(limit);
// Execute with enforced limits
const result = await firecrawl.crawlUrl(url, {
limit,
maxDepth: MAX_DEPTH,
scrapeOptions: { formats: ["markdown"], onlyMainContent: true },
});
// Record actual usage
const pagesScraped = result.data?.length || 0;
budget.record(pagesScraped);
// Filter by content quality
const validPages = (result.data || []).filter(page => {
const { accepted, reason } = validateScrapedContent(page);
if (!accepted) console.log(`Rejected: ${page.metadata?.sourceURL} — ${reason}`);
return accepted;
});
console.log(`Crawl: ${pagesScraped} scraped, ${validPages.length} accepted, ${pagesScraped - validPages.length} rejected`);
return validPages;
}
```
### Step 5: Per-Domain Rate Limiting
```typescript
const DOMAIN_RATE_LIMITS: Record<string, number> = {
"docs.example.com": 2, // 2 requests/second
"blog.example.com": 1, // 1 request/second
default: 5, // 5 requests/second
};
const lastRequest = new Map<string, number>();
async function rateLimitedScrape(url: string) {
const domain = new URL(url).hostname;
const rate = DOMAIN_RATE_LIMITS[domain] || DOMAIN_RATE_LIMITS.default;
const minInterval = 1000 / rate;
const last = lastRequest.get(domain) || 0;
const elapsed = Date.now() - last;
if (elapsed < minInterval) {
await new Promise(r => setTimeout(r, minInterval - elapsed));
}
lastRequest.set(domain, Date.now());
return firecrawl.scrapeUrl(url, { formats: ["markdown"] });
}
```
## Policy Summary
| Policy | Enforcement | Consequence |
|--------|-------------|-------------|
| Domain blocklist | Pre-request check | Request rejected with PolicyViolation |
| Credit budget | Pre-request check | Request rejected if over daily limit |
| Crawl limit | Hard cap at 500 | Silently capped, logged |
| Content quality | Post-scrape filter | Invalid pages excluded from results |
| Per-domain rate | Pre-request delay | Automatic throttling |
## Error Handling
| Issue | Cause | Solution |
|-------|-------|----------|
| PolicyViolation thrown | Blocked domain | Remove from scrape targets |
| Budget exceeded | Heavy scraping day | Increase daily limit or wait |
| Many rejected pages | Error/login pages | Check target site, adjust URL patterns |
| Slow scraping | Per-domain rate limit | Expected behavior, protects target site |
## Examples
### Policy-Checked Pipeline
```typescript
async function scrapePipeline(urls: string[]) {
const results = [];
for (const url of urls) {
try {
ScrapePolicy.validateUrl(url);
budget.authorize(1);
const result = await rateLimitedScrape(url);
const { accepted } = validateScrapedContent(result);
if (accepted) results.push(result);
budget.record(1);
} catch (e) {
if (e instanceof PolicyViolation) {
console.warn(`Policy: ${e.message}`);
} else {
console.error(`Error: ${(e as Error).message}`);
}
}
}
return results;
}
```
## Resources
- [Firecrawl Docs](https://docs.firecrawl.dev)
- [robots.txt Spec](https://www.robotstxt.org/robotstxt.html)
- [Web Scraping Legal Guide](https://www.eff.org/issues/web-scraping)
## Next Steps
For architecture patterns, see `firecrawl-architecture-variants`.
Related in Writing & Docs
jax-development
IncludedUse this skill when the user is writing, debugging, profiling, refactoring, reviewing, benchmarking, parallelising, exporting, or explaining JAX code, or when they mention JAX, jax.numpy, jit, grad, value_and_grad, vmap, scan, lax, random keys, pytrees, jax.Array, sharding, Mesh, PartitionSpec, NamedSharding, pmap, shard_map, Pallas, XLA, StableHLO, checkify, profiler, or the JAX repo. It helps turn NumPy or PyTorch-style code into pure functional JAX, fix tracer/control-flow/shape/PRNG bugs, remove recompiles and host-device syncs, choose transforms and sharding strategies, inspect jaxpr/lowering/IR, and benchmark compiled code correctly.
nature-article-writer
IncludedDrafts, rewrites, diagnostically critiques, and style-calibrates primary research manuscripts for Nature and Nature Portfolio journals. Use when the user wants a Nature-style title, summary paragraph or abstract, introduction, results, discussion, methods, figure legends, presubmission enquiry, cover letter, reviewer response, or when a scientific draft sounds generic, jargon-heavy, structurally weak, or AI-ish and needs precise, broad-reader-friendly prose without inventing data, analyses, or references. Best for primary research articles and letters rather than reviews or press releases unless explicitly adapting one.
deckrd
IncludedDocument-driven framework that derives requirements, specifications, implementation plans, and executable tasks from goals through structured AI dialogue. Use when user says "write requirements", "create spec", "plan implementation", "derive tasks", "structure this feature", "break down into tasks", or "document this module". Also use for reverse engineering existing code into docs (/deckrd rev). Do NOT use for direct code writing — use /deckrd-coder after tasks are generated. Do NOT use when the user only wants to run or fix existing code without planning.
clinical-decision-support
IncludedGenerate professional clinical decision support (CDS) documents for pharmaceutical and clinical research settings, including patient cohort analyses (biomarker-stratified with outcomes) and treatment recommendation reports (evidence-based guidelines with decision algorithms). Supports GRADE evidence grading, statistical analysis (hazard ratios, survival curves, waterfall plots), biomarker integration, and regulatory compliance. Outputs publication-ready LaTeX/PDF format optimized for drug development, clinical research, and evidence synthesis.
handling-sf-data
IncludedSalesforce data operations with 130-point scoring. Use this skill to create, update, delete, bulk import/export, generate test data, and clean up org records using sf CLI and anonymous Apex. TRIGGER when: user creates test data, performs bulk import/export, uses sf data CLI commands, needs data factory patterns for Apex tests, or needs to seed/clean records in a Salesforce org. DO NOT TRIGGER when: SOQL query writing only (use querying-soql), Apex test execution (use running-apex-tests), or metadata deployment (use deploying-metadata).
accelint-ac-to-playwright
IncludedConvert and validate acceptance criteria for Playwright test automation. Use when user asks to (1) review/evaluate/check if AC are ready for automation, (2) assess if AC can be converted as-is, (3) validate AC quality for Playwright, (4) turn AC into tests, (5) generate tests from acceptance criteria, (6) convert .md bullets or .feature Gherkin files to Playwright specs, (7) create test automation from requirements. Handles both bullet-style markdown and Gherkin syntax with JSON test plan generation and validation.