extract-from-pdfs

Included with Lifetime

$97 forever

This skill should be used when extracting structured data from scientific PDFs for systematic reviews, meta-analyses, or database creation. Use when working with collections of research papers that need to be converted into analyzable datasets with validation metrics.

Generalscriptsassets

What this skill does


# Extract Structured Data from Scientific PDFs

## Purpose

Extract standardized, structured data from scientific PDF literature using Claude's vision capabilities. Transform PDF collections into validated databases ready for statistical analysis in Python, R, or other frameworks.

**Core capabilities:**
- Organize metadata from BibTeX, RIS, directories, or DOI lists
- Filter papers by abstract using Claude (Haiku/Sonnet) or local models (Ollama)
- Extract structured data from PDFs with customizable schemas
- Repair and validate JSON outputs automatically
- Enrich with external databases (GBIF, WFO, GeoNames, PubChem, NCBI)
- Calculate precision/recall metrics for quality assurance
- Export to Python, R, CSV, Excel, or SQLite

## When to Use This Skill

Use when:
- Conducting systematic literature reviews requiring data extraction
- Building databases from scientific publications
- Converting PDF collections to structured datasets
- Validating extraction quality with ground truth metrics
- Comparing extraction approaches (different models, prompts)

Do not use for:
- Single PDF summarization (use basic PDF reading instead)
- Full-text PDF search (use document search tools)
- PDF editing or manipulation

## Getting Started

### 1. Initial Setup

Read the setup guide for installation and configuration:

```bash
cat references/setup_guide.md
```

Key setup steps:
- Install dependencies: `conda env create -f environment.yml`
- Set API keys: `export ANTHROPIC_API_KEY='your-key'`
- Optional: Install Ollama for free local filtering

### 2. Define Extraction Requirements

**Ask the user:**
- Research domain and extraction goals
- How PDFs are organized (reference manager, directory, DOI list)
- Approximate collection size
- Preferred analysis environment (Python, R, etc.)

**Provide 2-3 example PDFs** to analyze structure and design schema.

### 3. Design Extraction Schema

Create custom schema from template:

```bash
cp assets/schema_template.json my_schema.json
```

Customize for the specific domain:
- Set `objective` describing what to extract
- Define `output_schema` with field types and descriptions
- Add domain-specific `instructions` for Claude
- Provide `output_example` showing desired format

See `assets/example_flower_visitors_schema.json` for real-world ecology example.

## Workflow Execution

### Complete Pipeline

Run the 6-step pipeline (plus optional validation):

```bash
# Step 1: Organize metadata
python scripts/01_organize_metadata.py \
  --source-type bibtex \
  --source library.bib \
  --pdf-dir pdfs/ \
  --output metadata.json

# Step 2: Filter papers (optional - recommended)
# Choose backend: anthropic-haiku (cheap), anthropic-sonnet (accurate), ollama (free)
python scripts/02_filter_abstracts.py \
  --metadata metadata.json \
  --backend anthropic-haiku \
  --use-batches \
  --output filtered_papers.json

# Step 3: Extract from PDFs
python scripts/03_extract_from_pdfs.py \
  --metadata filtered_papers.json \
  --schema my_schema.json \
  --method batches \
  --output extracted_data.json

# Step 4: Repair JSON
python scripts/04_repair_json.py \
  --input extracted_data.json \
  --schema my_schema.json \
  --output cleaned_data.json

# Step 5: Validate with APIs
python scripts/05_validate_with_apis.py \
  --input cleaned_data.json \
  --apis my_api_config.json \
  --output validated_data.json

# Step 6: Export to analysis format
python scripts/06_export_database.py \
  --input validated_data.json \
  --format python \
  --output results
```

### Validation (Optional but Recommended)

Calculate extraction quality metrics:

```bash
# Step 7: Sample papers for annotation
python scripts/07_prepare_validation_set.py \
  --extraction-results cleaned_data.json \
  --schema my_schema.json \
  --sample-size 20 \
  --strategy stratified \
  --output validation_set.json

# Step 8: Manually annotate (edit validation_set.json)
# Fill ground_truth field for each sampled paper

# Step 9: Calculate metrics
python scripts/08_calculate_validation_metrics.py \
  --annotations validation_set.json \
  --output validation_metrics.json \
  --report validation_report.txt
```

Validation produces precision, recall, and F1 metrics per field and overall.

## Detailed Documentation

Access comprehensive guides in the `references/` directory:

**Setup and installation:**
```bash
cat references/setup_guide.md
```

**Complete workflow with examples:**
```bash
cat references/workflow_guide.md
```

**Validation methodology:**
```bash
cat references/validation_guide.md
```

**API integration details:**
```bash
cat references/api_reference.md
```

## Customization

### Schema Customization

Modify `my_schema.json` to match the research domain:

1. **Objective:** Describe what data to extract
2. **Instructions:** Step-by-step extraction guidance
3. **Output schema:** JSON schema defining structure
4. **Important notes:** Domain-specific rules
5. **Examples:** Show desired output format

Use imperative language in instructions. Be specific about data types, required vs optional fields, and edge cases.

### API Configuration

Configure external database validation in `my_api_config.json`:

Map extracted fields to validation APIs:
- `gbif_taxonomy` - Biological taxonomy
- `wfo_plants` - Plant names specifically
- `geonames` - Geographic locations
- `geocode` - Address to coordinates
- `pubchem` - Chemical compounds
- `ncbi_gene` - Gene identifiers

See `assets/example_api_config_ecology.json` for ecology-specific example.

### Filtering Customization

Edit filtering criteria in `scripts/02_filter_abstracts.py` (line 74):

Replace TODO section with domain-specific criteria:
- What constitutes primary data vs review?
- What data types are relevant?
- What scope (geographic, temporal, taxonomic) is needed?

Use conservative criteria (when in doubt, include paper) to avoid false negatives.

## Cost Optimization

**Backend selection for filtering (Step 2):**
- Ollama (local): $0 - Best for privacy and high volume
- Haiku (API): ~$0.25/M tokens - Best balance of cost/quality
- Sonnet (API): ~$3/M tokens - Best for complex filtering

**Typical costs for 100 papers:**
- With filtering (Haiku + Sonnet): ~$4
- With local Ollama + Sonnet: ~$3.75
- Without filtering (Sonnet only): ~$7.50

**Optimization strategies:**
- Use abstract filtering to reduce PDF processing
- Use local Ollama for filtering (free)
- Enable prompt caching with `--use-caching`
- Process in batches with `--use-batches`

## Quality Assurance

**Validation workflow provides:**
- Precision: % of extracted items that are correct
- Recall: % of true items that were extracted
- F1 score: Harmonic mean of precision and recall
- Per-field metrics: Identify weak fields

**Use metrics to:**
- Establish baseline extraction quality
- Compare different approaches (models, prompts, schemas)
- Identify areas for improvement
- Report extraction quality in publications

**Recommended sample sizes:**
- Small projects (<100 papers): 10-20 papers
- Medium projects (100-500 papers): 20-50 papers
- Large projects (>500 papers): 50-100 papers

## Iterative Improvement

1. Run initial extraction with baseline schema
2. Validate on sample using Steps 7-9
3. Analyze field-level metrics and error patterns
4. Revise schema, prompts, or model selection
5. Re-extract and re-validate
6. Compare metrics to verify improvement
7. Repeat until acceptable quality achieved

See `references/validation_guide.md` for detailed guidance on interpreting metrics and improving extraction quality.

## Available Scripts

**Data organization:**
- `scripts/01_organize_metadata.py` - Standardize PDFs and metadata

**Filtering:**
- `scripts/02_filter_abstracts.py` - Filter by abstract (Haiku/Sonnet/Ollama)

**Extraction:**
- `scripts/03_extract_from_pdfs.py` - Extract from PDFs with Claude vision

**Processing:**
- `scripts/04_repair_json.py` - Repair and validate JSON
- `scripts/05_validate_with_apis.py` - Enrich with external databases
- `scripts/06_export_data

Files: 20

Size: 156.7 KB

Complexity: 88/100

Category: General

Source: https://github.com/brunoasm/my_claude_skills/tree/main/extract_from_pdfs

Related in General

modeling-omnistudio-epc-catalog

Included

Salesforce Industries CME EPC product-modeling skill for Product2-based catalog creation. Use when creating EPC products, configuring product attributes, building offer bundles with Product Child Items, or reviewing EPC DataPack JSON metadata for product catalog changes. TRIGGER when: user creates or updates Product2 EPC records, AttributeAssignment payloads, AttributeMetadata/AttributeDefaultValues, Offer bundles, or ProductChildItem relationships. DO NOT TRIGGER when: designing OmniScripts/FlexCards/Integration Procedures (use building-omnistudio-omniscript, building-omnistudio-flexcard, or building-omnistudio-integration-procedure), implementing Apex business logic (use generating-apex), or troubleshooting deployment pipelines (use deploying-metadata).

Generalscripts

relationship-science-coach

Included

Use this skill for direct, practical adult relationship coaching: couples conflict, repair, trust, marriage, dating, flirting, attachment patterns, emotional connection, sex, desire differences, eroticism, kink negotiation, affection, love languages, breakups, and long-term passion. Draw on Gottman, EFT and Hold Me Tight, attachment science, modern sex research, Perel, Nagoski, Kerner, Schnarch, Love and Stosny, and flexible love-language tools. Be concrete and low-hedge. Redirect only for imminent danger, abuse, coercive control, minors, non-consent, self-harm, stalking, or medical/legal/psychiatric decisions.

Generalscripts

building-sf-integrations

Included

Salesforce integration architecture and runtime plumbing with 120-point scoring. Use this skill to set up Named Credentials, External Credentials, External Services, REST/SOAP callout patterns, Platform Events, and Change Data Capture. TRIGGER when: user sets up Named Credentials, External Services, REST/SOAP callouts, Platform Events, CDC, or touches .namedCredential-meta.xml files. DO NOT TRIGGER when: Connected App/OAuth config (use configuring-connected-apps), Apex-only logic (use generating-apex), or data import/export (use handling-sf-data).

Generalscripts

venue-templates

Included

Access comprehensive LaTeX templates, formatting requirements, and submission guidelines for major scientific publication venues (Nature, Science, PLOS, IEEE, ACM), academic conferences (NeurIPS, ICML, CVPR, CHI), research posters, and grant proposals (NSF, NIH, DOE, DARPA). This skill should be used when preparing manuscripts for journal submission, conference papers, research posters, or grant proposals and need venue-specific formatting requirements and templates.

Generalscripts

let-fate-decide

Included

Draws the 12 Houses of the Zodiac Tarot spread to inject entropy into planning when prompts are vague, ambiguous, or casually delegated. Interprets the spread to guide next steps. Use when the user says 'let fate decide', 'YOLO', 'whatever', 'idk', or other nonchalant phrases, makes Yu-Gi-Oh references, or when you are about to arbitrarily pick between multiple reasonable approaches. Prefer over ask-questions-if-underspecified when the user's tone is casual or playful rather than precision-seeking.

Generalscripts

net-ops

Included

Cross-platform network troubleshooting (Windows, macOS, Linux) via local or remote shell. Use for: DNS broken, can't resolve hostnames, nslookup/dig works but apps fail, NRPT, WFP, scutil, /etc/resolver, systemd-resolved, /etc/resolv.conf, NetworkManager, VPN DNS leak residue (ProtonVPN/Mullvad/WireGuard/AnyConnect), AV/firewall blocking DNS or DoH, Tailscale DNS interaction, intermittent connectivity, remote diagnostics over SSH.

Generalscripts