table-extractor
Included with Lifetime
$97 forever
>
parsingtableextractionpdfcamelot
What this skill does
# Table Extractor Skill
## Overview
This skill enables precise extraction of tables from PDF documents using **camelot** - the gold standard for PDF table extraction. Handle complex tables with merged cells, borderless tables, and multi-page layouts with high accuracy.
## How to Use
1. Provide the PDF containing tables
2. Optionally specify pages or table detection method
3. I'll extract tables as pandas DataFrames
**Example prompts:**
- "Extract all tables from this PDF"
- "Get the table on page 5 of this report"
- "Extract borderless tables from this document"
- "Convert PDF tables to Excel format"
## Domain Knowledge
### camelot Fundamentals
```python
import camelot
# Extract tables from PDF
tables = camelot.read_pdf('document.pdf')
# Access results
print(f"Found {len(tables)} tables")
# Get first table as DataFrame
df = tables[0].df
print(df)
```
### Extraction Methods
| Method | Use Case | Description |
|--------|----------|-------------|
| `lattice` | Bordered tables | Detects table by lines/borders |
| `stream` | Borderless tables | Uses text positioning |
```python
# Lattice method (default) - for tables with visible borders
tables = camelot.read_pdf('document.pdf', flavor='lattice')
# Stream method - for borderless tables
tables = camelot.read_pdf('document.pdf', flavor='stream')
```
### Page Selection
```python
# Single page
tables = camelot.read_pdf('document.pdf', pages='1')
# Multiple pages
tables = camelot.read_pdf('document.pdf', pages='1,3,5')
# Page range
tables = camelot.read_pdf('document.pdf', pages='1-5')
# All pages
tables = camelot.read_pdf('document.pdf', pages='all')
```
### Advanced Options
#### Lattice Options
```python
tables = camelot.read_pdf(
'document.pdf',
flavor='lattice',
line_scale=40, # Line detection sensitivity
copy_text=['h', 'v'], # Copy text across merged cells
shift_text=['l', 't'], # Shift text alignment
split_text=True, # Split text at newlines
flag_size=True, # Flag super/subscripts
strip_text='\n', # Characters to strip
process_background=False, # Process background lines
)
```
#### Stream Options
```python
tables = camelot.read_pdf(
'document.pdf',
flavor='stream',
edge_tol=500, # Edge tolerance
row_tol=10, # Row tolerance
column_tol=0, # Column tolerance
strip_text='\n', # Characters to strip
)
```
### Table Area Specification
```python
# Extract from specific area (x1, y1, x2, y2)
# Coordinates from bottom-left, in PDF points (72 points = 1 inch)
tables = camelot.read_pdf(
'document.pdf',
table_areas=['72,720,540,400'], # One area
)
# Multiple areas
tables = camelot.read_pdf(
'document.pdf',
table_areas=['72,720,540,400', '72,380,540,200'],
)
```
### Column Specification
```python
# Manually specify column positions (for stream method)
tables = camelot.read_pdf(
'document.pdf',
flavor='stream',
columns=['100,200,300,400'], # X positions of column separators
)
```
### Working with Results
```python
import camelot
tables = camelot.read_pdf('document.pdf')
for i, table in enumerate(tables):
# Access DataFrame
df = table.df
# Table metadata
print(f"Table {i+1}:")
print(f" Page: {table.page}")
print(f" Accuracy: {table.accuracy}")
print(f" Whitespace: {table.whitespace}")
print(f" Order: {table.order}")
print(f" Shape: {df.shape}")
# Parsing report
report = table.parsing_report
print(f" Report: {report}")
```
### Export Options
```python
import camelot
tables = camelot.read_pdf('document.pdf')
# Export to CSV
tables[0].to_csv('table.csv')
# Export to Excel
tables[0].to_excel('table.xlsx')
# Export to JSON
tables[0].to_json('table.json')
# Export to HTML
tables[0].to_html('table.html')
# Export all tables
for i, table in enumerate(tables):
table.to_excel(f'table_{i+1}.xlsx')
```
### Visual Debugging
```python
import camelot
# Enable visual debugging
tables = camelot.read_pdf('document.pdf')
# Plot detected table areas
camelot.plot(tables[0], kind='contour').show()
# Plot text on table
camelot.plot(tables[0], kind='text').show()
# Plot detected lines (lattice only)
camelot.plot(tables[0], kind='joint').show()
camelot.plot(tables[0], kind='line').show()
# Save plot
fig = camelot.plot(tables[0])
fig.savefig('debug.png')
```
### Handling Multi-page Tables
```python
import camelot
import pandas as pd
def extract_multipage_table(pdf_path, pages='all'):
"""Extract and combine tables that span multiple pages."""
tables = camelot.read_pdf(pdf_path, pages=pages)
# Group tables by similar structure (columns)
table_groups = {}
for table in tables:
cols = tuple(table.df.columns)
if cols not in table_groups:
table_groups[cols] = []
table_groups[cols].append(table.df)
# Combine similar tables
combined = []
for cols, dfs in table_groups.items():
if len(dfs) > 1:
# Combine and deduplicate header rows
combined_df = pd.concat(dfs, ignore_index=True)
combined.append(combined_df)
else:
combined.append(dfs[0])
return combined
```
## Best Practices
1. **Try Both Methods**: Lattice for bordered, stream for borderless
2. **Check Accuracy Score**: Above 90% is usually good
3. **Use Visual Debugging**: Understand extraction results
4. **Specify Areas**: For PDFs with multiple table types
5. **Handle Headers**: First row often needs special treatment
## Common Patterns
### Batch Table Extraction
```python
import camelot
from pathlib import Path
import pandas as pd
def batch_extract_tables(input_dir, output_dir):
"""Extract tables from all PDFs in directory."""
input_path = Path(input_dir)
output_path = Path(output_dir)
output_path.mkdir(exist_ok=True)
results = []
for pdf_file in input_path.glob('*.pdf'):
try:
tables = camelot.read_pdf(str(pdf_file), pages='all')
for i, table in enumerate(tables):
# Skip low accuracy tables
if table.accuracy < 80:
continue
output_file = output_path / f"{pdf_file.stem}_table_{i+1}.xlsx"
table.to_excel(str(output_file))
results.append({
'source': str(pdf_file),
'table': i + 1,
'page': table.page,
'accuracy': table.accuracy,
'output': str(output_file)
})
except Exception as e:
results.append({
'source': str(pdf_file),
'error': str(e)
})
return results
```
### Auto-detect Table Method
```python
import camelot
def smart_extract_tables(pdf_path, pages='1'):
"""Try both methods and return best results."""
# Try lattice first
lattice_tables = camelot.read_pdf(pdf_path, pages=pages, flavor='lattice')
# Try stream
stream_tables = camelot.read_pdf(pdf_path, pages=pages, flavor='stream')
# Compare and return best
results = []
if lattice_tables and lattice_tables[0].accuracy > 70:
results.extend(lattice_tables)
elif stream_tables:
results.extend(stream_tables)
return results
```
## Examples
### Example 1: Financial Statement Extraction
```python
import camelot
import pandas as pd
def extract_financial_tables(pdf_path):
"""Extract financial tables from annual report."""
# Extract all tables
tables = camelot.read_pdf(pdf_path, pages='all', flavor='lattice')
financial_data = {
'income_statement': None,
'balance_sheet': None,
'cash_flow': None,
'other_tables': []
}
for tabl