Claude
Skills
Sign in
Back

polars

Included with Lifetime
$97 forever

High-performance DataFrame library for fast data processing with lazy evaluation, parallel execution, and memory efficiency

data-analysispolarsdataframeperformanceparallellazy-evaluationarrowrustdata-processing

What this skill does


# Polars High-Performance DataFrame Skill

Master Polars for blazing-fast data processing with lazy evaluation, parallel execution, and memory-efficient operations on datasets of any size.

## When to Use This Skill

### USE Polars when:
- **Large datasets** - Working with data too large for pandas (10GB+)
- **Performance critical** - Need maximum speed for data transformations
- **Memory constrained** - Limited RAM requires efficient memory usage
- **Parallel processing** - Want to utilize all CPU cores automatically
- **Complex aggregations** - Group by, window functions, rolling calculations
- **Lazy evaluation** - Query optimization before execution matters
- **ETL pipelines** - Building production data pipelines
- **Streaming data** - Processing data larger than memory

### DON'T USE Polars when:
- **Pandas ecosystem required** - Need specific pandas-only libraries
- **Small datasets** - Under 100MB where pandas is sufficient
- **Legacy code** - Extensive existing pandas codebase
- **Matplotlib/Seaborn direct integration** - These work better with pandas
- **Time series with specialized needs** - Some pandas time series features are more mature

## Prerequisites

```bash
# Basic installation
pip install polars

# With all optional dependencies
pip install 'polars[all]'

# Specific extras
pip install 'polars[numpy,pandas,pyarrow,fsspec,connectorx,xlsx2csv,deltalake,timezone]'

# Using uv (recommended)
uv pip install polars pyarrow connectorx
```

## Core Capabilities

### 1. DataFrame Creation and I/O

**Creating DataFrames:**
```python
import polars as pl
import numpy as np
from datetime import datetime, date

# From Python dictionaries
df = pl.DataFrame({
    "id": [1, 2, 3, 4, 5],
    "name": ["Alice", "Bob", "Charlie", "Diana", "Eve"],
    "value": [100.5, 200.3, 150.7, 300.2, 250.8],
    "category": ["A", "B", "A", "C", "B"],
    "timestamp": [
        datetime(2025, 1, 1, 10, 0),
        datetime(2025, 1, 2, 11, 30),
        datetime(2025, 1, 3, 9, 15),
        datetime(2025, 1, 4, 14, 45),
        datetime(2025, 1, 5, 16, 0),
    ]
})

print(df)
print(f"Shape: {df.shape}")
print(f"Schema: {df.schema}")

# From NumPy arrays
np_data = np.random.randn(1000, 5)
df_numpy = pl.DataFrame(
    np_data,
    schema=["col_a", "col_b", "col_c", "col_d", "col_e"]
)

# From list of dictionaries
records = [
    {"x": 1, "y": "a"},
    {"x": 2, "y": "b"},
    {"x": 3, "y": "c"}
]
df_records = pl.DataFrame(records)

# Specify schema explicitly
df_typed = pl.DataFrame(
    {
        "integers": [1, 2, 3],
        "floats": [1.0, 2.0, 3.0],
        "strings": ["a", "b", "c"]
    },
    schema={
        "integers": pl.Int32,
        "floats": pl.Float64,
        "strings": pl.Utf8
    }
)
```

**Reading Files:**
```python
# CSV files
df = pl.read_csv("data.csv")

# With options
df = pl.read_csv(
    "data.csv",
    separator=",",
    has_header=True,
    skip_rows=0,
    n_rows=10000,  # Read only first N rows
    columns=["col1", "col2", "col3"],  # Select columns
    dtypes={"id": pl.Int64, "value": pl.Float32},  # Specify types
    null_values=["NA", "N/A", ""],
    ignore_errors=True,
    try_parse_dates=True,
    encoding="utf8"
)

# Parquet files (recommended for large data)
df = pl.read_parquet("data.parquet")

# Multiple Parquet files with globbing
df = pl.read_parquet("data/*.parquet")

# Parquet with row group filtering
df = pl.read_parquet(
    "large_data.parquet",
    columns=["id", "value", "date"],
    n_rows=100000,
    row_count_name="row_nr"
)

# JSON files
df = pl.read_json("data.json")

# JSON Lines (newline-delimited JSON)
df = pl.read_ndjson("data.jsonl")

# Excel files
df = pl.read_excel("data.xlsx", sheet_name="Sheet1")

# Delta Lake
df = pl.read_delta("delta_table/")

# From SQL database using ConnectorX (fast!)
df = pl.read_database(
    query="SELECT * FROM sales WHERE date > '2025-01-01'",
    connection="postgresql://user:pass@localhost/db"
)

# From URL
df = pl.read_csv("https://example.com/data.csv")
```

**Writing Files:**
```python
# CSV
df.write_csv("output.csv")

# Parquet (recommended)
df.write_parquet(
    "output.parquet",
    compression="zstd",  # zstd, lz4, snappy, gzip, brotli
    compression_level=3,
    statistics=True,
    row_group_size=100000
)

# JSON
df.write_json("output.json", row_oriented=True)
df.write_ndjson("output.jsonl")

# Delta Lake
df.write_delta("delta_table/", mode="overwrite")

# IPC/Arrow format (fastest for inter-process communication)
df.write_ipc("output.arrow")
```

### 2. Lazy Evaluation and Query Optimization

**LazyFrame Basics:**
```python
import polars as pl

# Create lazy frame (no computation yet)
lf = pl.scan_csv("large_data.csv")

# Or convert from eager DataFrame
df = pl.DataFrame({"x": [1, 2, 3]})
lf = df.lazy()

# Chain operations (still no computation)
result_lf = (
    lf
    .filter(pl.col("date") >= "2025-01-01")
    .with_columns([
        (pl.col("revenue") - pl.col("cost")).alias("profit"),
        pl.col("category").cast(pl.Categorical)
    ])
    .group_by("category")
    .agg([
        pl.col("profit").sum().alias("total_profit"),
        pl.col("profit").mean().alias("avg_profit"),
        pl.count().alias("count")
    ])
    .sort("total_profit", descending=True)
)

# View the query plan
print(result_lf.explain())

# Execute and collect results
result_df = result_lf.collect()

# Execute with streaming (for very large data)
result_df = result_lf.collect(streaming=True)

# Fetch only first N rows
sample = result_lf.fetch(1000)
```

**Query Optimization Benefits:**
```python
# Polars optimizes this automatically:
lf = (
    pl.scan_parquet("data/*.parquet")
    .filter(pl.col("country") == "USA")  # Predicate pushdown
    .select(["id", "name", "revenue"])   # Projection pushdown
    .filter(pl.col("revenue") > 1000)    # Combined with first filter
)

# View optimized plan
print("Naive plan:")
print(lf.explain(optimized=False))

print("\nOptimized plan:")
print(lf.explain(optimized=True))

# The optimizer will:
# 1. Push filters to data source (read less data)
# 2. Select only needed columns (reduce memory)
# 3. Combine/reorder operations for efficiency
# 4. Eliminate redundant operations
```

**Streaming Large Files:**
```python
# Process files larger than memory
def process_large_file(input_path: str, output_path: str):
    """Process file that doesn't fit in memory."""
    result = (
        pl.scan_csv(input_path)
        .filter(pl.col("status") == "active")
        .group_by("region")
        .agg([
            pl.col("sales").sum(),
            pl.col("customers").n_unique()
        ])
        .collect(streaming=True)  # Stream processing
    )

    result.write_parquet(output_path)
    return result

# Sink directly to file (even more memory efficient)
(
    pl.scan_csv("huge_file.csv")
    .filter(pl.col("value") > 0)
    .sink_parquet("filtered_output.parquet")
)
```

### 3. Expression API

**Basic Expressions:**
```python
import polars as pl

df = pl.DataFrame({
    "a": [1, 2, 3, 4, 5],
    "b": [10, 20, 30, 40, 50],
    "c": ["x", "y", "x", "y", "x"],
    "d": [1.5, 2.5, 3.5, 4.5, 5.5]
})

# Column selection
df.select(pl.col("a"))
df.select(pl.col("a", "b", "c"))
df.select(pl.col("^a.*$"))  # Regex pattern
df.select(pl.all())
df.select(pl.exclude("c"))

# Arithmetic operations
df.select([
    pl.col("a"),
    (pl.col("a") + pl.col("b")).alias("sum"),
    (pl.col("a") * pl.col("d")).alias("product"),
    (pl.col("b") / pl.col("a")).alias("ratio"),
    (pl.col("a") ** 2).alias("squared"),
    (pl.col("a") % 2).alias("modulo")
])

# Conditional expressions
df.select([
    pl.col("a"),
    pl.when(pl.col("a") > 3)
      .then(pl.lit("high"))
      .otherwise(pl.lit("low"))
      .alias("category"),

    pl.when(pl.col("a") < 2)
      .then(pl.lit("low"))
      .when(pl.col("a") < 4)
      .then(pl.lit("medium"))
      .otherwise(pl.lit("high"))
      .alias("tier")
])

# String operations
df_str = pl.DataFrame({
    "text": ["Hello World", "Polars is Fast", "D

Related in data-analysis