polars
High-performance DataFrame library for fast data processing with lazy evaluation, parallel execution, and memory efficiency
What this skill does
# Polars High-Performance DataFrame Skill
Master Polars for blazing-fast data processing with lazy evaluation, parallel execution, and memory-efficient operations on datasets of any size.
## When to Use This Skill
### USE Polars when:
- **Large datasets** - Working with data too large for pandas (10GB+)
- **Performance critical** - Need maximum speed for data transformations
- **Memory constrained** - Limited RAM requires efficient memory usage
- **Parallel processing** - Want to utilize all CPU cores automatically
- **Complex aggregations** - Group by, window functions, rolling calculations
- **Lazy evaluation** - Query optimization before execution matters
- **ETL pipelines** - Building production data pipelines
- **Streaming data** - Processing data larger than memory
### DON'T USE Polars when:
- **Pandas ecosystem required** - Need specific pandas-only libraries
- **Small datasets** - Under 100MB where pandas is sufficient
- **Legacy code** - Extensive existing pandas codebase
- **Matplotlib/Seaborn direct integration** - These work better with pandas
- **Time series with specialized needs** - Some pandas time series features are more mature
## Prerequisites
```bash
# Basic installation
pip install polars
# With all optional dependencies
pip install 'polars[all]'
# Specific extras
pip install 'polars[numpy,pandas,pyarrow,fsspec,connectorx,xlsx2csv,deltalake,timezone]'
# Using uv (recommended)
uv pip install polars pyarrow connectorx
```
## Core Capabilities
### 1. DataFrame Creation and I/O
**Creating DataFrames:**
```python
import polars as pl
import numpy as np
from datetime import datetime, date
# From Python dictionaries
df = pl.DataFrame({
"id": [1, 2, 3, 4, 5],
"name": ["Alice", "Bob", "Charlie", "Diana", "Eve"],
"value": [100.5, 200.3, 150.7, 300.2, 250.8],
"category": ["A", "B", "A", "C", "B"],
"timestamp": [
datetime(2025, 1, 1, 10, 0),
datetime(2025, 1, 2, 11, 30),
datetime(2025, 1, 3, 9, 15),
datetime(2025, 1, 4, 14, 45),
datetime(2025, 1, 5, 16, 0),
]
})
print(df)
print(f"Shape: {df.shape}")
print(f"Schema: {df.schema}")
# From NumPy arrays
np_data = np.random.randn(1000, 5)
df_numpy = pl.DataFrame(
np_data,
schema=["col_a", "col_b", "col_c", "col_d", "col_e"]
)
# From list of dictionaries
records = [
{"x": 1, "y": "a"},
{"x": 2, "y": "b"},
{"x": 3, "y": "c"}
]
df_records = pl.DataFrame(records)
# Specify schema explicitly
df_typed = pl.DataFrame(
{
"integers": [1, 2, 3],
"floats": [1.0, 2.0, 3.0],
"strings": ["a", "b", "c"]
},
schema={
"integers": pl.Int32,
"floats": pl.Float64,
"strings": pl.Utf8
}
)
```
**Reading Files:**
```python
# CSV files
df = pl.read_csv("data.csv")
# With options
df = pl.read_csv(
"data.csv",
separator=",",
has_header=True,
skip_rows=0,
n_rows=10000, # Read only first N rows
columns=["col1", "col2", "col3"], # Select columns
dtypes={"id": pl.Int64, "value": pl.Float32}, # Specify types
null_values=["NA", "N/A", ""],
ignore_errors=True,
try_parse_dates=True,
encoding="utf8"
)
# Parquet files (recommended for large data)
df = pl.read_parquet("data.parquet")
# Multiple Parquet files with globbing
df = pl.read_parquet("data/*.parquet")
# Parquet with row group filtering
df = pl.read_parquet(
"large_data.parquet",
columns=["id", "value", "date"],
n_rows=100000,
row_count_name="row_nr"
)
# JSON files
df = pl.read_json("data.json")
# JSON Lines (newline-delimited JSON)
df = pl.read_ndjson("data.jsonl")
# Excel files
df = pl.read_excel("data.xlsx", sheet_name="Sheet1")
# Delta Lake
df = pl.read_delta("delta_table/")
# From SQL database using ConnectorX (fast!)
df = pl.read_database(
query="SELECT * FROM sales WHERE date > '2025-01-01'",
connection="postgresql://user:pass@localhost/db"
)
# From URL
df = pl.read_csv("https://example.com/data.csv")
```
**Writing Files:**
```python
# CSV
df.write_csv("output.csv")
# Parquet (recommended)
df.write_parquet(
"output.parquet",
compression="zstd", # zstd, lz4, snappy, gzip, brotli
compression_level=3,
statistics=True,
row_group_size=100000
)
# JSON
df.write_json("output.json", row_oriented=True)
df.write_ndjson("output.jsonl")
# Delta Lake
df.write_delta("delta_table/", mode="overwrite")
# IPC/Arrow format (fastest for inter-process communication)
df.write_ipc("output.arrow")
```
### 2. Lazy Evaluation and Query Optimization
**LazyFrame Basics:**
```python
import polars as pl
# Create lazy frame (no computation yet)
lf = pl.scan_csv("large_data.csv")
# Or convert from eager DataFrame
df = pl.DataFrame({"x": [1, 2, 3]})
lf = df.lazy()
# Chain operations (still no computation)
result_lf = (
lf
.filter(pl.col("date") >= "2025-01-01")
.with_columns([
(pl.col("revenue") - pl.col("cost")).alias("profit"),
pl.col("category").cast(pl.Categorical)
])
.group_by("category")
.agg([
pl.col("profit").sum().alias("total_profit"),
pl.col("profit").mean().alias("avg_profit"),
pl.count().alias("count")
])
.sort("total_profit", descending=True)
)
# View the query plan
print(result_lf.explain())
# Execute and collect results
result_df = result_lf.collect()
# Execute with streaming (for very large data)
result_df = result_lf.collect(streaming=True)
# Fetch only first N rows
sample = result_lf.fetch(1000)
```
**Query Optimization Benefits:**
```python
# Polars optimizes this automatically:
lf = (
pl.scan_parquet("data/*.parquet")
.filter(pl.col("country") == "USA") # Predicate pushdown
.select(["id", "name", "revenue"]) # Projection pushdown
.filter(pl.col("revenue") > 1000) # Combined with first filter
)
# View optimized plan
print("Naive plan:")
print(lf.explain(optimized=False))
print("\nOptimized plan:")
print(lf.explain(optimized=True))
# The optimizer will:
# 1. Push filters to data source (read less data)
# 2. Select only needed columns (reduce memory)
# 3. Combine/reorder operations for efficiency
# 4. Eliminate redundant operations
```
**Streaming Large Files:**
```python
# Process files larger than memory
def process_large_file(input_path: str, output_path: str):
"""Process file that doesn't fit in memory."""
result = (
pl.scan_csv(input_path)
.filter(pl.col("status") == "active")
.group_by("region")
.agg([
pl.col("sales").sum(),
pl.col("customers").n_unique()
])
.collect(streaming=True) # Stream processing
)
result.write_parquet(output_path)
return result
# Sink directly to file (even more memory efficient)
(
pl.scan_csv("huge_file.csv")
.filter(pl.col("value") > 0)
.sink_parquet("filtered_output.parquet")
)
```
### 3. Expression API
**Basic Expressions:**
```python
import polars as pl
df = pl.DataFrame({
"a": [1, 2, 3, 4, 5],
"b": [10, 20, 30, 40, 50],
"c": ["x", "y", "x", "y", "x"],
"d": [1.5, 2.5, 3.5, 4.5, 5.5]
})
# Column selection
df.select(pl.col("a"))
df.select(pl.col("a", "b", "c"))
df.select(pl.col("^a.*$")) # Regex pattern
df.select(pl.all())
df.select(pl.exclude("c"))
# Arithmetic operations
df.select([
pl.col("a"),
(pl.col("a") + pl.col("b")).alias("sum"),
(pl.col("a") * pl.col("d")).alias("product"),
(pl.col("b") / pl.col("a")).alias("ratio"),
(pl.col("a") ** 2).alias("squared"),
(pl.col("a") % 2).alias("modulo")
])
# Conditional expressions
df.select([
pl.col("a"),
pl.when(pl.col("a") > 3)
.then(pl.lit("high"))
.otherwise(pl.lit("low"))
.alias("category"),
pl.when(pl.col("a") < 2)
.then(pl.lit("low"))
.when(pl.col("a") < 4)
.then(pl.lit("medium"))
.otherwise(pl.lit("high"))
.alias("tier")
])
# String operations
df_str = pl.DataFrame({
"text": ["Hello World", "Polars is Fast", "DRelated in data-analysis
pandas
IncludedExpert data analysis and manipulation for customer support operations using pandas
autoviz
IncludedAutomatic exploratory data analysis and visualization with a single line of code - generates comprehensive charts, detects patterns, and exports to HTML/notebooks
dash
IncludedBuild production-grade interactive dashboards with Plotly Dash - enterprise features, callbacks, and scalable deployment
great-tables
IncludedPublication-quality tables in Python with rich styling, formatting, conditional formatting, and export to HTML/images - inspired by R's gt package
streamlit
IncludedBuild interactive data applications and dashboards with pure Python - no frontend experience required
sweetviz
IncludedAutomated EDA comparison reports with target analysis, feature comparison, and HTML report generation for pandas DataFrames