sweetviz
Automated EDA comparison reports with target analysis, feature comparison, and HTML report generation for pandas DataFrames
What this skill does
# Sweetviz EDA Comparison Skill
Master Sweetviz for automated exploratory data analysis with powerful comparison capabilities, target variable analysis, and beautiful HTML reports. Sweetviz excels at comparing datasets (train vs test) and analyzing features against a target variable.
## When to Use This Skill
### USE Sweetviz when:
- **Dataset comparison** - Comparing train vs test, before vs after, or any two datasets
- **Target variable analysis** - Understanding how features relate to a target
- **Quick EDA reports** - Need comprehensive EDA in one line of code
- **Feature comparison** - Analyzing feature distributions across subsets
- **HTML reports** - Creating shareable, interactive analysis reports
- **Intra-set analysis** - Comparing subpopulations within a dataset
- **Data validation** - Checking for data drift between datasets
- **Feature selection** - Identifying important features for modeling
### DON'T USE Sweetviz when:
- **Very large datasets** - Over 1M rows (use sampling)
- **Streaming data** - Need real-time analysis
- **Deep statistical tests** - Need p-values and hypothesis testing
- **Custom visualizations** - Specific chart requirements
- **Interactive dashboards** - Use Streamlit or Dash instead
- **Text/NLP analysis** - Use dedicated NLP tools
## Prerequisites
```bash
# Basic installation
pip install sweetviz
# Using uv (recommended)
uv pip install sweetviz pandas numpy
# With Jupyter support
pip install sweetviz pandas numpy jupyter
# Verify installation
python -c "import sweetviz as sv; print(f'Sweetviz version: {sv.__version__}')"
```
### System Requirements
- Python 3.6 or higher
- pandas 0.25.3 or higher
- numpy
- matplotlib (for internal plotting)
- Modern web browser (for viewing HTML reports)
## Core Capabilities
### 1. Basic EDA Report (Analyze)
**Single Dataset Analysis:**
```python
import sweetviz as sv
import pandas as pd
import numpy as np
# Load data
df = pd.read_csv("data.csv")
# Generate basic EDA report
report = sv.analyze(df)
# Save HTML report
report.show_html("sweetviz_report.html")
# Show in notebook (auto-opens browser)
report.show_notebook()
```
**With Source Name:**
```python
import sweetviz as sv
import pandas as pd
df = pd.read_csv("sales_data.csv")
# Generate report with custom name
report = sv.analyze(
source=df,
pairwise_analysis="auto" # "on", "off", or "auto"
)
report.show_html("sales_analysis.html", open_browser=True)
```
**Sample Dataset for Examples:**
```python
import sweetviz as sv
import pandas as pd
import numpy as np
from datetime import datetime, timedelta
# Create comprehensive sample dataset
np.random.seed(42)
n = 5000
df = pd.DataFrame({
# Numeric features
"age": np.random.randint(18, 80, n),
"income": np.random.exponential(50000, n),
"credit_score": np.random.normal(700, 50, n).clip(300, 850).astype(int),
"account_balance": np.random.exponential(10000, n),
"transaction_count": np.random.poisson(15, n),
# Categorical features
"gender": np.random.choice(["Male", "Female", "Other"], n, p=[0.48, 0.48, 0.04]),
"education": np.random.choice(
["High School", "Bachelor", "Master", "PhD"],
n, p=[0.3, 0.4, 0.2, 0.1]
),
"employment_status": np.random.choice(
["Employed", "Self-employed", "Unemployed", "Retired"],
n, p=[0.6, 0.2, 0.1, 0.1]
),
"region": np.random.choice(["North", "South", "East", "West"], n),
# Date feature
"join_date": [
datetime(2020, 1, 1) + timedelta(days=int(d))
for d in np.random.uniform(0, 1460, n)
],
# Target variable (binary classification)
"churned": np.random.choice([0, 1], n, p=[0.8, 0.2])
})
# Add some missing values
df.loc[np.random.choice(n, 200), "income"] = np.nan
df.loc[np.random.choice(n, 100), "credit_score"] = np.nan
df.loc[np.random.choice(n, 150), "education"] = np.nan
# Basic analysis
report = sv.analyze(df)
report.show_html("customer_analysis.html")
```
### 2. Target Variable Analysis
**Binary Target Analysis:**
```python
import sweetviz as sv
import pandas as pd
import numpy as np
# Create dataset with target variable
np.random.seed(42)
n = 3000
df = pd.DataFrame({
"feature_1": np.random.randn(n),
"feature_2": np.random.exponential(10, n),
"feature_3": np.random.choice(["A", "B", "C"], n),
"feature_4": np.random.randint(1, 100, n),
"target": np.random.choice([0, 1], n, p=[0.7, 0.3])
})
# Analyze with target variable
# Shows how each feature relates to the target
report = sv.analyze(
source=df,
target_feat="target" # Specify target column
)
report.show_html("target_analysis.html")
```
**Continuous Target Analysis:**
```python
import sweetviz as sv
import pandas as pd
import numpy as np
# Regression target example
np.random.seed(42)
n = 2000
# Features that affect the target
x1 = np.random.randn(n)
x2 = np.random.exponential(5, n)
x3 = np.random.choice([0, 1], n)
# Target is a function of features + noise
target = 10 + 2*x1 + 0.5*x2 + 3*x3 + np.random.randn(n)
df = pd.DataFrame({
"feature_linear": x1,
"feature_exp": x2,
"feature_binary": x3,
"feature_noise": np.random.randn(n), # Unrelated feature
"price": target # Continuous target
})
# Analyze with continuous target
report = sv.analyze(
source=df,
target_feat="price"
)
report.show_html("regression_target_analysis.html")
```
**Multi-class Target:**
```python
import sweetviz as sv
import pandas as pd
import numpy as np
np.random.seed(42)
n = 2500
df = pd.DataFrame({
"feature_1": np.random.randn(n),
"feature_2": np.random.uniform(0, 100, n),
"category": np.random.choice(["A", "B", "C"], n),
"class_label": np.random.choice(
["Class_A", "Class_B", "Class_C", "Class_D"],
n, p=[0.4, 0.3, 0.2, 0.1]
)
})
# Multi-class target analysis
report = sv.analyze(
source=df,
target_feat="class_label"
)
report.show_html("multiclass_analysis.html")
```
### 3. Dataset Comparison (Compare)
**Train vs Test Comparison:**
```python
import sweetviz as sv
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
# Create sample dataset
np.random.seed(42)
n = 5000
df = pd.DataFrame({
"feature_1": np.random.randn(n),
"feature_2": np.random.exponential(50, n),
"feature_3": np.random.choice(["X", "Y", "Z"], n),
"feature_4": np.random.randint(1, 100, n),
"target": np.random.choice([0, 1], n, p=[0.75, 0.25])
})
# Split into train and test
train_df, test_df = train_test_split(df, test_size=0.2, random_state=42)
print(f"Train shape: {train_df.shape}")
print(f"Test shape: {test_df.shape}")
# Compare train vs test datasets
comparison_report = sv.compare(
source=[train_df, "Training Data"],
compare=[test_df, "Test Data"],
target_feat="target"
)
comparison_report.show_html("train_test_comparison.html")
```
**Before vs After Comparison:**
```python
import sweetviz as sv
import pandas as pd
import numpy as np
np.random.seed(42)
# Original data with issues
df_before = pd.DataFrame({
"value": np.concatenate([
np.random.randn(900),
np.array([50, -30, 100, 75, -50]) # Outliers
]),
"category": np.random.choice(["A", "B", "C"], 905),
"score": np.random.uniform(0, 100, 905)
})
# Add missing values
df_before.loc[np.random.choice(905, 80), "value"] = np.nan
# Cleaned data
df_after = df_before.copy()
# Remove outliers using IQR
Q1 = df_after["value"].quantile(0.25)
Q3 = df_after["value"].quantile(0.75)
IQR = Q3 - Q1
df_after = df_after[
(df_after["value"].isna()) | # Keep NaN for now
((df_after["value"] >= Q1 - 1.5 * IQR) &
(df_after["value"] <= Q3 + 1.5 * IQR))
]
# Fill missing values
df_after["value"] = df_after["value"].fillna(df_after["value"].median())
# Compare before vs after cleaning
comparison = sv.compare(
source=[df_before, "Before Cleaning"],
compare=[df_after, "After Cleaning"]
)
comparison.show_html("cleaning_comparison.html")
`Related in data-analysis
pandas
IncludedExpert data analysis and manipulation for customer support operations using pandas
autoviz
IncludedAutomatic exploratory data analysis and visualization with a single line of code - generates comprehensive charts, detects patterns, and exports to HTML/notebooks
dash
IncludedBuild production-grade interactive dashboards with Plotly Dash - enterprise features, callbacks, and scalable deployment
great-tables
IncludedPublication-quality tables in Python with rich styling, formatting, conditional formatting, and export to HTML/images - inspired by R's gt package
polars
IncludedHigh-performance DataFrame library for fast data processing with lazy evaluation, parallel execution, and memory efficiency
streamlit
IncludedBuild interactive data applications and dashboards with pure Python - no frontend experience required