Claude
Skills
Sign in
Back

sweetviz

Included with Lifetime
$97 forever

Automated EDA comparison reports with target analysis, feature comparison, and HTML report generation for pandas DataFrames

data-analysissweetvizedadata-analysiscomparisontarget-analysishtml-reportfeature-comparisonvisualization

What this skill does


# Sweetviz EDA Comparison Skill

Master Sweetviz for automated exploratory data analysis with powerful comparison capabilities, target variable analysis, and beautiful HTML reports. Sweetviz excels at comparing datasets (train vs test) and analyzing features against a target variable.

## When to Use This Skill

### USE Sweetviz when:
- **Dataset comparison** - Comparing train vs test, before vs after, or any two datasets
- **Target variable analysis** - Understanding how features relate to a target
- **Quick EDA reports** - Need comprehensive EDA in one line of code
- **Feature comparison** - Analyzing feature distributions across subsets
- **HTML reports** - Creating shareable, interactive analysis reports
- **Intra-set analysis** - Comparing subpopulations within a dataset
- **Data validation** - Checking for data drift between datasets
- **Feature selection** - Identifying important features for modeling

### DON'T USE Sweetviz when:
- **Very large datasets** - Over 1M rows (use sampling)
- **Streaming data** - Need real-time analysis
- **Deep statistical tests** - Need p-values and hypothesis testing
- **Custom visualizations** - Specific chart requirements
- **Interactive dashboards** - Use Streamlit or Dash instead
- **Text/NLP analysis** - Use dedicated NLP tools

## Prerequisites

```bash
# Basic installation
pip install sweetviz

# Using uv (recommended)
uv pip install sweetviz pandas numpy

# With Jupyter support
pip install sweetviz pandas numpy jupyter

# Verify installation
python -c "import sweetviz as sv; print(f'Sweetviz version: {sv.__version__}')"
```

### System Requirements

- Python 3.6 or higher
- pandas 0.25.3 or higher
- numpy
- matplotlib (for internal plotting)
- Modern web browser (for viewing HTML reports)

## Core Capabilities

### 1. Basic EDA Report (Analyze)

**Single Dataset Analysis:**
```python
import sweetviz as sv
import pandas as pd
import numpy as np

# Load data
df = pd.read_csv("data.csv")

# Generate basic EDA report
report = sv.analyze(df)

# Save HTML report
report.show_html("sweetviz_report.html")

# Show in notebook (auto-opens browser)
report.show_notebook()
```

**With Source Name:**
```python
import sweetviz as sv
import pandas as pd

df = pd.read_csv("sales_data.csv")

# Generate report with custom name
report = sv.analyze(
    source=df,
    pairwise_analysis="auto"  # "on", "off", or "auto"
)

report.show_html("sales_analysis.html", open_browser=True)
```

**Sample Dataset for Examples:**
```python
import sweetviz as sv
import pandas as pd
import numpy as np
from datetime import datetime, timedelta

# Create comprehensive sample dataset
np.random.seed(42)
n = 5000

df = pd.DataFrame({
    # Numeric features
    "age": np.random.randint(18, 80, n),
    "income": np.random.exponential(50000, n),
    "credit_score": np.random.normal(700, 50, n).clip(300, 850).astype(int),
    "account_balance": np.random.exponential(10000, n),
    "transaction_count": np.random.poisson(15, n),

    # Categorical features
    "gender": np.random.choice(["Male", "Female", "Other"], n, p=[0.48, 0.48, 0.04]),
    "education": np.random.choice(
        ["High School", "Bachelor", "Master", "PhD"],
        n, p=[0.3, 0.4, 0.2, 0.1]
    ),
    "employment_status": np.random.choice(
        ["Employed", "Self-employed", "Unemployed", "Retired"],
        n, p=[0.6, 0.2, 0.1, 0.1]
    ),
    "region": np.random.choice(["North", "South", "East", "West"], n),

    # Date feature
    "join_date": [
        datetime(2020, 1, 1) + timedelta(days=int(d))
        for d in np.random.uniform(0, 1460, n)
    ],

    # Target variable (binary classification)
    "churned": np.random.choice([0, 1], n, p=[0.8, 0.2])
})

# Add some missing values
df.loc[np.random.choice(n, 200), "income"] = np.nan
df.loc[np.random.choice(n, 100), "credit_score"] = np.nan
df.loc[np.random.choice(n, 150), "education"] = np.nan

# Basic analysis
report = sv.analyze(df)
report.show_html("customer_analysis.html")
```

### 2. Target Variable Analysis

**Binary Target Analysis:**
```python
import sweetviz as sv
import pandas as pd
import numpy as np

# Create dataset with target variable
np.random.seed(42)
n = 3000

df = pd.DataFrame({
    "feature_1": np.random.randn(n),
    "feature_2": np.random.exponential(10, n),
    "feature_3": np.random.choice(["A", "B", "C"], n),
    "feature_4": np.random.randint(1, 100, n),
    "target": np.random.choice([0, 1], n, p=[0.7, 0.3])
})

# Analyze with target variable
# Shows how each feature relates to the target
report = sv.analyze(
    source=df,
    target_feat="target"  # Specify target column
)

report.show_html("target_analysis.html")
```

**Continuous Target Analysis:**
```python
import sweetviz as sv
import pandas as pd
import numpy as np

# Regression target example
np.random.seed(42)
n = 2000

# Features that affect the target
x1 = np.random.randn(n)
x2 = np.random.exponential(5, n)
x3 = np.random.choice([0, 1], n)

# Target is a function of features + noise
target = 10 + 2*x1 + 0.5*x2 + 3*x3 + np.random.randn(n)

df = pd.DataFrame({
    "feature_linear": x1,
    "feature_exp": x2,
    "feature_binary": x3,
    "feature_noise": np.random.randn(n),  # Unrelated feature
    "price": target  # Continuous target
})

# Analyze with continuous target
report = sv.analyze(
    source=df,
    target_feat="price"
)

report.show_html("regression_target_analysis.html")
```

**Multi-class Target:**
```python
import sweetviz as sv
import pandas as pd
import numpy as np

np.random.seed(42)
n = 2500

df = pd.DataFrame({
    "feature_1": np.random.randn(n),
    "feature_2": np.random.uniform(0, 100, n),
    "category": np.random.choice(["A", "B", "C"], n),
    "class_label": np.random.choice(
        ["Class_A", "Class_B", "Class_C", "Class_D"],
        n, p=[0.4, 0.3, 0.2, 0.1]
    )
})

# Multi-class target analysis
report = sv.analyze(
    source=df,
    target_feat="class_label"
)

report.show_html("multiclass_analysis.html")
```

### 3. Dataset Comparison (Compare)

**Train vs Test Comparison:**
```python
import sweetviz as sv
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

# Create sample dataset
np.random.seed(42)
n = 5000

df = pd.DataFrame({
    "feature_1": np.random.randn(n),
    "feature_2": np.random.exponential(50, n),
    "feature_3": np.random.choice(["X", "Y", "Z"], n),
    "feature_4": np.random.randint(1, 100, n),
    "target": np.random.choice([0, 1], n, p=[0.75, 0.25])
})

# Split into train and test
train_df, test_df = train_test_split(df, test_size=0.2, random_state=42)

print(f"Train shape: {train_df.shape}")
print(f"Test shape: {test_df.shape}")

# Compare train vs test datasets
comparison_report = sv.compare(
    source=[train_df, "Training Data"],
    compare=[test_df, "Test Data"],
    target_feat="target"
)

comparison_report.show_html("train_test_comparison.html")
```

**Before vs After Comparison:**
```python
import sweetviz as sv
import pandas as pd
import numpy as np

np.random.seed(42)

# Original data with issues
df_before = pd.DataFrame({
    "value": np.concatenate([
        np.random.randn(900),
        np.array([50, -30, 100, 75, -50])  # Outliers
    ]),
    "category": np.random.choice(["A", "B", "C"], 905),
    "score": np.random.uniform(0, 100, 905)
})

# Add missing values
df_before.loc[np.random.choice(905, 80), "value"] = np.nan

# Cleaned data
df_after = df_before.copy()

# Remove outliers using IQR
Q1 = df_after["value"].quantile(0.25)
Q3 = df_after["value"].quantile(0.75)
IQR = Q3 - Q1
df_after = df_after[
    (df_after["value"].isna()) |  # Keep NaN for now
    ((df_after["value"] >= Q1 - 1.5 * IQR) &
     (df_after["value"] <= Q3 + 1.5 * IQR))
]

# Fill missing values
df_after["value"] = df_after["value"].fillna(df_after["value"].median())

# Compare before vs after cleaning
comparison = sv.compare(
    source=[df_before, "Before Cleaning"],
    compare=[df_after, "After Cleaning"]
)

comparison.show_html("cleaning_comparison.html")
`

Related in data-analysis