ydata-profiling
Automated data quality reports with comprehensive variable analysis, missing value detection, correlations, and HTML report generation - formerly pandas-profiling
What this skill does
# YData Profiling Data Quality Skill
Master YData Profiling (formerly pandas-profiling) for automated data quality reports with comprehensive variable analysis, missing value patterns, correlation detection, and publication-ready HTML reports.
## When to Use This Skill
### USE YData Profiling when:
- **Data quality assessment** - Evaluating dataset health and completeness
- **Initial data exploration** - Understanding a new dataset quickly
- **Missing value analysis** - Detecting patterns in missing data
- **Variable analysis** - Understanding distributions and characteristics
- **Data documentation** - Creating shareable data quality reports
- **Dataset comparison** - Comparing training vs test data, or before/after
- **Stakeholder reporting** - Generating professional HTML reports
- **Data validation** - Checking data before ML model training
### DON'T USE YData Profiling when:
- **Real-time analysis** - Need streaming data profiling
- **Custom visualizations** - Specific chart requirements
- **Interactive dashboards** - Use Streamlit or Dash instead
- **Very large datasets** - Over 10M rows (use sampling or minimal mode)
- **Production pipelines** - Need lightweight validation (use Great Expectations)
## Prerequisites
```bash
# Basic installation
pip install ydata-profiling
# With all optional dependencies
pip install 'ydata-profiling[all]'
# Using uv (recommended)
uv pip install ydata-profiling pandas numpy
# Jupyter notebook support
pip install ydata-profiling ipywidgets notebook
# Verify installation
python -c "from ydata_profiling import ProfileReport; print('YData Profiling ready!')"
```
## Core Capabilities
### 1. Basic Profile Report Generation
**Simplest Usage:**
```python
from ydata_profiling import ProfileReport
import pandas as pd
# Load data
df = pd.read_csv("data.csv")
# Generate profile report
profile = ProfileReport(df, title="Data Quality Report")
# Save to HTML file
profile.to_file("report.html")
# Display in Jupyter notebook
profile.to_notebook_iframe()
```
**With Configuration:**
```python
from ydata_profiling import ProfileReport
import pandas as pd
df = pd.read_csv("sales_data.csv")
# Customized profile report
profile = ProfileReport(
df,
title="Sales Data Quality Report",
explorative=True, # Enable all analyses
dark_mode=False,
orange_mode=False,
config_file=None, # Or path to custom config
lazy=True # Defer computation
)
# Access specific sections
print(profile.description_set) # Variable descriptions
print(profile.get_description()) # Full description
# Save report
profile.to_file("sales_report.html")
```
**From DataFrame with Sample Data:**
```python
from ydata_profiling import ProfileReport
import pandas as pd
import numpy as np
from datetime import datetime, timedelta
# Create sample dataset
np.random.seed(42)
n = 5000
df = pd.DataFrame({
"customer_id": range(1, n + 1),
"name": [f"Customer_{i}" for i in range(n)],
"age": np.random.normal(40, 15, n).astype(int),
"income": np.random.exponential(50000, n),
"category": np.random.choice(["A", "B", "C", "D"], n, p=[0.4, 0.3, 0.2, 0.1]),
"registration_date": [
datetime(2020, 1, 1) + timedelta(days=int(d))
for d in np.random.uniform(0, 1825, n)
],
"is_active": np.random.choice([True, False], n, p=[0.8, 0.2]),
"score": np.random.uniform(0, 100, n),
"email": [f"customer_{i}@example.com" for i in range(n)]
})
# Add some missing values
df.loc[np.random.choice(n, 200), "income"] = np.nan
df.loc[np.random.choice(n, 150), "age"] = np.nan
# Generate report
profile = ProfileReport(df, title="Customer Data Profile")
profile.to_file("customer_profile.html")
```
### 2. Variable Analysis
**Understanding Variable Types:**
```python
from ydata_profiling import ProfileReport
import pandas as pd
import numpy as np
# Dataset with various variable types
df = pd.DataFrame({
# Numeric variables
"integer_col": np.random.randint(1, 100, 1000),
"float_col": np.random.randn(1000) * 100,
# Categorical variables
"category_high_card": [f"cat_{i}" for i in np.random.randint(1, 100, 1000)],
"category_low_card": np.random.choice(["A", "B", "C"], 1000),
# Boolean
"boolean_col": np.random.choice([True, False], 1000),
# Date/Time
"date_col": pd.date_range("2020-01-01", periods=1000, freq="H"),
# Text
"text_col": ["Sample text " * np.random.randint(1, 10) for _ in range(1000)],
# URL
"url_col": [f"https://example.com/page/{i}" for i in range(1000)],
# Constant
"constant_col": ["constant"] * 1000,
# Unique
"unique_col": range(1000)
})
profile = ProfileReport(
df,
title="Variable Types Analysis",
explorative=True
)
# The report will automatically detect:
# - Numeric: integer_col, float_col
# - Categorical: category_high_card, category_low_card
# - Boolean: boolean_col
# - DateTime: date_col
# - Text: text_col
# - URL: url_col
# - Constant: constant_col
# - Unique: unique_col (potentially ID column)
profile.to_file("variable_types_report.html")
```
**Detailed Variable Statistics:**
```python
from ydata_profiling import ProfileReport
import pandas as pd
import numpy as np
df = pd.DataFrame({
"revenue": np.random.exponential(1000, 5000),
"quantity": np.random.randint(1, 100, 5000),
"discount": np.random.uniform(0, 0.5, 5000),
"category": np.random.choice(["Electronics", "Clothing", "Food"], 5000)
})
profile = ProfileReport(df, title="Sales Variables Analysis")
# Access variable-level statistics programmatically
description = profile.get_description()
# Numeric variable statistics
for var_name, var_data in description.variables.items():
print(f"\n{var_name}:")
print(f" Type: {var_data['type']}")
if "mean" in var_data:
print(f" Mean: {var_data['mean']:.2f}")
print(f" Std: {var_data['std']:.2f}")
print(f" Min: {var_data['min']:.2f}")
print(f" Max: {var_data['max']:.2f}")
```
### 3. Missing Value Analysis
**Detecting Missing Patterns:**
```python
from ydata_profiling import ProfileReport
import pandas as pd
import numpy as np
# Create dataset with various missing patterns
np.random.seed(42)
n = 5000
df = pd.DataFrame({
"complete": np.random.randn(n), # No missing
"random_missing": np.where(
np.random.random(n) < 0.1,
np.nan,
np.random.randn(n)
),
"conditional_missing": np.where(
np.random.randn(n) > 1.5,
np.nan,
np.random.randn(n)
),
"block_missing": np.concatenate([
np.random.randn(4000),
np.full(1000, np.nan)
]),
"highly_missing": np.where(
np.random.random(n) < 0.7,
np.nan,
np.random.randn(n)
)
})
# Profile with missing value analysis
profile = ProfileReport(
df,
title="Missing Value Analysis",
missing_diagrams={
"bar": True,
"matrix": True,
"heatmap": True
}
)
profile.to_file("missing_analysis.html")
# Programmatic access to missing info
description = profile.get_description()
print("\nMissing Value Summary:")
for var_name, var_data in description.variables.items():
missing_count = var_data.get("n_missing", 0)
missing_pct = var_data.get("p_missing", 0) * 100
print(f" {var_name}: {missing_count} ({missing_pct:.1f}%)")
```
**Missing Value Configuration:**
```python
from ydata_profiling import ProfileReport
import pandas as pd
df = pd.read_csv("data_with_missing.csv")
# Detailed missing value analysis
profile = ProfileReport(
df,
title="Missing Value Deep Dive",
missing_diagrams={
"bar": True, # Bar chart of missing values per variable
"matrix": True, # Nullity matrix (pattern visualization)
"heatmap": True # Nullity correlation heatmap
},
# Treat certain values as missing
vars={
"num": {
"low_categorical_threshold": 0
}
}
)
profile.to_file("missing_deep_dive.html")
Related in data-analysis
pandas
IncludedExpert data analysis and manipulation for customer support operations using pandas
autoviz
IncludedAutomatic exploratory data analysis and visualization with a single line of code - generates comprehensive charts, detects patterns, and exports to HTML/notebooks
dash
IncludedBuild production-grade interactive dashboards with Plotly Dash - enterprise features, callbacks, and scalable deployment
great-tables
IncludedPublication-quality tables in Python with rich styling, formatting, conditional formatting, and export to HTML/images - inspired by R's gt package
polars
IncludedHigh-performance DataFrame library for fast data processing with lazy evaluation, parallel execution, and memory efficiency
streamlit
IncludedBuild interactive data applications and dashboards with pure Python - no frontend experience required