Claude
Skills
Sign in
Back

databricks-synthetic-data-gen

Included with Lifetime
$97 forever

Generate realistic synthetic data using Spark + Faker (strongly recommended). Supports serverless execution, multiple output formats (Parquet/JSON/CSV/Delta), and scales from thousands to millions of rows. For small datasets (<10K rows), can optionally generate locally and upload to volumes. Use when user mentions 'synthetic data', 'test data', 'generate data', 'demo dataset', 'Faker', or 'sample data'.

Data & Analyticsscriptsassets

What this skill does


> Catalog and schema are **always user-supplied** — never default to any value. If the user hasn't provided them, ask. For any UC write, **always create the schema if it doesn't exist** before writing data.

# Databricks Synthetic Data Generation

Generate realistic, story-driven synthetic data for Databricks using **Spark + Faker + Pandas UDFs** (strongly recommended).

## Data Must Tell a Business Story

Synthetic data should demonstrate how Databricks helps solve real business problems.

**The pattern:** Something goes wrong → business impact ($) → analyze root cause → identify affected customers → fix and prevent.

**Key principles:**
- **Problem → Impact → Analysis → Solution** — Include an incident, anomaly, or issue that causes measurable business impact. The data lets you find the root cause and act on it.
- **Industry-relevant but simple** — Use domain terms (e.g., "SLA breach", "churn", "stockout") but keep the schema easy to understand. A few tables, clear relationships.
- **Business metrics with $ impact** — Revenue, MRR, cost, conversion rate. Every story needs a dollar sign to show why it matters.
- **Tables explain each other** — Ticket spike? Incident table shows the outage. Revenue drop? Churn table shows who left and why. All data connects.
- **Actionable insights** — Data should answer: What happened? Who's affected? How much did it cost? How do we prevent it?

**Why no flat distributions:** Uniform data has no story — no spikes, no anomalies, no cohort, no 20/80, no skew, nothing to investigate. It can't show Databricks' value for root cause analysis.

## References

| When | Guide |
|------|-------|
| User mentions **ML model training** or complex time patterns | [references/1-data-patterns.md](references/1-data-patterns.md) — ML-ready data, time multipliers, row coherence |
| Errors during generation | [references/2-troubleshooting.md](references/2-troubleshooting.md) — Fixing common issues |

## Critical Rules

1. **Data tells a story** — Something goes wrong, impacts $, can be analyzed and fixed. Show Databricks value.
2. **All data serves the story** — Every table and column must be coherent and usable in dashboards or ML models. No orphan data, no random noise — if it doesn't help explain or plot a futur dashboard or predict, don't generate it.
3. **Industry terms, simple schema** — Use domain-specific vocabulary but keep it easy to understand (few tables, clear relationships)
4. **Never uniform distributions** — Skewed categories, log-normal amounts, 80/20 patterns. Flat = no story = useless
5. **Enough data for trends** — ~100K+ rows for main tables so patterns survive aggregation
6. **Ask for catalog/schema** — Never default, always confirm before generating
7. **Present plan for approval** — Show tables, distributions, assumptions before writing code
8. **Master tables first** — Generate parent tables, write to Delta, then create children with valid FKs
9. **Use Spark + Faker + Pandas UDFs** — Scalable, parallel. Polars only if user explicitly wants local + <30K rows
10. **Use Databricks Connect Serverless by default to generate data** — Update databricks-connect on python 3.12 if required (avoid using execute_code unless instructed to not use Databricks Connect)
11. **No `.cache()` or `.persist()`** — Not supported on serverless. Write to Delta, read back for joins
12. **No Python loops or `.collect()`** — Use Spark parallelism. No driver-side iteration, avoid Pandas↔Spark conversions

## Generation Planning Workflow

**Before generating any code, you MUST present a plan for user approval.**

### ⚠️ MUST DO: Confirm Catalog Before Proceeding

**You MUST explicitly ask the user which catalog to use.** Do not assume or proceed without confirmation.

Example prompt to user:
> "Which Unity Catalog should I use for this data?"

When presenting your plan, always show the selected catalog prominently:
```
📍 Output Location: catalog_name.schema_name
   Volume: /Volumes/catalog_name/schema_name/raw_data/
```

This makes it easy for the user to spot and correct if needed.

### Step 1: Gather Requirements

Ask the user about:
- **Catalog/Schema** — Which catalog to use?
- **Domain** — E-commerce, support tickets, IoT, financial? (Use industry terms)

**If user doesn't specify a story:** Propose one. Don't generate bland data — suggest an incident, anomaly, or trend that shows Databricks value (e.g., "I'll include a system outage that causes ticket spike and churn — this lets you demo root cause analysis").

### Step 2: Present Plan with Story

Show a clear specification with **the business story and your assumptions surfaced**:

```
📍 Output Location: {user_catalog}.support_demo
   Volume: /Volumes/{user_catalog}/support_demo/raw_data/

📖 Story: A payment system outage causes support ticket spike. Resolution times
   degrade, enterprise customers churn, revenue drops $2.3M. With Databricks we
   identify the root cause, affected customers, and prevent future impact.
```

| Table | Description | Rows | Key Assumptions |
|-------|-------------|------|-----------------|
| customers | Customer profiles with tier, MRR | 10,000 | Enterprise 10% but 60% of revenue |
| tickets | Support tickets with priority, resolution_time | 80,000 | Spike during outage, SLA breaches |
| incidents | System events (outages, deployments) | 50 | Payment outage mid-month |
| churn_events | Customer cancellations with reason | 500 | Spike after poor support experience |

**Business metrics:**
- `customers.mrr` — Revenue at risk ($)
- `tickets.resolution_hours` — SLA performance
- `churn_events.lost_mrr` — Churn impact ($)

**The story this data tells:**
- Incident table shows payment outage on March 15
- Tickets spike 5x during outage, resolution time degrades from 4h → 18h
- Enterprise customers with SLA breaches churn 3 weeks later
- Total impact: $2.3M lost MRR, traceable to one incident
- **Databricks value:** Root cause analysis, identify at-risk customers, build alerting

**Ask user**: "Does this story work? Any adjustments?"

### Step 3: Ask About Data Features

- [x] Skew (non-uniform distributions) - **Enabled by default**
- [x] Joins (referential integrity) - **Enabled by default**
- [ ] Bad data injection (for data quality testing)
- [ ] Multi-language text
- [ ] Incremental mode (append instead of overwrite)

### Pre-Generation Checklist

- [ ] **Catalog confirmed** - User explicitly approved which catalog to use
- [ ] Output location shown prominently in plan (easy to spot/change)
- [ ] Table specification shown and approved
- [ ] Assumptions about distributions confirmed
- [ ] User confirmed compute preference (Databricks Connect on serverless recommended)
- [ ] Data features selected

**Do NOT proceed to code generation until user approves the plan, including the catalog.**

### Post-Generation Validation

Use `databricks experimental aitools tools query` to validate generated data (row counts, distributions, referential integrity). Query parquet files directly:

```bash
databricks experimental aitools tools query --warehouse $WAREHOUSE_ID "
SELECT COUNT(*) FROM parquet.\`/Volumes/CATALOG/SCHEMA/raw_data/customers\`
"
```

See [references/2-troubleshooting.md](references/2-troubleshooting.md) for full validation examples.

## Use Databricks Connect Spark + Faker Pattern

```python
from databricks.connect import DatabricksSession
from pyspark.sql import functions as F
from pyspark.sql.types import StringType
import pandas as pd

# Setup serverless Spark session
spark = DatabricksSession.builder.serverless(True).getOrCreate()

# Pandas UDF pattern - import lib INSIDE the function (libs must be installed locally)
@F.pandas_udf(StringType())
def fake_name(ids: pd.Series) -> pd.Series:
    from faker import Faker  # Import inside UDF
    fake = Faker()
    return pd.Series([fake.name() for _ in range(len(ids))])

# Generate with spark.range, apply UDFs
customers_df = spark.range(0, 10000, numPartitions=16).select(
    F.concat(F.lit("CUST-"),

Related in Data & Analytics