Claude
Skills
Sign in
Back

monte-carlo-push-ingestion

Included with Lifetime
$97 forever

Expert guide for pushing metadata, lineage, and query logs to Monte Carlo from any data warehouse.

datadata-observabilityingestionmonte-carlopycarlometadatascripts

What this skill does


# Monte Carlo Push Ingestion

You are an agent that helps customers collect metadata, lineage, and query logs from their
data warehouses and push that data to Monte Carlo via the push ingestion API. The push model
works with **any data source** — if the customer's warehouse does not have a ready-made
template, derive the appropriate collection queries from that warehouse's system catalog or
metadata APIs. The push format and pycarlo SDK calls are the same regardless of source.

Monte Carlo's push model lets customers send metadata, lineage, and query logs directly to
Monte Carlo instead of waiting for the pull collector to gather it. It fills gaps the pull
model cannot always cover — integrations that don't expose query history, custom lineage
between non-warehouse assets, or customers who already have this data and want to send it
directly.

## When to Use

Use this skill when the user needs to collect metadata, lineage, freshness, volume, or query-log data from a warehouse or adjacent system and push it into Monte Carlo through the push-ingestion API.

Push data travels through the integration gateway → dedicated Kinesis streams → thin
adapter/normalizer code → the same downstream systems that power the pull model. The only
new infrastructure is the ingress layer; everything after it is shared.

## MANDATORY — Always start from templates

When generating any push-ingestion script, you MUST:

1. **Read the corresponding template** before writing any code. Templates live in this skill's
   directory under `scripts/templates/<warehouse>/`. To find them, glob for
   `**/push-ingestion/scripts/templates/<warehouse>/*.py` — this works regardless of where the
   skill is installed. Do NOT search from the current working directory alone.
2. **Adapt the template** to the customer's needs — do not write pycarlo imports, model constructors,
   or SDK method calls from memory.
3. If no template exists for the target warehouse, read the **Snowflake template** as the canonical
   reference and adapt only the warehouse-specific collection queries.

Template files follow this naming pattern:
- `collect_<flow>.py` — collection only (queries the warehouse, writes a JSON manifest)
- `push_<flow>.py` — push only (reads the manifest, sends to Monte Carlo)
- `collect_and_push_<flow>.py` — combined (imports from both, runs in sequence)

**After running any push script**, you MUST surface the `invocation_id`(s) returned by the API
to the user. The invocation ID is the only way to trace pushed data through downstream systems
and is required for validation. Never let a push complete without showing the user the
invocation IDs — they need them for `/mc-validate-metadata`, `/mc-validate-lineage`, and
debugging.

## Canonical pycarlo API — authoritative reference

The following imports, classes, and method signatures are the **ONLY** correct pycarlo API for
push ingestion. If your training data suggests different names, **it is wrong**. Use exactly
what is listed here.

### Imports and client setup

```python
from pycarlo.core import Client, Session
from pycarlo.features.ingestion import IngestionService
from pycarlo.features.ingestion.models import (
    # Metadata
    RelationalAsset, AssetMetadata, AssetField, AssetVolume, AssetFreshness, Tag,
    # Lineage
    LineageEvent, LineageAssetRef, ColumnLineageField, ColumnLineageSourceField,
    # Query logs
    QueryLogEntry,
)

client = Client(session=Session(mcd_id=key_id, mcd_token=key_token, scope="Ingestion"))
service = IngestionService(mc_client=client)
```

### Method signatures

```python
# Metadata
service.send_metadata(resource_uuid=..., resource_type=..., events=[RelationalAsset(...)])

# Lineage (table or column)
service.send_lineage(resource_uuid=..., resource_type=..., events=[LineageEvent(...)])

# Query logs — note: log_type, NOT resource_type
service.send_query_logs(resource_uuid=..., log_type=..., events=[QueryLogEntry(...)])

# Extract invocation ID from any response
service.extract_invocation_id(result)
```

### RelationalAsset structure (nested, NOT flat)

```python
RelationalAsset(
    type="TABLE",  # ONLY "TABLE" or "VIEW" (uppercase) — normalize warehouse-native values
    metadata=AssetMetadata(
        name="my_table",
        database="analytics",
        schema="public",
        description="optional description",
    ),
    fields=[
        AssetField(name="id", type="INTEGER", description=None),
        AssetField(name="amount", type="DECIMAL(10,2)"),
    ],
    volume=AssetVolume(row_count=1000000, byte_count=111111111),  # optional
    freshness=AssetFreshness(last_update_time="2026-03-12T14:30:00Z"),  # optional
)
```

## Environment variable conventions

All generated scripts MUST use these exact variable names. Do NOT invent alternatives like
`MCD_KEY_ID`, `MC_TOKEN`, `MONTE_CARLO_KEY`, etc.

| Variable | Purpose | Used by |
|---|---|---|
| `MCD_INGEST_ID` | Ingestion key ID (scope=Ingestion) | push scripts |
| `MCD_INGEST_TOKEN` | Ingestion key secret | push scripts |
| `MCD_ID` | GraphQL API key ID | verification scripts |
| `MCD_TOKEN` | GraphQL API key secret | verification scripts |
| `MCD_RESOURCE_UUID` | Warehouse resource UUID | all scripts |

## What this skill can build for you

Tell Claude your warehouse or data platform and Monte Carlo resource UUID and this skill will
generate a ready-to-run Python script that:
- Connects to your warehouse using the idiomatic driver for that platform
- Discovers databases, schemas, and tables
- Extracts the right columns — names, types, row counts, byte counts, last modified time, descriptions
- Builds the correct pycarlo `RelationalAsset`, `LineageEvent`, or `QueryLogEntry` objects
- Pushes to Monte Carlo and saves an output manifest with the `invocation_id` for tracing

Templates are available for common warehouses (Snowflake, BigQuery, BigQuery Iceberg,
Databricks, Redshift, Hive). For any other platform, Claude will derive the appropriate
collection queries from the warehouse's system catalog or metadata APIs and generate an
equivalent script.

### Ready-to-run examples

Production-ready example scripts built from these templates are published in the
[mcd-public-resources](https://github.com/monte-carlo-data/mcd-public-resources) repo:

- **[BigQuery Iceberg (BigLake) tables](https://github.com/monte-carlo-data/mcd-public-resources/tree/main/examples/push-ingestion/bigquery/push-iceberg-tables)** —
  metadata and query log collection for BigQuery Iceberg tables that are invisible to Monte
  Carlo's standard pull collector (which uses `__TABLES__`). Includes a `--only-freshness-and-volume`
  flag for fast periodic pushes that skip the schema/fields query — useful for hourly cron jobs
  after the initial full metadata push.

## Reference docs — when to load

| Reference file | Load when… |
|---|---|
| `references/prerequisites.md` | Customer is setting up for the first time, has auth errors, or needs help creating API keys |
| `references/push-metadata.md` | Building or debugging a metadata collection script |
| `references/push-lineage.md` | Building or debugging a lineage collection script |
| `references/push-query-logs.md` | Building or debugging a query log collection script |
| `references/custom-lineage.md` | Customer needs custom lineage nodes or edges via GraphQL |
| `references/validation.md` | Verifying pushed data, running GraphQL checks, or deleting push-ingested tables |
| `references/direct-http-api.md` | Customer wants to call push APIs directly via curl/HTTP without pycarlo |
| `references/anomaly-detection.md` | Customer asks why freshness or volume detectors aren't firing |

## Prerequisites — read this first

→ Load `references/prerequisites.md`

Two separate API keys are required. This is the most common setup stumbling block:
- **Ingestion key** (scope=Ingestion) — for pushing data
- **GraphQL API key** — for verification queries

Both use the same `x-mcd-id` / `x-mcd-token` headers but point to different endpoints.

##

Related in data