Claude
Skills
Sign in
Back

podium-contact-dedup

Included with Lifetime
$97 forever

Deduplicate Podium contacts in production and survive the data-quality failures — phone-format inconsistency producing four contacts for one phone, merge-api ordering that silently discards the richer record, opt-out flags lost on merge re-enabling marketing on an opted-out person, soft-delete confusion, cross-location duplicate blind spots, and simultaneous-merge race conditions. Use when you build a Podium dedup pipeline, scan and detect duplicate clusters, normalize phones to e164 form across an entire contact corpus, preserve opt-out state through merges, or validate cross-location duplicate resolution. Trigger with "podium contact dedup", "podium phone normalization", "podium e164", "podium duplicate contacts", "podium merge contacts", "podium opt-out preservation", "podium cross-location dedup".

Backend & APIspodiumcontact-dedupe164-normalizationdata-qualityopt-out-preservationmerge-orchestrationscripts

What this skill does


# Podium Contact Dedup

## Overview

Deduplicate Podium contacts in production and operate the dedup pipeline at scale. This is not a one-shot cleanup script — it is the data-quality layer your integration runs continuously to keep the contact corpus sane while messages, calls, webchats, and reviews keep mutating it. Run it once and Sydney's "0412 345 678" walk-in stops creating a fifth contact next to the four that already exist as `+61 412 345 678`, `(04) 1234-5678`, `+61412345678`, and `0412345678`.

The six production failures this skill prevents:

1. **Phone format inconsistency** — `+61 412 345 678`, `0412 345 678`, `(04) 1234-5678`, `+61412345678` are all the same phone but produce four contacts. Operators paste numbers from a CRM, a phone screen, a written form, and a stored fragment; Podium dedups on exact string match, so all four survive and the next caller appears as a fifth.
2. **Merge API ordering loses fields** — Podium's merge endpoint takes a `primary` and a `duplicate`; whichever you pick as `primary` keeps its own fields, the other's fields are discarded. Pick the wrong record (newer but emptier) as primary and the older, richer record's name, tags, and conversation links vanish silently.
3. **Opt-out flag lost on merge** — duplicate had `marketing_opt_out=true`, primary had `marketing_opt_out=false`; naive merge keeps primary's flag and re-enables marketing on a person who explicitly opted out. This is a compliance incident (TCPA, GDPR Article 21, ACMA Spam Act) and a trust incident — the customer opted out, you marketed at them anyway.
4. **Soft-delete vs hard-delete semantic confusion** — Podium's `DELETE /contacts/{uid}` is reversible; the record is hidden, not destroyed. Treat it as terminal and you ship a "contact reappeared after we deleted them" support ticket every time an admin restores a contact via the UI. Hard-delete (purge) is a separate, irreversible endpoint with different scopes.
5. **Duplicate detection across locations** — same phone calls Sydney AND Burleigh Heads, two contacts created (one per location), per-location dedup misses it entirely. Cross-location dedup needs a separate routine keyed by `phone_natural_key` across the union of contacts in every location_uid, not just within one.
6. **Merge conflicts on simultaneous edits** — two operators (or one operator + one automated job) merge overlapping clusters at the same time; the second merge's `primary` may have already been merged into another record, the API silently merges into a now-stale target, and one operator's intent is dropped without surfacing the conflict.

## Prerequisites

- Python 3.10+ with the `phonenumbers` library (`pip install phonenumbers`)
- A working `podium-auth` integration (this skill calls Podium with an authenticated client)
- Read scope: `contacts.read`. Write scope: `contacts.write`. (`contacts.delete` only if hard-purge is in scope.)
- A local SQLite database for the natural-key index and merge state file (default `./podium-dedup.sqlite`)
- A default region for E.164 parsing (`AU` for Australian deployments, `US` for US — set per-tenant)

## Instructions

Build in this order. Each section neutralizes one production failure mode.

### 1. E.164 normalization with natural-key emission (neutralizes phone-format inconsistency)

Every contact's phone is parsed by the `phonenumbers` library into E.164 form, then hashed into a stable "natural key" suitable for an index lookup. Same human-readable phone → same key, regardless of formatting input.

```python
import phonenumbers
from phonenumbers import NumberParseException

def normalize_phone(raw: str, default_region: str = "AU") -> dict:
    """Return {e164, national, country, natural_key, valid} for any input format."""
    try:
        parsed = phonenumbers.parse(raw, default_region)
    except NumberParseException as e:
        return {"valid": False, "reason": f"parse_failed: {e}"}
    if not phonenumbers.is_valid_number(parsed):
        return {"valid": False, "reason": "not_a_valid_number"}
    e164 = phonenumbers.format_number(parsed, phonenumbers.PhoneNumberFormat.E164)
    national = phonenumbers.format_number(parsed, phonenumbers.PhoneNumberFormat.NATIONAL)
    return {
        "valid": True,
        "e164": e164,                         # +61412345678
        "national": national,                 # 0412 345 678
        "country": phonenumbers.region_code_for_number(parsed),
        "natural_key": e164,                  # the E.164 IS the natural key — no further hashing needed
    }
```

The natural key is the E.164 string itself. Hashing it adds nothing — E.164 is already canonical and bounded in length. Use the E.164 directly as the SQLite primary key on the natural-key index.

### 2. SQLite-backed natural-key index (neutralizes O(N²) duplicate scans)

A naive dedup scans every pair of contacts — O(N²) on a 50k-contact corpus is hours. Instead, build a `(natural_key → [contact_uid, ...])` index in SQLite once, then duplicate detection is O(N) over the index.

```sql
CREATE TABLE IF NOT EXISTS contact_index (
    contact_uid    TEXT PRIMARY KEY,
    location_uid   TEXT NOT NULL,
    natural_key    TEXT NOT NULL,   -- E.164
    raw_phone      TEXT,
    name           TEXT,
    field_count    INTEGER NOT NULL DEFAULT 0,
    marketing_opt_out  INTEGER NOT NULL DEFAULT 0,
    sms_opt_out        INTEGER NOT NULL DEFAULT 0,
    email_opt_out      INTEGER NOT NULL DEFAULT 0,
    deleted_at_podium  TEXT,        -- ISO8601, NULL if live
    updated_at_podium  TEXT NOT NULL,
    indexed_at         TEXT NOT NULL
);
CREATE INDEX IF NOT EXISTS idx_natural_key ON contact_index(natural_key);
CREATE INDEX IF NOT EXISTS idx_natural_key_per_location ON contact_index(natural_key, location_uid);
```

The `field_count` column is precomputed at index time so the merge orchestrator picks the richer record as `primary` without re-fetching every record.

### 3. Duplicate detection with confidence scoring (neutralizes blind merges)

A cluster is a set of contacts sharing the same `natural_key`. Within a cluster, each pair gets a confidence score in `[0.0, 1.0]`:

| Factor | Weight |
|---|---|
| Same E.164 (always true within a cluster) | 0.60 — required floor |
| Same name (case-insensitive, normalized) | +0.20 |
| Same email | +0.15 |
| Overlapping tags | +0.05 |

Only clusters with all pairwise scores `>= 0.80` auto-merge by default. Lower-scored clusters surface for human review.

```python
def cluster_confidence(a: dict, b: dict) -> float:
    score = 0.60   # same natural_key by construction
    if a.get("name") and a["name"].strip().lower() == (b.get("name") or "").strip().lower():
        score += 0.20
    if a.get("email") and a["email"].strip().lower() == (b.get("email") or "").strip().lower():
        score += 0.15
    a_tags, b_tags = set(a.get("tags") or []), set(b.get("tags") or [])
    if a_tags & b_tags:
        score += 0.05
    return round(min(score, 1.0), 4)
```

### 4. Merge orchestrator with primary selection (neutralizes lost richer record)

For each auto-mergeable cluster, pick the `primary` by this deterministic rule, in order:

1. **Most fields populated** (highest `field_count`) — the richer record wins.
2. **Most recently updated** (`updated_at_podium`) — break ties toward fresher state.
3. **Lowest contact_uid** (lexical) — final, deterministic, reproducible tiebreak.

Every other contact in the cluster is a `duplicate` to be merged INTO the primary. Never trust caller-supplied ordering — always compute primary inside the orchestrator.

```python
def select_primary(cluster: list[dict]) -> dict:
    return max(
        cluster,
        key=lambda c: (c["field_count"], c["updated_at_podium"], -ord_key(c["contact_uid"]))
    )

def ord_key(uid: str) -> int:
    # Stable, deterministic tiebreak — lower uid sorts first
    return sum(ord(c) for c in uid)
```

### 5. Opt-out preservation by union (neutralizes compliance re-enable)

The strongest setting wins, always. If any

Related in Backend & APIs