Claude
Skills
Sign in
Back

podium-rag-context-bridge

Included with Lifetime
$97 forever

Bridge a live Podium call transcript or webchat turn to an LLM by fetching relevant historical conversation context as a structured RAG bundle — vector search over embedded prior conversations + reranking + live Podium contact lookup, merged under a hard latency budget that keeps the call answerable in real time. Use when wiring a transcription-driven agent loop to an LLM that needs cross-channel customer memory, building the substrate that turns "transcript chunk" into "LLM-ready prompt with context", or hardening an existing RAG pipeline against staleness, reranking noise, PII leakage, token-budget overflow, and missed-the-call latency. Trigger with "podium rag", "podium llm context", "podium transcript to llm", "podium retrieval", "podium vector search", "podium real-time context", "podium agent grounding".

AI Agentspodiumragllm-contextvector-searchrerankingreal-time-contextscripts

What this skill does


# Podium RAG Context Bridge

## Overview

Take a live transcript chunk (or webchat turn) and emit a structured RAG context bundle the calling LLM can drop into its prompt. This is not a chatbot. This is the substrate that sits between Podium's transcription stream and whatever LLM your agent loop is using — fetching the right historical context, fast enough to matter, in a shape the model can actually consume.

The substrate combines two retrieval surfaces: a **vector store** of past Podium conversations (embedded at ingest by `podium-conversation-history-export`) and a **live Podium API lookup** for the contact's fresh state (phone, opt-out, location, last-seen). The two surfaces answer different questions — vectors answer "what did this person ever say about this topic," the live API answers "is the phone number we're about to dial actually still their phone number." A naive RAG pipeline collapses them and ships drift to the model.

The six production failures this skill prevents:

1. **Relevance scoring picks wrong historical context** — naive cosine similarity over top-K=5 chunks surfaces the five most similar embeddings; on real Podium corpora most of those are off-topic boilerplate ("thanks for reaching out", "have a great day"). The model sees noise, generates a generic reply, and the operator loses the customer. Fix: cross-encoder reranking + per-contact filtering BEFORE the model sees anything.
2. **Vector store stale vs live Podium data drift** — the contact updated their phone yesterday; today the vector store still embeds the old number. The model answers "I'll call you back at (555) 0100" using a number that hasn't worked in 16 hours. Vector recall and live state are different SLAs and must be merged with live state winning on any field that can mutate.
3. **Transcript chunk boundaries lose context** — the transcriber emits chunks every 800ms. A boundary that cuts "my order number is" / "ABC-12345" in half means neither chunk retrieves the order. Both retrieve nothing relevant. Fix: sliding-window overlap (default 200ms tail of previous chunk prepended to the embed query) plus chunk coalescing before embedding.
4. **LLM token budget overflow** — retrieved context plus system prompt plus transcript history pushes the user prompt past the model's window. Most models silently truncate from the middle; some refuse the request. Either way the model is operating on a corrupted prompt. Fix: hard token budget per retrieval surface (default 1500 tokens of context, summarized if over).
5. **PII reaches LLM in raw form** — the retrieved historical context still contains credit-card-like strings, full home addresses, and DOBs that were never redacted at ingest because nobody told the ingest pipeline they had to. The LLM will repeat them back. Fix: redaction filter at retrieval time as the last line of defense before emission, even if ingest is supposed to do it too.
6. **Context emission latency exceeds call duration** — by the time the bridge returns context, the customer has already hung up. The vector store p99 is 600ms, the reranker p99 is 400ms, the Podium lookup p99 is 300ms — serially that is over a second per turn and the agent never catches up. Fix: parallel fan-out + a hard 800ms wall-clock timeout that returns whatever finished, with a structured `partial: true` flag so the LLM knows it is grounding on incomplete context.

## Prerequisites

- Python 3.10+
- A populated vector store of past Podium conversations. Recommended bootstrap: run `podium-conversation-history-export` against the org's full history, embed each chunk, and write to a pgvector table (schema in `references/implementation.md`)
- A working `podium-auth` instance for the live Podium contact lookup
- An embedding model available at request time. Default: `BAAI/bge-large-en-v1.5` via `sentence-transformers` (free, local) or `text-embedding-3-small` via OpenAI (hosted, ~$0.00002/embed)
- A cross-encoder reranker for the second-stage score. Default: `BAAI/bge-reranker-base` (free, local). LLM-as-reranker is also supported but adds latency and cost
- pgvector ≥ 0.5 if using the default backend. Pinecone / Weaviate adapters are documented but not the reference path

## Instructions

Build in this order. Each section neutralizes one production failure mode.

### 1. Sliding-window chunk coalescing (neutralizes boundary loss)

The first thing that goes wrong is upstream of every other thing: the transcript chunks themselves arrive truncated on a word boundary that matters. Before the chunk hits the embedding model, prepend the tail of the previous chunk and the head of the next chunk (if available) so the embed query sees a full clause, not a sentence fragment.

```python
from dataclasses import dataclass, field
from collections import deque
from typing import Deque

@dataclass
class TranscriptCoalescer:
    """Coalesce 800ms transcript chunks into overlap-window embed queries."""
    tail_window_ms: int = 200
    head_window_ms: int = 200
    _recent: Deque[str] = field(default_factory=lambda: deque(maxlen=3))

    def feed(self, chunk: str) -> str:
        # tail of the previous chunk, then current chunk
        prev_tail = self._tail(self._recent[-1]) if self._recent else ""
        self._recent.append(chunk)
        return f"{prev_tail} {chunk}".strip()

    def _tail(self, s: str) -> str:
        # naive: last ~30 chars; production: last word-bounded ~200ms of text
        return s[-30:] if len(s) > 30 else s
```

Without this, every embed query is doing word-boundary keyhole surgery on the corpus and missing relevant context for reasons that have nothing to do with the model.

### 2. Vector query with cross-encoder reranking (neutralizes relevance noise)

Top-K cosine similarity gives you the five most-similar chunks. Most are off-topic. The cure is a second pass through a cross-encoder that scores `(query, candidate)` pairs directly — slower but ~10x more accurate at the top of the list. Pull top-20 from the vector store, rerank to top-5, return.

```python
import asyncio
from typing import Protocol

class VectorStore(Protocol):
    async def query(self, embedding: list[float], top_k: int,
                    filter: dict | None = None) -> list[dict]: ...

class Reranker(Protocol):
    async def score(self, query: str, candidates: list[str]) -> list[float]: ...

class PgvectorStore:
    """pgvector reference implementation. Replace with Pinecone/Weaviate as needed."""
    def __init__(self, dsn: str):
        import psycopg
        self.dsn = dsn  # e.g. "postgresql://user:pass@host/db" — load from secret store

    async def query(self, embedding: list[float], top_k: int,
                    filter: dict | None = None) -> list[dict]:
        import psycopg
        contact_uid = (filter or {}).get("contact_uid")
        sql = """
            SELECT id, contact_uid, content, channel, occurred_at,
                   1 - (embedding <=> %s::vector) AS cosine_score
            FROM podium_conversations
            WHERE (%s IS NULL OR contact_uid = %s)
            ORDER BY embedding <=> %s::vector
            LIMIT %s
        """
        async with await psycopg.AsyncConnection.connect(self.dsn) as conn:
            cur = await conn.execute(sql, (embedding, contact_uid, contact_uid, embedding, top_k))
            rows = await cur.fetchall()
        return [
            {"id": r[0], "contact_uid": r[1], "content": r[2],
             "channel": r[3], "occurred_at": r[4], "score": float(r[5])}
            for r in rows
        ]

async def search_with_rerank(
    query_text: str,
    embedder, vector_store: VectorStore, reranker: Reranker,
    contact_uid: str | None = None,
    pool_k: int = 20, final_k: int = 5,
) -> list[dict]:
    """Two-stage retrieval. ANN recall to pool_k, cross-encoder rerank to final_k."""
    embedding = await embedder.embed(query_text)
    pool = await vector_store.query(embedding, top_k=pool_k,
                                    filter={"contact_uid"

Related in AI Agents