podium-rag-context-bridge
Bridge a live Podium call transcript or webchat turn to an LLM by fetching relevant historical conversation context as a structured RAG bundle — vector search over embedded prior conversations + reranking + live Podium contact lookup, merged under a hard latency budget that keeps the call answerable in real time. Use when wiring a transcription-driven agent loop to an LLM that needs cross-channel customer memory, building the substrate that turns "transcript chunk" into "LLM-ready prompt with context", or hardening an existing RAG pipeline against staleness, reranking noise, PII leakage, token-budget overflow, and missed-the-call latency. Trigger with "podium rag", "podium llm context", "podium transcript to llm", "podium retrieval", "podium vector search", "podium real-time context", "podium agent grounding".
What this skill does
# Podium RAG Context Bridge
## Overview
Take a live transcript chunk (or webchat turn) and emit a structured RAG context bundle the calling LLM can drop into its prompt. This is not a chatbot. This is the substrate that sits between Podium's transcription stream and whatever LLM your agent loop is using — fetching the right historical context, fast enough to matter, in a shape the model can actually consume.
The substrate combines two retrieval surfaces: a **vector store** of past Podium conversations (embedded at ingest by `podium-conversation-history-export`) and a **live Podium API lookup** for the contact's fresh state (phone, opt-out, location, last-seen). The two surfaces answer different questions — vectors answer "what did this person ever say about this topic," the live API answers "is the phone number we're about to dial actually still their phone number." A naive RAG pipeline collapses them and ships drift to the model.
The six production failures this skill prevents:
1. **Relevance scoring picks wrong historical context** — naive cosine similarity over top-K=5 chunks surfaces the five most similar embeddings; on real Podium corpora most of those are off-topic boilerplate ("thanks for reaching out", "have a great day"). The model sees noise, generates a generic reply, and the operator loses the customer. Fix: cross-encoder reranking + per-contact filtering BEFORE the model sees anything.
2. **Vector store stale vs live Podium data drift** — the contact updated their phone yesterday; today the vector store still embeds the old number. The model answers "I'll call you back at (555) 0100" using a number that hasn't worked in 16 hours. Vector recall and live state are different SLAs and must be merged with live state winning on any field that can mutate.
3. **Transcript chunk boundaries lose context** — the transcriber emits chunks every 800ms. A boundary that cuts "my order number is" / "ABC-12345" in half means neither chunk retrieves the order. Both retrieve nothing relevant. Fix: sliding-window overlap (default 200ms tail of previous chunk prepended to the embed query) plus chunk coalescing before embedding.
4. **LLM token budget overflow** — retrieved context plus system prompt plus transcript history pushes the user prompt past the model's window. Most models silently truncate from the middle; some refuse the request. Either way the model is operating on a corrupted prompt. Fix: hard token budget per retrieval surface (default 1500 tokens of context, summarized if over).
5. **PII reaches LLM in raw form** — the retrieved historical context still contains credit-card-like strings, full home addresses, and DOBs that were never redacted at ingest because nobody told the ingest pipeline they had to. The LLM will repeat them back. Fix: redaction filter at retrieval time as the last line of defense before emission, even if ingest is supposed to do it too.
6. **Context emission latency exceeds call duration** — by the time the bridge returns context, the customer has already hung up. The vector store p99 is 600ms, the reranker p99 is 400ms, the Podium lookup p99 is 300ms — serially that is over a second per turn and the agent never catches up. Fix: parallel fan-out + a hard 800ms wall-clock timeout that returns whatever finished, with a structured `partial: true` flag so the LLM knows it is grounding on incomplete context.
## Prerequisites
- Python 3.10+
- A populated vector store of past Podium conversations. Recommended bootstrap: run `podium-conversation-history-export` against the org's full history, embed each chunk, and write to a pgvector table (schema in `references/implementation.md`)
- A working `podium-auth` instance for the live Podium contact lookup
- An embedding model available at request time. Default: `BAAI/bge-large-en-v1.5` via `sentence-transformers` (free, local) or `text-embedding-3-small` via OpenAI (hosted, ~$0.00002/embed)
- A cross-encoder reranker for the second-stage score. Default: `BAAI/bge-reranker-base` (free, local). LLM-as-reranker is also supported but adds latency and cost
- pgvector ≥ 0.5 if using the default backend. Pinecone / Weaviate adapters are documented but not the reference path
## Instructions
Build in this order. Each section neutralizes one production failure mode.
### 1. Sliding-window chunk coalescing (neutralizes boundary loss)
The first thing that goes wrong is upstream of every other thing: the transcript chunks themselves arrive truncated on a word boundary that matters. Before the chunk hits the embedding model, prepend the tail of the previous chunk and the head of the next chunk (if available) so the embed query sees a full clause, not a sentence fragment.
```python
from dataclasses import dataclass, field
from collections import deque
from typing import Deque
@dataclass
class TranscriptCoalescer:
"""Coalesce 800ms transcript chunks into overlap-window embed queries."""
tail_window_ms: int = 200
head_window_ms: int = 200
_recent: Deque[str] = field(default_factory=lambda: deque(maxlen=3))
def feed(self, chunk: str) -> str:
# tail of the previous chunk, then current chunk
prev_tail = self._tail(self._recent[-1]) if self._recent else ""
self._recent.append(chunk)
return f"{prev_tail} {chunk}".strip()
def _tail(self, s: str) -> str:
# naive: last ~30 chars; production: last word-bounded ~200ms of text
return s[-30:] if len(s) > 30 else s
```
Without this, every embed query is doing word-boundary keyhole surgery on the corpus and missing relevant context for reasons that have nothing to do with the model.
### 2. Vector query with cross-encoder reranking (neutralizes relevance noise)
Top-K cosine similarity gives you the five most-similar chunks. Most are off-topic. The cure is a second pass through a cross-encoder that scores `(query, candidate)` pairs directly — slower but ~10x more accurate at the top of the list. Pull top-20 from the vector store, rerank to top-5, return.
```python
import asyncio
from typing import Protocol
class VectorStore(Protocol):
async def query(self, embedding: list[float], top_k: int,
filter: dict | None = None) -> list[dict]: ...
class Reranker(Protocol):
async def score(self, query: str, candidates: list[str]) -> list[float]: ...
class PgvectorStore:
"""pgvector reference implementation. Replace with Pinecone/Weaviate as needed."""
def __init__(self, dsn: str):
import psycopg
self.dsn = dsn # e.g. "postgresql://user:pass@host/db" — load from secret store
async def query(self, embedding: list[float], top_k: int,
filter: dict | None = None) -> list[dict]:
import psycopg
contact_uid = (filter or {}).get("contact_uid")
sql = """
SELECT id, contact_uid, content, channel, occurred_at,
1 - (embedding <=> %s::vector) AS cosine_score
FROM podium_conversations
WHERE (%s IS NULL OR contact_uid = %s)
ORDER BY embedding <=> %s::vector
LIMIT %s
"""
async with await psycopg.AsyncConnection.connect(self.dsn) as conn:
cur = await conn.execute(sql, (embedding, contact_uid, contact_uid, embedding, top_k))
rows = await cur.fetchall()
return [
{"id": r[0], "contact_uid": r[1], "content": r[2],
"channel": r[3], "occurred_at": r[4], "score": float(r[5])}
for r in rows
]
async def search_with_rerank(
query_text: str,
embedder, vector_store: VectorStore, reranker: Reranker,
contact_uid: str | None = None,
pool_k: int = 20, final_k: int = 5,
) -> list[dict]:
"""Two-stage retrieval. ANN recall to pool_k, cross-encoder rerank to final_k."""
embedding = await embedder.embed(query_text)
pool = await vector_store.query(embedding, top_k=pool_k,
filter={"contact_uid"Related in AI Agents
skill-development
IncludedComprehensive meta-skill for creating, managing, validating, auditing, and distributing Claude Code skills and slash commands (unified in v2.1.3+). Provides skill templates, creation workflows, validation patterns, audit checklists, naming conventions, YAML frontmatter guidance, progressive disclosure examples, and best practices lookup. Use when creating new skills, validating existing skills, auditing skill quality, understanding skill architecture, needing skill templates, learning about YAML frontmatter requirements, progressive disclosure patterns, tool restrictions (allowed-tools), skill composition, skill naming conventions, troubleshooting skill activation issues, creating custom slash commands, configuring command frontmatter, using command arguments ($ARGUMENTS, $1, $2), bash execution in commands, file references in commands, command namespacing, plugin commands, MCP slash commands, Skill tool configuration, or deciding between skills vs slash commands. Delegates to docs-management skill for official documentation.
reprompter
IncludedTransform messy prompts into well-structured, effective prompts — single or multi-agent. Use when: "reprompt", "reprompt this", "clean up this prompt", "structure my prompt", rough text needing XML tags and best practices, "reprompter teams", "repromptception", "run with quality", "smart run", "smart agents", multi-agent tasks, audits, parallel work, anything going to agent teams. Don't use when: simple Q&A, pure chat, immediate execution-only tasks. See "Don't Use When" section for details. Outputs: Structured XML/Markdown prompt, quality score (before/after), optional team brief + per-agent sub-prompts, agent team output files. Success criteria: Single mode quality score ≥ 7/10; Repromptception per-agent prompt quality score 8+/10; all required sections present, actionable and specific.
adaptive-compaction
IncludedAdaptive add-on policy and recovery layer that decides WHEN to compact, prune, snapshot, or fork -- replacing fixed-percent auto-compaction across Claude Code, Codex, and MCP-capable hosts. Trigger on auto-compact timing or damage: "when should I compact", "is it safe to compact now or start a fresh session", "auto-compact fires too early/mid-task", "switching to an unrelated task but the window still has space", "context rot", "answers get worse the longer the session runs", "the agent forgot the plan or my decisions after it summarized", "add a layer on top that manages context without changing the agent", raising autoCompactWindow to give the policy room, or installing/tuning a cross-tool compaction policy or PreCompact hook -- even when "compaction" is never said but the problem is context-window pressure or post-summarization memory loss. Do NOT use to summarize a conversation, build RAG, write a summarization prompt (decides WHEN not HOW), or answer max-context-length trivia.
agent-skill-creator
IncludedCreate cross-platform agent skills from workflow descriptions. Activates when users ask to create an agent, automate a repetitive workflow, create a custom skill, or need advanced agent creation. Triggers on phrases like create agent for, automate workflow, create skill for, every day I have to, daily I need to, turn process into agent, need to automate, create a cross-platform skill, validate this skill, export this skill, migrate this skill. Supports single skills, multi-agent suites, transcript processing, template-based creation, interactive configuration, cross-platform export, and spec validation.
llm-wiki
IncludedUse when building or maintaining a persistent personal knowledge base (second brain) in Obsidian where an LLM incrementally ingests sources, updates entity/concept pages, maintains cross-references, and keeps a synthesis current. Triggers include "second brain", "Obsidian wiki", "personal knowledge management", "ingest this paper/article/book", "build a research wiki", "compound knowledge", "Memex", or whenever the user wants knowledge to accumulate across sessions instead of being re-derived by RAG on every query.
skill-master
IncludedAgent Skills authoring, evaluation, and optimization. Create, edit, validate, benchmark, and improve skills following the agentskills.io specification. Use when designing SKILL.md files, structuring skill folders (references, scripts, assets), ingesting external documentation into skills, running trigger evals, benchmarking skill quality, optimizing descriptions, or performing blind A/B comparisons. Keywords: agentskills.io, SKILL.md, skill authoring, eval, benchmark, trigger optimization.