groq-observability
Set up observability for Groq integrations: latency histograms, token throughput, rate limit gauges, cost tracking, and Prometheus alerts. Trigger with phrases like "groq monitoring", "groq metrics", "groq observability", "monitor groq", "groq alerts", "groq dashboard".
What this skill does
# Groq Observability
## Overview
Monitor Groq LPU inference for latency, token throughput, rate limit utilization, and cost. Groq's defining advantage is speed (280-560 tok/s), so latency degradation is the highest-priority signal. The API returns rich timing metadata (`queue_time`, `prompt_time`, `completion_time`) and rate limit headers on every response.
## Key Metrics to Track
| Metric | Type | Source | Why |
|--------|------|--------|-----|
| TTFT (time to first token) | Histogram | Client-side timing | Groq's main value prop |
| Tokens/second | Gauge | `usage.completion_time` | Throughput degradation |
| Total latency | Histogram | Client-side timing | End-to-end performance |
| Rate limit remaining | Gauge | `x-ratelimit-remaining-*` headers | Prevent 429s |
| Token usage | Counter | `usage.total_tokens` | Cost attribution |
| Error rate by code | Counter | Error handler | Availability |
| Estimated cost | Counter | Tokens * model price | Budget tracking |
## Instructions
### Step 1: Instrumented Groq Client
```typescript
import Groq from "groq-sdk";
const groq = new Groq();
interface GroqMetrics {
model: string;
latencyMs: number;
ttftMs: number;
tokensPerSec: number;
promptTokens: number;
completionTokens: number;
totalTokens: number;
queueTimeMs: number;
estimatedCostUsd: number;
}
const PRICE_PER_1M: Record<string, { input: number; output: number }> = {
"llama-3.1-8b-instant": { input: 0.05, output: 0.08 },
"llama-3.3-70b-versatile": { input: 0.59, output: 0.79 },
"llama-3.3-70b-specdec": { input: 0.59, output: 0.99 },
"meta-llama/llama-4-scout-17b-16e-instruct": { input: 0.11, output: 0.34 },
};
async function trackedCompletion(
model: string,
messages: any[],
options?: { maxTokens?: number; temperature?: number }
): Promise<{ result: any; metrics: GroqMetrics }> {
const start = performance.now();
const result = await groq.chat.completions.create({
model,
messages,
max_tokens: options?.maxTokens ?? 1024,
temperature: options?.temperature ?? 0.7,
});
const latencyMs = performance.now() - start;
const usage = result.usage!;
const pricing = PRICE_PER_1M[model] || { input: 0.10, output: 0.10 };
const metrics: GroqMetrics = {
model,
latencyMs: Math.round(latencyMs),
ttftMs: Math.round(((usage as any).prompt_time ?? 0) * 1000),
tokensPerSec: Math.round(
usage.completion_tokens / ((usage as any).completion_time || latencyMs / 1000)
),
promptTokens: usage.prompt_tokens,
completionTokens: usage.completion_tokens,
totalTokens: usage.total_tokens,
queueTimeMs: Math.round(((usage as any).queue_time ?? 0) * 1000),
estimatedCostUsd:
(usage.prompt_tokens / 1_000_000) * pricing.input +
(usage.completion_tokens / 1_000_000) * pricing.output,
};
emitMetrics(metrics);
return { result, metrics };
}
```
### Step 2: Prometheus Metrics
```typescript
import { Histogram, Counter, Gauge } from "prom-client";
const groqLatency = new Histogram({
name: "groq_latency_ms",
help: "Groq API latency in milliseconds",
labelNames: ["model"],
buckets: [50, 100, 200, 500, 1000, 2000, 5000],
});
const groqTokens = new Counter({
name: "groq_tokens_total",
help: "Total tokens processed",
labelNames: ["model", "direction"],
});
const groqThroughput = new Gauge({
name: "groq_tokens_per_second",
help: "Current tokens per second",
labelNames: ["model"],
});
const groqRateLimitRemaining = new Gauge({
name: "groq_ratelimit_remaining",
help: "Remaining rate limit quota",
labelNames: ["type"],
});
const groqCost = new Counter({
name: "groq_cost_usd",
help: "Estimated cost in USD",
labelNames: ["model"],
});
const groqErrors = new Counter({
name: "groq_errors_total",
help: "API errors by status code",
labelNames: ["model", "status_code"],
});
function emitMetrics(m: GroqMetrics) {
groqLatency.labels(m.model).observe(m.latencyMs);
groqTokens.labels(m.model, "input").inc(m.promptTokens);
groqTokens.labels(m.model, "output").inc(m.completionTokens);
groqThroughput.labels(m.model).set(m.tokensPerSec);
groqCost.labels(m.model).inc(m.estimatedCostUsd);
}
```
### Step 3: Rate Limit Header Tracking
```typescript
// Parse rate limit headers from any Groq response
function trackRateLimitHeaders(headers: Record<string, string>) {
const remaining = {
requests: parseInt(headers["x-ratelimit-remaining-requests"] || "0"),
tokens: parseInt(headers["x-ratelimit-remaining-tokens"] || "0"),
};
groqRateLimitRemaining.labels("requests").set(remaining.requests);
groqRateLimitRemaining.labels("tokens").set(remaining.tokens);
return remaining;
}
```
### Step 4: Prometheus Alert Rules
```yaml
# prometheus/groq-alerts.yml
groups:
- name: groq
rules:
- alert: GroqLatencyHigh
expr: histogram_quantile(0.95, rate(groq_latency_ms_bucket[5m])) > 1000
for: 2m
labels:
severity: warning
annotations:
summary: "Groq P95 latency > 1s (normally < 200ms)"
- alert: GroqRateLimitCritical
expr: groq_ratelimit_remaining{type="requests"} < 5
for: 1m
labels:
severity: critical
annotations:
summary: "Groq rate limit nearly exhausted (< 5 requests remaining)"
- alert: GroqThroughputDrop
expr: groq_tokens_per_second < 100
for: 5m
labels:
severity: warning
annotations:
summary: "Groq throughput dropped below 100 tok/s (expected 280+)"
- alert: GroqErrorRateHigh
expr: rate(groq_errors_total[5m]) > 0.05
for: 2m
labels:
severity: critical
annotations:
summary: "Groq API error rate elevated (> 5% of requests)"
- alert: GroqCostSpike
expr: increase(groq_cost_usd[1h]) > 10
labels:
severity: warning
annotations:
summary: "Groq spend exceeded $10 in the past hour"
```
### Step 5: Structured Request Logging
```typescript
// Structured JSON log for each Groq request
function logGroqRequest(metrics: GroqMetrics, requestId?: string) {
const logEntry = {
ts: new Date().toISOString(),
service: "groq",
model: metrics.model,
latency_ms: metrics.latencyMs,
ttft_ms: metrics.ttftMs,
tokens_per_sec: metrics.tokensPerSec,
prompt_tokens: metrics.promptTokens,
completion_tokens: metrics.completionTokens,
queue_time_ms: metrics.queueTimeMs,
cost_usd: metrics.estimatedCostUsd.toFixed(6),
request_id: requestId,
};
// Output as structured JSON for log aggregation
console.log(JSON.stringify(logEntry));
}
```
### Step 6: Dashboard Panels
Key Grafana/dashboard panels for Groq monitoring:
1. **TTFT Distribution** (histogram) -- Groq's main value; alert if > 500ms
2. **Tokens/Second by Model** (time series) -- should be 280-560 range
3. **Rate Limit Utilization** (gauge, 0-100%) -- alert at 90%
4. **Request Volume** (counter rate) -- by model
5. **Error Rate** (counter rate) -- by status code (429, 5xx)
6. **Cumulative Cost** (counter) -- by model, daily/weekly/monthly
7. **Queue Time** (histogram) -- Groq-specific, should be < 50ms
## Error Handling
| Issue | Cause | Solution |
|-------|-------|----------|
| 429 with high retry-after | RPM or TPM exhausted | Implement request queuing |
| Latency spike > 2s | Model overloaded or large prompt | Reduce prompt size or switch to lighter model |
| 503 Service Unavailable | Groq capacity issue | Enable fallback to alternative provider |
| Tokens/sec drop | Streaming disabled or large prompts | Enable streaming for better perceived performance |
## Resources
- [Groq API Reference (usage fields)](https://console.groq.com/docs/api-reference)
- [Groq Rate Limits](https://console.groq.com/docs/rate-limits)
- [prom-client on npm](https://www.npmjs.com/package/prom-client)
## Next Steps
For incident response procedures, see `groq-incident-runbook`.
Related in Data & Analytics
clawarr-suite
IncludedComprehensive management for self-hosted media stacks (Sonarr, Radarr, Lidarr, Readarr, Prowlarr, Bazarr, Overseerr, Plex, Tautulli, SABnzbd, Recyclarr, Unpackerr, Notifiarr, Maintainerr, Kometa, FlareSolverr). Deep library exploration, analytics, dashboard generation, content management, request handling, subtitle management, indexer control, download monitoring, quality profile sync, library cleanup automation, notification routing, collection/overlay management, and media tracker integration (Trakt, Letterboxd, Simkl).
querying-soql
IncludedSOQL query generation, optimization, and analysis with 100-point scoring. Use this skill when the user needs SOQL/SOSL authoring or optimization: natural-language-to-query generation, relationship queries, aggregates, query-plan analysis, and performance or safety improvements for Salesforce queries. TRIGGER when: user writes, optimizes, or debugs SOQL/SOSL queries, touches .soql files, or asks about relationship queries, aggregates, or query performance. DO NOT TRIGGER when: bulk data operations (use handling-sf-data), Apex DML logic (use generating-apex), or report/dashboard queries.
app-store-optimization
IncludedApp Store Optimization (ASO) toolkit for researching keywords, analyzing competitor rankings, generating metadata suggestions, and improving app visibility on Apple App Store and Google Play Store. Use when the user asks about ASO, app store rankings, app metadata, app titles and descriptions, app store listings, app visibility, or mobile app marketing on iOS or Android. Supports keyword research and scoring, competitor keyword analysis, metadata optimization, A/B test planning, launch checklists, and tracking ranking changes.
habit-flow
IncludedAI-powered atomic habit tracker with natural language logging, streak tracking, smart reminders, and coaching. Use for creating habits, logging completions naturally ("I meditated today"), viewing progress, and getting personalized coaching.
app-store-optimization
IncludedApp Store Optimization (ASO) toolkit for researching keywords, analyzing competitor rankings, generating metadata suggestions, and improving app visibility on Apple App Store and Google Play Store. Use when the user asks about ASO, app store rankings, app metadata, app titles and descriptions, app store listings, app visibility, or mobile app marketing on iOS or Android. Supports keyword research and scoring, competitor keyword analysis, metadata optimization, A/B test planning, launch checklists, and tracking ranking changes.
visualizing-data
IncludedBuilds dashboards, reports, and data-driven interfaces requiring charts, graphs, or visual analytics. Provides systematic framework for selecting appropriate visualizations based on data characteristics and analytical purpose. Includes 24+ visualization types organized by purpose (trends, comparisons, distributions, relationships, flows, hierarchies, geospatial), accessibility patterns (WCAG 2.1 AA compliance), colorblind-safe palettes, and performance optimization strategies. Use when creating visualizations, choosing chart types, displaying data graphically, or designing data interfaces.