clickhouse-prod-checklist
Production readiness checklist for ClickHouse — server tuning, backup, monitoring, and deployment verification. Use when launching a ClickHouse deployment, doing go-live reviews, or auditing production readiness. Trigger: "clickhouse production", "clickhouse go-live", "clickhouse launch checklist", "production clickhouse", "clickhouse prod ready".
What this skill does
# ClickHouse Production Checklist
## Overview
Comprehensive go-live checklist for ClickHouse covering server tuning, schema design,
backup configuration, monitoring, and operational readiness.
## Prerequisites
- ClickHouse instance provisioned (Cloud or self-hosted)
- Application integration code tested in staging
## Checklist
### 1. Schema & Engine Design
- [ ] Tables use `MergeTree` family engines (not `Memory`, `Log`, or `TinyLog`)
- [ ] `ORDER BY` columns match primary filter/group patterns
- [ ] `PARTITION BY` is coarse (monthly or weekly, never by ID)
- [ ] `TTL` configured for data retention policy
- [ ] `LowCardinality(String)` used for low-cardinality columns
- [ ] `CODEC(ZSTD)` applied to large String/JSON columns
- [ ] ReplacingMergeTree used with `FINAL` or dedup logic if upserts needed
### 2. Server Configuration (Self-Hosted)
```xml
<!-- Key production settings in config.xml / users.xml -->
<!-- Memory: set to ~80% of available RAM -->
<max_server_memory_usage_to_ram_ratio>0.8</max_server_memory_usage_to_ram_ratio>
<!-- Query limits -->
<max_concurrent_queries>150</max_concurrent_queries>
<max_memory_usage>10000000000</max_memory_usage> <!-- 10GB per query -->
<max_execution_time>300</max_execution_time> <!-- 5 min timeout -->
<!-- Merge settings -->
<background_pool_size>16</background_pool_size>
<background_schedule_pool_size>16</background_schedule_pool_size>
<!-- Logging -->
<query_log>
<database>system</database>
<table>query_log</table>
<flush_interval_milliseconds>7500</flush_interval_milliseconds>
</query_log>
```
### 3. Backup Configuration
```sql
-- ClickHouse native BACKUP to S3
BACKUP TABLE analytics.events
TO S3(
'https://my-bucket.s3.us-east-1.amazonaws.com/backups/events',
'ACCESS_KEY',
'SECRET_KEY'
)
SETTINGS compression_method = 'zstd';
-- Incremental backup (base + delta)
BACKUP TABLE analytics.events
TO S3('s3://my-bucket/backups/events-incremental')
SETTINGS base_backup = S3('s3://my-bucket/backups/events-base');
```
**ClickHouse Cloud:** Backups are automatic. Configure retention and frequency
in the Cloud console under Service Settings.
- [ ] Backup schedule configured (daily minimum)
- [ ] Backup restore tested and documented
- [ ] Point-in-time recovery possible (incremental backups)
- [ ] Backup stored in different region/account from primary
### 4. Monitoring & Alerting
```sql
-- Key metrics to monitor
-- 1. Query latency (p50/p95/p99)
-- 2. Active parts per table (alert > 300)
-- 3. Merge queue depth
-- 4. Memory usage
-- 5. Disk space free
-- 6. Replication lag (if clustered)
-- Health check query (use for load balancer probes)
SELECT 1;
-- More thorough health check
SELECT
(SELECT count() FROM system.processes) AS running_queries,
(SELECT sum(value) FROM system.metrics WHERE metric = 'MemoryTracking') AS memory_bytes,
(SELECT count() FROM system.merges) AS active_merges;
```
- [ ] Prometheus endpoint configured (`/metrics` via ClickHouse Exporter or Cloud endpoint)
- [ ] Grafana dashboard with key panels (QPS, latency, memory, parts, merges)
- [ ] Alerts on: error rate > 5%, p95 latency > 5s, parts > 300, disk < 20%
- [ ] Query log monitoring for slow/failed queries
### 5. Security
- [ ] Default user password changed or user disabled
- [ ] Application user has minimal privileges (see `clickhouse-security-basics`)
- [ ] TLS enabled (HTTPS on port 8443)
- [ ] IP allowlist configured
- [ ] Secrets in environment variables or secret manager (not code)
### 6. Application Integration
- [ ] Connection pooling configured (`max_open_connections`)
- [ ] Graceful shutdown calls `client.close()`
- [ ] Insert batching in place (10K+ rows per INSERT)
- [ ] Retry logic for transient errors (see `clickhouse-rate-limits`)
- [ ] Health check endpoint includes ClickHouse ping
```typescript
// Health check endpoint
app.get('/health', async (req, res) => {
try {
const { success } = await client.ping();
res.json({ status: success ? 'healthy' : 'degraded', clickhouse: success });
} catch {
res.status(503).json({ status: 'unhealthy', clickhouse: false });
}
});
```
### 7. Operational Readiness
- [ ] Incident runbook documented (see `clickhouse-incident-runbook`)
- [ ] On-call escalation path defined
- [ ] Key rotation procedure documented
- [ ] Schema migration process in place (see `clickhouse-migration-deep-dive`)
- [ ] Load testing completed at expected peak traffic
### 8. Verification Queries
```sql
-- Verify table sizes and part health
SELECT database, table, count() AS parts, sum(rows) AS rows,
formatReadableSize(sum(bytes_on_disk)) AS size
FROM system.parts WHERE active
GROUP BY database, table ORDER BY sum(bytes_on_disk) DESC;
-- Verify no pending mutations
SELECT * FROM system.mutations WHERE NOT is_done;
-- Verify no replication lag
SELECT database, table, queue_size, inserts_in_queue
FROM system.replicas WHERE queue_size > 0;
```
## Error Handling
| Issue | Detection | Action |
|-------|-----------|--------|
| Parts > 300 | Monitoring alert | Review insert patterns, wait for merges |
| Disk > 80% | Disk alert | Add storage, drop old partitions |
| Query p95 > 5s | Latency alert | Check `system.query_log` for slow queries |
| Replication lag | Replica check | Investigate network, merge backlog |
## Resources
- [ClickHouse Operations Guide](https://clickhouse.com/docs/operations)
- [Backup & Restore](https://clickhouse.com/docs/operations/backup)
- [Monitoring with Prometheus](https://clickhouse.com/docs/integrations/prometheus)
## Next Steps
For SDK version upgrades, see `clickhouse-upgrade-migration`.
Related in Data & Analytics
clawarr-suite
IncludedComprehensive management for self-hosted media stacks (Sonarr, Radarr, Lidarr, Readarr, Prowlarr, Bazarr, Overseerr, Plex, Tautulli, SABnzbd, Recyclarr, Unpackerr, Notifiarr, Maintainerr, Kometa, FlareSolverr). Deep library exploration, analytics, dashboard generation, content management, request handling, subtitle management, indexer control, download monitoring, quality profile sync, library cleanup automation, notification routing, collection/overlay management, and media tracker integration (Trakt, Letterboxd, Simkl).
querying-soql
IncludedSOQL query generation, optimization, and analysis with 100-point scoring. Use this skill when the user needs SOQL/SOSL authoring or optimization: natural-language-to-query generation, relationship queries, aggregates, query-plan analysis, and performance or safety improvements for Salesforce queries. TRIGGER when: user writes, optimizes, or debugs SOQL/SOSL queries, touches .soql files, or asks about relationship queries, aggregates, or query performance. DO NOT TRIGGER when: bulk data operations (use handling-sf-data), Apex DML logic (use generating-apex), or report/dashboard queries.
app-store-optimization
IncludedApp Store Optimization (ASO) toolkit for researching keywords, analyzing competitor rankings, generating metadata suggestions, and improving app visibility on Apple App Store and Google Play Store. Use when the user asks about ASO, app store rankings, app metadata, app titles and descriptions, app store listings, app visibility, or mobile app marketing on iOS or Android. Supports keyword research and scoring, competitor keyword analysis, metadata optimization, A/B test planning, launch checklists, and tracking ranking changes.
habit-flow
IncludedAI-powered atomic habit tracker with natural language logging, streak tracking, smart reminders, and coaching. Use for creating habits, logging completions naturally ("I meditated today"), viewing progress, and getting personalized coaching.
app-store-optimization
IncludedApp Store Optimization (ASO) toolkit for researching keywords, analyzing competitor rankings, generating metadata suggestions, and improving app visibility on Apple App Store and Google Play Store. Use when the user asks about ASO, app store rankings, app metadata, app titles and descriptions, app store listings, app visibility, or mobile app marketing on iOS or Android. Supports keyword research and scoring, competitor keyword analysis, metadata optimization, A/B test planning, launch checklists, and tracking ranking changes.
visualizing-data
IncludedBuilds dashboards, reports, and data-driven interfaces requiring charts, graphs, or visual analytics. Provides systematic framework for selecting appropriate visualizations based on data characteristics and analytical purpose. Includes 24+ visualization types organized by purpose (trends, comparisons, distributions, relationships, flows, hierarchies, geospatial), accessibility patterns (WCAG 2.1 AA compliance), colorblind-safe palettes, and performance optimization strategies. Use when creating visualizations, choosing chart types, displaying data graphically, or designing data interfaces.