diagnose

Included with Lifetime

$97 forever

Use when the user reports that a Zilliz Cloud cluster or Milvus collection is unhealthy, slow, stuck, returning errors, hitting quotas, or otherwise misbehaving — or when they ask "what's wrong with...", "why is ... slow", "diagnose ...", "troubleshoot ...".

Cloud & DevOps

What this skill does


## Prerequisites

1. CLI installed and logged in (see setup skill).
2. For cluster diagnosis: a cluster context, or pass `--cluster-id` explicitly.
3. For collection diagnosis: the cluster context must point at the cluster that owns the collection (see setup skill).

## Scope

This skill performs **read-only diagnosis** using existing `zilliz` commands. It does NOT mutate clusters, collections, indexes, or data. Every recommendation is presented as a suggestion plus the exact next command the user can run themselves.

Two entry points:

| User says | Use section |
|---|---|
| "diagnose my cluster", "cluster is slow / stuck / unhealthy" | [Cluster Diagnosis](#cluster-diagnosis) |
| "diagnose this collection", "search is slow on X", "X won't load" | [Collection Diagnosis](#collection-diagnosis) |

When the user's intent spans both (e.g. "everything is slow"), run cluster diagnosis first — a collection-level symptom often has a cluster-level cause.

## Cluster Diagnosis

Follow this checklist in order. Stop early only if you find a P0 problem (cluster not RUNNING, quota hard-stop) — surface it before running the rest.

### 1. Collect

Run these in parallel where possible and prefer `-o json` for machine parsing:

```bash
zilliz context current
zilliz cluster describe --cluster-id <id> -o json
zilliz cluster list --all -o json                       # peer clusters for context
zilliz billing usage -o json                            # quota / spend headroom
```

Then pull time-series metrics covering the last hour and last 24h. Use the chart output for human review and `-o json` if you need to compute thresholds:

```bash
zilliz cluster metrics --cluster-id <id> \
  -m CU_COMPUTATION -m CU_CAPACITY -m CU_SIZE -m REPLICA_COUNT \
  -m STORAGE -m SLOW_QUERIES \
  -m SEARCH_QPS -m SEARCH_LATENCY_P99 \
  -m SEARCH_FAIL_RATE -m INSERT_FAIL_RATE \
  --period 1h
```

Repeat with `--period 24h` for trend context.

### 2. Analyze

Walk each rule. Each finding must cite the evidence (command + observed value).

| Check | Evidence | If true, suggest |
|---|---|---|
| Cluster status ≠ RUNNING | `cluster describe` `.status` | Pause diagnosis; explain status; if SUSPENDED suggest `cluster resume` |
| Plan vs. observed QPS mismatch (e.g. Serverless under sustained high QPS) | plan + `SEARCH_QPS` | Discuss plan upgrade tradeoff |
| `CU_COMPUTATION` p95 > 80% of `CU_CAPACITY` for sustained windows | metric series | Recommend scaling CU or adding replicas |
| `SEARCH_LATENCY_P99` rising while QPS flat | latency vs qps | Likely index/CU pressure; drill into top collections |
| Non-zero `*_FAIL_RATE` | fail-rate series | Cross-check with recent user-reported errors |
| `SLOW_QUERIES` non-zero | metric | Drill into per-collection diagnosis for the offenders |
| Storage trending toward quota | `STORAGE` + billing | Suggest cleanup / plan change before hard cap |
| Region likely far from client | endpoint + user-reported RTT | Note possible region mismatch — cannot measure RTT from CLI |

### 3. Present

Render a single report with three sections, in this order:

1. **Summary** — one-line health verdict (healthy / degraded / critical) + 1-3 sentence rationale.
2. **Findings** — table with columns `Severity | Finding | Evidence | Suggested Next Step`.
3. **Suggested commands** — copy-pasteable `zilliz ...` commands matching each suggestion. Never run mutating commands yourself — let the user run them.

## Collection Diagnosis

### 1. Collect

```bash
zilliz collection describe --name <coll> -o json
zilliz collection get-stats --name <coll> -o json
zilliz collection get-load-state --name <coll> -o json
zilliz index list --collection-name <coll> -o json
# For each vector index reported above:
zilliz index describe --collection-name <coll> --field-name <field> -o json
zilliz partition list --collection-name <coll> -o json
```

Pull per-collection metrics for the last hour and 24h:

```bash
zilliz collection metrics -c <coll> \
  -m SEARCH_QPS -m SEARCH_LATENCY_P99 -m SEARCH_FAIL_RATE \
  -m QUERY_QPS -m QUERY_LATENCY_P99 \
  -m INSERT_QPS -m INSERT_LATENCY_P99 \
  -m ENTITIES -m ENTITIES_LOADED -m ENTITIES_INDEXED \
  --period 1h
```

If the cluster context is wrong or the collection lives in a non-default database, pass `--database <db>` on every command.

### 2. Analyze

| Check | Evidence | If true, suggest |
|---|---|---|
| Load state ≠ Loaded (or partially loaded) | `get-load-state` | `collection load --name <coll>`; explain that unloaded collections cannot serve search |
| `ENTITIES_LOADED` ≪ `ENTITIES` | metrics | Load incomplete or recently grew; wait or reload |
| `ENTITIES_INDEXED` ≪ `ENTITIES` | metrics | Indexing lag; investigate before tuning |
| Vector field present but no vector index, or FLAT on large row count | `index list` + `get-stats` | Recommend HNSW / IVF_* with a parameter range appropriate for row count; mark as recommendation, not absolute |
| Index params clearly off (e.g. `nlist` ≪ √N for IVF) | `index describe` + row count | Suggest revised values; note that benchmarking is needed for the final number |
| Replica count = 1 with sustained high QPS | `collection describe` replicas + `SEARCH_QPS` | Suggest more replicas, conditioned on CU headroom from cluster diagnosis |
| Schema reasons: very wide varchar, many un-indexed scalar fields used in filters | `collection describe` schema | Note impact; suggest scalar indexes where applicable |
| No partition key but row count is large and queries are naturally partitionable | schema | Mention partition-key option (cannot be added in place — design-time decision) |
| Non-zero `SEARCH_FAIL_RATE` | metric | Cross-check with cluster-level fail-rate metrics |
| Latency rising while QPS flat | latency vs qps | Index pressure or growing segment count; consider compaction; note that segment-level state is not exposed via CLI |

### 3. Present

Same three-section format as cluster diagnosis. When a collection finding's true root cause is at the cluster level (e.g. CU saturation), say so explicitly and reference the cluster report.

## Guidance

- **Read-only.** Never run `delete`, `drop`, `release`, `resume`, `suspend`, `create`, `update`, index create/drop, or any mutating data-plane command as part of diagnosis. Present them as suggestions only.
- **Cite evidence.** Every finding must reference the command that produced it and the observed value. No unsourced claims.
- **Mark uncertainty.** Index parameter recommendations, replica counts, and CU sizing depend on workload specifics the CLI cannot observe. Phrase as "starting point, benchmark to confirm."
- **Know the limits.** The CLI cannot see per-query plans, segment-level state, compaction backlog, GC, server-internal queues, or alert-rule state. If a question requires those, say so plainly rather than guessing.
- **Prefer parallel collection.** When multiple read-only commands are independent, run them in parallel to keep the diagnosis fast.
- **JSON for parsing, charts for humans.** Use `-o json` (optionally with `--query`) when comparing values against thresholds; use the default chart output when surfacing trends back to the user.
- **Honor context.** If the user has multiple clusters, confirm which one before collecting. If the collection is in a non-default database, thread `--database` through every command.

Files: 1

Size: 7.5 KB

Complexity: 12/100

Category: Cloud & DevOps

Source: https://github.com/zilliztech/zilliz-plugin/tree/main/plugins/zilliz/skills/diagnose

Related in Cloud & DevOps

appbuilder-action-scaffolder

Included

Create, implement, deploy, and debug Adobe Runtime actions with consistent layout, validation, and error handling. Use this skill whenever the user needs to add actions to an App Builder project, understand action structure (params, response format, web/raw actions), configure actions in the manifest, use App Builder SDKs (State, Files, Events, database), deploy and invoke actions via CLI, debug action issues, or implement patterns such as webhook receivers, custom event providers, journaling consumers, large payload redirects, action sequence pipelines, and Asset Compute workers. Also trigger when users mention serverless functions in Adobe context, action logging, IMS authentication for actions, or cron-style scheduled actions.

Cloud & DevOpsscripts

orchestrating-datacloud

Included

Salesforce Data Cloud product orchestrator for connect→prepare→harmonize→segment→act workflows. Use this skill when the user needs a multi-step Data Cloud pipeline, cross-phase troubleshooting, or data space and data kit management. TRIGGER when: user needs a multi-step Data Cloud pipeline, asks to set up or troubleshoot Data Cloud across phases, manages data spaces or data kits, or wants a cross-phase sf data360 workflow. DO NOT TRIGGER when: work is isolated to a single phase (use the matching phase-specific skill), the task is STDM/session tracing/parquet telemetry (use observing-agentforce), standard CRM SOQL (use querying-soql), or Apex implementation (use generating-apex).

Cloud & DevOpsscripts

github-project-automation

Included

Automate GitHub repository setup with CI/CD workflows, issue templates, Dependabot, and CodeQL security scanning. Includes 12 production-tested workflows and prevents 18 errors: YAML syntax, action pinning, and configuration. Use when: setting up GitHub Actions CI/CD, creating issue/PR templates, enabling Dependabot or CodeQL scanning, deploying to Cloudflare Workers, implementing matrix testing, or troubleshooting YAML indentation, action version pinning, secrets syntax, runner versions, or CodeQL configuration. Keywords: github actions, github workflow, ci/cd, issue templates, pull request templates, dependabot, codeql, security scanning, yaml syntax, github automation, repository setup, workflow templates, github actions matrix, secrets management, branch protection, codeowners, github projects, continuous integration, continuous deployment, workflow syntax error, action version pinning, runner version, github context, yaml indentation error

Cloud & DevOpsscripts

sf-datacloud

Included

Salesforce Data Cloud product orchestrator for connect→prepare→harmonize→segment→act workflows. TRIGGER when: user needs a multi-step Data Cloud pipeline, asks to set up or troubleshoot Data Cloud across phases, manages data spaces or data kits, or wants a cross-phase `sf data360` workflow. DO NOT TRIGGER when: work is isolated to a single phase (use the matching sf-datacloud-* skill), the task is STDM/session tracing/parquet telemetry (use sf-ai-agentforce-observability), standard CRM SOQL (use sf-soql), or Apex implementation (use sf-apex).

Cloud & DevOpsscripts

fabric-cli

Included

Use this skill for Fabric.so CLI workflows with the `fabric` terminal command: diagnose/install/login, search or browse a Fabric library, save notes/links/files, create folders, ask the Fabric AI assistant, manage tasks/workspaces, generate shell completion, check subscription usage, produce JSON output, and use Fabric as persistent agent memory. Do not use for Microsoft Fabric/Azure/Power BI `fab`, Daniel Miessler's Fabric framework, Python Fabric SSH, Fabric.js, or textile/fashion fabric.

Cloud & DevOpsscripts

lark

Included

Lark/Feishu CLI skills: lark-cli operations for docs, markdown, sheets, base, calendar, im, mail, task, okr, drive, wiki, slides, whiteboard, apps, approval, attendance, contact, vc, minutes, event. Use when the user needs to operate Lark/Feishu resources via lark-cli, send messages, manage documents, spreadsheets, calendars, tasks, OKRs, deploy web pages, or any Feishu/Lark workspace operations.

Cloud & DevOpsscripts