Claude
Skills
Sign in
Back

temporal-cloud

Included with Lifetime
$97 forever

Fix Temporal Cloud connection, auth, and config problems. Use when users hit login failures, can't connect to Cloud, get x509/TLS errors, have namespace or endpoint mismatches, paste broken SDK connection snippets, are confused about which endpoint to use, see "no pollers" or RESOURCE_EXHAUSTED, struggle with PrivateLink/PSC, or need help setting up a new namespace. Also use for HA namespace failover and DNS issues. Not for worker performance tuning or scaling.

Backend & APIs

What this skill does


# Temporal Cloud Skill

Help users diagnose and resolve Temporal Cloud connectivity, authentication, and configuration issues using tcld and temporal CLI.

## Core Philosophy

Cloud issues are frustrating because they sit at the intersection of configuration, networking, authentication, and Temporal-specific code. Most problems fall into predictable patterns. This skill provides systematic diagnosis to quickly identify root causes and prescribe fixes.

**References:**
- See `references/cloud-troubleshooting-reference.md` for full CLI command reference and error codes
- See `references/common-scenarios.md` for step-by-step setup walkthroughs
- [Environment configuration docs](https://docs.temporal.io/develop/environment-configuration) - SDK setup for connecting to Cloud
- [HA namespace connectivity](https://docs.temporal.io/cloud/high-availability/ha-connectivity) - multi-region endpoint and DNS setup
- [Dev Success troubleshooting guide](https://github.com/temporalio/dev-success/blob/main/troubleshooting-connection-issues-to-temporal-cloud.md) - companion connection troubleshooting guide

**Out of scope:** Worker performance tuning, scaling, metrics interpretation, SDK-specific config, deployment patterns. Those topics are covered by separate worker-focused skills.

## Issue Classification

| Category | Key Symptoms | First Check |
|----------|--------------|-------------|
| **tcld Login** | login failed, token refresh failed, wrong account | `tcld account get` |
| **Connection/Auth** | can't connect, access denied, handshake failures | Endpoint format + DNS + port connectivity |
| **Ambiguous Runtime Errors** | `context deadline exceeded`, `workflow is busy` | Identify the operation and layer first |
| **mTLS/Certs** | x509 errors, unknown authority, expired | `openssl x509 -enddate` |
| **Namespace** | namespace not found, SNI mismatch | Namespace name format |
| **HA / Failover** | Failover not working, wrong region, DNS stale | DNS CNAME resolution |
| **Worker** | Tasks not picked up, stale connections | `temporal task-queue describe` |
| **Private Connectivity** | PrivateLink/PSC errors | VPC endpoint status |
| **Rate Limiting** | RESOURCE_EXHAUSTED | APS limits |

## The Process

### Step 1: Identify the Category

Ask the user:
- **What's the exact error message?** (copy-paste if possible)
- **What are you trying to do?** (tcld command, starting workers, running workflows)
- **What changed recently?** (new certs, new namespace, new region)

### Step 2: Gather Context

**For SDK/client snippet reviews:**
- Which auth method are you using: API key or mTLS?
- Which SDK and version are you using?
- What exact `HostPort` / address are you using?
- What exact Namespace are you using?
- Is this SDK code, `temporal` CLI, or `tcld`?

**For tcld issues:**
- Can you run `tcld account get`?
- Multiple Temporal accounts?

**For connection issues:**
- What's your exact address / `HostPort`?
- Using mTLS or API keys?
- Which SDK and version are you using?
- Any firewall/proxy between you and Cloud?

**For ambiguous runtime errors:**
- Where exactly do you see the error: workflow start, signal/update, polling, querying, logs?
- Is this happening before work starts, while polling, or while workflow code is already running?
- Are pollers present on the relevant task queue?
- Did this start after a traffic spike, deploy, or config change?

**For certificate issues:**
- When were certs generated?
- What CA was used?
- Is CA uploaded to namespace?

**For worker issues:**
- Are workers running? How many?
- What does `temporal task-queue describe` show?
- Any errors in worker logs?

### Step 3: Apply Decision Tree

Use the appropriate decision tree based on category (see below).

### Step 4: Provide Fix

Give specific commands to resolve the issue, with verification steps.

Always include a confidence score for the proposed diagnosis or fix:
- `Confidence: 9-10/10` when the symptom, operation, and confirming signals line up cleanly
- `Confidence: 6-8/10` when the evidence is good but one plausible alternative remains
- `Confidence: 1-5/10` when the issue is still ambiguous and the "fix" is really the next discriminating check

If the problem is ambiguous, say so explicitly and keep the recommendation scoped to the next check rather than presenting a speculative root cause as settled.

## Decision Trees

### tcld Login Issues

```
Symptom: tcld login not working
│
├─ Can `tcld account get` run?
│  ├─ Yes → Login is valid; continue with account verification
│  └─ No → Run `tcld login`
│
├─ Token refresh failed?
│  └─ tcld logout && tcld login
│
├─ Wrong organization/account?
│  ├─ tcld account get
│  └─ Verify the expected namespace appears in `tcld namespace list`
│
└─ "unauthorized" or auth errors?
   └─ tcld logout && tcld login
```

### Connection Failures

**Docs:** [Environment configuration](https://docs.temporal.io/develop/environment-configuration) - SDK connection options

**Endpoint check before network debugging:**

| Use case | Recommended endpoint | Notes |
|----------|---------------------|-------|
| Workers & clients (all auth) | `<namespace>.<account>.tmprl.cloud:7233` | **Namespace Endpoint** - works for both mTLS and API key auth. Recommended for all namespaces. |
| Multi-region HA (advanced) | `<region>.<cloud_provider>.api.temporal.io:7233` | Regional Endpoint - only needed for advanced HA routing. See [namespace access docs](https://docs.temporal.io/cloud/namespaces#access-namespaces). |
| tcld / Cloud Ops API | `saas-api.tmprl.cloud` | Control plane |

**Exception:** Namespaces using Flexible Auth (pre-release) cannot use Namespace Endpoints yet.

```
Symptom: Can't connect to Temporal Cloud
│
├─ Check: Using Namespace Endpoint?
│  ├─ Using regional endpoint (`*.api.temporal.io`) without HA need?
│  │  └─ Switch to Namespace Endpoint (`<ns>.<acct>.tmprl.cloud:7233`)
│  ├─ Using old/stale endpoint format?
│  │  └─ Switch to Namespace Endpoint
│  └─ Endpoint looks correct → Continue
│
├─ Check: DNS resolution
│  └─ nslookup <host-from-address>
│     ├─ Fails → DNS issue (check network, VPN)
│     └─ Succeeds → Continue
│
├─ Check: Port connectivity
│  └─ nc -zv <host-from-address> 7233
│     ├─ Fails → Firewall blocking port 7233
│     └─ Succeeds → Continue
│
├─ Check: TLS handshake
│  └─ openssl s_client -connect <address>
│     ├─ Fails → Certificate issue (see mTLS tree)
│     └─ Succeeds → Continue
│
└─ Check: Temporal CLI test
   └─ temporal workflow list --limit 1 --address ...
      ├─ PERMISSION_DENIED → Check namespace name format
      ├─ UNAUTHENTICATED → Certificate not accepted
      └─ Works → Connection OK, issue elsewhere
```

### Ambiguous Runtime Errors

Do not assume these are pure connectivity failures. Classify them by operation first.

| Error text | Common interpretations | First discriminator |
|------------|------------------------|---------------------|
| `context deadline exceeded` | wrong endpoint, network timeout, oversized payload, blocked execution path, client-side timeout | Where in the flow does it occur? |
| `workflow is busy` / `RESOURCE_EXHAUSTED: Workflow is busy` | operation-level contention, workload pressure, confusing user-facing error semantics | Which operation returned it? |
| `no pollers` | no connected workers, workers present but misconfigured, stale/misleading metrics | Does `temporal task-queue describe` show pollers? |

Use this decision sequence:

```
Symptom: ambiguous runtime error
│
├─ Check: Which operation returned the error?
│  ├─ start / signal / update / query request
│  ├─ poll loop / worker logs
│  └─ UI / metrics only
│
├─ Check: Is work reaching a task queue?
│  ├─ No pollers listed
│  │  └─ Treat as worker connectivity / config until proven otherwise
│  ├─ Pollers listed, backlog growing
│  │  └─ Worker capacity / tuning issue (out of scope for this skill)
│  └─ Pollers listed, no backlog issue
│     └─ Continue
│
├─ For `context deadline exceeded`
│  ├─ Happens before any work 
Files: 6
Size: 46.3 KB
Complexity: 51/100
Category: Backend & APIs

Related in Backend & APIs