Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
115 changes: 115 additions & 0 deletions docs/talos/CLAUDE.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,115 @@
# Documentation Instructions

## JSON Processing

Use `jq` instead of `python3` for all JSON operations in code examples:

- **Pretty-print:** `| jq .` not `| python3 -m json.tool`
- **Extract required fields:** `| jq -er '.field'` (the `-e` flag exits non-zero on `null` so `set -e` aborts the snippet instead
of silently exporting an empty value).
- **Extract optional fields:** `| jq -r '.field'` is fine when the field may legitimately be missing.

**Never write curl output to temporary files.** Capture responses in shell variables instead. File-based operations fail when
`/tmp` doesn't exist or isn't writable.

## Passing state between doctest blocks

Doctest runs each code block in a fresh `bash -eu -o pipefail` subprocess and auto-captures the exported environment after each
successful block. To make a value available to the next block, just `export` it — no manual write to `$DOCTEST_ENV_FILE` is
needed.

```bash
# Good: variable-based, exported for the next block, asserts the field is present
RESPONSE=$(curl -s -X POST "$URL/v2alpha1/admin/issuedApiKeys" \
-H "Content-Type: application/json" \
-d '{"name": "my-key"}')
echo "$RESPONSE" | jq .
export KEY_ID=$(echo "$RESPONSE" | jq -er '.key_id')

# Bad: file-based
curl -s ... -o /tmp/response.json
jq . /tmp/response.json
KEY_ID=$(jq -r '.key_id' /tmp/response.json)
rm -f /tmp/response.json

# Bad: redirecting to $DOCTEST_ENV_FILE (legacy; auto-capture handles this now)
KEY_ID=$(echo "$RESPONSE" | jq -r '.key_id')
echo "export KEY_ID=$KEY_ID" >> "$DOCTEST_ENV_FILE"
```

## API Field Documentation

Integration guides under `integrate/` must NOT duplicate API field tables, error code tables, or enum tables. These are maintained
in the canonical references:

- **Field tables** -> auto-generated API reference at `reference/api/*.api.mdx`
- **Error codes** -> `reference/error-codes.md`

### What belongs in integration guides

- **Workflow and examples**: curl commands, step-by-step instructions, the "how" and "why"
- **Brief inline mentions**: 1-3 sentences highlighting the most important fields (e.g., "The response includes a `secret` field
-- store it securely")
- **Conceptual comparisons**: tables comparing patterns, trade-offs, or usage scenarios (e.g., JWT vs macaroon)
- **Operational constraints**: limits, cache control headers, retry strategies
- **Links to reference**: always link to the canonical source for complete field/error details

### What does NOT belong in integration guides

- Full request/response field tables (use API reference link instead)
- Error code enum tables (use error codes reference link instead)
- Query parameter tables (use API reference link instead)
- Revocation reason enum tables (use API reference link instead)

### Link format

**All links MUST be relative links to markdown/mdx files with the file extension.** Never use absolute links (starting with `/`)
or links without a file extension. Hashbang anchors are allowed after the file extension.

- Links to `.md` files: `[text](../reference/error-codes.md#section)`
- Links to `.api.mdx` files: `[text](../reference/api/admin-issue-api-key.api.mdx)`
- Links to directory index pages: `[text](../operate/cache/index.md)` (never `../operate/cache/`)
- Links within the same directory: `[text](./sibling-page.md)`

```text
# Good: relative links with file extensions
For the complete field reference, see the [IssueAPIKey API reference](../reference/api/admin-issue-api-key.api.mdx).
For the full list of error codes, see the [error codes reference](../reference/error-codes.md#verification-error-codes).

# Bad: absolute links without file extensions
For the complete field reference, see the [IssueAPIKey API reference](/reference/api/admin-issue-api-key).
For the full list of error codes, see the [error codes reference](/reference/error-codes#verification-error-codes).
```

### API reference URL pattern

API reference pages are `.api.mdx` files at `reference/api/{plane}-{method}.api.mdx` where:

- `{plane}` is `admin` or `data`
- `{method}` is the kebab-case method name (e.g., `issue-api-key`, `verify-api-key`)

The API overview page is `reference/api/ory-talos-api.info.mdx`.

### Notes and callouts

Ensure that notes / callouts have two line breaks, or they will get formatted incorrectly.

**Incorrect:**

```md
:::note Internal package The Go client is in an `internal/` package and cannot be imported by external Go modules. :::
```

```md
:::note Internal package The Go client is in an `internal/` package and cannot be imported by external Go modules. :::
```

Correct:

```md
:::note

Internal package The Go client is in an `internal/` package and cannot be imported by external Go modules.

:::
```
213 changes: 213 additions & 0 deletions docs/talos/concepts/architecture.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,213 @@
---
title: Architecture
---

# Architecture

Talos separates API key management into two planes.

## Admin plane

The admin plane handles all key management and verification operations: key issuance, rotation, revocation, token derivation,
JWKS, and verification (single and batch). It is exposed only to internal services and clients with admin credentials.

Endpoints: `/v2alpha1/admin/`, including `/v2alpha1/admin/apiKeys:verify` and `/v2alpha1/admin/apiKeys:batchVerify`.

For low-latency verification close to clients, deploy the commercial [edge proxy](../operate/deploy/edge-proxy.md) as a sidecar.
The proxy caches admin verify responses locally, so applications get sub-millisecond cache hits without exposing the admin plane
publicly.

## Data plane

The data plane handles self-service operations that credential holders perform with proof of possession of the credential itself,
no admin authentication required.

Endpoints: `POST /v2alpha1/apiKeys:selfRevoke`

## Verification flow

```
Client --> Verifier --> Cache (hit?) --> Database --> Response
| ^
+-- cache hit ---------------+
```

1. Client sends credential to `POST /v2alpha1/admin/apiKeys:verify`
2. Talos identifies the credential type (generated, imported, JWT, macaroon)
3. For generated keys, the UUID is extracted from the token identifier
4. For imported keys, a tenant-scoped SHA-512/256 hash is computed
5. Database lookup (or cache hit) returns key metadata
6. Response includes key status, owner, scopes, and metadata

## Deployment topologies

| Topology | Edition | Description |
| ------------ | ---------- | -------------------------------------------------------------------- |
| Single-node | OSS | One process serves both planes |
| Split planes | Commercial | Admin and data planes as separate deployments |
| Edge proxy | Commercial | Sidecar proxy at the edge that caches admin verify responses locally |

Both planes share the same database. Verification uses caching (memory or Redis) to minimize database load.

## Ports

| Port | Purpose |
| ---- | ------------------ |
| 4420 | HTTP API (default) |
| 4422 | Prometheus metrics |

## Design philosophy

### Separation of concerns

The system is divided into distinct layers:

- **Admin plane**: Management operations (CRUD for keys, rotation, import, token derivation)
- **Data plane**: High-throughput verification operations
- **Persistence layer**: Database abstraction with pluggable drivers
- **Cache layer**: Performance optimization with multiple backends

This separation allows independent scaling of components, different SLOs for different operations (admin targets \<100ms p99, data
plane targets \<3ms p99), and clear boundaries between responsibilities.

### Production-first design

- Hard isolation between admin and data operations
- Metrics, traces, and structured logs are emitted by default
- Graceful degradation when the database or cache backend is unavailable
- Zero-downtime deployments via rolling updates and stateless verification

### Performance characteristics

- Self-contained tokens (JWT/macaroon) enable stateless verification
- HMAC-SHA256 keeps the revocation check on the order of microseconds; bcrypt would cap a single core at roughly 10 verifications
per second
- LRU caching for hot paths
- Minimal allocations in the verification path

## System architecture

```
Clients (CLI, SDK, HTTP)
|
v
+----------------------------------+
| HTTP Server (grpc-gateway) |
| Port: 4420 |
+----------------------------------+
|
v
+----------------------------------+
| Middleware |
| Logging, Metrics, Tracing |
+----------------------------------+
|
+-----+----------+
| |
v v
+-----------+ +-----------+
| Admin | | Data |
| Plane | | Plane |
| <100ms | | <3ms p99 |
+-----------+ +-----------+
| |
v v
+----------------------------------+
| Service Layer |
| Business logic, Validation |
+----------------------------------+
|
+-----+----------+
| |
v v
+-----------+ +-----------+
| Persist. | | Cache |
| SQLite | | Memory |
| PG/MySQL | | LRU |
| CRDB | | Redis |
+-----------+ +-----------+
```

All requests enter through a single HTTP server built on grpc-gateway (port 4420) and pass through middleware for logging,
metrics, and tracing before being routed to the appropriate plane.

## Component overview

### HTTP server

The API layer uses grpc-gateway for HTTP/JSON routing with protobuf-based schemas. It serves both planes through a single port,
handles CORS and compression, and exposes OpenAPI documentation.

### Service layer

Business logic is split between the admin plane service (key lifecycle, import, token derivation, input validation) and the data
plane verifier (token parsing, signature verification, revocation checking, cache management). The verifier is optimized for the
hot path with minimal allocations.

### Persistence

Database access uses sqlc-generated type-safe queries with pluggable drivers:

- **SQLite** -- OSS edition, zero-config, suitable for millions of keys
- **PostgreSQL** -- production workloads
- **MySQL** -- production workloads
- **CockroachDB** -- distributed deployments

Schema changes are managed through versioned migrations using golang-migrate.

### Cache

The cache layer reduces database load on the verification path:

- **Memory LRU** (OSS) -- local to each instance, configurable size limits
- **Redis** (Commercial) -- distributed, supports cluster and sentinel modes
- **Hierarchical L1+L2** (Commercial) -- memory for speed, Redis for shared state

### Crypto

Talos supports multiple JWT signing algorithms and a separate API key hashing mechanism:

- **JWT signing algorithms**
- `Ed25519 (EdDSA)` -- default, fastest signing and smallest keys
- `RSA-2048/4096 (RS256)` -- legacy compatibility
- **API key hashing**
- `HMAC-SHA256` -- used for API key revocation checks (\<1ms with constant-time comparison)

The JWT signing algorithm is determined per JWK by its `alg` field, so one JWKS can contain keys for multiple signing algorithms
at the same time.

### Observability

Built-in instrumentation across three pillars:

- **Metrics** -- Prometheus exposition on port 4422 with request latency histograms and error rate counters
- **Tracing** -- OpenTelemetry with W3C Trace Context propagation, configurable sampling, OTLP and Jaeger exporters
- **Logging** -- structured JSON logging via slog with correlation IDs and contextual fields

## Scalability

### Small (\<1k RPS)

A single Talos instance handles both planes with SQLite and an in-memory LRU cache. No external dependencies required.

- OSS edition sufficient
- 1 CPU, 512MB RAM
- Cost: $5-10/month

### Medium (10-50k RPS)

Separate admin and data plane deployments behind a load balancer. PostgreSQL replaces SQLite for durability. Redis provides shared
caching across data plane instances.

- Commercial edition
- Auto-scaling for data plane
- Cost: $100-500/month

### Large (200k+ RPS)

A cluster of 10-50+ stateless data plane instances with auto-scaling, backed by a distributed Redis cache and PostgreSQL with read
replicas and connection pooling. Supports multi-region deployment.

- Commercial edition
- Regional data plane deployment
- Cost: $1-5k/month
53 changes: 53 additions & 0 deletions docs/talos/concepts/caching.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,53 @@
---
title: Caching and consistency
---

# Caching and consistency

Talos caches verification results to reduce database load and improve latency. The OSS edition ships a no-op cache; in-memory and
Redis backends are commercial-only — see [Caching](../operate/cache/index.md) for backend selection.

## How it works

When caching is enabled, the first verification request for a key hits the database. Subsequent requests within the cache TTL are
served from cache without a database lookup.

## Cache types

| Type | Scope | Use case |
| ------ | ----------- | ----------------------------------- |
| Memory | Per-process | Single node or per-instance caching |
| Redis | Shared | Multi-instance deployments |

## Eventual consistency

Caching introduces eventual consistency for revocation:

1. Admin revokes a key via `POST /v2alpha1/admin/apiKeys/{key_id}:revoke`
2. The revocation takes effect in the database immediately
3. Cached verification results for that key remain valid until the cache entry expires
4. After TTL expiry, the next verification hits the database and returns `is_active: false`

## Cache bypass

To force a database lookup (bypassing cache), include the `Cache-Control: no-cache` header:

```bash
curl -X POST http://localhost:4420/v2alpha1/admin/apiKeys:verify \
-H "Content-Type: application/json" \
-H "Cache-Control: no-cache" \
-d '{"credential": "..."}'
```

See the [quickstart revocation check](../quickstart/index.mdx) and the [curl SDK reference](../integrate/sdk/curl.md) for tested
examples using cache bypass.

## TTL guidelines

| TTL | Trade-off |
| ----- | ------------------------------------------------- |
| `1m` | Fast revocation propagation, higher database load |
| `5m` | Balanced (recommended default) |
| `30m` | Low database load, slower revocation propagation |

See [Cache operations guide](../operate/cache/index.md) for configuration details.
Loading