Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
141 changes: 141 additions & 0 deletions internal/embed/skills/cluster-bootstrap/SKILL.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,141 @@
---
name: cluster-bootstrap
description: "Bootstrap and join obol-stack across multiple hosts on LAN or cloud. Wraps k3sup over SSH to install k3s on a server node and join agent nodes, then prepares the cluster for `obol stack up`. Single-host, LAN multi-node, and cloud multi-node topologies."
metadata: { "openclaw": { "emoji": "🪴", "requires": { "bins": ["k3sup", "ssh", "python3"] } } }
---

# Cluster Bootstrap

Bootstrap a k3s cluster across one or more hosts (LAN or cloud) and prepare it
to run obol-stack. Wraps [k3sup](https://github.com/alexellis/k3sup) over SSH.

This skill is a **scaffold**. The single-host and LAN-join flows are wired up
and the multi-node design is decided (storage-primary node label, cloudflared
pools — see `references/multi-node-design.md`), but the chart changes that
consume the bootstrap output are tracked in a follow-up ticket. Don't run this
against a production cluster yet.

## When to Use

- Standing up obol-stack on a single Linux host (no Docker / k3d)
- Joining a second/third host on the LAN as an agent node
- Bootstrapping on cloud VMs reachable via SSH

## When NOT to Use

- Local Mac dev — keep using `obol stack up` (k3d + Docker)
- Existing managed k8s (EKS/GKE) — point `KUBECONFIG` at it directly
- Single-host where `obolup.sh` already works

## Topologies

### single-host

One Linux box. k3sup installs k3s, writes kubeconfig locally, done. Equivalent
to `curl -sfL https://get.k3s.io | sh` plus kubeconfig export.

### lan-multi

One server + N agent nodes on the same L2 network. Server is reachable by IP
from all agents.

### cloud-multi

One server + N agents across cloud VMs. Same shape as lan-multi but server
must have a routable IP and SG/firewall must allow 6443/tcp from agents.

## Quick Start

```bash
# Single host (current dev box)
python3 scripts/bootstrap.py single --host 192.168.1.50 --user obol \
--ssh-key ~/.ssh/id_ed25519

# LAN: server + 2 agents
python3 scripts/bootstrap.py server --host 192.168.1.50 --user obol \
--ssh-key ~/.ssh/id_ed25519
python3 scripts/bootstrap.py join --server-host 192.168.1.50 \
--host 192.168.1.51 --user obol --ssh-key ~/.ssh/id_ed25519
python3 scripts/bootstrap.py join --server-host 192.168.1.50 \
--host 192.168.1.52 --user obol --ssh-key ~/.ssh/id_ed25519

# After bootstrap, point obol at the kubeconfig and run stack up:
export KUBECONFIG=$(python3 scripts/bootstrap.py kubeconfig-path)
obol stack up
```

## Subcommands

```
single --host --user --ssh-key [--k3s-channel stable]
[--storage-primary] [--cloudflared-pool <name>]
Install k3s on one host. Equivalent to `server` with no agents.

server --host --user --ssh-key [--k3s-channel stable] [--cluster-cidr]
[--storage-primary] [--no-storage-primary]
[--cloudflared-pool <name>]
Install k3s server on the target host. Writes kubeconfig to
$OBOL_CONFIG_DIR/kubeconfig.yaml with API rewritten to --host.
Records `obol.org/storage=primary` on the server by default and
`obol.org/cloudflared-pool=<name>` (default `default`) into
topology.json.

join --server-host --host --user --ssh-key
[--cloudflared-pool <name>]
Install k3s agent on --host and join to --server-host. Records
`obol.org/cloudflared-pool=<name>` (default `default`).

kubeconfig-path
Print the absolute path of the kubeconfig this skill writes to.

label --host <name> --label key=value [--label key=value ...]
Apply ad-hoc node labels (used when storage/tunnel placement
needs more than the bootstrap conveniences cover).

status List nodes, their roles, and the labels relevant to obol-stack.
```

## Design Notes (decided)

Full rationale and rejected alternatives in `references/multi-node-design.md`.

### Storage — single primary node

One node carries `obol.org/storage=primary` (the bootstrap server by default).
Stateful Deployments — LiteLLM, Hermes, default obol-agent, OpenClaw — add
`nodeAffinity` to that label so PVCs always land on the same node. Lose the
primary, restore from PVC backup. This is `--storage-primary` (default on)
on `bootstrap.py server` / `single`.

### Cloudflared — `pools` list

The cloudflared chart will render one Deployment per entry in
`cloudflared.pools`. Each pool has its own `replicas`, `nodeSelector`, and
Cloudflare credentials, with hostname `PodAntiAffinity` ensuring at most one
replica per node within a pool. Default values ship a single `default` pool
preserving today's behavior; advanced topologies opt in by adding more pools
(e.g. `edge` + `cloud` with separate tunnel tokens).

`bootstrap.py server --cloudflared-pool <name>` and `bootstrap.py join
--cloudflared-pool <name>` record per-node pool labels into `topology.json`.

## Files Written by the Skill

| Path | Purpose |
|------|---------|
| `$OBOL_CONFIG_DIR/kubeconfig.yaml` | k3s admin kubeconfig (API rewritten to server host IP) |
| `$OBOL_CONFIG_DIR/cluster-bootstrap/topology.json` | Inventory of bootstrapped nodes (host, role, labels) |
| `$OBOL_CONFIG_DIR/cluster-bootstrap/server-token` | k3s node token (mode 0600) — used to join agents |

## Caveats

- **Not for k3d/local Mac.** Use `obol stack up` for that — k3d-on-Docker is
still the canonical local dev path.
- **Firewalls.** Server: 6443/tcp inbound from agents. All nodes: 8472/udp
(flannel VXLAN) between each other. Cloud: configure SGs accordingly.
- **`OBOL_DEVELOPMENT=true` registry caches** are k3d-only today — they don't
run on the k3sup-bootstrapped k3s cluster yet.
- **`obol stack up` on a real k3s cluster** has not been validated end to end
on this branch; the `obol stack` lifecycle today expects the k3d cluster
name written by `obol stack init`. Treat the post-bootstrap `obol stack up`
as the next milestone, not a finished path.
Original file line number Diff line number Diff line change
@@ -0,0 +1,155 @@
# Multi-node design notes

Decisions for the multi-node behavior of obol-stack. The `cluster-bootstrap`
skill carries the bootstrap-time flags; the actual chart changes that consume
them live in a separate ticket (see "Implementation status" at the bottom).

## Storage — DECIDED: Option A (storage-primary node)

Today: `internal/embed/infrastructure/base/templates/local-path.yaml` installs
the rancher local-path provisioner with `volumeBindingMode: WaitForFirstConsumer`
and `pathPattern: "{{ .PVC.Namespace }}/{{ .PVC.Name }}"` under
`{{ .Values.dataDir }}`. PVCs pin to whichever node first schedules a consumer
pod; reschedule to a different node breaks the mount.

### Decision: A — single storage-primary node

- One node carries `obol.org/storage=primary`. By default this is the
bootstrap (server) node.
- Every Deployment that owns a PVC adds a soft `nodeAffinity` preferring the
primary, hard `nodeAffinity` requiring it for true single-writer state
(LiteLLM, Hermes, default obol-agent, OpenClaw instances).
- Failure mode is identical to today's single-host k3d: lose the primary,
restore from PVC backup. Single-node-of-failure for state is acceptable
given our LAN/cloud-small topologies.
- Helm values gain `storage.primaryLabel` (default `obol.org/storage=primary`)
so charts can opt in via a shared values key.

### Rejected

- **B — Longhorn / OpenEBS Mayastor.** Real PVC migration but ≥3 nodes,
~500MiB RAM/node baseline, new failure modes (stuck volumes, replica
rebalance IO). Reconsider if a deployment actually needs HA state.
- **C — NFS export + dual StorageClass.** SPOF on NFS host; fsync/lease
semantics differ from local disk and would silently break SQLite-style state
(LiteLLM logs DB, BoltDB-backed services). Reconsider if a deployment
centralizes only on bulk read-mostly storage.

### Bootstrap surface

- `bootstrap.py server --storage-primary` records `obol.org/storage=primary`
on the server node in topology.json. Apply with `kubectl label node …`
printed by `bootstrap.py label`.
- `bootstrap.py server --no-storage-primary` opts out (e.g. when a separate
storage node will be added later).

## Cloudflared — DECIDED: Shape 2 (pools)

Today: `internal/embed/infrastructure/cloudflared/templates/deployment.yaml`
renders one Deployment with `replicas: 1` (or 0 when no token/credentials).
Modes: `quickTunnel`, `remoteManaged` (token), `localManaged` (credentials +
config).

### Decision: Shape 2 — `cloudflared.pools` list

Values gain a `pools` list. Each pool is its own Deployment with hostname
PodAntiAffinity so within a pool there is at most one replica per node, and
each pool gets its own Cloudflare credentials (edge vs cloud usually map to
different zones / accounts).

```yaml
# Default values.yaml — single pool, backwards compatible with today.
pools:
- name: default
replicas: 1
# nodeSelector omitted -> any schedulable node
mode: auto # auto | local | remote | quick
quickTunnel:
url: "http://traefik.traefik.svc.cluster.local:80"
remoteManaged:
tokenSecretName: cloudflared-tunnel-token
tokenSecretKey: TUNNEL_TOKEN
localManaged:
secretName: cloudflared-local-credentials
configMapName: cloudflared-local-config
tunnelIDKey: tunnel_id
```

Per-pool example for an edge+cloud topology:

```yaml
pools:
- name: edge
replicas: 2
nodeSelector:
obol.org/cloudflared-pool: edge
mode: remote
remoteManaged:
tokenSecretName: cloudflared-edge-token
tokenSecretKey: TUNNEL_TOKEN
- name: cloud
replicas: 1
nodeSelector:
obol.org/cloudflared-pool: cloud
mode: local
localManaged:
secretName: cloudflared-cloud-credentials
configMapName: cloudflared-cloud-config
tunnelIDKey: tunnel_id
```

Invariants the chart must enforce:
- Per-pool `requiredDuringSchedulingIgnoredDuringExecution` PodAntiAffinity by
`kubernetes.io/hostname` — at most one replica per node within a pool.
- `quickTunnel` mode caps at `replicas: 1` (per-replica trycloudflare URL).
- Resource names get a per-pool suffix: `cloudflared-<pool>` for the
Deployment, default suffix omitted only when the single pool is named
`default` and no migration is in flight.
- Validation: each pool must have exactly one of `quickTunnel` (when
`mode=quick`), `remoteManaged` (`mode=remote`), `localManaged`
(`mode=local`), or any of the three when `mode=auto`.

Footgun documented for users: if `replicas` exceeds the count of nodes
matching `nodeSelector`, the surplus pods stay Pending. The chart NOTES.txt
should print a warning at install time.

### Rejected

- **Shape 1 — DaemonSet per labeled pool.** Hard-caps at one tunnel per
labeled node, which means "more tunnels on edge" requires labeling more
nodes. Doesn't compose when one beefy edge box wants two tunnels.
- **Shape 3 — single Deployment, hostname antiaffinity, replicas knob.** No
way to differentiate edge vs cloud tunnels (different Cloudflare
credentials, different zones). Replicas-exceeds-nodes footgun is the same
but with no value to offset it.

### Bootstrap surface

- `bootstrap.py server --cloudflared-pool <name>` records the pool label on
the server. Default is `default`.
- `bootstrap.py join --cloudflared-pool <name>` records the pool label on
the agent. Repeat with different pool names to build edge/cloud topology.
- The recorded labels are written into `topology.json` so the chart-rewrite
ticket can read them when generating per-pool values.

## Other multi-node concerns (out of scope for this skill, tracked here)

- **Dev registry cache**: today configured per-cluster in `registries.yaml`,
scoped to a single localhost cache on the dev box. Multi-node needs each
agent to either reach the cache over LAN or have its own cache.
- **Host Ollama auto-detection**: `autoConfigureLLM` detects models on the
host where `obol stack up` ran. In multi-node we need to either disable
this (require `obol model setup custom`) or aggregate across nodes.
- **Traefik / Gateway**: single Service IP works fine multi-node out of the
box; nothing to do unless we want active-active ingress per region.

## Implementation status

| Piece | Status |
|------------------------------------------------------------|--------|
| `bootstrap.py` records storage-primary + cloudflared-pool | done (this PR) |
| `local-path.yaml` chart honors `storage.primaryLabel` | next ticket |
| Stateful Deployments add `nodeAffinity` to primary label | next ticket |
| `cloudflared` chart `range` over `pools` | next ticket |
| `obol stack up` consumes `topology.json` for chart values | next ticket |
| End-to-end multi-node smoke test | follow-up |
Loading
Loading