diff --git a/internal/embed/skills/cluster-bootstrap/SKILL.md b/internal/embed/skills/cluster-bootstrap/SKILL.md
new file mode 100644
index 00000000..500bdbc5
--- /dev/null
+++ b/internal/embed/skills/cluster-bootstrap/SKILL.md
@@ -0,0 +1,141 @@
+---
+name: cluster-bootstrap
+description: "Bootstrap and join obol-stack across multiple hosts on LAN or cloud. Wraps k3sup over SSH to install k3s on a server node and join agent nodes, then prepares the cluster for `obol stack up`. Single-host, LAN multi-node, and cloud multi-node topologies."
+metadata: { "openclaw": { "emoji": "🪴", "requires": { "bins": ["k3sup", "ssh", "python3"] } } }
+---
+
+# Cluster Bootstrap
+
+Bootstrap a k3s cluster across one or more hosts (LAN or cloud) and prepare it
+to run obol-stack. Wraps [k3sup](https://github.com/alexellis/k3sup) over SSH.
+
+This skill is a **scaffold**. The single-host and LAN-join flows are wired up
+and the multi-node design is decided (storage-primary node label, cloudflared
+pools — see `references/multi-node-design.md`), but the chart changes that
+consume the bootstrap output are tracked in a follow-up ticket. Don't run this
+against a production cluster yet.
+
+## When to Use
+
+- Standing up obol-stack on a single Linux host (no Docker / k3d)
+- Joining a second/third host on the LAN as an agent node
+- Bootstrapping on cloud VMs reachable via SSH
+
+## When NOT to Use
+
+- Local Mac dev — keep using `obol stack up` (k3d + Docker)
+- Existing managed k8s (EKS/GKE) — point `KUBECONFIG` at it directly
+- Single-host where `obolup.sh` already works
+
+## Topologies
+
+### single-host
+
+One Linux box. k3sup installs k3s, writes kubeconfig locally, done. Equivalent
+to `curl -sfL https://get.k3s.io | sh` plus kubeconfig export.
+
+### lan-multi
+
+One server + N agent nodes on the same L2 network. Server is reachable by IP
+from all agents.
+
+### cloud-multi
+
+One server + N agents across cloud VMs. Same shape as lan-multi but server
+must have a routable IP and SG/firewall must allow 6443/tcp from agents.
+
+## Quick Start
+
+```bash
+# Single host (current dev box)
+python3 scripts/bootstrap.py single --host 192.168.1.50 --user obol \
+  --ssh-key ~/.ssh/id_ed25519
+
+# LAN: server + 2 agents
+python3 scripts/bootstrap.py server --host 192.168.1.50 --user obol \
+  --ssh-key ~/.ssh/id_ed25519
+python3 scripts/bootstrap.py join --server-host 192.168.1.50 \
+  --host 192.168.1.51 --user obol --ssh-key ~/.ssh/id_ed25519
+python3 scripts/bootstrap.py join --server-host 192.168.1.50 \
+  --host 192.168.1.52 --user obol --ssh-key ~/.ssh/id_ed25519
+
+# After bootstrap, point obol at the kubeconfig and run stack up:
+export KUBECONFIG=$(python3 scripts/bootstrap.py kubeconfig-path)
+obol stack up
+```
+
+## Subcommands
+
+```
+single   --host --user --ssh-key [--k3s-channel stable]
+            [--storage-primary] [--cloudflared-pool <name>]
+            Install k3s on one host. Equivalent to `server` with no agents.
+
+server   --host --user --ssh-key [--k3s-channel stable] [--cluster-cidr]
+            [--storage-primary] [--no-storage-primary]
+            [--cloudflared-pool <name>]
+            Install k3s server on the target host. Writes kubeconfig to
+            $OBOL_CONFIG_DIR/kubeconfig.yaml with API rewritten to --host.
+            Records `obol.org/storage=primary` on the server by default and
+            `obol.org/cloudflared-pool=<name>` (default `default`) into
+            topology.json.
+
+join     --server-host --host --user --ssh-key
+            [--cloudflared-pool <name>]
+            Install k3s agent on --host and join to --server-host. Records
+            `obol.org/cloudflared-pool=<name>` (default `default`).
+
+kubeconfig-path
+            Print the absolute path of the kubeconfig this skill writes to.
+
+label    --host <name> --label key=value [--label key=value ...]
+            Apply ad-hoc node labels (used when storage/tunnel placement
+            needs more than the bootstrap conveniences cover).
+
+status   List nodes, their roles, and the labels relevant to obol-stack.
+```
+
+## Design Notes (decided)
+
+Full rationale and rejected alternatives in `references/multi-node-design.md`.
+
+### Storage — single primary node
+
+One node carries `obol.org/storage=primary` (the bootstrap server by default).
+Stateful Deployments — LiteLLM, Hermes, default obol-agent, OpenClaw — add
+`nodeAffinity` to that label so PVCs always land on the same node. Lose the
+primary, restore from PVC backup. This is `--storage-primary` (default on)
+on `bootstrap.py server` / `single`.
+
+### Cloudflared — `pools` list
+
+The cloudflared chart will render one Deployment per entry in
+`cloudflared.pools`. Each pool has its own `replicas`, `nodeSelector`, and
+Cloudflare credentials, with hostname `PodAntiAffinity` ensuring at most one
+replica per node within a pool. Default values ship a single `default` pool
+preserving today's behavior; advanced topologies opt in by adding more pools
+(e.g. `edge` + `cloud` with separate tunnel tokens).
+
+`bootstrap.py server --cloudflared-pool <name>` and `bootstrap.py join
+--cloudflared-pool <name>` record per-node pool labels into `topology.json`.
+
+## Files Written by the Skill
+
+| Path | Purpose |
+|------|---------|
+| `$OBOL_CONFIG_DIR/kubeconfig.yaml` | k3s admin kubeconfig (API rewritten to server host IP) |
+| `$OBOL_CONFIG_DIR/cluster-bootstrap/topology.json` | Inventory of bootstrapped nodes (host, role, labels) |
+| `$OBOL_CONFIG_DIR/cluster-bootstrap/server-token` | k3s node token (mode 0600) — used to join agents |
+
+## Caveats
+
+- **Not for k3d/local Mac.** Use `obol stack up` for that — k3d-on-Docker is
+  still the canonical local dev path.
+- **Firewalls.** Server: 6443/tcp inbound from agents. All nodes: 8472/udp
+  (flannel VXLAN) between each other. Cloud: configure SGs accordingly.
+- **`OBOL_DEVELOPMENT=true` registry caches** are k3d-only today — they don't
+  run on the k3sup-bootstrapped k3s cluster yet.
+- **`obol stack up` on a real k3s cluster** has not been validated end to end
+  on this branch; the `obol stack` lifecycle today expects the k3d cluster
+  name written by `obol stack init`. Treat the post-bootstrap `obol stack up`
+  as the next milestone, not a finished path.
diff --git a/internal/embed/skills/cluster-bootstrap/references/multi-node-design.md b/internal/embed/skills/cluster-bootstrap/references/multi-node-design.md
new file mode 100644
index 00000000..a5d10523
--- /dev/null
+++ b/internal/embed/skills/cluster-bootstrap/references/multi-node-design.md
@@ -0,0 +1,155 @@
+# Multi-node design notes
+
+Decisions for the multi-node behavior of obol-stack. The `cluster-bootstrap`
+skill carries the bootstrap-time flags; the actual chart changes that consume
+them live in a separate ticket (see "Implementation status" at the bottom).
+
+## Storage — DECIDED: Option A (storage-primary node)
+
+Today: `internal/embed/infrastructure/base/templates/local-path.yaml` installs
+the rancher local-path provisioner with `volumeBindingMode: WaitForFirstConsumer`
+and `pathPattern: "{{ .PVC.Namespace }}/{{ .PVC.Name }}"` under
+`{{ .Values.dataDir }}`. PVCs pin to whichever node first schedules a consumer
+pod; reschedule to a different node breaks the mount.
+
+### Decision: A — single storage-primary node
+
+- One node carries `obol.org/storage=primary`. By default this is the
+  bootstrap (server) node.
+- Every Deployment that owns a PVC adds a soft `nodeAffinity` preferring the
+  primary, hard `nodeAffinity` requiring it for true single-writer state
+  (LiteLLM, Hermes, default obol-agent, OpenClaw instances).
+- Failure mode is identical to today's single-host k3d: lose the primary,
+  restore from PVC backup. Single-node-of-failure for state is acceptable
+  given our LAN/cloud-small topologies.
+- Helm values gain `storage.primaryLabel` (default `obol.org/storage=primary`)
+  so charts can opt in via a shared values key.
+
+### Rejected
+
+- **B — Longhorn / OpenEBS Mayastor.** Real PVC migration but ≥3 nodes,
+  ~500MiB RAM/node baseline, new failure modes (stuck volumes, replica
+  rebalance IO). Reconsider if a deployment actually needs HA state.
+- **C — NFS export + dual StorageClass.** SPOF on NFS host; fsync/lease
+  semantics differ from local disk and would silently break SQLite-style state
+  (LiteLLM logs DB, BoltDB-backed services). Reconsider if a deployment
+  centralizes only on bulk read-mostly storage.
+
+### Bootstrap surface
+
+- `bootstrap.py server --storage-primary` records `obol.org/storage=primary`
+  on the server node in topology.json. Apply with `kubectl label node …`
+  printed by `bootstrap.py label`.
+- `bootstrap.py server --no-storage-primary` opts out (e.g. when a separate
+  storage node will be added later).
+
+## Cloudflared — DECIDED: Shape 2 (pools)
+
+Today: `internal/embed/infrastructure/cloudflared/templates/deployment.yaml`
+renders one Deployment with `replicas: 1` (or 0 when no token/credentials).
+Modes: `quickTunnel`, `remoteManaged` (token), `localManaged` (credentials +
+config).
+
+### Decision: Shape 2 — `cloudflared.pools` list
+
+Values gain a `pools` list. Each pool is its own Deployment with hostname
+PodAntiAffinity so within a pool there is at most one replica per node, and
+each pool gets its own Cloudflare credentials (edge vs cloud usually map to
+different zones / accounts).
+
+```yaml
+# Default values.yaml — single pool, backwards compatible with today.
+pools:
+  - name: default
+    replicas: 1
+    # nodeSelector omitted -> any schedulable node
+    mode: auto                   # auto | local | remote | quick
+    quickTunnel:
+      url: "http://traefik.traefik.svc.cluster.local:80"
+    remoteManaged:
+      tokenSecretName: cloudflared-tunnel-token
+      tokenSecretKey: TUNNEL_TOKEN
+    localManaged:
+      secretName: cloudflared-local-credentials
+      configMapName: cloudflared-local-config
+      tunnelIDKey: tunnel_id
+```
+
+Per-pool example for an edge+cloud topology:
+
+```yaml
+pools:
+  - name: edge
+    replicas: 2
+    nodeSelector:
+      obol.org/cloudflared-pool: edge
+    mode: remote
+    remoteManaged:
+      tokenSecretName: cloudflared-edge-token
+      tokenSecretKey: TUNNEL_TOKEN
+  - name: cloud
+    replicas: 1
+    nodeSelector:
+      obol.org/cloudflared-pool: cloud
+    mode: local
+    localManaged:
+      secretName: cloudflared-cloud-credentials
+      configMapName: cloudflared-cloud-config
+      tunnelIDKey: tunnel_id
+```
+
+Invariants the chart must enforce:
+- Per-pool `requiredDuringSchedulingIgnoredDuringExecution` PodAntiAffinity by
+  `kubernetes.io/hostname` — at most one replica per node within a pool.
+- `quickTunnel` mode caps at `replicas: 1` (per-replica trycloudflare URL).
+- Resource names get a per-pool suffix: `cloudflared-<pool>` for the
+  Deployment, default suffix omitted only when the single pool is named
+  `default` and no migration is in flight.
+- Validation: each pool must have exactly one of `quickTunnel` (when
+  `mode=quick`), `remoteManaged` (`mode=remote`), `localManaged`
+  (`mode=local`), or any of the three when `mode=auto`.
+
+Footgun documented for users: if `replicas` exceeds the count of nodes
+matching `nodeSelector`, the surplus pods stay Pending. The chart NOTES.txt
+should print a warning at install time.
+
+### Rejected
+
+- **Shape 1 — DaemonSet per labeled pool.** Hard-caps at one tunnel per
+  labeled node, which means "more tunnels on edge" requires labeling more
+  nodes. Doesn't compose when one beefy edge box wants two tunnels.
+- **Shape 3 — single Deployment, hostname antiaffinity, replicas knob.** No
+  way to differentiate edge vs cloud tunnels (different Cloudflare
+  credentials, different zones). Replicas-exceeds-nodes footgun is the same
+  but with no value to offset it.
+
+### Bootstrap surface
+
+- `bootstrap.py server --cloudflared-pool <name>` records the pool label on
+  the server. Default is `default`.
+- `bootstrap.py join --cloudflared-pool <name>` records the pool label on
+  the agent. Repeat with different pool names to build edge/cloud topology.
+- The recorded labels are written into `topology.json` so the chart-rewrite
+  ticket can read them when generating per-pool values.
+
+## Other multi-node concerns (out of scope for this skill, tracked here)
+
+- **Dev registry cache**: today configured per-cluster in `registries.yaml`,
+  scoped to a single localhost cache on the dev box. Multi-node needs each
+  agent to either reach the cache over LAN or have its own cache.
+- **Host Ollama auto-detection**: `autoConfigureLLM` detects models on the
+  host where `obol stack up` ran. In multi-node we need to either disable
+  this (require `obol model setup custom`) or aggregate across nodes.
+- **Traefik / Gateway**: single Service IP works fine multi-node out of the
+  box; nothing to do unless we want active-active ingress per region.
+
+## Implementation status
+
+| Piece                                                      | Status |
+|------------------------------------------------------------|--------|
+| `bootstrap.py` records storage-primary + cloudflared-pool  | done (this PR) |
+| `local-path.yaml` chart honors `storage.primaryLabel`      | next ticket |
+| Stateful Deployments add `nodeAffinity` to primary label   | next ticket |
+| `cloudflared` chart `range` over `pools`                   | next ticket |
+| `obol stack up` consumes `topology.json` for chart values  | next ticket |
+| End-to-end multi-node smoke test                           | follow-up |
diff --git a/internal/embed/skills/cluster-bootstrap/scripts/bootstrap.py b/internal/embed/skills/cluster-bootstrap/scripts/bootstrap.py
new file mode 100755
index 00000000..c69a2ad0
--- /dev/null
+++ b/internal/embed/skills/cluster-bootstrap/scripts/bootstrap.py
@@ -0,0 +1,256 @@
+#!/usr/bin/env python3
+"""Cluster bootstrap helper for obol-stack.
+
+Thin wrapper around k3sup that codifies the topology and writes inventory
+to $OBOL_CONFIG_DIR/cluster-bootstrap/. See ../SKILL.md for usage.
+
+This is a scaffold: the SSH-driven k3sup invocations are wired up, but
+multi-node storage placement and cloudflared HA labeling are deferred until
+the design notes in ../references/multi-node-design.md are finalized.
+"""
+
+from __future__ import annotations
+
+import argparse
+import json
+import os
+import shutil
+import subprocess
+import sys
+from dataclasses import dataclass, asdict
+from pathlib import Path
+from typing import Optional
+
+
+def config_dir() -> Path:
+    if env := os.environ.get("OBOL_CONFIG_DIR"):
+        return Path(env)
+    if xdg := os.environ.get("XDG_CONFIG_HOME"):
+        return Path(xdg) / "obol"
+    return Path.home() / ".config" / "obol"
+
+
+def state_dir() -> Path:
+    p = config_dir() / "cluster-bootstrap"
+    p.mkdir(parents=True, exist_ok=True)
+    return p
+
+
+def kubeconfig_path() -> Path:
+    return config_dir() / "kubeconfig.yaml"
+
+
+def topology_path() -> Path:
+    return state_dir() / "topology.json"
+
+
+def server_token_path() -> Path:
+    return state_dir() / "server-token"
+
+
+@dataclass
+class Node:
+    host: str
+    role: str  # "server" | "agent"
+    user: str
+    labels: dict
+
+
+def load_topology() -> dict:
+    p = topology_path()
+    if not p.exists():
+        return {"nodes": []}
+    return json.loads(p.read_text())
+
+
+def save_topology(topo: dict) -> None:
+    topology_path().write_text(json.dumps(topo, indent=2, sort_keys=True))
+
+
+def upsert_node(node: Node) -> None:
+    topo = load_topology()
+    nodes = [n for n in topo["nodes"] if n["host"] != node.host]
+    nodes.append(asdict(node))
+    topo["nodes"] = nodes
+    save_topology(topo)
+
+
+def require_k3sup() -> None:
+    if shutil.which("k3sup") is None:
+        sys.exit(
+            "k3sup not found in PATH. Install from https://github.com/alexellis/k3sup"
+        )
+
+
+def run(cmd: list[str]) -> subprocess.CompletedProcess:
+    print(f"+ {' '.join(cmd)}", file=sys.stderr)
+    return subprocess.run(cmd, check=True)
+
+
+STORAGE_PRIMARY_LABEL = "obol.org/storage"
+CLOUDFLARED_POOL_LABEL = "obol.org/cloudflared-pool"
+
+
+def cmd_install_server(args: argparse.Namespace) -> int:
+    require_k3sup()
+    kc = kubeconfig_path()
+    kc.parent.mkdir(parents=True, exist_ok=True)
+
+    k3sup_cmd = [
+        "k3sup", "install",
+        "--ip", args.host,
+        "--user", args.user,
+        "--ssh-key", args.ssh_key,
+        "--local-path", str(kc),
+        "--context", "obol",
+        "--k3s-channel", args.k3s_channel,
+    ]
+    if args.cluster_cidr:
+        k3sup_cmd += ["--cluster-cidr", args.cluster_cidr]
+    run(k3sup_cmd)
+
+    # Pull the node token off the server so agents can join later.
+    token = subprocess.check_output([
+        "ssh", "-i", args.ssh_key,
+        "-o", "StrictHostKeyChecking=accept-new",
+        f"{args.user}@{args.host}",
+        "sudo cat /var/lib/rancher/k3s/server/node-token",
+    ]).decode().strip()
+    p = server_token_path()
+    p.write_text(token)
+    p.chmod(0o600)
+
+    labels: dict[str, str] = {}
+    if args.storage_primary:
+        labels[STORAGE_PRIMARY_LABEL] = "primary"
+    labels[CLOUDFLARED_POOL_LABEL] = args.cloudflared_pool
+    upsert_node(Node(host=args.host, role="server", user=args.user, labels=labels))
+    print(f"server installed; kubeconfig at {kc}")
+    if labels:
+        print("recorded labels:", ", ".join(f"{k}={v}" for k, v in labels.items()))
+        print("apply with: bootstrap.py status   # then run the printed kubectl label commands")
+    return 0
+
+
+def cmd_join(args: argparse.Namespace) -> int:
+    require_k3sup()
+    if not server_token_path().exists():
+        sys.exit("no server token on disk; run `bootstrap.py server` first")
+
+    run([
+        "k3sup", "join",
+        "--ip", args.host,
+        "--user", args.user,
+        "--ssh-key", args.ssh_key,
+        "--server-ip", args.server_host,
+        "--server-user", args.user,
+    ])
+    labels = {CLOUDFLARED_POOL_LABEL: args.cloudflared_pool}
+    upsert_node(Node(host=args.host, role="agent", user=args.user, labels=labels))
+    print(f"agent {args.host} joined to server {args.server_host}")
+    print("recorded labels:", ", ".join(f"{k}={v}" for k, v in labels.items()))
+    return 0
+
+
+def cmd_kubeconfig_path(_: argparse.Namespace) -> int:
+    print(kubeconfig_path())
+    return 0
+
+
+def cmd_status(_: argparse.Namespace) -> int:
+    topo = load_topology()
+    if not topo["nodes"]:
+        print("no nodes recorded")
+        return 0
+    for n in topo["nodes"]:
+        labels = ",".join(f"{k}={v}" for k, v in n["labels"].items()) or "-"
+        print(f"{n['host']:20} {n['role']:6} {n['user']:12} {labels}")
+    return 0
+
+
+def cmd_label(args: argparse.Namespace) -> int:
+    pairs = {}
+    for raw in args.label:
+        if "=" not in raw:
+            sys.exit(f"label must be key=value (got {raw!r})")
+        k, v = raw.split("=", 1)
+        pairs[k] = v
+
+    # Persist intent locally; actual `kubectl label node` happens via the
+    # caller because we don't want to assume a kubeconfig is loaded yet.
+    topo = load_topology()
+    found = False
+    for n in topo["nodes"]:
+        if n["host"] == args.host:
+            n["labels"].update(pairs)
+            found = True
+    if not found:
+        sys.exit(f"host {args.host!r} not in topology")
+    save_topology(topo)
+    print(f"recorded labels for {args.host}; apply with:")
+    for k, v in pairs.items():
+        print(f"  kubectl label node $(kubectl get node -o name | grep {args.host}) {k}={v} --overwrite")
+    return 0
+
+
+def main(argv: Optional[list[str]] = None) -> int:
+    p = argparse.ArgumentParser(prog="bootstrap.py")
+    sub = p.add_subparsers(dest="cmd", required=True)
+
+    def add_ssh(parser: argparse.ArgumentParser) -> None:
+        parser.add_argument("--host", required=True)
+        parser.add_argument("--user", required=True)
+        parser.add_argument("--ssh-key", required=True)
+
+    def add_server_topology(parser: argparse.ArgumentParser) -> None:
+        parser.add_argument("--k3s-channel", default="stable")
+        parser.add_argument("--cluster-cidr", default=None)
+        # storage-primary defaults on; --no-storage-primary opts out.
+        parser.add_argument(
+            "--storage-primary", dest="storage_primary",
+            action=argparse.BooleanOptionalAction, default=True,
+            help="record obol.org/storage=primary on this node (default on)",
+        )
+        parser.add_argument(
+            "--cloudflared-pool", default="default",
+            help="cloudflared pool name (label obol.org/cloudflared-pool); "
+                 "default 'default'",
+        )
+
+    s = sub.add_parser("single", help="install k3s on one host")
+    add_ssh(s)
+    add_server_topology(s)
+    s.set_defaults(func=cmd_install_server)
+
+    s = sub.add_parser("server", help="install k3s server")
+    add_ssh(s)
+    add_server_topology(s)
+    s.set_defaults(func=cmd_install_server)
+
+    s = sub.add_parser("join", help="join an agent node to the server")
+    add_ssh(s)
+    s.add_argument("--server-host", required=True)
+    s.add_argument(
+        "--cloudflared-pool", default="default",
+        help="cloudflared pool name for this agent (default 'default')",
+    )
+    s.set_defaults(func=cmd_join)
+
+    s = sub.add_parser("kubeconfig-path", help="print kubeconfig path")
+    s.set_defaults(func=cmd_kubeconfig_path)
+
+    s = sub.add_parser("status", help="list known nodes")
+    s.set_defaults(func=cmd_status)
+
+    s = sub.add_parser("label", help="record node labels in topology")
+    s.add_argument("--host", required=True)
+    s.add_argument("--label", action="append", default=[],
+                   help="key=value (repeatable)")
+    s.set_defaults(func=cmd_label)
+
+    args = p.parse_args(argv)
+    return args.func(args)
+
+
+if __name__ == "__main__":
+    raise SystemExit(main())