Skip to content

[DIRTY] feat(skills): scaffold cluster-bootstrap for LAN/cloud k3s#423

Draft
bussyjd wants to merge 2 commits intomainfrom
feature/cluster-bootstrap-skill
Draft

[DIRTY] feat(skills): scaffold cluster-bootstrap for LAN/cloud k3s#423
bussyjd wants to merge 2 commits intomainfrom
feature/cluster-bootstrap-skill

Conversation

@bussyjd
Copy link
Copy Markdown
Collaborator

@bussyjd bussyjd commented May 5, 2026

Summary

Scaffold a new cluster-bootstrap skill that lets obol-agent stand up a k3s cluster across one or more hosts (LAN or cloud) over SSH using k3sup, then prepares it for obol stack up. Topology (nodes, roles, labels) is persisted under $OBOL_CONFIG_DIR/cluster-bootstrap/ so the chart-side wiring can consume it later.

This is a scaffold PR: the bootstrap flows are wired up and the multi-node design is decided, but the chart changes that consume the bootstrap output are intentionally deferred (see "Out of scope" below). Don't run against a production cluster yet.

What's in

  • internal/embed/skills/cluster-bootstrap/SKILL.md — topology model (single-host / lan-multi / cloud-multi), subcommand reference, caveats.
  • internal/embed/skills/cluster-bootstrap/scripts/bootstrap.py — k3sup wrapper: single, server, join, kubeconfig-path, status, label. Persists topology to topology.json, keeps server token mode 0600, writes kubeconfig to the standard $OBOL_CONFIG_DIR/kubeconfig.yaml.
  • internal/embed/skills/cluster-bootstrap/references/multi-node-design.md — full design doc with rejected alternatives.

Locked decisions

Storage — Option A (single primary node). One node carries obol.org/storage=primary (server by default). Stateful Deployments will gain nodeAffinity to that label; --no-storage-primary opts out. State is single-node, recoverable from PVC backup. Rejected: Longhorn/OpenEBS (≥3 nodes, ~500MiB/node baseline) and NFS dual-StorageClass (SPOF + fsync semantics break SQLite/BoltDB workloads).

Cloudflared — Shape 2 (pools list). The chart will render one Deployment per cloudflared.pools entry; each pool has its own replicas, nodeSelector, and Cloudflare credentials, with hostname PodAntiAffinity capping at one replica per node within a pool. bootstrap.py server|join --cloudflared-pool <name> records obol.org/cloudflared-pool=<name> (default default). Rejected: DaemonSet (can't run >1 tunnel per beefy node), single-Deployment+replicas (no edge-vs-cloud credential separation).

Out of scope (next ticket)

  • local-path.yaml chart honoring storage.primaryLabel
  • Stateful Deployments adding nodeAffinity to the primary label
  • cloudflared chart rewriting the single Deployment as a range over pools (with hostname-antiaffinity invariant + NOTES.txt warning when replicas > matching nodes)
  • obol stack up consuming topology.json to feed those values
  • End-to-end multi-node smoke test

Test plan

  • go build ./... clean
  • go test ./internal/embed/... passes (CRD/RBAC validation unchanged)
  • bootstrap.py --help, bootstrap.py server --help, bootstrap.py join --help parse cleanly
  • bootstrap.py status on a fresh state returns "no nodes recorded"
  • Live single-host install on a Linux box (deferred, requires non-Mac target)
  • LAN server + agent join (deferred, requires multi-node test bed)

Notes

  • Branch name is feature/cluster-bootstrap-skill per request; project convention elsewhere is feat/. Happy to rename if preferred.
  • Does not touch any production infra templates (local-path.yaml, cloudflared chart). All changes live under internal/embed/skills/cluster-bootstrap/.

Generated by Claude Code

claude added 2 commits May 5, 2026 01:03
Wraps k3sup over SSH for single-host install, server install, and agent
join. Persists topology to $OBOL_CONFIG_DIR/cluster-bootstrap/ and the
k3s admin kubeconfig to the standard location so `obol stack up` can
take over once multi-node infra concerns are resolved.

Multi-node storage placement and cloudflared HA are intentionally
deferred — the design tradeoffs are captured in
references/multi-node-design.md and the skill exposes flag stubs
(--storage-primary, --edge-node) so the CLI surface won't churn once
we pick paths.
Storage: single primary node (Option A). One node carries
obol.org/storage=primary; stateful Deployments will gain nodeAffinity to
that label. Bootstrap records the label on the server by default;
--no-storage-primary opts out.

Cloudflared: pools list (Shape 2). The chart will render one Deployment
per cloudflared.pools entry, each with its own replicas, nodeSelector,
and credentials, plus hostname PodAntiAffinity to keep one replica per
node within a pool. Bootstrap records obol.org/cloudflared-pool on each
node so the chart-rewrite ticket can read topology.json.

Rejected paths and rationale captured in references/multi-node-design.md.
The actual chart-side changes (local-path nodeAffinity, cloudflared range
over pools) remain a separate ticket.
@bussyjd bussyjd changed the title feat(skills): scaffold cluster-bootstrap for LAN/cloud k3s [DIRTY] feat(skills): scaffold cluster-bootstrap for LAN/cloud k3s May 5, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants