[DIRTY] feat(skills): scaffold cluster-bootstrap for LAN/cloud k3s#423
Draft
[DIRTY] feat(skills): scaffold cluster-bootstrap for LAN/cloud k3s#423
Conversation
Wraps k3sup over SSH for single-host install, server install, and agent join. Persists topology to $OBOL_CONFIG_DIR/cluster-bootstrap/ and the k3s admin kubeconfig to the standard location so `obol stack up` can take over once multi-node infra concerns are resolved. Multi-node storage placement and cloudflared HA are intentionally deferred — the design tradeoffs are captured in references/multi-node-design.md and the skill exposes flag stubs (--storage-primary, --edge-node) so the CLI surface won't churn once we pick paths.
Storage: single primary node (Option A). One node carries obol.org/storage=primary; stateful Deployments will gain nodeAffinity to that label. Bootstrap records the label on the server by default; --no-storage-primary opts out. Cloudflared: pools list (Shape 2). The chart will render one Deployment per cloudflared.pools entry, each with its own replicas, nodeSelector, and credentials, plus hostname PodAntiAffinity to keep one replica per node within a pool. Bootstrap records obol.org/cloudflared-pool on each node so the chart-rewrite ticket can read topology.json. Rejected paths and rationale captured in references/multi-node-design.md. The actual chart-side changes (local-path nodeAffinity, cloudflared range over pools) remain a separate ticket.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Scaffold a new
cluster-bootstrapskill that lets obol-agent stand up a k3s cluster across one or more hosts (LAN or cloud) over SSH using k3sup, then prepares it forobol stack up. Topology (nodes, roles, labels) is persisted under$OBOL_CONFIG_DIR/cluster-bootstrap/so the chart-side wiring can consume it later.This is a scaffold PR: the bootstrap flows are wired up and the multi-node design is decided, but the chart changes that consume the bootstrap output are intentionally deferred (see "Out of scope" below). Don't run against a production cluster yet.
What's in
internal/embed/skills/cluster-bootstrap/SKILL.md— topology model (single-host / lan-multi / cloud-multi), subcommand reference, caveats.internal/embed/skills/cluster-bootstrap/scripts/bootstrap.py— k3sup wrapper:single,server,join,kubeconfig-path,status,label. Persists topology totopology.json, keeps server token mode 0600, writes kubeconfig to the standard$OBOL_CONFIG_DIR/kubeconfig.yaml.internal/embed/skills/cluster-bootstrap/references/multi-node-design.md— full design doc with rejected alternatives.Locked decisions
Storage — Option A (single primary node). One node carries
obol.org/storage=primary(server by default). Stateful Deployments will gainnodeAffinityto that label;--no-storage-primaryopts out. State is single-node, recoverable from PVC backup. Rejected: Longhorn/OpenEBS (≥3 nodes, ~500MiB/node baseline) and NFS dual-StorageClass (SPOF + fsync semantics break SQLite/BoltDB workloads).Cloudflared — Shape 2 (
poolslist). The chart will render one Deployment percloudflared.poolsentry; each pool has its ownreplicas,nodeSelector, and Cloudflare credentials, with hostnamePodAntiAffinitycapping at one replica per node within a pool.bootstrap.py server|join --cloudflared-pool <name>recordsobol.org/cloudflared-pool=<name>(defaultdefault). Rejected: DaemonSet (can't run >1 tunnel per beefy node), single-Deployment+replicas (no edge-vs-cloud credential separation).Out of scope (next ticket)
local-path.yamlchart honoringstorage.primaryLabelnodeAffinityto the primary labelcloudflaredchart rewriting the single Deployment as arangeoverpools(with hostname-antiaffinity invariant + NOTES.txt warning whenreplicas > matching nodes)obol stack upconsumingtopology.jsonto feed those valuesTest plan
go build ./...cleango test ./internal/embed/...passes (CRD/RBAC validation unchanged)bootstrap.py --help,bootstrap.py server --help,bootstrap.py join --helpparse cleanlybootstrap.py statuson a fresh state returns "no nodes recorded"Notes
feature/cluster-bootstrap-skillper request; project convention elsewhere isfeat/. Happy to rename if preferred.local-path.yaml, cloudflared chart). All changes live underinternal/embed/skills/cluster-bootstrap/.Generated by Claude Code