Skip to content

Harmonize EC2Tags peer discovery: controller-side node_id + retire DNSEndpointsSource #370

@bdchatham

Description

@bdchatham

Problem

PR #369 (merged 2026-05-29) moved node_id resolution for the LabelPeerSource path from seictl's sidecar (:26657/status query) into the controller, producing fully-composed <node_id>@<host>:<port> strings in Status.ResolvedPeers and feeding them to the planner via sidecar.PeerSourceStatic. The EC2TagsPeerSource path is still split: the controller resolves the host list, but the sidecar still queries each peer's :26657/status for node_id at config-render time, via sidecar.PeerSourceDNSEndpoints.

This is the split shape #368 identified as load-bearing. Now mixed semantics inside one controller: Label peers preserve prior entries on transient failure, EC2Tag peers don't; Label peers are resilient to mass-restart, EC2Tag peers re-render fragility every reconcile.

Impact

Affects sei-infra peer discovery — pacific-1 validator nodes peer with sei-infra-managed peers via EC2 tag selectors, and those peer relationships should benefit from the controller-side resilience story #369 shipped for Labels. While DNSEndpointsSource remains live, the EC2Tag side carries the pre-#369 failure modes:

  • Sidecar config-render queries :26657/status of each peer at task-execution time; a peer mid-restart drops out of the rendered persistent_peers and the gap persists until the next config-render task fires.
  • No prior-entry preservation; each render is a fresh resolution against current peer reachability.
  • The EC2 tag query and the DNS query are temporally split, allowing membership drift between "what the controller thinks the peer set is" and "what got written to config.toml".

Proposed approach (to refine)

Mirror the #369 pattern for EC2Tags:

  1. Controller's reconcilePeers learns an EC2TagsPeerSource branch alongside the existing Label branch. Resolves EC2-tagged instances to host:port, then calls the per-peer sidecar gRPC GetNodeID for each — same per-peer-best-effort semantics (preserve prior on transient sidecar failure, skip new peer with structured log).
  2. Compose <node_id>@<host>:<port> into Status.ResolvedPeers, same wire format as the Label path.
  3. Planner maps the EC2Tags branch to sidecar.PeerSourceStatic (same as Labels post-feat(controller/node): resolve label peers to NLB addresses + controller-side node_id (#368) #369), retiring sidecar.PeerSourceDNSEndpoints from the resolver→sidecar contract.
  4. Once both Label and EC2Tags use the static path, the seictl-side DNSEndpointsSource handler is dead code — remove it from the seictl repo as a follow-up.

Open question for the experts: EC2-tagged peers that aren't in the same K8s cluster (sei-infra-managed) — the controller can't dial a per-peer sidecar gRPC the way it does for in-cluster SeiNodes. Resolution path options:

  • (a) Gate the controller-side GetNodeID on "is the peer an in-cluster SeiNode"; fall back to leaving DNSEndpointsSource live for out-of-cluster EC2 peers.
  • (b) Controller queries :26657/status directly for out-of-cluster peers.
  • (c) Require sei-infra peers to also publish node_id via a discoverable surface (tag, S3, on-chain).

Worth a coral round before implementation.

Out of scope

  • Anything that changes the Spec.Peers user surface (ec2Tags, static, label union stays).
  • Genesis ceremony peer logic (controller-side genesis assembly is a different code path).
  • Drain policy for stale Status.ResolvedPeers entries (separate concern, deferred until prod signal warrants).

Relevant experts

  • kubernetes-specialist — extending reconcilePeers with a new branch; reusing the per-peer-best-effort + prior-preserve pattern from feat(controller/node): resolve label peers to NLB addresses + controller-side node_id (#368) #369.
  • platform-engineer — DNSEndpointsSource retirement on the seictl side; the out-of-cluster identity-resolution question.
  • sei-network-specialist — CometBFT node_id resolution for out-of-cluster (sei-infra) peers; whether :26657/status direct-query is acceptable or if identity should come from a more authoritative surface.

References

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions