Skip to content

feat(fleetnode): discovery scan + ReportDiscoveredDevices agent#256

Draft
ankitgoswami wants to merge 18 commits into
mainfrom
ankitg/fleetnode-agent-discovery
Draft

feat(fleetnode): discovery scan + ReportDiscoveredDevices agent#256
ankitgoswami wants to merge 18 commits into
mainfrom
ankitg/fleetnode-agent-discovery

Conversation

@ankitgoswami
Copy link
Copy Markdown
Contributor

@ankitgoswami ankitgoswami commented May 18, 2026

Summary

Adds the on-prem fleet-node agent's control-stream consumer (RFC 0001 phase 2). After enrollment is CONFIRMED, the agent opens ControlStream, executes the pairing.v1.DiscoverRequest the server dispatches, streams results via ReportDiscoveredDevices correlated by command_id, and acks.

Pairs with #235 (server-side ControlStream handler, DiscoverOnFleetNode admin RPC, in-memory registry). Wire contracts merged in #257. Until #235 lands, the agent's control loop receives Unimplemented and reconnects with backoff — degrades quietly. PR order between #235 and this one doesn't matter.

mission control

Agent invocation

fleetnode run — no scan flags needed. Plugins resolved from <exe-dir>/plugins; nmap from <exe-dir>/nmap or PATH at scan time. State, enrollment, and heartbeat configuration via existing flags.

Control loop (control.go)

  • runControlLoop opens the stream and reconnects with exponential backoff (1s → 30s cap). Permanent errors (CodeNotFound, ErrBeginAuthRejected) propagate so the daemon exits rather than spinning.
  • runControlSession performs HelloAccepted handshake, then receives ControlCommands. A watcher goroutine closes the stream on ctx.Done because http2.pipe.Read blocks on a sync.Cond that ctx alone cannot unblock.
  • handleCommand decodes the payload as pairing.v1.DiscoverRequest and dispatches:
    • IPListModeRequest → bounded-concurrency probes via the plugin manager (32 in-flight, 3s per probe).
    • NmapModeRequestrunNmapDiscovery shells out via Ullaakut/nmap/v3, then runs plugin probes against every open host:port (16 in-flight). 10-minute overall scan budget.
    • MDNSModeRequest → rejected via ack error_message.
  • streamReports chunks at the server's max_items = 1024 limit and sets command_id so the operator stream wakes up.
  • Runs alongside the heartbeat loop. sync.Mutex guards SessionToken; refreshAndSave handshakes against a shallow State copy so the 2-RPC refresh does not stall token reads.

Plugins

Loaded from <exe-dir>/plugins (fixed; no CLI flag). Missing dir → heartbeat-only mode. Present-but-unsafe → hard error: the plugin manager execs every file there as the agent uid, so non-owner write capability is RCE-equivalent. Unix checkPluginsDirPerms refuses non-root/non-self owners or any group/world write bits. Windows production deployments must install the binary under an Administrator-only directory (e.g., %ProgramFiles%\fleetnode\) so plugins\ inherits a safe ACL.

Nmap

<exe-dir>/nmap if present, otherwise nmap from PATH at scan time. Server-supplied targets pass through validateNmapTarget, which accepts only IP / CIDR / A.B.C.D-N range / hostname. Leading dashes, whitespace, and shell metacharacters are rejected — a compromised server cannot pivot via flags like -iL/etc/passwd or -oN /tmp/loot.

Transport security

newGatewayHTTPClient is scheme-aware: http:// uses h2c (loopback streaming), https:// uses a TLS-validating http2.Transport. A shared AllowHTTP + DialTLSContext shim would silently downgrade https:// to plaintext, defeating ValidateServerURL's non-loopback HTTPS requirement.

Identifier synthesis

Production SDK drivers leave DeviceIdentifier empty. The agent synthesizes:

  • mac:<addr> when MAC is exposed
  • serial:<sn> when only serial is exposed
  • auto:<uuid> otherwise (server reconciles by (fleet_node, ip, port))

Docs

server/cmd/fleetnode/README.md covers overview, subcommands, state/lock semantics, plugins, nmap, control-stream flow, build, run, enrollment, security model, development, troubleshooting.

Tests

  • control_test.go: happy path (probe → report → ack with correlated command_id), MDNS rejection ack, reconnect after stream EOF.
  • nmap_test.go: TestValidateNmapTarget (21 cases covering IP/CIDR/range/hostname acceptance and -iL, whitespace, null-byte, shell-metachar, leading-dash rejection); resolver tests.
  • plugins_dir_test.go: binary-adjacent default discovery, missing-default-returns-empty, present-but-unsafe hard error.
  • client_test.go: TLS round-trip with pinned cert succeeds; untrusted cert fails with x509; non-http(s) scheme rejected; redirects refused.
  • run_test.go: stubGatewayClient satisfies the extended gatewayClient interface (ReportDiscoveredDevices, ControlStream).

Test plan

  • go build ./server/...
  • go vet ./server/...
  • golangci-lint run -c .golangci.yaml ./cmd/fleetnode/... ./internal/fleetnodebootstrap/...
  • go test ./server/cmd/fleetnode/... ./server/internal/fleetnodebootstrap/... -race -count=1
  • E2E (post-feat(fleetnode): discovery + pairing E2E (server) #235): fleetnode enroll, confirm, then fleetnode run. Verify control stream opened log. With the fake antminer on :4028, hit DiscoverOnFleetNode from the operator side with an ipList request and confirm streaming DiscoverResponse containing the device + a discovered_device row with discovered_by_fleet_node_id set. Re-run with an nmap request and confirm the same.

🤖 Generated with Claude Code

Copilot AI review requested due to automatic review settings May 18, 2026 22:55
@ankitgoswami ankitgoswami requested a review from a team as a code owner May 18, 2026 22:55
@github-actions github-actions Bot added javascript Pull requests that update javascript code client server shared labels May 18, 2026
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: e357c802ad

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

rpc UploadTelemetry(stream UploadTelemetryRequest) returns (UploadTelemetryResponse);
rpc UploadEvents(stream UploadEventsRequest) returns (UploadEventsResponse);
rpc UploadHeartbeat(UploadHeartbeatRequest) returns (UploadHeartbeatResponse);
rpc ReportDiscoveredDevices(ReportDiscoveredDevicesRequest) returns (ReportDiscoveredDevicesResponse);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Implement the discovered-device RPC before enabling scans

With this RPC added to the gateway contract, the fleetnode daemon now calls ReportDiscoveredDevices, but the fleetd handler still only embeds UnimplementedFleetNodeGatewayServiceHandler and a repo-wide search found no concrete ReportDiscoveredDevices method outside generated/test code. In any environment running discovery (--scan-cidr with ports/plugins), every scan report will receive CodeUnimplemented from the server and sendDiscoveryChunk logs then drops the batch, so discovered devices are never persisted or pairable.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

implemented in #235

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Implements the on-prem fleet-node agent side of RFC 0001 phase 2: a periodic discovery scan that probes user-supplied CIDRs with plugin-loaded discoverers and batches results into the new ReportDiscoveredDevices gateway RPC, running in parallel with the existing heartbeat loop. Adds the matching protos (also duplicated in #235) and exports ipscanner.GenerateIPsFromCIDR so the agent reuses the upstream CIDR enumerator.

Changes:

  • New fleetnode run flags (--scan-cidr, --scan-port, --discovery-interval) and a discovery goroutine with bounded-concurrency probes, 60s tick timeout, 1024-per-batch chunking, and auto:/mac:/serial: identifier synthesis.
  • Heartbeat and discovery goroutines share RunCmd.stateMu; refreshAndSave handshakes against a shallow State copy so token reads don't block on the 2-RPC refresh.
  • ipscanner.generateIPsFromCIDR is exported as GenerateIPsFromCIDR and consumed by the agent's enumerateCIDR, which layers a 4096-host cap on top of the existing 16-host-bit safety cap.

Reviewed changes

Copilot reviewed 10 out of 17 changed files in this pull request and generated no comments.

Show a summary per file
File Description
proto/fleetnodegateway/v1/fleetnodegateway.proto Adds ReportDiscoveredDevices RPC and DiscoveredDeviceReport message with validation.
proto/fleetnodeadmin/v1/fleetnodeadmin.proto Adds pairing/unpairing/list admin RPCs (duplicated in #235).
server/cmd/fleetnode/run.go Adds discovery flags, plugin bootstrap, parallel heartbeat/discovery goroutines, mutex-guarded session refresh.
server/cmd/fleetnode/discovery.go New file: probe orchestration, chunked report sending, identifier synthesis, plugin discoverer wiring.
server/cmd/fleetnode/discovery_test.go New table-driven and integration tests covering CIDR enumeration, batching, error paths, and parallel heartbeat+discovery.
server/cmd/fleetnode/run_test.go Extends stubGatewayClient with discovery call tracking.
server/cmd/fleetnode/fake_gateway_test.go Adds ReportDiscoveredDevices handler to the fake gateway.
server/internal/domain/ipscanner/network_utils.go Exports GenerateIPsFromCIDR and clarifies its doc comment.
server/internal/domain/ipscanner/scanner.go Updates internal caller to the exported name.
server/internal/domain/ipscanner/network_utils_test.go Renames test references to the exported function.
server/generated/grpc/fleetnodegateway/v1/*.go Regenerated for the new RPC and messages.
server/generated/grpc/fleetnodeadmin/v1/*.go Regenerated for the new pairing RPCs.
client/src/protoFleet/api/generated/**/*.ts Regenerated TS bindings mirroring the proto additions.
server/sdk/v1/python/proto_fleet_sdk/generated/pb/driver_pb2_grpc.py Regenerated GRPC version constant update.

@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 18, 2026

🔐 Codex Security Review

Note: This is an automated security-focused code review generated by Codex.
It should be used as a supplementary check alongside human review.
False positives are possible - use your judgment.

Scope summary

  • Reviewed pull request diff only (19a03050f8be5f51375a7b83c1e6f93d9e820873...f7a9656b3efe8f40bf0267355a6fdf6e1f44a7d8, exact PR three-dot diff)
  • Model: gpt-5.5

💡 Click "edited" above to see previous reviews for this PR.


Review Summary

Overall Risk: HIGH

Findings

[HIGH] Fleetnode Executes Discovery Against Unrestricted Targets

  • Category: Network Discovery
  • Location: server/cmd/fleetnode/control.go:168
  • Description: The new control path accepts server-supplied DiscoverRequest targets, resolves IP-list hostnames, and probes them without enforcing an agent-side network policy. This allows loopback, IPv4 link-local metadata addresses such as 169.254.169.254, public IPs, and arbitrary private networks. The Nmap path similarly accepts any syntactically valid IP/CIDR/hostname, including very large CIDRs, before passing it to Nmap.
  • Impact: A compromised fleet server, server-side authorization bug, or abused discovery workflow can turn enrolled fleet nodes into SSRF/lateral-scanning pivots inside operator networks, or trigger expensive scans from the agent host.
  • Recommendation: Enforce target policy on the agent, not only server-side. Resolve hostnames first, then reject loopback, link-local/metadata, multicast, and public ranges by default; cap CIDR host counts; and allow only locally configured/approved subnets unless the operator explicitly opts in.

[MEDIUM] Windows Plugin Loading Skips ACL and File Validation

  • Category: Plugin | Infrastructure
  • Location: server/cmd/fleetnode/plugins_dir_windows.go:25
  • Description: The Unix path rejects unsafe plugin directories/files because the plugin manager executes binaries from <exe-dir>/plugins, but the Windows implementation only checks that the directory exists and makes per-file validation a no-op.
  • Impact: If a Windows install places plugins\ somewhere writable by non-admin users, an attacker can plant or replace a plugin and get code execution as the fleetnode service account.
  • Recommendation: Fail closed on Windows unless ACLs can be verified. Use golang.org/x/sys/windows to reject writable ACEs for unprivileged principals, reject reparse points/symlinks, and require an Administrator/SYSTEM/service-owned install path.

Notes

No SQL, frontend rendering, Rust ASIC plugin, pool/wallet, or protobuf wire-format changes were present in the reviewed diff. The changed TLS/h2c client split and redirect rejection look security-positive.


Generated by Codex Security Review |
Triggered by: @ankitgoswami |
Review workflow run

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 9216c9d33f

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

rpc UploadTelemetry(stream UploadTelemetryRequest) returns (UploadTelemetryResponse);
rpc UploadEvents(stream UploadEventsRequest) returns (UploadEventsResponse);
rpc UploadHeartbeat(UploadHeartbeatRequest) returns (UploadHeartbeatResponse);
rpc ReportDiscoveredDevices(ReportDiscoveredDevicesRequest) returns (ReportDiscoveredDevicesResponse);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Add discovery reports to fleet-node auth

This new RPC is called by the fleetnode agent with its session token via NewAuthenticatedGatewayClient, but server/internal/handlers/interceptors/config.go only lists telemetry/events/heartbeat/control in FleetNodeAuthenticatedProcedures. For ReportDiscoveredDevices, FleetNodeAuthInterceptor therefore never resolves the fleet-node session and the normal auth interceptor treats the bearer value as a user API key, so discovery reports are rejected before they can reach the handler; sendDiscoveryChunk then drops Unauthenticated batches and the feature never persists scanned devices.

Useful? React with 👍 / 👎.

Replace the autonomous CIDR-scanning daemon with a control-stream
consumer. The agent opens a ControlStream after enrollment is
CONFIRMED, sends a Hello, waits for Accepted, then runs each
DiscoverRequest the server dispatches as a ControlCommand and
streams results back via ReportDiscoveredDevices with the matching
command_id. A ControlAck closes each command.

- New control.go: runControlLoop with exponential backoff
  reconnect, runControlSession handshake + receive pump,
  handleCommand decode + probe + report + ack, runProbes with
  bounded concurrency, streamReports chunked at server's max,
  plus the discoverer interface + pluginDiscoverer implementation
  (moved from the old discovery.go).
- Only IPList mode is accepted on the agent. The server expands
  CIDR/IPRange into IPList before dispatching (see PR B). MDNS and
  Nmap are rejected with a descriptive ack error.
- run.go: drop --scan-cidr/--scan-port/--discovery-interval and
  the autonomous discovery loop. Add --plugins-dir-driven control
  loop alongside the heartbeat loop in the same goroutine pattern.
  Permanent errors (NotFound, ErrBeginAuthRejected) propagate so
  the daemon exits rather than reconnecting forever.

Tests cover happy path (probe + report + ack with correlated
command_id), MDNS rejection, and reconnect after stream EOF.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@ankitgoswami ankitgoswami force-pushed the ankitg/fleetnode-agent-discovery branch from 9216c9d to 977b22d Compare May 19, 2026 21:15
@github-actions github-actions Bot removed javascript Pull requests that update javascript code client shared labels May 19, 2026
@ankitgoswami ankitgoswami marked this pull request as draft May 19, 2026 21:21
Add zero-config defaults so \`fleetnode run\` works out of the box for the
shipped layout, with safe fallback behavior when the layout isn't there.

--plugins-dir
- Default to <dir-of-binary>/plugins. Missing dir means the control loop
  stays off (heartbeat only) instead of failing — operators can use the
  agent without bundling plugins.
- A present-but-unsafe dir is a hard error rather than a silent downgrade:
  the plugin manager execs anything in the dir as the agent's user, so
  any non-owner write capability there is RCE-equivalent.
- Perm check on Unix: owner must be root or the running uid; no group- or
  world-writable bits. Windows relies on the MSI installer placing the dir
  under an Administrator-only path and only checks existence here.

--nmap-path
- Adds nmap support to the agent. NmapModeRequest now executes via the
  Ullaakut nmap wrapper, then the plugin probe runs against each open
  host:port (mirrors combined-mode pairing.service).
- Default path resolution: explicit flag wins; otherwise prefer
  <dir-of-binary>/nmap (for the bundled-binary install) if executable;
  otherwise return "nmap" and let exec.LookPath resolve from PATH at scan
  time.
- Flag must be absolute and point at an executable file.

handleCommand now switches on the request mode: IPList runs the existing
plugin probe sweep, Nmap runs the new nmap path, and MDNS is rejected.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
ankitgoswami added a commit that referenced this pull request May 19, 2026
Drop the Nmap rejection in DiscoverOnFleetNode. With the agent shipping
a bundled nmap binary (#256), NmapModeRequest now passes through unchanged
so the agent can execute it. MDNS still rejected; the agent doesn't run
an mDNS listener.

Server-side validation: require nmap.target to be non-empty so an
all-zero NmapModeRequest is caught before round-tripping through the
control stream.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
ankitgoswami and others added 2 commits May 19, 2026 15:31
Apply ce-simplify findings:

- Extract fanOutProbes helper so the bounded-concurrency probe loop is
  shared between IPList mode (runProbes inline) and the nmap follow-up
  loop (runNmapDiscovery). The two had identical sem/wg/mu structure
  with different concurrency caps; now both use the same helper with
  their own constants.
- Introduce endpoint{ip, port} struct so IPList ports-x-ips and nmap's
  open-host list both feed fanOutProbes through one type.
- Drop fleetNodeID parameter from streamReports / handleCommand. The
  ID was used only in log lines; runControlLoop now enriches the
  logger with fleet_node_id once per session and downstream callers
  inherit it.
- Use nmap.Open constant instead of "open" string literal.
- Cache executableDir() once in run() instead of calling it twice
  (once for resolvePluginsDir, once for resolveNmapPath).
- Reset backoff after a stable control session. Previously a single
  successful reconnect followed by an immediate disconnect kept
  growing backoff toward the 30s cap; now a session that lives past
  stableSessionThreshold resets backoff to 1s on the next failure.
- Strip narrative doc comments (control loop, nmap, plugins-dir) that
  restated the implementation. Keep WHY comments where the rationale
  isn't obvious (plugin-manager exec security, nmap PATH-fallback
  contract, plugins-dir RCE).

Build, vet, lint, and tests green.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`just build-fleetnode` now wires up the layout the agent expects out of the
box: server/plugins/ is already adjacent (populated by `just build-plugins`),
and this recipe additionally symlinks server/nmap from whatever is on PATH so
the agent's NmapMode probe path works without --nmap-path. Both have soft
warnings rather than hard failures when the underlying tools aren't present
(plugins not yet built, nmap not installed), so developers without the full
stack can still build the binary.

.gitignore picks up server/nmap (server/fleetnode and server/plugins/* were
already ignored).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
ankitgoswami and others added 5 commits May 19, 2026 16:14
The previous build-fleetnode recipe dropped the binary at server/fleetnode
and relied on server/plugins/ being adjacent. That breaks the common dev
workflow: `just dev` cross-compiles plugins to linux/arm64 inside
server/plugins/ for the Docker server container, and the native-host
agent can't exec ELF binaries (`exec format error` on every load).

Move the agent's full local layout into server/.fleetnode/:
- ./.fleetnode/fleetnode (native binary)
- ./.fleetnode/plugins/ (native Go plugins via _build-go-plugins-native)
- ./.fleetnode/nmap (symlink to system nmap)

`server/plugins/` is left untouched for `just dev`'s container, and the
agent's binary-adjacent default still Just Works (`./.fleetnode/fleetnode
run --state-dir=...`) without any flags.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The ControlStream RPC is bidi; HTTP/1.1 can't carry it, so the agent's
gateway client now uses an h2c-prior-knowledge HTTP/2 transport that
matches the fleetd server's h2c.NewHandler. Without this the agent's
control loop hits "HTTP status 505 HTTP Version Not Supported" on every
reconnect attempt while unary heartbeats keep working (HTTP/1.1 was
fine for unary).

Side effects:
- http.Client.Timeout is removed (it would cap long-lived streams).
  Per-call deadlines added: heartbeatTimeout in sendHeartbeat,
  handshakeStepTimeout in RunHandshake's two RPC calls. Existing
  per-call timeouts (discoveryReportTimeout, ControlStream ctx)
  already cover their paths.
- Test servers wrapped in h2c.NewHandler so they negotiate HTTP/2
  prior knowledge like prod does.
- TestGatewayHTTPClient_HasTimeout removed -- the constant it
  asserted against no longer exists.

Production HTTPS support is deferred until there's a target prod
deployment; the agent currently only speaks h2c plaintext.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two complementary defenses so the agent doesn't leave go-plugin
children running when it dies.

1. Catch SIGHUP alongside SIGINT/SIGTERM. Closing the terminal sends
   SIGHUP; without it the parent dies uncaught and 'defer cleanup()'
   never fires, so the plugin manager's Shutdown is skipped and the
   children are reparented to init.

2. Add a startup reaper. On Unix, before constructing the new plugin
   manager, scan `ps -eo pid,ppid,command` for processes whose parent
   is init (ppid == 1) and whose binary lives in the resolved
   plugins directory; SIGKILL them. Covers the uncatchable cases
   (SIGKILL, panic, OOM) that the SIGHUP fix can't.

Windows is a documented no-op: go-plugin children there inherit a
job object that terminates them with the parent.

Both run early enough to keep the failure mode the user just hit
(stranded proto-plugin blocking the next agent's plugin handshake)
from happening again.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The ppid==1 filter was too narrow. Real-world cases where it missed:

- A previous agent's children whose zombie parent hadn't been collected
  yet, so the kernel had not re-parented them to init at the moment the
  new agent's reaper ran.
- Children adopted by an arbitrary surviving process tree (e.g. a debug
  wrapper or shell job) instead of init.

The reaper runs before newPluginDiscoverer, so by definition the current
agent has not yet spawned anything in pluginsDir. Anything we see whose
executable lives there is leftover from a prior run, regardless of ppid.
Drop the filter and exclude only the agent's own PID for safety.

Concurrent agents on the same pluginsDir are already impossible because
of state.lock contention, so widening the reaper here cannot affect a
legitimate sibling.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three changes to prevent the silent-hang footgun we just hit.

1. Lock acquisition is non-blocking (LOCK_EX|LOCK_NB). On contention,
   read the owner PID recorded inside the lockfile and either tell the
   operator who to kill or, if the recorded PID is dead, hint that a
   subprocess inherited the FD. Previously the lock would block forever
   with no log output -- the new behavior is a clear error in <100ms.

2. Log "acquiring state lock" at the call site so the agent always has
   a visible checkpoint between plugin startup and lock acquisition.
   The user-facing symptom of the earlier hang was zero log output;
   now there's always a last-line indicating exactly how far startup
   got.

3. Move the plugin reaper + newPluginDiscoverer INSIDE the state lock.
   The reaper is broad enough to kill any plugin process in our
   pluginsDir; if a sibling agent already owns the lock, running the
   reaper would murder its plugin children. Reaping only after we own
   the lock guarantees we're the authoritative agent and any leftover
   plugins are real strays.

Verified with a smoke test: two `fleetnode run` invocations against
the same state-dir -- second errors in ~60ms with the new message and
the first agent's daemon + plugins survive untouched.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
ankitgoswami and others added 6 commits May 20, 2026 11:41
The bidi http2 response body parks in sync.Cond.Wait, which is not
ctx-aware. So stream.Receive() in the control loop would block past
the parent ctx's lifetime even after SIGINT triggered cancellation --
Ctrl+C left the daemon running indefinitely.

Wrap the session with a tiny watcher goroutine: when ctx fires, force
stream.CloseRequest() + CloseResponse(). That broadcasts the cond and
Receive returns with "canceled: http2: response body closed". Session
cleans up, control loop exits on its outer ctx check, runLocked
returns, plugin manager shuts down.

Also drop the plugin-manager grace period from 5s to 2s. Shutdown
waits the full timer regardless of whether the plugin already exited,
so 5s per plugin was 10s of dead time on every clean exit. 2s keeps
the safety net for genuinely stuck plugins while bringing clean
shutdown to ~4s total.

Smoke-tested: SIGINT now returns in ~4s vs hanging indefinitely.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Some IDE terminal panels and shell setups silently filter or buffer
stderr (FD 2), making the agent appear hung when it was logging the
whole time. Slog now writes to stdout, matching the convention used
by other CLI daemons (caddy, prometheus, vault). Redirecting with
`2>&1` and piping through tee still work; the only behavior change
is that the structured-log stream is now on FD 1 instead of FD 2.

go-plugin's hclog noise (debug `plugin address` lines, plugin
manager Shutdown messages) stays on stderr -- those are secondary
and we want the agent's own logs to win the visible-by-default slot.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Extract testutil.NewH2CServer / testutil.NewH2CClient so the five
  inline httptest.NewUnstartedServer(h2c.NewHandler(...)) + h2c
  transport copies across cmd/fleetnode and fleetnodebootstrap test
  files share one canonical form.
- Collapse the symmetric handshake.go WithTimeout pair into a
  withHandshakeTimeout helper; the previous inline pattern was easy
  to forget on a future third step.
- Swap the defer order in runControlSession so close(done) runs
  before the stream-close defer. The watcher goroutine now retires
  via its quiet path on every normal return, avoiding a redundant
  close race when ctx happens to cancel at the same moment.
- Replace the no-op `if err := writeLockOwnerPID(f); err != nil
  { _ = err }` block with `_ = writeLockOwnerPID(f)` plus a single
  one-liner explaining the best-effort intent.
- Trim narrative comments on newGatewayHTTPClient, runControlSession,
  reapOrphanedPlugins, and the stdout-logging Run shim. Keep only
  the non-obvious WHY.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Pass through the recent code aggressively removing comments that
restated implementation, narrated change history, or padded the
WHY into multi-sentence prose. Project preference is default to no
comments; non-obvious rationale stays but in one or two lines.

No behavior change.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
vnish (and the rest of the asic-rs firmware family) live in the Rust
asicrs plugin, not the Go antminer plugin. Until now build-fleetnode
only staged the Go plugins, so the agent couldn't discover any
asic-rs-only firmware on macOS dev.

- Parameterize _asicrs-build with outdir (default kept at
  server/plugins so build-plugins/test-contract are unaffected).
- Build natively via local cargo on Darwin instead of docker buildx;
  the docker path produces a Linux ELF that macOS can't exec. Linux
  hosts (CI) keep the docker path so they don't need Rust installed.
- Add the new dep to build-fleetnode so server/.fleetnode/plugins/
  contains asicrs-plugin + asicrs-config.yaml alongside the Go
  plugins.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…README

- HTTPS gateway client is now scheme-aware: TLS-validating http2.Transport
  for https://, h2c for http://. Prior shared DialTLSContext shim silently
  bypassed TLS for https://, defeating ValidateServerURL.
- validateNmapTarget restricts server-supplied targets to IP/CIDR/range/
  hostname; rejects leading dashes and shell metacharacters so a compromised
  server cannot pivot via nmap flags like -iL/etc/passwd or -oN.
- Drop --plugins-dir and --nmap-path CLI flags. Installer always uses
  <exe-dir>/plugins and <exe-dir>/nmap; operator overrides only widen the
  attack surface. nmap falls back to PATH for dev machines.
- Add server/cmd/fleetnode/README.md.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions github-actions Bot added the documentation Improvements or additions to documentation label May 20, 2026
ankitgoswami and others added 3 commits May 20, 2026 15:28
- Per-file plugin validation: lstat each entry under <exe-dir>/plugins,
  reject symlinks, require regular file owned by root or current uid with
  no group/world write. Container-only check missed writable individual
  binaries and symlinks to elsewhere.
- Bound discovery commands: cap IPs at 1024, ports at 10 (matches
  pairing.MaxPortsPerIP), wall-clock at 10m (matches pairing default).
  Reject oversized requests with a clear error.
- IPv6 nmap: detect IPv6 literal / hostname-only-IPv6 and pass -6 via
  nmap.WithIPv6Scanning(); reject IPv6 CIDR consistently with pairing
  service. Extract buildNmapOptions for testability.
- Per-RPC timeout on Register: wrap with withHandshakeTimeout so
  blackholed servers can't hang enroll while it holds the state lock.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The pairing service rejects scoped IPv6, rejects link-local IPv6,
canonicalizes IPv6 spelling, and resolves hostnames to IP literals before
probing. The fleetnode agent previously passed server-supplied
ip_addresses straight through, so hostname inputs probed successfully but
then failed ReportDiscoveredDevices buf.validate (ip: true), and scoped
or link-local IPv6 inputs reached the dialer with no chance of succeeding.

- New helper netutil.NormalizeIPListEntry with an IPListResolver interface
  so DNS can be stubbed in tests.
- Sentinel errors (ErrEmptyTarget, ErrScopedIPv6, ErrLinkLocalIPv6,
  ErrHostnameUnresolved) so callers can branch on cause when logging skips.
- pairing/service.go::DiscoverWithIPList: ~45 lines of inline normalization
  replaced with a single call to the helper.
- cmd/fleetnode/control.go: normalize each ip_address before fan-out, skip
  failures and log, ack failure if no entry normalizes.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The previous cap on len(ports) didn't prevent a single entry from
expanding to thousands of probes: nmap honors range ("1-65535") and
comma ("80,443,8080,...") syntax inside a single string. Plugin-default
ports were also unchecked.

- resolveAndValidatePorts requires every entry to be a decimal TCP port
  in 1..65535. Range syntax, comma syntax, protocol prefixes ("T:80"),
  whitespace, and service names are rejected.
- Dedupes the result and re-caps at maxPortsPerIP.
- Applied uniformly across IpList and Nmap dispatch, and to the
  plugin-supplied default fallback.
- runNmapDiscovery / buildNmapOptions now take the validated port list
  so they never see the raw request.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

automation documentation Improvements or additions to documentation server

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants