nssh: harden session recovery#11
Conversation
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: c78d07791a
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| joinedPID := 0 | ||
| if cfg.Topic == "" { | ||
| cfg.Topic = generateTopic() | ||
| existing := findActiveSessionForHost(shortHost) |
There was a problem hiding this comment.
Key session reuse by full SSH target identity
The collision lookup now uses only shortHost (findActiveSessionForHost(shortHost)), but resolveShortHost normalizes hostnames to the first label (drops the domain) and ignores the SSH user. That causes unrelated targets like alice@db1.prod vs bob@db1.stage (or two different FQDNs sharing db1) to be treated as the same session, so a new run can incorrectly join/replace another connection’s topic and skip remote preparation for the actual target. This can cross-wire clipboard/open traffic and break shims on the second host/user.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
Autofix Details
Bugbot Autofix prepared a fix for the issue found in the latest run.
- ✅ Fixed: Nil dereference when joining without existing session
- Added explicit nil guards in the
joinandreplaceswitch arms so the caller no longer relies onresolveSessionCollision's internal nil handling, falling back to a fresh topic ifexistingis nil.
- Added explicit nil guards in the
Or push these changes by commenting:
@cursor push 88eb385cd4
Preview (88eb385cd4)
diff --git a/cmd/nssh/main.go b/cmd/nssh/main.go
--- a/cmd/nssh/main.go
+++ b/cmd/nssh/main.go
@@ -666,11 +666,19 @@
existing := findActiveSessionForHost(shortHost)
switch resolveSessionCollision(existing, collisionFlag) {
case "join":
+ if existing == nil {
+ cfg.Topic = generateTopic()
+ break
+ }
cfg.Topic = existing.Topic
cfg.Server = existing.Server
joinedPID = existing.PID
fmt.Fprintf(os.Stderr, "nssh: joining active session for %s (PID %d)\n", shortHost, existing.PID)
case "replace":
+ if existing == nil {
+ cfg.Topic = generateTopic()
+ break
+ }
fmt.Fprintf(os.Stderr, "nssh: replacing existing session for %s (PID %d)\n", shortHost, existing.PID)
replaceSession(existing)
cfg.Topic = generateTopic()You can send follow-ups to the cloud agent here.
- skip prepareRemote when joining an existing session for the host (the original process already wrote the remote session file and seeded the log; redoing it costs an SSH and adds a misleading duplicate session-open event) - record `joined: <pid>` on the local session-start event so the log distinguishes a fresh connect from a join subscriber resilience: - log subscribe-up / subscribe-down events with reconnect gap, so silent failures are visible in `nssh status --tail` - track the most-recent ntfy message id and pass it as ?since=<id> on every reconnect — anything posted while we were asleep / on a broken connection is replayed instead of dropped on the floor liveness + collision UX: - new ping / pong wire kinds and a pingTopic helper (one-shot subscribe + publish + wait for ack) to distinguish "PID is alive and answering" from "PID is alive but wedged" - on a session collision, ping the existing peer and prompt interactively: [J]oin / [R]eplace (SIGTERM + escalate to SIGKILL) / [N]ew topic / [C]ancel. Default is join when alive, replace when not. Non-interactive shells silently join with a stderr warning if the peer didn't answer. - new flags --join / --replace / --new to skip the prompt remote cleanup: - new `nssh sweep [--all|--older <dur>] <host>` subcommand that enumerates mosh-server processes owned by $USER on the remote and offers to kill them. Safe with tmux-inside-mosh: tmux server is its own daemon, so detached sessions survive. follow-ups in #10 (auto-sweep on connect, spare-yourself in --all). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
Autofix Details
Bugbot Autofix prepared a fix for the issue found in the latest run.
- ✅ Fixed: New log fields invisible in
nssh status --tail- Updated
formatEventto render the new host, joined, reconnect, gap, and since log fields in status tail output.
- Updated
Or push these changes by commenting:
@cursor push ce1939e6aa
Preview (ce1939e6aa)
diff --git a/cmd/nssh/status.go b/cmd/nssh/status.go
--- a/cmd/nssh/status.go
+++ b/cmd/nssh/status.go
@@ -324,9 +324,18 @@
if e.Exit != nil {
fmt.Fprintf(&sb, " exit=%d", *e.Exit)
}
+ if e.Gap != "" {
+ fmt.Fprintf(&sb, " gap=%s", e.Gap)
+ }
+ if e.Host != "" {
+ fmt.Fprintf(&sb, " host=%s", e.Host)
+ }
if e.ID != "" {
fmt.Fprintf(&sb, " id=%s", e.ID)
}
+ if e.Joined != 0 {
+ fmt.Fprintf(&sb, " joined=%d", e.Joined)
+ }
if e.Kind != "" {
fmt.Fprintf(&sb, " kind=%s", e.Kind)
}
@@ -339,9 +348,15 @@
if e.Persona != "" {
fmt.Fprintf(&sb, " persona=%s", e.Persona)
}
+ if e.Reconnect {
+ fmt.Fprintf(&sb, " reconnect=%v", e.Reconnect)
+ }
if e.Server != "" {
fmt.Fprintf(&sb, " server=%s", e.Server)
}
+ if e.Since != "" {
+ fmt.Fprintf(&sb, " since=%s", e.Since)
+ }
if e.Size > 0 {
fmt.Fprintf(&sb, " size=%d", e.Size)
}You can send follow-ups to the cloud agent here.
- collapse --join/--replace/--new flag parsing onto one case arm - flatten the collision switch so the existing==nil branch hoists out of all three "case" bodies - drop the redundant joinedPID guard in session-start (omitempty handles 0) - replaceSession: extract pidfile path once, single deferred os.Remove - subscribeNtfy: hoist subscribe-down logging into a markDown closure shared by the connect-failure and stream-end paths - sweep: collapse --older/--older= branches; extract pidsToStrings
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
There are 2 total unresolved issues (including 1 from previous review).
Autofix Details
Bugbot Autofix prepared a fix for the issue found in the latest run.
- ✅ Fixed: New log fields invisible in
nssh status --tail- Added the missing LogEvent fields to status tail formatting so host, joined, reconnect, gap, and since are printed when present.
Or push these changes by commenting:
@cursor push 636d9e59b0
Preview (636d9e59b0)
diff --git a/cmd/nssh/status.go b/cmd/nssh/status.go
--- a/cmd/nssh/status.go
+++ b/cmd/nssh/status.go
@@ -324,9 +324,18 @@
if e.Exit != nil {
fmt.Fprintf(&sb, " exit=%d", *e.Exit)
}
+ if e.Gap != "" {
+ fmt.Fprintf(&sb, " gap=%s", e.Gap)
+ }
+ if e.Host != "" {
+ fmt.Fprintf(&sb, " host=%s", e.Host)
+ }
if e.ID != "" {
fmt.Fprintf(&sb, " id=%s", e.ID)
}
+ if e.Joined != 0 {
+ fmt.Fprintf(&sb, " joined=%d", e.Joined)
+ }
if e.Kind != "" {
fmt.Fprintf(&sb, " kind=%s", e.Kind)
}
@@ -339,9 +348,15 @@
if e.Persona != "" {
fmt.Fprintf(&sb, " persona=%s", e.Persona)
}
+ if e.Reconnect {
+ fmt.Fprintf(&sb, " reconnect=%v", e.Reconnect)
+ }
if e.Server != "" {
fmt.Fprintf(&sb, " server=%s", e.Server)
}
+ if e.Since != "" {
+ fmt.Fprintf(&sb, " since=%s", e.Since)
+ }
if e.Size > 0 {
fmt.Fprintf(&sb, " size=%d", e.Size)
}You can send follow-ups to the cloud agent here.
The session-recovery changes added Host / Joined / Reconnect / Gap / Since to LogEvent but the formatter in status.go was never updated, so the reconnect-gap and replay-since data that was the whole point of the visibility work was silently dropped from tail output. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
Bugbot Autofix prepared a fix for the issue found in the latest run.
- ✅ Fixed: Self-SIGTERM via stale pidfile with PID reuse
- activeSessions now treats a pidfile for the current PID as stale PID reuse and removes it before collision handling can replace it.
Or push these changes by commenting:
@cursor push 1326bdb7db
Preview (1326bdb7db)
diff --git a/cmd/nssh/status.go b/cmd/nssh/status.go
--- a/cmd/nssh/status.go
+++ b/cmd/nssh/status.go
@@ -90,6 +90,7 @@
if err != nil {
return nil
}
+ currentPID := os.Getpid()
var out []sessionInfo
for _, e := range entries {
if e.IsDir() || !strings.HasSuffix(e.Name(), ".json") {
@@ -104,6 +105,11 @@
if err := json.Unmarshal(data, &info); err != nil {
continue
}
+ // A pidfile for our own PID is stale PID reuse; don't let replace kill us.
+ if info.PID == currentPID {
+ _ = os.Remove(path)
+ continue
+ }
// Signal 0 is a permission/existence probe on POSIX.
if err := syscall.Kill(info.PID, 0); err != nil {
_ = os.Remove(path)
diff --git a/cmd/nssh/status_test.go b/cmd/nssh/status_test.go
new file mode 100644
--- /dev/null
+++ b/cmd/nssh/status_test.go
@@ -1,0 +1,42 @@
+package main
+
+import (
+ "encoding/json"
+ "fmt"
+ "os"
+ "path/filepath"
+ "testing"
+ "time"
+)
+
+func TestActiveSessionsDropsStaleSelfPID(t *testing.T) {
+ t.Setenv("XDG_STATE_HOME", t.TempDir())
+
+ dir := sessionsDir()
+ if err := os.MkdirAll(dir, 0o755); err != nil {
+ t.Fatal(err)
+ }
+
+ path := filepath.Join(dir, fmt.Sprintf("%d.json", os.Getpid()))
+ data, err := json.Marshal(sessionInfo{
+ PID: os.Getpid(),
+ Target: "devbox",
+ Host: "devbox",
+ Topic: "nssh_stale",
+ Server: defaultServer,
+ Started: time.Now().Add(-time.Hour),
+ })
+ if err != nil {
+ t.Fatal(err)
+ }
+ if err := os.WriteFile(path, data, 0o644); err != nil {
+ t.Fatal(err)
+ }
+
+ if sessions := activeSessions(); len(sessions) != 0 {
+ t.Fatalf("activeSessions() = %d sessions, want 0", len(sessions))
+ }
+ if _, err := os.Stat(path); !os.IsNotExist(err) {
+ t.Fatalf("stale self pidfile still exists: %v", err)
+ }
+}You can send follow-ups to the cloud agent here.
Reviewed by Cursor Bugbot for commit 148702d. Configure here.
| } | ||
| } | ||
| return nil | ||
| } |
There was a problem hiding this comment.
Self-SIGTERM via stale pidfile with PID reuse
Medium Severity
findActiveSessionForHost calls activeSessions(), which considers a PID alive if syscall.Kill(pid, 0) succeeds. If a previous nssh crashed without cleanup and the OS reuses that PID for the current process, the stale session passes the liveness check (it's us!). When the host matches and the user picks "replace" (the default for an unresponsive peer), replaceSession sends syscall.Kill(ourOwnPID, SIGTERM), terminating the current process before the signal handler is installed. Filtering out os.Getpid() in findActiveSessionForHost or activeSessions would prevent this.
Additional Locations (1)
Reviewed by Cursor Bugbot for commit 148702d. Configure here.



Summary
Hardens the per-host session-recovery mechanism that the
session-recoverycheckpoint introduced. Two earlier branch commits stayed, plus four new behaviors:prepareRemotewhen joining. The original process for the host already wrote the remote session file and seeded the log. Redoing it wastes an SSH and produces a misleading secondsession-openevent.?since=<id>. Track the latest ntfy message id we processed and ask for everything since that point on reconnect — solves the laptop-closed-overnight case directly. Closed-up TCP sockets timeout viadeadlineConn, then the next subscribe burst-replays anything posted while we were unreachable.subscribe-up/subscribe-downevents in the jsonl with reconnect gap, so silent failures are visible fromnssh status --tail.ping/pongliveness probe between nssh peers sharing a topic.pingTopicis a one-shot subscribe + publish + 1.5 s wait — distinguishes "PID is alive" from "PID is alive but wedged".[J]oin / [R]eplace (SIGTERM, then SIGKILL after 1 s) / [N]ew topic / [C]ancel. Default depends on liveness. Non-interactive shells join silently with a stderr warning if the peer didn't answer. New--join / --replace / --newflags skip the prompt.nssh sweep <host>new subcommand. Lists mosh-server processes for$USERon the remote (oldest first) and offers to kill them.--allfor unattended,--older 168hto keep the last week. Safe with tmux-inside-mosh — the tmux server is its own daemon, sessions detach instead of dying.Wire-protocol additions:
pingandpongkinds. Bothlocal ↔ local. Documented inCLAUDE.md.Follow-ups (auto-sweep on connect, spare-yourself in
--all) are captured in #10.Test plan
just testpasses (addedparseEtimetests)go vetclean,gofmtcleannssh --helpshows new flags + sweepnssh sweeparg validation (no host, conflicting flags, bad duration) behavesnssh hostAinstances → second one prompts; pick join → both share topicnssh sweep hostAlists mosh-servers; pick by PID; pickold; SIGKILL escalation🤖 Generated with Claude Code
Note
Medium Risk
Changes core session startup/reuse logic and ntfy subscription behavior, plus adds a remote process-killing command (
sweep), so bugs could disrupt session routing or terminate the wrong process if host/PID detection is off.Overview
Hardens local session reuse and recovery.
nssh <host>now detects existing local sessions for the same canonical host and either joins, replaces (SIGTERM→SIGKILL), or starts a new topic, with--join/--replace/--newflags and an interactive prompt; joining also skips redundantprepareRemotework.Improves ntfy subscriber resilience. The subscriber tracks the last processed ntfy message
idand reconnects with?since=<id>to replay missed messages, emits newsubscribe-up/subscribe-downlog events (with gap/since metadata), and addsping/pongwire kinds used for liveness checks during collision resolution.Adds maintenance tooling and docs updates. Introduces
nssh sweep <host>to list and optionally kill remotemosh-serverprocesses (with--all/--olderand tests forps etimeparsing), updates log schema fields (host,joined, reconnect metadata), and refreshes protocol/CLAUDE documentation;.gitignorenow ignores/.claude/.Reviewed by Cursor Bugbot for commit 148702d. Bugbot is set up for automated code reviews on this repo. Configure here.