nssh: harden session recovery by abizer · Pull Request #11 · abizer/nssh

abizer · 2026-05-12T00:05:24Z

Summary

Hardens the per-host session-recovery mechanism that the session-recovery checkpoint introduced. Two earlier branch commits stayed, plus four new behaviors:

Skip prepareRemote when joining. The original process for the host already wrote the remote session file and seeded the log. Redoing it wastes an SSH and produces a misleading second session-open event.
Replay missed messages via ?since=<id>. Track the latest ntfy message id we processed and ask for everything since that point on reconnect — solves the laptop-closed-overnight case directly. Closed-up TCP sockets timeout via deadlineConn, then the next subscribe burst-replays anything posted while we were unreachable.
subscribe-up / subscribe-down events in the jsonl with reconnect gap, so silent failures are visible from nssh status --tail.
ping / pong liveness probe between nssh peers sharing a topic. pingTopic is a one-shot subscribe + publish + 1.5 s wait — distinguishes "PID is alive" from "PID is alive but wedged".
Collision prompt when an existing session is found: [J]oin / [R]eplace (SIGTERM, then SIGKILL after 1 s) / [N]ew topic / [C]ancel. Default depends on liveness. Non-interactive shells join silently with a stderr warning if the peer didn't answer. New --join / --replace / --new flags skip the prompt.
nssh sweep <host> new subcommand. Lists mosh-server processes for $USER on the remote (oldest first) and offers to kill them. --all for unattended, --older 168h to keep the last week. Safe with tmux-inside-mosh — the tmux server is its own daemon, sessions detach instead of dying.

Wire-protocol additions: ping and pong kinds. Both local ↔ local. Documented in CLAUDE.md.

Follow-ups (auto-sweep on connect, spare-yourself in --all) are captured in #10.

Test plan

just test passes (added parseEtime tests)
go vet clean, gofmt clean
nssh --help shows new flags + sweep
nssh sweep arg validation (no host, conflicting flags, bad duration) behaves
Manual: two nssh hostA instances → second one prompts; pick join → both share topic
Manual: replace path → first instance gets SIGTERM, second takes over with fresh topic
Manual: laptop sleep → ntfy reconnect logs replay count + gap
Manual: nssh sweep hostA lists mosh-servers; pick by PID; pick old; SIGKILL escalation

🤖 Generated with Claude Code

Note

Medium Risk
Changes core session startup/reuse logic and ntfy subscription behavior, plus adds a remote process-killing command (sweep), so bugs could disrupt session routing or terminate the wrong process if host/PID detection is off.

Overview
Hardens local session reuse and recovery. nssh <host> now detects existing local sessions for the same canonical host and either joins, replaces (SIGTERM→SIGKILL), or starts a new topic, with --join/--replace/--new flags and an interactive prompt; joining also skips redundant prepareRemote work.

Improves ntfy subscriber resilience. The subscriber tracks the last processed ntfy message id and reconnects with ?since=<id> to replay missed messages, emits new subscribe-up/subscribe-down log events (with gap/since metadata), and adds ping/pong wire kinds used for liveness checks during collision resolution.

Adds maintenance tooling and docs updates. Introduces nssh sweep <host> to list and optionally kill remote mosh-server processes (with --all/--older and tests for ps etime parsing), updates log schema fields (host, joined, reconnect metadata), and refreshes protocol/CLAUDE documentation; .gitignore now ignores /.claude/.

^{Reviewed by Cursor Bugbot for commit 148702d. Bugbot is set up for automated code reviews on this repo. Configure here.}

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: c78d07791a

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-05-12T00:10:28Z

+	joinedPID := 0
 	if cfg.Topic == "" {
-		cfg.Topic = generateTopic()
+		existing := findActiveSessionForHost(shortHost)


Key session reuse by full SSH target identity

The collision lookup now uses only shortHost (findActiveSessionForHost(shortHost)), but resolveShortHost normalizes hostnames to the first label (drops the domain) and ignores the SSH user. That causes unrelated targets like alice@db1.prod vs bob@db1.stage (or two different FQDNs sharing db1) to be treated as the same session, so a new run can incorrectly join/replace another connection’s topic and skip remote preparation for the actual target. This can cross-wire clipboard/open traffic and break shims on the second host/user.

Useful? React with 👍 / 👎.

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Autofix Details

Bugbot Autofix prepared a fix for the issue found in the latest run.

✅ Fixed: Nil dereference when joining without existing session
- Added explicit nil guards in the join and replace switch arms so the caller no longer relies on resolveSessionCollision's internal nil handling, falling back to a fresh topic if existing is nil.

Or push these changes by commenting:

@cursor push 88eb385cd4

Preview (88eb385cd4)

diff --git a/cmd/nssh/main.go b/cmd/nssh/main.go
--- a/cmd/nssh/main.go
+++ b/cmd/nssh/main.go
@@ -666,11 +666,19 @@
 		existing := findActiveSessionForHost(shortHost)
 		switch resolveSessionCollision(existing, collisionFlag) {
 		case "join":
+			if existing == nil {
+				cfg.Topic = generateTopic()
+				break
+			}
 			cfg.Topic = existing.Topic
 			cfg.Server = existing.Server
 			joinedPID = existing.PID
 			fmt.Fprintf(os.Stderr, "nssh: joining active session for %s (PID %d)\n", shortHost, existing.PID)
 		case "replace":
+			if existing == nil {
+				cfg.Topic = generateTopic()
+				break
+			}
 			fmt.Fprintf(os.Stderr, "nssh: replacing existing session for %s (PID %d)\n", shortHost, existing.PID)
 			replaceSession(existing)
 			cfg.Topic = generateTopic()

_{You can send follow-ups to the cloud agent here.}

- skip prepareRemote when joining an existing session for the host (the original process already wrote the remote session file and seeded the log; redoing it costs an SSH and adds a misleading duplicate session-open event) - record `joined: <pid>` on the local session-start event so the log distinguishes a fresh connect from a join subscriber resilience: - log subscribe-up / subscribe-down events with reconnect gap, so silent failures are visible in `nssh status --tail` - track the most-recent ntfy message id and pass it as ?since=<id> on every reconnect — anything posted while we were asleep / on a broken connection is replayed instead of dropped on the floor liveness + collision UX: - new ping / pong wire kinds and a pingTopic helper (one-shot subscribe + publish + wait for ack) to distinguish "PID is alive and answering" from "PID is alive but wedged" - on a session collision, ping the existing peer and prompt interactively: [J]oin / [R]eplace (SIGTERM + escalate to SIGKILL) / [N]ew topic / [C]ancel. Default is join when alive, replace when not. Non-interactive shells silently join with a stderr warning if the peer didn't answer. - new flags --join / --replace / --new to skip the prompt remote cleanup: - new `nssh sweep [--all|--older <dur>] <host>` subcommand that enumerates mosh-server processes owned by $USER on the remote and offers to kill them. Safe with tmux-inside-mosh: tmux server is its own daemon, so detached sessions survive. follow-ups in #10 (auto-sweep on connect, spare-yourself in --all). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Autofix Details

Bugbot Autofix prepared a fix for the issue found in the latest run.

✅ Fixed: New log fields invisible in nssh status --tail
- Updated formatEvent to render the new host, joined, reconnect, gap, and since log fields in status tail output.

Or push these changes by commenting:

@cursor push ce1939e6aa

Preview (ce1939e6aa)

diff --git a/cmd/nssh/status.go b/cmd/nssh/status.go
--- a/cmd/nssh/status.go
+++ b/cmd/nssh/status.go
@@ -324,9 +324,18 @@
 	if e.Exit != nil {
 		fmt.Fprintf(&sb, " exit=%d", *e.Exit)
 	}
+	if e.Gap != "" {
+		fmt.Fprintf(&sb, " gap=%s", e.Gap)
+	}
+	if e.Host != "" {
+		fmt.Fprintf(&sb, " host=%s", e.Host)
+	}
 	if e.ID != "" {
 		fmt.Fprintf(&sb, " id=%s", e.ID)
 	}
+	if e.Joined != 0 {
+		fmt.Fprintf(&sb, " joined=%d", e.Joined)
+	}
 	if e.Kind != "" {
 		fmt.Fprintf(&sb, " kind=%s", e.Kind)
 	}
@@ -339,9 +348,15 @@
 	if e.Persona != "" {
 		fmt.Fprintf(&sb, " persona=%s", e.Persona)
 	}
+	if e.Reconnect {
+		fmt.Fprintf(&sb, " reconnect=%v", e.Reconnect)
+	}
 	if e.Server != "" {
 		fmt.Fprintf(&sb, " server=%s", e.Server)
 	}
+	if e.Since != "" {
+		fmt.Fprintf(&sb, " since=%s", e.Since)
+	}
 	if e.Size > 0 {
 		fmt.Fprintf(&sb, " size=%d", e.Size)
 	}

_{You can send follow-ups to the cloud agent here.}

- collapse --join/--replace/--new flag parsing onto one case arm - flatten the collision switch so the existing==nil branch hoists out of all three "case" bodies - drop the redundant joinedPID guard in session-start (omitempty handles 0) - replaceSession: extract pidfile path once, single deferred os.Remove - subscribeNtfy: hoist subscribe-down logging into a markDown closure shared by the connect-failure and stream-end paths - sweep: collapse --older/--older= branches; extract pidsToStrings

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

There are 2 total unresolved issues (including 1 from previous review).

Autofix Details

Bugbot Autofix prepared a fix for the issue found in the latest run.

✅ Fixed: New log fields invisible in nssh status --tail
- Added the missing LogEvent fields to status tail formatting so host, joined, reconnect, gap, and since are printed when present.

Or push these changes by commenting:

@cursor push 636d9e59b0

Preview (636d9e59b0)

diff --git a/cmd/nssh/status.go b/cmd/nssh/status.go
--- a/cmd/nssh/status.go
+++ b/cmd/nssh/status.go
@@ -324,9 +324,18 @@
 	if e.Exit != nil {
 		fmt.Fprintf(&sb, " exit=%d", *e.Exit)
 	}
+	if e.Gap != "" {
+		fmt.Fprintf(&sb, " gap=%s", e.Gap)
+	}
+	if e.Host != "" {
+		fmt.Fprintf(&sb, " host=%s", e.Host)
+	}
 	if e.ID != "" {
 		fmt.Fprintf(&sb, " id=%s", e.ID)
 	}
+	if e.Joined != 0 {
+		fmt.Fprintf(&sb, " joined=%d", e.Joined)
+	}
 	if e.Kind != "" {
 		fmt.Fprintf(&sb, " kind=%s", e.Kind)
 	}
@@ -339,9 +348,15 @@
 	if e.Persona != "" {
 		fmt.Fprintf(&sb, " persona=%s", e.Persona)
 	}
+	if e.Reconnect {
+		fmt.Fprintf(&sb, " reconnect=%v", e.Reconnect)
+	}
 	if e.Server != "" {
 		fmt.Fprintf(&sb, " server=%s", e.Server)
 	}
+	if e.Since != "" {
+		fmt.Fprintf(&sb, " since=%s", e.Since)
+	}
 	if e.Size > 0 {
 		fmt.Fprintf(&sb, " size=%d", e.Size)
 	}

_{You can send follow-ups to the cloud agent here.}

The session-recovery changes added Host / Joined / Reconnect / Gap / Since to LogEvent but the formatter in status.go was never updated, so the reconnect-gap and replay-since data that was the whole point of the visibility work was silently dropped from tail output. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Bugbot Autofix prepared a fix for the issue found in the latest run.

✅ Fixed: Self-SIGTERM via stale pidfile with PID reuse
- activeSessions now treats a pidfile for the current PID as stale PID reuse and removes it before collision handling can replace it.

Or push these changes by commenting:

@cursor push 1326bdb7db

Preview (1326bdb7db)

diff --git a/cmd/nssh/status.go b/cmd/nssh/status.go
--- a/cmd/nssh/status.go
+++ b/cmd/nssh/status.go
@@ -90,6 +90,7 @@
 	if err != nil {
 		return nil
 	}
+	currentPID := os.Getpid()
 	var out []sessionInfo
 	for _, e := range entries {
 		if e.IsDir() || !strings.HasSuffix(e.Name(), ".json") {
@@ -104,6 +105,11 @@
 		if err := json.Unmarshal(data, &info); err != nil {
 			continue
 		}
+		// A pidfile for our own PID is stale PID reuse; don't let replace kill us.
+		if info.PID == currentPID {
+			_ = os.Remove(path)
+			continue
+		}
 		// Signal 0 is a permission/existence probe on POSIX.
 		if err := syscall.Kill(info.PID, 0); err != nil {
 			_ = os.Remove(path)

diff --git a/cmd/nssh/status_test.go b/cmd/nssh/status_test.go
new file mode 100644
--- /dev/null
+++ b/cmd/nssh/status_test.go
@@ -1,0 +1,42 @@
+package main
+
+import (
+	"encoding/json"
+	"fmt"
+	"os"
+	"path/filepath"
+	"testing"
+	"time"
+)
+
+func TestActiveSessionsDropsStaleSelfPID(t *testing.T) {
+	t.Setenv("XDG_STATE_HOME", t.TempDir())
+
+	dir := sessionsDir()
+	if err := os.MkdirAll(dir, 0o755); err != nil {
+		t.Fatal(err)
+	}
+
+	path := filepath.Join(dir, fmt.Sprintf("%d.json", os.Getpid()))
+	data, err := json.Marshal(sessionInfo{
+		PID:     os.Getpid(),
+		Target:  "devbox",
+		Host:    "devbox",
+		Topic:   "nssh_stale",
+		Server:  defaultServer,
+		Started: time.Now().Add(-time.Hour),
+	})
+	if err != nil {
+		t.Fatal(err)
+	}
+	if err := os.WriteFile(path, data, 0o644); err != nil {
+		t.Fatal(err)
+	}
+
+	if sessions := activeSessions(); len(sessions) != 0 {
+		t.Fatalf("activeSessions() = %d sessions, want 0", len(sessions))
+	}
+	if _, err := os.Stat(path); !os.IsNotExist(err) {
+		t.Fatalf("stale self pidfile still exists: %v", err)
+	}
+}

_{You can send follow-ups to the cloud agent here.}

^{Reviewed by Cursor Bugbot for commit 148702d. Configure here.}

cursor · 2026-05-12T03:48:54Z

+		}
+	}
+	return nil
+}


Self-SIGTERM via stale pidfile with PID reuse

Medium Severity

findActiveSessionForHost calls activeSessions(), which considers a PID alive if syscall.Kill(pid, 0) succeeds. If a previous nssh crashed without cleanup and the OS reuses that PID for the current process, the stale session passes the liveness check (it's us!). When the host matches and the user picks "replace" (the default for an unresponsive peer), replaceSession sends syscall.Kill(ourOwnPID, SIGTERM), terminating the current process before the signal handler is installed. Filtering out os.Getpid() in findActiveSessionForHost or activeSessions would prevent this.

Additional Locations (1)

cmd/nssh/session.go#L215-L236

^{Reviewed by Cursor Bugbot for commit 148702d. Configure here.}

chatgpt-codex-connector Bot reviewed May 12, 2026

View reviewed changes

cursor Bot reviewed May 12, 2026

View reviewed changes

Comment thread cmd/nssh/main.go Outdated

abizer and others added 3 commits May 11, 2026 19:37

session recovery checkpoint

7e87af1

fix(nssh): guard against nil existing session in collision switch

ec5dc3a

abizer force-pushed the session-recovery branch from c78d077 to ec5dc3a Compare May 12, 2026 02:43

cursor Bot reviewed May 12, 2026

View reviewed changes

Comment thread cmd/nssh/log.go

cursor Bot reviewed May 12, 2026

View reviewed changes

Comment thread cmd/nssh/status.go

cursor Bot reviewed May 12, 2026

View reviewed changes

abizer mentioned this pull request May 12, 2026

design: per-session config via NSSH_SESSION env injection #12

Open

abizer merged commit bfa3180 into master May 12, 2026
3 checks passed

abizer deleted the session-recovery branch May 12, 2026 03:52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

nssh: harden session recovery#11

nssh: harden session recovery#11
abizer merged 5 commits into
masterfrom
session-recovery

abizer commented May 12, 2026 •

edited by cursor Bot

Loading

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot May 12, 2026

Uh oh!

cursor Bot left a comment •

edited

Loading

Uh oh!

Uh oh!

cursor Bot left a comment •

edited

Loading

Uh oh!

Uh oh!

cursor Bot left a comment •

edited

Loading

Uh oh!

Uh oh!

cursor Bot left a comment •

edited

Loading

Uh oh!

cursor Bot May 12, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

abizer commented May 12, 2026 • edited by cursor Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot May 12, 2026

Choose a reason for hiding this comment

Uh oh!

cursor Bot left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

cursor Bot left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

cursor Bot left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

cursor Bot left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cursor Bot May 12, 2026

Choose a reason for hiding this comment

Self-SIGTERM via stale pidfile with PID reuse

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

abizer commented May 12, 2026 •

edited by cursor Bot

Loading

cursor Bot left a comment •

edited

Loading

cursor Bot left a comment •

edited

Loading

cursor Bot left a comment •

edited

Loading

cursor Bot left a comment •

edited

Loading