Skip to content

feat(xdbg): add healthcheck command for local XMTP clients#3351

Open
insipx wants to merge 2 commits intomainfrom
feat/xdbg-healthcheck
Open

feat(xdbg): add healthcheck command for local XMTP clients#3351
insipx wants to merge 2 commits intomainfrom
feat/xdbg-healthcheck

Conversation

@insipx
Copy link
Copy Markdown
Contributor

@insipx insipx commented Mar 21, 2026

Summary

  • Adds xdbg healthcheck subcommand that runs 5 health checks per client in the local identity store
  • Checks: missing messages, can-send, can-receive, no commit log fork, identity reachable on network
  • Supports --fail-fast (-f) flag to stop at the first unhealthy client
  • Prints a per-client/per-group report and exits non-zero if any client is unhealthy

Test plan

  • Run xdbg healthcheck against a local node with generated identities and verify output
  • Run xdbg healthcheck --fail-fast and verify it exits early on the first failure
  • Verify xdbg --help shows the new healthcheck subcommand with correct description

🤖 Generated with Claude Code

Note

Add healthcheck command to xmtp_debug CLI for local XMTP clients

  • Adds a new healthcheck subcommand to args.rs with a --fail-fast (-f) flag to stop at the first unhealthy client.
  • Implements the health check workflow in healthcheck.rs: loads identities from the local DB, checks network reachability via get_identity_updates_v2, enumerates groups, compares network vs. local messages to detect missing ones, reads fork indicators, and attempts to send a test message (xdbg-healthcheck) to an active group.
  • Logs per-client and per-group results with colored output and prints a summary on completion.
  • Risk: the healthcheck sends a real message to a group on the network; it also calls std::process::exit(1) directly on unhealthy findings rather than returning an error.

Macroscope summarized e61b440.

Adds `xdbg healthcheck` which runs 5 checks per client stored in the
local identity store: missing messages, can-send, can-receive, no fork,
and identity reachability. Supports --fail-fast to stop on first failure.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@insipx insipx requested a review from a team as a code owner March 21, 2026 02:26
@claude
Copy link
Copy Markdown

claude bot commented Mar 21, 2026

Claude finished @insipx's task —— View job


Code Review Complete

Review Date: March 21, 2026 02:37 UTC

I've reviewed PR #3351 and found the following issues beyond what Macroscope already identified:

🔴 Critical Issues

1. Incorrect is_healthy() logic for groups (apps/xmtp_debug/src/app/healthcheck.rs:54-60)

  • Macroscope correctly identified this: is_healthy() requires ALL groups to have can_send=true, but the code only sets it on the first successful send (line 238)
  • This causes clients with multiple groups to always be reported as unhealthy
  • Recommendation: Change line 56 to remove the can_send requirement from the .all() check, or track whether ANY group can send

⚠️ Other Issues

2. Potential panic on unwrap (apps/xmtp_debug/src/app/healthcheck.rs:191)

let stored: Option<StoredGroup> = conn.find_group(&group.group_id).unwrap_or(None);
  • Using unwrap_or(None) suggests find_group returns Result<Option<T>>, but unwrap_or would panic on Err
  • Recommendation: Use .ok().flatten() or proper error handling

3. Inconsistent error handling in fail-fast mode (apps/xmtp_debug/src/app/healthcheck.rs:151-155)

  • When query_group_messages fails and fail_fast is enabled, the function returns an error
  • However, when client_from_identity fails (line 100-108), it only returns if fail_fast is true
  • But when a test message send fails (line 236-248), it continues even in fail-fast mode
  • Recommendation: Clarify the fail-fast semantics - should it exit on ANY failure or only on unhealthy clients?

4. Test message spam (apps/xmtp_debug/src/app/healthcheck.rs:234)

  • The healthcheck sends a test message b"xdbg-healthcheck" to verify send capability
  • These messages persist in the group and may clutter production/staging groups
  • Recommendation: Document this behavior in the command help text, or consider a cleanup mechanism

5. Missing context on missing messages (apps/xmtp_debug/src/app/healthcheck.rs:286-293)

  • The summary logs missing sequence IDs but doesn't explain what they mean or how to fix them
  • Recommendation: Add a hint about running sync or investigating network issues

6. No test coverage

  • The test plan checkboxes are unchecked, and there are no unit tests for the new functionality
  • Per CLAUDE.md: "Add test coverage for new functionality"
  • Recommendation: Add tests or document why manual testing is sufficient

✅ Good Practices Observed

  • Clear documentation at the top of the file explaining the 5 health checks
  • Good use of structured logging with context fields
  • Color-coded output for readability
  • Proper error messages for user guidance

Comment on lines +52 to +59
impl ClientHealthResult {
fn is_healthy(&self) -> bool {
self.identity_reachable
&& self
.groups
.iter()
.all(|g| g.missing_message_count == 0 && g.can_send && g.can_receive && g.is_forked != Some(true))
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Critical app/healthcheck.rs:52

In is_healthy(), g.can_send is required to be true for all groups, but the usage code only sets can_send = true on the first active group that successfully sends and then breaks. A client with multiple groups will always be reported as unhealthy because at most one group can have can_send = true. Consider changing the logic to check whether the client can send to ANY group, or track which groups are expected to be sendable.

     fn is_healthy(&self) -> bool {
         self.identity_reachable
             && self
                 .groups
                 .iter()
-                .all(|g| g.missing_message_count == 0 && g.can_send && g.can_receive && g.is_forked != Some(true))
+                .all(|g| g.missing_message_count == 0 && g.can_receive && g.is_forked != Some(true))
     }
🚀 Reply "fix it for me" or copy this AI Prompt for your agent:
In file apps/xmtp_debug/src/app/healthcheck.rs around lines 52-59:

In `is_healthy()`, `g.can_send` is required to be `true` for all groups, but the usage code only sets `can_send = true` on the first active group that successfully sends and then breaks. A client with multiple groups will always be reported as unhealthy because at most one group can have `can_send = true`. Consider changing the logic to check whether the client can send to ANY group, or track which groups are expected to be sendable.

Evidence trail:
apps/xmtp_debug/src/app/healthcheck.rs lines 51-56 (is_healthy uses .all() requiring can_send=true for ALL groups), lines 182-201 (only sets can_send=true on first successful send then breaks), line 5 (documented intent says 'at least one active group')

@macroscopeapp
Copy link
Copy Markdown
Contributor

macroscopeapp bot commented Mar 21, 2026

Approvability

Verdict: Approved

This PR adds a new diagnostic healthcheck command to the xmtp_debug tool, which is a self-contained developer utility that doesn't affect production code paths. The author owns all modified files and has extensive tenure, making this a low-risk addition within their designated domain.

You can customize Macroscope's approvability policy. Learn more.

- Print missing sequence IDs (sequence_id, originator_id) in summary
- Check all fork signals: `maybe_forked` + `fork_details` in addition to `is_commit_log_forked`
- Replace all println! with info! for consistent structured logging
- Use owo_colors red+bold for unhealthy status, error counts, and failure messages
- Exit with code 1 on failure so CI can detect unhealthy state

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@codecov
Copy link
Copy Markdown

codecov bot commented Mar 21, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 82.93%. Comparing base (0dff96f) to head (e61b440).
⚠️ Report is 1 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #3351      +/-   ##
==========================================
+ Coverage   82.91%   82.93%   +0.02%     
==========================================
  Files         376      376              
  Lines       50896    50896              
==========================================
+ Hits        42198    42209      +11     
+ Misses       8698     8687      -11     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@insipx insipx mentioned this pull request Mar 30, 2026
3 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant