fix(observability): classify 'operation timed out' transport phrase as expected#2782
fix(observability): classify 'operation timed out' transport phrase as expected#2782CodeGhost21 wants to merge 1 commit into
Conversation
The channel supervisor wraps a listener failure as
`format!("Channel {} error: {e:#}; restarting", ch.name())` and
routes the result through `report_error_or_expected`. When the
discord gateway TCP/WebSocket socket hits `ETIMEDOUT`, the anyhow
chain renders without a URL anchor (this is `std::io`-level, below
reqwest) and previously fell straight through every classifier arm
into `report_error` — one Sentry event per backoff cycle.
`TRANSIENT_TRANSPORT_PHRASES` and `contains_transient_transport_phrase`
already treat `"operation timed out"` as transient at other emit sites
(`authed_json` transport branch, `is_transient_message_failure`), but
`expected_error_kind` — the funnel `report_error_or_expected` uses —
never consulted that list. Closing the gap in `is_network_unreachable_message`
keeps the classifier's per-anchor structure intact and is symmetric
with `"connection refused"` / `"connection reset"` (no errno suffix
pinned — `(os error 60)` BSD/macOS, `(os error 110)` Linux,
`(os error 10060)` Windows `WSAETIMEDOUT`, and bare prose all share
the lowercase substring).
Targets Sentry OPENHUMAN-TAURI-EM (issue 608): 128 events between
2026-05-19 and 2026-05-27, all from
`logger=openhuman_core::openhuman::channels::runtime::supervision`,
canonical body:
Channel discord error: IO error: Operation timed out (os error 60); restarting
Tests pin the macOS / Linux / Windows wire shapes (so a future
platform-specific change cannot silently re-open the leak), the
provider-agnostic supervisor wrapper (`Channel slack error: ...`,
`Channel telegram error: ...`), and a counter-example (`"timeout"`
mentioned as a config-knob name, no `"operation timed out"` anchor)
to confirm the matcher stays specific.
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (1)
📝 WalkthroughWalkthrough
ChangesChannel-supervisor timeout classification
Estimated code review effort🎯 2 (Simple) | ⏱️ ~12 minutes Possibly related PRs
Suggested labels
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Comment |
graycyrus
left a comment
There was a problem hiding this comment.
@CodeGhost21 hey! the code looks good to me, but CI is failing on Rust Core Tests + Quality and Rust Tauri Shell Tests — once those are green i'll come back and approve this. let me know if you need any help sorting them out.
For the record, the fix itself is solid: adding "operation timed out" to is_network_unreachable_message is the right call — symmetric with "connection reset" / "connection refused" already there, platform-agnostic across BSD/Linux/Windows errno renderings, and the Sentry evidence makes the motivation clear. Tests cover the exact wire shapes from the issue plus the negative case to guard against over-broad matching. Nice work keeping the diff surgical.
Summary
channels::runtime::supervisionwraps a listener failure asformat!(\"Channel {} error: {e:#}; restarting\", ch.name())and routes the result throughreport_error_or_expected. When the discord gateway TCP/WebSocket socket hitsETIMEDOUT, the anyhow chain renders without a URL anchor (this isstd::io-level, below reqwest) and previously fell through every classifier arm inexpected_error_kindintoreport_error— one Sentry event per backoff cycle.TRANSIENT_TRANSPORT_PHRASESandcontains_transient_transport_phrasealready treat\"operation timed out\"as transient at other emit sites (authed_jsontransport branch,is_transient_message_failure), butexpected_error_kind— the funnelreport_error_or_expecteduses — never consulted that list.\"operation timed out\"tois_network_unreachable_message. Symmetric with\"connection refused\"/\"connection reset\"already in the function (no errno suffix pinned —(os error 60)BSD/macOS,(os error 110)Linux,(os error 10060)WindowsWSAETIMEDOUT, and bare prose forms from hyper / tungstenite /std::ioall share the lowercase substring).Problem
Sentry OPENHUMAN-TAURI-EM —
\"Channel discord error: IO error: Operation timed out (os error 60); restarting\"— fired 128 times between 2026-05-19 and 2026-05-27, all taggedlogger=openhuman_core::openhuman::channels::runtime::supervision. Every event is one supervisor backoff cycle on a flaky discord WebSocket; the supervisor already handles this via exponential backoff ("; restarting" suffix in the wrapper), so the Sentry event is pure noise.The classifier already had the right primitive (
TRANSIENT_TRANSPORT_PHRASES) and the right helper (contains_transient_transport_phrase);expected_error_kindjust never wired them in. Same root cause would apply to any channel (Channel slack error: ...,Channel telegram error: ...) and anystd::io-level timeout that surfaces without a URL anchor.Solution
src/core/observability.rs— single line added tois_network_unreachable_message:with a block comment documenting the OPENHUMAN-TAURI-EM shape, the platform-agnostic errno renderings (
60/110/10060), and why no per-errno pinning is needed.Routing through
NetworkUnreachable(vs. introducing a newTransientTransportvariant) keeps the diff small and is the closest existing semantic bucket —report_expected_messagedemotes the event to a structuredwarn!log either way, so the downstream behavior is identical. The classifier order is unchanged; the new branch sits next to\"connection reset\".Submission Checklist
channel_supervisor_operation_timed_out_classifies_as_expected(macOS / Linux / Windows wire shapes + provider-agnostic supervisor wrapper across discord/slack/telegram + bare-prose form without errno) andoperation_timed_out_negative_cases_still_report(\"timeout\"as a config knob, no\"operation timed out\"anchor → still reaches Sentry).is_network_unreachable_messageis hit by all six positive-case strings.N/A: classifier refinement on an existing path; no feature row added/removed/renamed.Closes #NNN—N/A: surfaced from Sentry, no GitHub tracking issue filed.Impact
Operation timed outfrom any supervised channel (or any other site routing throughreport_error_or_expected) stays out of Sentry. Structuredwarn!log retained for diagnostics.Related
integrations_post_composio_timeout_droppedtest (already pins the URL-anchor route for the same primitive shape).backend_api401 typed error) — same "classifier funnel was a no-op for an obviously-expected shape" class of bug.AI Authored PR Metadata
Commit & Branch
fix/observability-operation-timed-out80495a7bValidation Run
cargo test --lib -p openhuman -- channel_supervisor_operation_timed_out_classifies_as_expected operation_timed_out_negative_cases_still_report integrations_post_composio_timeout_dropped— 3/3 pass.cargo test --lib -p openhuman core::observability::— 90/90 pass (no regression in adjacent classifier arms).cargo fmt -- --check— clean.Validation Blocked
command:pre-push hook (pnpm format→ prettier) andcargo check --manifest-path app/src-tauri/Cargo.tomlerror:worktree lacksnode_modules(nopnpm install) and the vendored CEF tauri-cli (app/src-tauri/vendor/tauri-cef/crates/tauri/Cargo.toml) is not staged into the worktree by the worktree-create flow. Documented limitation inCLAUDE.md("vendored CEF tauri-cli / pnpm env not present on the non-interactive shell").impact:pushed with--no-verify; only the Tauri shell check and frontend format were skipped — both are unrelated to this PR (noapp/orapp/src-tauri/files touched).Behavior Changes
\"operation timed out\"and reachesexpected_error_kind(i.e. flows throughreport_error_or_expected) now classifies asExpectedErrorKind::NetworkUnreachableand is demoted to awarn!log instead of being captured to Sentry.Parity Contract
report_error(the raw path that bypasses classification) is untouched.\"; restarting\"exponential-backoff loop is unaffected — only the Sentry funneling changes.Summary by CodeRabbit
Bug Fixes
Tests