Skip to content

fix: Harden metrics and event subsystem under stress#720

Open
nikooo777 wants to merge 25 commits intoimpr/gunfrom
fix/neo-logging
Open

fix: Harden metrics and event subsystem under stress#720
nikooo777 wants to merge 25 commits intoimpr/gunfrom
fix/neo-logging

Conversation

@nikooo777
Copy link
Collaborator

@nikooo777 nikooo777 commented Mar 2, 2026

Actions

  • Update production config to use "only-ids": true option in remote stores.

Tests

rebar3 eunit --module dev_node_process still fails on 2 tests, but this is happening in edge as well.

Summary

  • Fix stale PID caching in event server causing frozen metrics in long-lived processes
  • Add timeout to cpu_sup:avg5 scrape to prevent indefinite blocking
  • Use selective ETS query for /events endpoint instead of full table scan
  • Remove unnecessary spawn-per-request in metric recording paths
  • Guard all unprotected Prometheus writes that could crash critical server processes
  • Cache Prometheus startup failure with 60s retry backoff to avoid hot-path blocking
  • Bound event server Prometheus wait to ~30s with lazy counter redeclaration

Commits

  • 3fc9e78 fix: Harden event server against crash and stale PID caching
  • e872f1f fix: Add timeout to cpu_sup:avg5 in metrics collector
  • 6e1a978 fix: Use selective ETS query for event counters
  • da2c207 fix: Remove unnecessary spawn in metric recording paths
  • 1c25e27 fix: Guard unprotected Prometheus writes in hot paths
  • c0b8005 fix: Cache Prometheus startup failure with retry backoff
  • 535c2c6 fix: Bound event server Prometheus wait and handle late startup

@speeddragon speeddragon changed the base branch from neo/edge to impr/gun March 5, 2026 14:12
@speeddragon
Copy link
Collaborator

hb_store_gateway: remote_hyperbeam_node_ans104_test test is fixed in #736

@speeddragon
Copy link
Collaborator

Only tests now failing are:

  • hb_client (5 failing)
  • dev_arweave (2 failing)

Due to how the upload is done to up.arweave.net

samcamwilliams and others added 15 commits March 9, 2026 18:26
…st-hook"

This reverts commit de6262b, reversing
changes made to 37c508f.
- Wrap prometheus_counter:inc/3 in try-catch in handle_events/0 to prevent server crash on Prometheus errors
- Add is_process_alive/1 check in find_event_server/0 so long-lived processes recover from a dead cached PID
- Wrap cpu_sup:avg5/0 in spawn_monitor with 2s timeout
- Kill worker on timeout to prevent process leaks
- Replace ets:tab2list with ets:match_object filtering for event rows
- Avoids materializing the entire prometheus_counter_table
- Inline Prometheus ETS writes in hb_http and hb_store_arweave
- Guard observe in hb_http_client record_duration spawn
- Eliminates per-request process overhead under load
- Add try-catch to dec_prometheus_gauge and inc_prometheus_counter
- Guard retry path in inc_prometheus_gauge
- Wrap hb_store_lmdb:name_hit_metrics to prevent LMDB read crashes
- Check ensure_all_started result instead of ignoring it
- Cache failure timestamp in persistent_term, retry after 60s
- Callers degrade gracefully when Prometheus is unavailable
- Cap await_prometheus_started to ~30s to prevent unbounded mailbox growth
- Messages stay in mailbox during wait instead of being consumed and lost
- Wrap declare in try-catch, redeclare lazily on inc failure via ensure_event_counter
- Keep metric writes off the HTTP reply path to isolate request tail latency
- Retain try-catch inside the spawn to guard against metric-path hiccups
Comment on lines +153 to +156
ets:match_object(
prometheus_counter_table,
{{default, <<"event">>, '_', '_'}, '_', '_'}
).
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of returning all items from the list, we only return events.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this materially effect the results on ~hyperbuddy@1.0/events/format~hyperbuddy@1.0?

}
end,
WithPrivIP = hb_private:set(NormalBody, <<"ip">>, RealIP, Opts),
WithPrivIP = hb_private:set(WithPeer, <<"ip">>, RealIP, Opts),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WithPeer wasn't being used.

src/hb_http.erl Outdated
Comment on lines +1098 to +1100
% Add host
Host = cowboy_req:host(Req),
WithDevice#{<<"host">> => Host}.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fix dev_name:arns_json_snapshot_test test. Host header isn't included in the request, so we cannot resolve the name.

%% returns only the name component of the host, if it is present. If no name is
%% present, an empty binary is returned.
name_from_host(Host, no_host) ->
name_from_host(Host, RawNodeHost) when RawNodeHost =:= no_host; RawNodeHost =:= <<"localhost">> ->
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the change of adding <<"host">> to the header, name_from_host is now called with <<"localhost">>, which is parsed into #{ path => <<"localhost">>} by uri_string:parse(RawNodeHost).

With this, we avoid resolving <<"localhost">>.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't going to be resilient, I think? On some platforms don't you get 127.0.0.1 or the IPv6 equivalent?

@speeddragon
Copy link
Collaborator

speeddragon commented Mar 10, 2026

Failed: 2.  Skipped: 0.  Passed: 3000.
dev_node_process: lookup_spawn_test...*failed*
dev_node_process: lookup_execute_test...*failed*

@speeddragon speeddragon marked this pull request as ready for review March 10, 2026 18:44
NodeOpts2 = maps:merge(NodeOpts, #{ bundler_max_items => 3 }),
Node = hb_http_server:start_node(NodeOpts2#{
priv_wallet => hb:wallet(),
priv_wallet => ar_wallet:new(),
Copy link
Collaborator

@speeddragon speeddragon Mar 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we now start hb for every single test, we cannot use hb:wallet because the ServerID address will conflict when starting a new node inside the test. By using ar_wallet:new(), a new ServerID will be generated based in the private wallet, and it will not conflict with existing one.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is because hb:wallet() only creates a wallet if one hasn't been cached in the persistent_term store?

% There is a race condition where we write to the store and a
% reset happens making not read the written value.
% Lower this number can still produce flaky test
timer:sleep(100),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general it is better to use hb_util:until for this type of thing.

hb_http_client:response_status_to_atom(Status),
http_response_to_httpsig(Status, Headers, Body, Opts)
};
outbound_result_to_message(<<"json@1.0">>, Status, Headers, Body, Opts) ->
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does this need a custom handler? This could just match outbound_result_to_message(Codec, Status, Headers, Body, Opts) and convert to the given codec.

[prometheus, prometheus_cowboy, prometheus_ranch]
),
ok;
case persistent_term:get(hb_prometheus_start_failed, undefined) of
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We definitely don't want this route. I don't think it works and it may also return non-ok, when it is called ensure_. Why not 38ccd088ff4cdd1c79bfeb09fd88fde0bb7e0d6e? (Also, we should make sure 14827ad3ee2a3f53803f795cc6ce250794e1240a or equivalent makes it in from this dead branch).


%% @doc Reading an unsupported signature type transaction should fail
failure_to_process_message_test() ->
%% TODO: Enable when we find a TX that we don't support
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Low-key flex 😂

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants