fix: Harden metrics and event subsystem under stress#720
fix: Harden metrics and event subsystem under stress#720
Conversation
535c2c6 to
0fb60f4
Compare
0fb60f4 to
6dde746
Compare
|
|
872e966 to
5d84e2f
Compare
|
Only tests now failing are:
Due to how the upload is done to |
- Wrap prometheus_counter:inc/3 in try-catch in handle_events/0 to prevent server crash on Prometheus errors - Add is_process_alive/1 check in find_event_server/0 so long-lived processes recover from a dead cached PID
- Wrap cpu_sup:avg5/0 in spawn_monitor with 2s timeout - Kill worker on timeout to prevent process leaks
- Replace ets:tab2list with ets:match_object filtering for event rows - Avoids materializing the entire prometheus_counter_table
- Inline Prometheus ETS writes in hb_http and hb_store_arweave - Guard observe in hb_http_client record_duration spawn - Eliminates per-request process overhead under load
- Add try-catch to dec_prometheus_gauge and inc_prometheus_counter - Guard retry path in inc_prometheus_gauge - Wrap hb_store_lmdb:name_hit_metrics to prevent LMDB read crashes
- Check ensure_all_started result instead of ignoring it - Cache failure timestamp in persistent_term, retry after 60s - Callers degrade gracefully when Prometheus is unavailable
- Cap await_prometheus_started to ~30s to prevent unbounded mailbox growth - Messages stay in mailbox during wait instead of being consumed and lost - Wrap declare in try-catch, redeclare lazily on inc failure via ensure_event_counter
- Keep metric writes off the HTTP reply path to isolate request tail latency - Retain try-catch inside the spawn to guard against metric-path hiccups
4ffac84 to
33a7cbd
Compare
…est-hook" This reverts commit 9f0d114.
dc33e0c to
1dca459
Compare
| ets:match_object( | ||
| prometheus_counter_table, | ||
| {{default, <<"event">>, '_', '_'}, '_', '_'} | ||
| ). |
There was a problem hiding this comment.
Instead of returning all items from the list, we only return events.
There was a problem hiding this comment.
Does this materially effect the results on ~hyperbuddy@1.0/events/format~hyperbuddy@1.0?
| } | ||
| end, | ||
| WithPrivIP = hb_private:set(NormalBody, <<"ip">>, RealIP, Opts), | ||
| WithPrivIP = hb_private:set(WithPeer, <<"ip">>, RealIP, Opts), |
There was a problem hiding this comment.
WithPeer wasn't being used.
src/hb_http.erl
Outdated
| % Add host | ||
| Host = cowboy_req:host(Req), | ||
| WithDevice#{<<"host">> => Host}. |
There was a problem hiding this comment.
Fix dev_name:arns_json_snapshot_test test. Host header isn't included in the request, so we cannot resolve the name.
| %% returns only the name component of the host, if it is present. If no name is | ||
| %% present, an empty binary is returned. | ||
| name_from_host(Host, no_host) -> | ||
| name_from_host(Host, RawNodeHost) when RawNodeHost =:= no_host; RawNodeHost =:= <<"localhost">> -> |
There was a problem hiding this comment.
With the change of adding <<"host">> to the header, name_from_host is now called with <<"localhost">>, which is parsed into #{ path => <<"localhost">>} by uri_string:parse(RawNodeHost).
With this, we avoid resolving <<"localhost">>.
There was a problem hiding this comment.
This isn't going to be resilient, I think? On some platforms don't you get 127.0.0.1 or the IPv6 equivalent?
|
| NodeOpts2 = maps:merge(NodeOpts, #{ bundler_max_items => 3 }), | ||
| Node = hb_http_server:start_node(NodeOpts2#{ | ||
| priv_wallet => hb:wallet(), | ||
| priv_wallet => ar_wallet:new(), |
There was a problem hiding this comment.
Since we now start hb for every single test, we cannot use hb:wallet because the ServerID address will conflict when starting a new node inside the test. By using ar_wallet:new(), a new ServerID will be generated based in the private wallet, and it will not conflict with existing one.
There was a problem hiding this comment.
This is because hb:wallet() only creates a wallet if one hasn't been cached in the persistent_term store?
| % There is a race condition where we write to the store and a | ||
| % reset happens making not read the written value. | ||
| % Lower this number can still produce flaky test | ||
| timer:sleep(100), |
There was a problem hiding this comment.
In general it is better to use hb_util:until for this type of thing.
| hb_http_client:response_status_to_atom(Status), | ||
| http_response_to_httpsig(Status, Headers, Body, Opts) | ||
| }; | ||
| outbound_result_to_message(<<"json@1.0">>, Status, Headers, Body, Opts) -> |
There was a problem hiding this comment.
Why does this need a custom handler? This could just match outbound_result_to_message(Codec, Status, Headers, Body, Opts) and convert to the given codec.
| [prometheus, prometheus_cowboy, prometheus_ranch] | ||
| ), | ||
| ok; | ||
| case persistent_term:get(hb_prometheus_start_failed, undefined) of |
There was a problem hiding this comment.
We definitely don't want this route. I don't think it works and it may also return non-ok, when it is called ensure_. Why not 38ccd088ff4cdd1c79bfeb09fd88fde0bb7e0d6e? (Also, we should make sure 14827ad3ee2a3f53803f795cc6ce250794e1240a or equivalent makes it in from this dead branch).
|
|
||
| %% @doc Reading an unsupported signature type transaction should fail | ||
| failure_to_process_message_test() -> | ||
| %% TODO: Enable when we find a TX that we don't support |
Actions
"only-ids": trueoption in remote stores.Tests
rebar3 eunit --module dev_node_processstill fails on 2 tests, but this is happening inedgeas well.Summary
Commits