chore: avoid hammering store at ntx builder startup#2064
chore: avoid hammering store at ntx builder startup#2064SantiagoPittella wants to merge 5 commits into
Conversation
| /// Bounds the burst of `GetNetworkAccountDetailsById` and `GetUnconsumedNetworkNotes` calls the | ||
| /// ntx-builder fires when catching up after a restart. | ||
| const DEFAULT_STORE_HYDRATION_CONCURRENCY: NonZeroUsize = | ||
| NonZeroUsize::new(4).expect("literal is non-zero"); |
There was a problem hiding this comment.
Do we know how slow things could be because of this?
There was a problem hiding this comment.
Not really, because my local DB isn't big enough to actually measure this. It would be really nice to have a copy of the DB in a server with similar specs to the devnet/testnet ones to compare it. Or at least running it locally with the testnet database.
Do you know if I can get access to it? maybe @Mirko-von-Leipzig ?
There was a problem hiding this comment.
I do have a 0.13 testnet database snapshot. I can upload it somewhere for you if that helps (but since it's for 0.13 I'm not sure it does...)
There was a problem hiding this comment.
yeah, probably wont be of much use
sergerad
left a comment
There was a problem hiding this comment.
LGTM but I think we need to make DEFAULT_STORE_HYDRATION_CONCURRENCY configurable even if we do test performance locally. Feels like the type of thing that might need changing in the live environment.
| -- Store-sync checkpoint: the highest block whose state has been fully ingested into the local | ||
| -- DB by the startup catch-up sync. | ||
| -- | ||
| -- NULL means "never sync'd": treated as a full GENESIS-onward sync on first boot after upgrade. |
There was a problem hiding this comment.
Could we avoid this NULL by reframing this to the next block to sync, rather than the latest sync'd block?
There was a problem hiding this comment.
hmm yes, that should work
| /// ```sql | ||
| /// SELECT account_id FROM accounts WHERE transaction_id IS NULL | ||
| /// ``` |
There was a problem hiding this comment.
This is going to become a very large list one day.
There was a problem hiding this comment.
should we paginate it?
| block_num = excluded.block_num, \ | ||
| block_header = excluded.block_header", |
There was a problem hiding this comment.
Where does excluded come from?
There was a problem hiding this comment.
It basically holds the row that the INSERT tried to add. Probably we can simplify it by doing something like:
INSERT INTO chain_state (id, block_num, block_header)
VALUES (0, ?1, ?2)
ON CONFLICT(id) DO UPDATE SET
block_num = ?1,
block_header = ?2
| // For each account that had an inflight row at startup, do a full hydration | ||
| // (account-details + unconsumed-notes) from the store. The inflight tx may have | ||
| // landed in a block we didn't witnessed, so locally-committed state cannot be | ||
| // trusted for these specific accounts. The set is bounded by `max_concurrent_txs` | ||
| // so this is small and we run it sequentially before opening the main loop. |
There was a problem hiding this comment.
Oh this is neat; I hadn't actually considered this. Nice!
| let kickoff_block = if catch_up_started { | ||
| None | ||
| } else { | ||
| match &event { | ||
| MempoolEvent::BlockCommitted { header, .. } => { | ||
| Some(header.block_num()) | ||
| }, | ||
| _ => None, | ||
| } | ||
| }; |
There was a problem hiding this comment.
We can simplify this by first waiting for the first block committed event outside of this main loop. Buffer the processed items into a Vec until then.
Once found, kickoff the catch up stuff, and then enter this main loop but chain the vec + mempool stream:
let mut events = futures::stream::iter(buffered_events).chain(mempool_events);There was a problem hiding this comment.
I think this works. But I must admit this more complex than I anticipated (not an impl fault; overlaps + concurrency is a pain).
I'm wondering if this is worth it. What are some alternate options?
We could revert to always requiring a full sync before continuing on.. this is a bit wasteful and means we cannot immediately begin producing network txs, but does simplify this to a: resync from last_sync..=tip and only then begin producing txs.. This also simplifies the code quite a bit.
I'm a bit on the fence..
wdyt?
There was a problem hiding this comment.
I mostly agree, if we can get at testnet snapshot we can start checking (though probably not enough) if this complexity is worth having or not
closes #1952
This PR:
unconsumed_notesdelta per known account, dropping the per-accountGetNetworkAccountDetailsByIdcall.StoreClient(defaulted to 4) that bounds concurrent in-flight startup RPCs.store_sync_checkpointcolumn on chain_state (required a new SQL migration), distinct from the mempool-driven block_num.I tried it by:
mainversion of the node.