-
Notifications
You must be signed in to change notification settings - Fork 47
feat: F3 e2e lifecycle #1469
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
feat: F3 e2e lifecycle #1469
Conversation
fbaa095 to
b34142d
Compare
b34142d to
31feb85
Compare
91db005 to
cbce51c
Compare
39e59d6 to
bfdc6f7
Compare
93a1066 to
1747f78
Compare
…t and execute logic
| self.verifier | ||
| .verify_proof_bundle_with_tipsets(&proof_bundle, &finalized_tipsets) | ||
| .with_context(|| format!("Failed to verify proof for epoch {}", parent_epoch))?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apparently, there's no verification of continuity of top-down event nonces, yet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes there is not. My understanding was you were suggesting to skip it for now. But I can add it here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe also skip the verification of proof bundles for now and tackle both in a separate PR? Or complete it in this PR. Up to you
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm. I have a strong desire to merge, but I also want to see if the proofs and everything is going to work. I might implement the check tomorrow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it hard to check the nonces? If it's relatively easy then let's do it in this PR. Otherwise, it would appear that there's everything fully verified where as it's not quite, and we'd need to make sure we don't forget about that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is not hard, but the nonce need to be stored somewhere. That is it annoying bit. But I will do it in this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.
| } | ||
| if f3_enabled_in_config && f3_state_in_genesis.is_none() { | ||
| bail!("F3 is enabled in config but initial F3 state is missing in genesis"); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fresh node with F3 config fails to start
High Severity
The F3 state validation check prevents a fresh node with F3 configuration from starting. start_topdown_if_enabled is called before App::new(), and for a fresh node, query_f3_state_in_genesis returns None because no database state exists yet. The check at line 121-123 then fails with "F3 is enabled in config but initial F3 state is missing in genesis". However, the F3 state is only created when genesis is applied during the InitChain ABCI call, which requires the node to start first. This creates a chicken-and-egg situation that blocks startup.
Additional Locations (1)
| if b.len() > 32 { | ||
| anyhow::bail!("expected <= 32 bytes, got {}", b.len()); | ||
| } | ||
| if b.len() < 32 { | ||
| let mut padded = vec![0u8; 32 - b.len()]; | ||
| padded.append(&mut b); | ||
| b = padded; | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it even allowed to be not exactly 32 bytes?
| padded.append(&mut b); | ||
| b = padded; | ||
| } | ||
| let tail: [u8; 8] = b[24..32].try_into().expect("slice is 8 bytes"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe need to check if the higher bits are all zero.
| // This path may be hit during catch-up for a node that did not have the local proof cache | ||
| // entry during attestation. In that case, wait for the cache to be filled by the proof-service. | ||
| let extracted = Self::extract_top_down_effects_retry_cache_miss( | ||
| &self.f3_execution_cache_retry, | ||
| f3, | ||
| &msg, | ||
| ) | ||
| .await?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should be infallible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm. Not sure. I think that it should not fail because of cache miss, but it should probably fail if the data are not extractable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But that would be fatal, no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes. and it si propagate as fatal
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, in the current code, it only causes apply_message to return with error, which CometBFT treats as ordinary transaction failure.
| /// - Latest Instance ID: The latest F3 instance that has been committed | ||
| /// - Latest Finalized Height: The highest epoch that has been finalized | ||
| /// - Power Table: Current validator power table (can change between instances) | ||
| /// | ||
| /// This state is extracted from F3 certificates received from the parent chain | ||
| /// and stored by the actor for use in finality proofs. | ||
| #[derive(Deserialize_tuple, Serialize_tuple, Debug, Clone, PartialEq, Eq)] | ||
| pub struct LightClientState { | ||
| /// Current F3 instance ID | ||
| pub instance_id: u64, | ||
| /// Finalized chain - full list of finalized epochs | ||
| /// Matches ECChain from F3 certificates | ||
| /// Empty initially at genesis until first update | ||
| pub finalized_epochs: Vec<ChainEpoch>, | ||
| /// Current power table for this instance | ||
| /// Power table can change between instances | ||
| pub power_table: Vec<PowerEntry>, | ||
| /// Latest F3 instance ID that has been committed | ||
| pub latest_instance_id: u64, | ||
| /// The latest finalized height | ||
| pub latest_finalized_height: ChainEpoch, | ||
| /// Root CID of the on-chain power table (HAMT). | ||
| /// | ||
| /// The actual entries are stored in the actor's blockstore and reachable from this root. | ||
| pub power_table_root: Cid, | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the former comment to the F3 instance ID was more clear: in many regards, it is the current instance because the same cert can be used to justify and commit parent chain updates for multiple epochs; one can consider a cert "committed" only when the last epoch it certifies is also "committed". The epoch number (height) then signifies the latest accepted parent chain extension. (In principle, the latest finalized epoch is the last one in the cert). Maybe we should call those fields instance_id (or current_instance_id) and latest_height (the latest height for which the parent chain updates were applied)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW, there seems to already be something very much like latest_finalized_height in the gateway contract, see commit_finality.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense.
| // Store validator changes in gateway | ||
| self.gateway_caller | ||
| .store_validator_changes(state, extracted.validator_changes) | ||
| .context("failed to store validator changes")?; | ||
|
|
||
| // Execute topdown messages | ||
| let ret = self | ||
| .execute_topdown_msgs(state, extracted.topdown_msgs) | ||
| .await | ||
| .context("failed to execute top down messages")?; | ||
|
|
||
| // Finalize F3 execution only after all effects were applied successfully. | ||
| f3.finalize_after_execution(state, msg.height, extracted.instance_id) | ||
| .context("failed to finalize F3 execution")?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm wondering what may cause store_validator_changes or execute_topdown_msgs to fail? Should that happen, finalize_after_execution won't be called, and, IIUC, no further update from the parent chain will ever make it to the subnet chain.
| let service = ProofGeneratorService::new( | ||
| proof_config.clone(), | ||
| proof_cache.clone(), | ||
| &subnet_id, | ||
| initial_instance, | ||
| fendermint_vm_topdown_proof_service::power_entries_from_actor(&f3_state.power_table), | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The proof generator service initializes the F3 client with initial_instance, which will start fetching certificates from initial_instance+1; therefore, we won't generate any proof bundles for initial_instance. If a validator joins late, it may find itself in a situation where an F3 cert is partially committed (some epochs already committed, but some not yet). In fact, this may even happen at genesis because we initialize the F3 client actor with the base epoch number.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wait but the initial_instance + 1 is by design. With F3 you can't validate current cert with current power table. You validate cert N with power table from N - 1. Hmm. So really the initial instance should not be called initial tbh. And it should ne N - 1 of where we actually want to start. Otherwise we would need to be storing previous power table or something like that. It is pretty complicated... Hmm. WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps we should update the instance number in the F3 actor state only once all epochs certified by that instance are committed. And we should initialize it in the genesis accordingly. Maybe also reconsider what we mark as "committed" in the proof cache, to make the logic consistent.
| })?; | ||
|
|
||
| // Get base power table for the specified instance | ||
| let power_table_response = lotus_client.f3_get_power_table(instance_id).await?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, we should keep in mind that we trust the endpoint here, when generating the genesis block, in that it provides us the correct initial power table for F3.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, when we create a new subnet, we trust that the parent endpoint doesn't provide us an F3 instance ID "from future".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes we do. But there is not other way around it IMO. But we should mention it docs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think, in principle, we could derive it, upon initialization, from the EC chain after 900 epochs (or earlier using the finality calculator), same way as Filecoin nodes are supposed to do, ultimately anchoring trust into the drand beacon
Closes #1441 and #1442
Note
High Risk
Introduces a new F3 proof-based parent finality path and changes on-chain light-client state layout; mistakes could break top-down finality progression or block execution on upgraded networks.
Overview
Adds an end-to-end F3 proof-based top-down finality flow alongside the existing legacy vote-based path. Node startup now chooses between legacy and F3 modes via new
ipc.topdown.f3settings, validates config vs genesis state, initializes a persistent proof cache, and (when enabled) runs a background proof generator service; legacy resolver/voting/polling syncer setup is refactored into a dedicatedservice/topdown.rs.Updates the on-chain
f3-light-clientactor to store only the latest finalized height/instance and a HAMT-backed power table root (withpower_beas big-endian bytes), adds monotonicity checks for updates, and materializes the power table onGetState. Genesis-from-parent now fetches the F3 certificate to derivebase_epochand parses parent power asBigInt, and the interpreter gains shared EVM log decoding utilities plus bundle event extraction for top-down messages and validator power changes.Written by Cursor Bugbot for commit 0e3593c. This will update automatically on new commits. Configure here.