Skip to content

Conversation

@karlem
Copy link
Contributor

@karlem karlem commented Oct 28, 2025

Closes #1441 and #1442


Note

High Risk
Introduces a new F3 proof-based parent finality path and changes on-chain light-client state layout; mistakes could break top-down finality progression or block execution on upgraded networks.

Overview
Adds an end-to-end F3 proof-based top-down finality flow alongside the existing legacy vote-based path. Node startup now chooses between legacy and F3 modes via new ipc.topdown.f3 settings, validates config vs genesis state, initializes a persistent proof cache, and (when enabled) runs a background proof generator service; legacy resolver/voting/polling syncer setup is refactored into a dedicated service/topdown.rs.

Updates the on-chain f3-light-client actor to store only the latest finalized height/instance and a HAMT-backed power table root (with power_be as big-endian bytes), adds monotonicity checks for updates, and materializes the power table on GetState. Genesis-from-parent now fetches the F3 certificate to derive base_epoch and parses parent power as BigInt, and the interpreter gains shared EVM log decoding utilities plus bundle event extraction for top-down messages and validator power changes.

Written by Cursor Bugbot for commit 0e3593c. This will update automatically on new commits. Configure here.

@karlem karlem changed the title feat: init lifecycle feat: F3 e2e lifecycle Oct 29, 2025
@karlem karlem force-pushed the f3-lifecycle branch 2 times, most recently from 91db005 to cbce51c Compare November 4, 2025 17:20
Base automatically changed from f3-proofs-cache to main December 18, 2025 16:15
@karlem karlem marked this pull request as ready for review January 16, 2026 19:52
@karlem karlem requested a review from a team as a code owner January 16, 2026 19:52
Comment on lines +276 to +278
self.verifier
.verify_proof_bundle_with_tipsets(&proof_bundle, &finalized_tipsets)
.with_context(|| format!("Failed to verify proof for epoch {}", parent_epoch))?;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apparently, there's no verification of continuity of top-down event nonces, yet.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes there is not. My understanding was you were suggesting to skip it for now. But I can add it here.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe also skip the verification of proof bundles for now and tackle both in a separate PR? Or complete it in this PR. Up to you

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm. I have a strong desire to merge, but I also want to see if the proofs and everything is going to work. I might implement the check tomorrow.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it hard to check the nonces? If it's relatively easy then let's do it in this PR. Otherwise, it would appear that there's everything fully verified where as it's not quite, and we'd need to make sure we don't forget about that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not hard, but the nonce need to be stored somewhere. That is it annoying bit. But I will do it in this PR.

Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.

}
if f3_enabled_in_config && f3_state_in_genesis.is_none() {
bail!("F3 is enabled in config but initial F3 state is missing in genesis");
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fresh node with F3 config fails to start

High Severity

The F3 state validation check prevents a fresh node with F3 configuration from starting. start_topdown_if_enabled is called before App::new(), and for a fresh node, query_f3_state_in_genesis returns None because no database state exists yet. The check at line 121-123 then fails with "F3 is enabled in config but initial F3 state is missing in genesis". However, the F3 state is only created when genesis is applied during the InitChain ABCI call, which requires the node to start first. This creates a chicken-and-egg situation that blocks startup.

Additional Locations (1)

Fix in Cursor Fix in Web

Comment on lines +29 to +36
if b.len() > 32 {
anyhow::bail!("expected <= 32 bytes, got {}", b.len());
}
if b.len() < 32 {
let mut padded = vec![0u8; 32 - b.len()];
padded.append(&mut b);
b = padded;
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it even allowed to be not exactly 32 bytes?

padded.append(&mut b);
b = padded;
}
let tail: [u8; 8] = b[24..32].try_into().expect("slice is 8 bytes");

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe need to check if the higher bits are all zero.

Comment on lines +251 to +258
// This path may be hit during catch-up for a node that did not have the local proof cache
// entry during attestation. In that case, wait for the cache to be filled by the proof-service.
let extracted = Self::extract_top_down_effects_retry_cache_miss(
&self.f3_execution_cache_retry,
f3,
&msg,
)
.await?;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be infallible.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm. Not sure. I think that it should not fail because of cache miss, but it should probably fail if the data are not extractable?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But that would be fatal, no?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes. and it si propagate as fatal

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, in the current code, it only causes apply_message to return with error, which CometBFT treats as ordinary transaction failure.

Comment on lines +16 to 32
/// - Latest Instance ID: The latest F3 instance that has been committed
/// - Latest Finalized Height: The highest epoch that has been finalized
/// - Power Table: Current validator power table (can change between instances)
///
/// This state is extracted from F3 certificates received from the parent chain
/// and stored by the actor for use in finality proofs.
#[derive(Deserialize_tuple, Serialize_tuple, Debug, Clone, PartialEq, Eq)]
pub struct LightClientState {
/// Current F3 instance ID
pub instance_id: u64,
/// Finalized chain - full list of finalized epochs
/// Matches ECChain from F3 certificates
/// Empty initially at genesis until first update
pub finalized_epochs: Vec<ChainEpoch>,
/// Current power table for this instance
/// Power table can change between instances
pub power_table: Vec<PowerEntry>,
/// Latest F3 instance ID that has been committed
pub latest_instance_id: u64,
/// The latest finalized height
pub latest_finalized_height: ChainEpoch,
/// Root CID of the on-chain power table (HAMT).
///
/// The actual entries are stored in the actor's blockstore and reachable from this root.
pub power_table_root: Cid,
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the former comment to the F3 instance ID was more clear: in many regards, it is the current instance because the same cert can be used to justify and commit parent chain updates for multiple epochs; one can consider a cert "committed" only when the last epoch it certifies is also "committed". The epoch number (height) then signifies the latest accepted parent chain extension. (In principle, the latest finalized epoch is the last one in the cert). Maybe we should call those fields instance_id (or current_instance_id) and latest_height (the latest height for which the parent chain updates were applied)?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, there seems to already be something very much like latest_finalized_height in the gateway contract, see commit_finality.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense.

Comment on lines +278 to +291
// Store validator changes in gateway
self.gateway_caller
.store_validator_changes(state, extracted.validator_changes)
.context("failed to store validator changes")?;

// Execute topdown messages
let ret = self
.execute_topdown_msgs(state, extracted.topdown_msgs)
.await
.context("failed to execute top down messages")?;

// Finalize F3 execution only after all effects were applied successfully.
f3.finalize_after_execution(state, msg.height, extracted.instance_id)
.context("failed to finalize F3 execution")?;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering what may cause store_validator_changes or execute_topdown_msgs to fail? Should that happen, finalize_after_execution won't be called, and, IIUC, no further update from the parent chain will ever make it to the subnet chain.

Comment on lines +343 to +349
let service = ProofGeneratorService::new(
proof_config.clone(),
proof_cache.clone(),
&subnet_id,
initial_instance,
fendermint_vm_topdown_proof_service::power_entries_from_actor(&f3_state.power_table),
)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The proof generator service initializes the F3 client with initial_instance, which will start fetching certificates from initial_instance+1; therefore, we won't generate any proof bundles for initial_instance. If a validator joins late, it may find itself in a situation where an F3 cert is partially committed (some epochs already committed, but some not yet). In fact, this may even happen at genesis because we initialize the F3 client actor with the base epoch number.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wait but the initial_instance + 1 is by design. With F3 you can't validate current cert with current power table. You validate cert N with power table from N - 1. Hmm. So really the initial instance should not be called initial tbh. And it should ne N - 1 of where we actually want to start. Otherwise we would need to be storing previous power table or something like that. It is pretty complicated... Hmm. WDYT?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps we should update the instance number in the F3 actor state only once all epochs certified by that instance are committed. And we should initialize it in the genesis accordingly. Maybe also reconsider what we mark as "committed" in the proof cache, to make the logic consistent.

})?;

// Get base power table for the specified instance
let power_table_response = lotus_client.f3_get_power_table(instance_id).await?;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, we should keep in mind that we trust the endpoint here, when generating the genesis block, in that it provides us the correct initial power table for F3.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, when we create a new subnet, we trust that the parent endpoint doesn't provide us an F3 instance ID "from future".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes we do. But there is not other way around it IMO. But we should mention it docs.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think, in principle, we could derive it, upon initialization, from the EC chain after 900 epochs (or earlier using the finality calculator), same way as Filecoin nodes are supposed to do, ultimately anchoring trust into the drand beacon

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

F3 topdown: Proof Verification & Completeness Enforcement

3 participants