Skip to content

ISSUE-190: fix the issue that leader coredump once follower asks for …#191

Open
jingyichen1223 wants to merge 1 commit into
eBay:masterfrom
jingyichen1223:ISSUE-190
Open

ISSUE-190: fix the issue that leader coredump once follower asks for …#191
jingyichen1223 wants to merge 1 commit into
eBay:masterfrom
jingyichen1223:ISSUE-190

Conversation

@jingyichen1223
Copy link
Copy Markdown
Collaborator

…raftlog that has been truncated

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR addresses a Raft leader coredump scenario triggered when a follower responds based on raft logs that have been truncated (or otherwise become inconsistent with the leader’s cached peer indices), and adds a detailed scenario matrix documenting AppendEntries index-handling behaviors.

Changes:

  • Add early-return reset handling in RaftCore::handleAppendEntriesResponse() for (a) follower log rollback vs cached matchIndex, and (b) nextIndex falling at/below the leader’s firstLogIndex after truncation.
  • Improve logging around these abnormal index shapes to aid diagnosis.
  • Add a comprehensive AppendEntries scenario walkthrough document.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 5 comments.

File Description
src/infra/raft/v2/RaftCore.cpp Adds defensive handling for abnormal AE rejection responses to prevent crashes when indices fall into truncated/rolled-back regions.
docs/Gringofts_AE_Scenario_Matrix.md New documentation describing AE index-shape scenarios and expected behaviors.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +685 to +687
/// Two special cases out of raft consensus but practical happenings:
/// 1. follower's log is rolled back due to disk crash, so response.last_log_index() is less than peer.mMatchIndex.
/// 2. leader's log is truncated due to disk usage, so response.last_log_index() is less than leader's firstLogIndex
// Suppress bulk data until we find proper peer.mMatchIndex and peer.mNextIndex.
peer.mSuppressBulkData = true;
SPDLOG_WARN("{} reset Follower {}: matchIndex from {} to 0, nextIndex from {} to {} ",
SPDLOG_WARN("Follower {} last_log_index({}) fall behind its matchIndex({})", response.id(),
}

/// there should be a gap between matchIndex and nextIndex
assert(peer.mMatchIndex + 1 < peer.mNextIndex);
- `handleAppendEntriesRequest()`
- `handleAppendEntriesResponse()`

in `core/third_party/Gengar/third_party/Gringofts/src/infra/raft/v2/RaftCore.cpp`.
Comment on lines +49 to +54
This walkthrough cover these scenarios:
- Normal cases
- New leader cases
- Follower disk was lost cases
- Truncate cases

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants