Skip to content

Clean up stale resources when BR destroy PG #394

@Besroy

Description

@Besroy

SH tests found 2 issues: When a member is resynced by BR, all PG resources (blob allocation/logs/status/...) are expected to be reset/new. However, pg_destroy misses some resources, leaving stale state, which can cause corrupted blobs or system crashes. Here we need:

  1. Clean up all currently identified stale resources;
  2. Discuss how to ensure future resources are reliably cleaned up when BR destroys a PG, since there might be more resources added in the future.

Currently there are two resources that need cleanup during BR PG destroy, related to 2 issues:

  1. stale rreqs (SH issue#91)
    The chunk block index is reset by BR, but the associated rreqs in m_repl_key_req_map are not cleared. After BR completes, when the log is appended from Raft, the stale rreq in memory is reused, leading to an incorrect block.
  2. no_space_left_error_info (SH issue#95)
    EGC handle_no_space_left waits for commits up to LSN1, then BR occurs and advances commit_lsn to LSN2 (LSN2 > LSN1). Since the stale no_space_left_error_info is not reset and its lsn < commit_lsn, this hits the assert in notify_committed_lsn called by flush_durable_commit_lsn

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions