Skip to content

LogDev Recovery CRC Mismatch #869

@Besroy

Description

@Besroy

The issue happen in my local build SM 0.5.5 image, lets put here to see if it need more discussion or not.

Environment

Parameter Value
Cluster 908
Namespace nuobject2sh-dev
Deployment sm-long-running1-1015
HomeObjVer homeobject/4.1.0@oss/v4
HomeStoreVer homestore/7.4.4@oss/master

Test Scenario

  1. Start write 1 million blobs in 2 PGs
    • blob_size: 150K
    • replica_count: 3
    • write:delete:read ratio: 87:13
  2. Randomly select one member to kill every 3 minutes
  3. After SM1 was killed and restarted, it crashed during recovery

Problem

LogDev recovery crashes when reading stale data from a previous journal cycle that coincidentally has matching prev_grp_crc.

[03/07/26 18:23:11.242] [storage_mgr] [error] [11] [log_stream.cpp:140:next_group] crc doesn't match 268939264 log_dev=1
[03/07/26 18:23:11.242] [storage_mgr] [critical] [11] data is corrupted 1
[03/07/26 18:23:11.309] [storage_mgr] [critical] [11] [stacktrace.cpp:133:crash_handler]
 * ****Received fatal SIGNAL : SIGABRT(6)	PID : 11
[03/07/26 18:23:11.309] [storage_mgr] [critical] [11] Thread num: 123956172627968 entered exit handler

Crash Scenario

Before Crash

  • NuRaft lsn=65651 fully flushed → LogDev log_idx=65650 was written
  • NuRaft lsn=65652 proposing → LogDev log_idx=65651 NOT written yet
  • SIGKILL → offset 268939264 still contains old data from previous journal round

After Restart (Recovery)

log_dev=1 - Crashes:

By adding more logging and restarting again, the detailed log is:

[03/10/26 03:44:51.544] [error] CRC mismatch: computed=3101365257 expected=511191246
    offset=268939264 is_at_last_chunk=true start_log_idx=65651
    prev_grp_crc=4037465583 m_prev_crc=4037465583 log_dev=1
[03/10/26 03:44:51.545] [critical] data is corrupted 1

Recovery reads:

  1. ✓ log_idx=65650 successfully → sets m_prev_crc=4037465583
  2. ✗ Reads offset 268939264 (old stale data):
    • Header shows prev_grp_crc=4037465583 → matches m_prev_crc ✓
    • Passes prev_crc check, footer check
    • CRC verification fails → CRASH

Root Cause

This appears to be a coincidence issue where the data PVC is reused across multiple rounds of storage hammer testing. After discussion with AI, no additional evidence was found to confirm this is a code bug.

The coincidence: Old data's prev_grp_crc happens to match the newly written group's cur_grp_crc (both 4037465583).

SH test Round N-1 (long ago):
  offset 268939264 written with log_idx=M, prev_crc=4037465583

SH test Round N (now):
  offset 268935168: log_idx=65650, cur_crc=4037465583  ← just written
  offset 268939264: NOT written yet → still has old data from Round N-1
    → Old header: prev_crc=4037465583 (coincidentally matches current!)
    → But data content is stale → CRC mismatch

GDB Evidence

(gdb) p (homestore::log_group_header)(*0x6006709aba00)
magic = 15731998          ✓ Valid
logdev_id = 1             ✓ Matches
start_log_idx = 65651     ✓ Looks correct
prev_grp_crc = 4037465583 ✓ Matches m_prev_crc (coincidence!)
cur_grp_crc = 511191246   ✗ But data CRC doesn't match

(gdb) p (homestore::log_group_footer)(*105580775258883)
magic = 11537694          ✓ Valid
start_log_idx = 65651     ✓ Consistent with header

(gdb) p (homestore::serialized_log_record)(*105580775258672)
$9 = {size = 111, offset = 148, is_inlined = 1, store_seq_num = 65651, store_id = 0}

Code Bug Discussion

File: src/lib/logstore/log_stream.cpp, line 138-148

CRC mismatch handling is inconsistent with other validation checks. What is the background for this difference? Can we add last_chunk tolerance here as well?

// prev_crc check (line 87-100) - has last_chunk tolerance ✓
if (m_prev_crc != 0 && m_prev_crc != header->prev_grp_crc) {
    if (!m_vdev_jd->is_offset_at_last_chunk(*out_dev_offset)) {
        HS_REL_ASSERT(0, "data is corrupted");
    }
    return ret_buf;  // Last chunk → graceful end
}

// Footer check (line 118-129) - has last_chunk tolerance ✓
if (footer->magic != LOG_GROUP_FOOTER_MAGIC || ...) {
    if (!m_vdev_jd->is_offset_at_last_chunk(*out_dev_offset)) {
        HS_REL_ASSERT(0, "data is corrupted");
    }
    return ret_buf;  // Last chunk → graceful end
}

// CRC check (line 138-148) - NO last_chunk tolerance ✗
if (cur_crc != header->cur_grp_crc) {
    LOGERROR("CRC mismatch: ... is_at_last_chunk={}", is_at_last_chunk, ...);
    HS_REL_ASSERT(0, "data is corrupted");  // Always crashes!
}

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions