Skip to content

Conversation

@poorbarcode
Copy link
Contributor

@poorbarcode poorbarcode commented Jan 14, 2026

Motivation

Background

  • Ledger states management
    • In BK server, ledgers are managed by HandleFactoryImpl, which maintains the ledger state.
    • When fencing entries, BK change the state of the Ledger Handle to fenced
    • When adding entries, BK checks fencing state first, then apply writing.
  • Open ledger with recovery
    • BK fence ledger first, then read LAC to ensure no entries missed
  • Threads design
    • Opening ledger with recovery works on the thread BookieHighPriorityThread
    • Adding entries works on the thread BookieWriteThreadPool

To avoid the following multi-threading competition issue,BK uses a lock[1]

(BookieWriteThreadPool)Add entry (BookieHighPriorityThread)Opening ledger with recovery
get LedgerDescriptor from HandleFactoryImpl
check fenced status: non-fenced
get LedgerDescriptor from HandleFactoryImpl
fence ledger
read Lac from write cache
add entry into write cache
add entry into Journal Disk
success
Lac is 1 Lac is 0
public void addEntry(ByteBuf entry, boolean ackBeforeSync, WriteCallback cb, Object ctx, byte[] masterKey)
            throws IOException, BookieException, InterruptedException {
        try {
            LedgerDescriptor handle = getLedgerForEntry(entry, masterKey);
            synchronized (handle) {
                if (handle.isFenced()) {
                    throw BookieException
                            .create(BookieException.Code.LedgerFencedException);
                }
                addEntryInternal(handle, entry, ackBeforeSync, cb, ctx, masterKey);
            }
        } 
}
synchronized CompletableFuture<Boolean> fenceAndLogInJournal(Journal journal) throws IOException {
    boolean success = this.setFenced();
    return logFenceEntryInJournal(journal);
}

Issue: there is a race condition breaks the lock above

https://github.com/apache/bookkeeper/blob/release-4.17.2/bookkeeper-server/src/main/java/org/apache/bookkeeper/bookie/HandleFactoryImpl.java#L54-L68

    public LedgerDescriptor getHandle(final long ledgerId, final byte[] masterKey, boolean journalReplay)
            throws IOException, BookieException {
        LedgerDescriptor handle = ledgers.get(ledgerId);

        if (handle == null) {
            handle = LedgerDescriptor.create(masterKey, ledgerId, ledgerStorage);
            ledgers.putIfAbsent(ledgerId, handle);
        }

        handle.checkAccess(masterKey);
        return handle;
    }
(BookieWriteThreadPool)Add entry (BookieHighPriorityThread)Opening ledger with recovery
get LedgerDescriptor from HandleFactoryImpl get LedgerDescriptor from HandleFactoryImpl
check if created: false check if created: false
creates a new one creates a new one

Changes

Fix the bug

@zymap zymap requested a review from Copilot January 15, 2026 09:20
}
}
handle = LedgerDescriptor.create(masterKey, ledgerId, ledgerStorage);
ledgers.putIfAbsent(ledgerId, handle);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like we can avoid the synchronization by handling the result of the putIfAbsent? I understand the main issue is that here we create two LedgerDescriptor, then used by different threads. But the putIfAbsent will reduce one. So we can handle the result to ensure the same handle is returned?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Improved, please review again

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes a race condition in HandleFactoryImpl.getHandle() that could lead to entry loss when multiple threads simultaneously create a LedgerDescriptor for the same ledger. The issue occurs when both the write thread and the recovery thread attempt to create a new descriptor concurrently, potentially breaking the synchronization lock that prevents fencing races.

Changes:

  • Implemented double-checked locking pattern in getHandle() to ensure only one LedgerDescriptor is created per ledger ID
  • Synchronized on the ledgers map during descriptor creation to prevent race conditions

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

throw BookieException.create(BookieException.Code.LedgerFencedAndDeletedException);
}
LedgerDescriptor handle = LedgerDescriptor.create(masterKey, ledgerId, ledgerStorage);
ledgers.putIfAbsent(ledgerId, handle);
Copy link

Copilot AI Jan 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The putIfAbsent on line 63 is unnecessary and potentially wasteful. Since we're already inside a synchronized block with a double-check, the key is guaranteed not to exist. Use put() instead of putIfAbsent() to avoid the redundant check and potential creation of unused LedgerDescriptor instances.

Suggested change
ledgers.putIfAbsent(ledgerId, handle);
ledgers.put(ledgerId, handle);

Copilot uses AI. Check for mistakes.
@poorbarcode poorbarcode requested a review from zymap January 16, 2026 03:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants