Skip to content

Conversation

@oneby-wang
Copy link
Contributor

Motivation

Fix ManagedLedgerTest.testLockReleaseWhenTrimLedger() flaky test.

This test has the following problems:

  1. The test doesn't wait for new ledger created, so assertEquals(ledger.ledgers.size() - 1, entries) or assertEquals(ledger.ledgerCache.size() - 1, entries - 1) may be flaky, see code comments.
java.lang.AssertionError:
Expected :10
Actual   :9
<Click to see difference>
	at org.testng.Assert.fail(Assert.java:110)
	at org.testng.Assert.failNotEquals(Assert.java:1577)
	at org.testng.Assert.assertEqualsImpl(Assert.java:149)
	at org.testng.Assert.assertEquals(Assert.java:131)
	at org.testng.Assert.assertEquals(Assert.java:1418)
	at org.testng.Assert.assertEquals(Assert.java:1382)
	at org.testng.Assert.assertEquals(Assert.java:1428)
	at org.apache.bookkeeper.mledger.impl.ManagedLedgerTest.testLockReleaseWhenTrimLedger(ManagedLedgerTest.java:3982)
  1. The test spies an instance that contains volatile fields, which also cause flakiness, see PR comments below.

Modifications

  1. Waiting for new ledger created to avoid flaky test.
  2. Use overriding instead of spying to avoid flaky test.

Verifying this change

  • Make sure that the change passes the CI checks.

If the box was checked, please highlight the changes

  • Dependencies (add or upgrade a dependency)
  • The public API
  • The schema
  • The default values of configurations
  • The threading model
  • The binary protocol
  • The REST endpoints
  • The admin CLI options
  • The metrics
  • Anything that affects deployment

Documentation

  • doc
  • doc-required
  • doc-not-needed
  • doc-complete

Matching PR in forked repository

PR in forked repository: oneby-wang#24

@oneby-wang
Copy link
Contributor Author

I occasionally found ManagedLedgerTest.testLockReleaseWhenTrimLedger() test was flaky. Here is a full log when the test throws ManagedLedgerException.

2026-01-13T11:07:22,410 - INFO  - [main:ManagedLedgerImpl@418] - Opening managed ledger testLockReleaseWhenTrimLedger
2026-01-13T11:07:22,410 - INFO  - [bookkeeper-ml-scheduler-OrderedScheduler-7-0:MetaStoreImpl@113] - Creating '/managed-ledgers/testLockReleaseWhenTrimLedger'
2026-01-13T11:07:22,415 - INFO  - [test-OrderedScheduler-0-0:PulsarMockBookKeeper@138] - Creating ledger 3
2026-01-13T11:07:22,416 - INFO  - [test-OrderedScheduler-1-0:ManagedLedgerImpl@614] - [testLockReleaseWhenTrimLedger] Created ledger 3 after closed null
2026-01-13T11:07:22,416 - INFO  - [bookkeeper-ml-scheduler-OrderedScheduler-7-0:ManagedLedgerFactoryImpl$2@482] - [testLockReleaseWhenTrimLedger] Successfully initialize managed ledger
2026-01-13T11:07:22,419 - INFO  - [main:ManagedCursorImpl@831] - [testLockReleaseWhenTrimLedger] Cursor test-cursor8784046c-0e28-4f04-8c8b-315231fb9b57 recovered to position 3:-1
2026-01-13T11:07:22,426 - INFO  - [bookkeeper-ml-scheduler-OrderedScheduler-7-0:ManagedLedgerImpl$6@1049] - [testLockReleaseWhenTrimLedger] Opened new cursor: ManagedCursorImpl{ledger=testLockReleaseWhenTrimLedger, name=test-cursor8784046c-0e28-4f04-8c8b-315231fb9b57, ackPos=3:-1, readPos=3:0}
2026-01-13T11:07:22,780 - INFO  - [test-OrderedScheduler-1-0:OpAddEntry@284] - [testLockReleaseWhenTrimLedger] Closing ledger 3 for being full
2026-01-13T11:07:22,808 - INFO  - [test-OrderedScheduler-0-0:ManagedLedgerImpl@1976] - [testLockReleaseWhenTrimLedger] Creating a new ledger after closed 3
2026-01-13T11:07:22,810 - INFO  - [test-OrderedScheduler-0-0:PulsarMockBookKeeper@138] - Creating ledger 4
2026-01-13T11:07:22,811 - INFO  - [test-OrderedScheduler-0-0:ManagedLedgerImpl@1745] - [testLockReleaseWhenTrimLedger] Created new ledger 4
2026-01-13T11:07:22,842 - INFO  - [test-OrderedScheduler-1-0:OpAddEntry@284] - [testLockReleaseWhenTrimLedger] Closing ledger 4 for being full
2026-01-13T11:07:22,844 - INFO  - [test-OrderedScheduler-0-0:ManagedLedgerImpl@1976] - [testLockReleaseWhenTrimLedger] Creating a new ledger after closed 4
2026-01-13T11:07:22,845 - INFO  - [test-OrderedScheduler-0-0:PulsarMockBookKeeper@138] - Creating ledger 5
2026-01-13T11:07:22,845 - INFO  - [test-OrderedScheduler-0-0:ManagedLedgerImpl@1745] - [testLockReleaseWhenTrimLedger] Created new ledger 5
2026-01-13T11:07:23,159 - INFO  - [test-OrderedScheduler-1-0:OpAddEntry@284] - [testLockReleaseWhenTrimLedger] Closing ledger 5 for being full
2026-01-13T11:07:23,161 - INFO  - [test-OrderedScheduler-0-0:ManagedLedgerImpl@1976] - [testLockReleaseWhenTrimLedger] Creating a new ledger after closed 5
2026-01-13T11:07:23,161 - INFO  - [test-OrderedScheduler-0-0:PulsarMockBookKeeper@138] - Creating ledger 6
2026-01-13T11:07:23,162 - INFO  - [test-OrderedScheduler-0-0:ManagedLedgerImpl@1745] - [testLockReleaseWhenTrimLedger] Created new ledger 6
2026-01-13T11:07:23,170 - INFO  - [test-OrderedScheduler-1-0:OpAddEntry@284] - [testLockReleaseWhenTrimLedger] Closing ledger 6 for being full
2026-01-13T11:07:23,171 - INFO  - [test-OrderedScheduler-0-0:ManagedLedgerImpl@1976] - [testLockReleaseWhenTrimLedger] Creating a new ledger after closed 6
2026-01-13T11:07:23,172 - INFO  - [test-OrderedScheduler-0-0:PulsarMockBookKeeper@138] - Creating ledger 7
2026-01-13T11:07:23,173 - INFO  - [test-OrderedScheduler-0-0:ManagedLedgerImpl@1745] - [testLockReleaseWhenTrimLedger] Created new ledger 7
2026-01-13T11:07:23,249 - INFO  - [test-OrderedScheduler-1-0:OpAddEntry@284] - [testLockReleaseWhenTrimLedger] Closing ledger 7 for being full
2026-01-13T11:07:23,250 - INFO  - [test-OrderedScheduler-0-0:ManagedLedgerImpl@1976] - [testLockReleaseWhenTrimLedger] Creating a new ledger after closed 7
2026-01-13T11:07:23,251 - INFO  - [test-OrderedScheduler-0-0:PulsarMockBookKeeper@138] - Creating ledger 8
2026-01-13T11:07:23,252 - INFO  - [test-OrderedScheduler-0-0:ManagedLedgerImpl@1745] - [testLockReleaseWhenTrimLedger] Created new ledger 8
2026-01-13T11:07:23,254 - INFO  - [test-OrderedScheduler-1-0:OpAddEntry@284] - [testLockReleaseWhenTrimLedger] Closing ledger 8 for being full
2026-01-13T11:07:23,255 - INFO  - [test-OrderedScheduler-0-0:ManagedLedgerImpl@1976] - [testLockReleaseWhenTrimLedger] Creating a new ledger after closed 8
2026-01-13T11:07:23,256 - INFO  - [test-OrderedScheduler-0-0:PulsarMockBookKeeper@138] - Creating ledger 9
2026-01-13T11:07:23,257 - INFO  - [test-OrderedScheduler-0-0:ManagedLedgerImpl@1745] - [testLockReleaseWhenTrimLedger] Created new ledger 9
2026-01-13T11:07:23,259 - INFO  - [test-OrderedScheduler-1-0:OpAddEntry@284] - [testLockReleaseWhenTrimLedger] Closing ledger 9 for being full
2026-01-13T11:07:23,260 - INFO  - [test-OrderedScheduler-0-0:ManagedLedgerImpl@1976] - [testLockReleaseWhenTrimLedger] Creating a new ledger after closed 9
2026-01-13T11:07:23,261 - INFO  - [test-OrderedScheduler-0-0:PulsarMockBookKeeper@138] - Creating ledger 10
2026-01-13T11:07:23,262 - INFO  - [test-OrderedScheduler-0-0:ManagedLedgerImpl@1745] - [testLockReleaseWhenTrimLedger] Created new ledger 10
2026-01-13T11:07:23,265 - INFO  - [test-OrderedScheduler-1-0:OpAddEntry@284] - [testLockReleaseWhenTrimLedger] Closing ledger 10 for being full
2026-01-13T11:07:23,266 - INFO  - [test-OrderedScheduler-0-0:ManagedLedgerImpl@1976] - [testLockReleaseWhenTrimLedger] Creating a new ledger after closed 10
2026-01-13T11:07:23,272 - INFO  - [test-OrderedScheduler-0-0:PulsarMockBookKeeper@138] - Creating ledger 11
2026-01-13T11:07:23,273 - INFO  - [test-OrderedScheduler-0-0:ManagedLedgerImpl@1745] - [testLockReleaseWhenTrimLedger] Created new ledger 11
2026-01-13T11:07:23,556 - INFO  - [test-OrderedScheduler-1-0:OpAddEntry@284] - [testLockReleaseWhenTrimLedger] Closing ledger 11 for being full
2026-01-13T11:07:23,565 - INFO  - [test-OrderedScheduler-0-0:ManagedLedgerImpl@1976] - [testLockReleaseWhenTrimLedger] Creating a new ledger after closed 11
2026-01-13T11:07:23,566 - INFO  - [test-OrderedScheduler-0-0:PulsarMockBookKeeper@138] - Creating ledger 12
2026-01-13T11:07:23,567 - INFO  - [test-OrderedScheduler-0-0:ManagedLedgerImpl@1745] - [testLockReleaseWhenTrimLedger] Created new ledger 12
2026-01-13T11:07:23,650 - INFO  - [test-OrderedScheduler-1-0:OpAddEntry@284] - [testLockReleaseWhenTrimLedger] Closing ledger 12 for being full
2026-01-13T11:07:23,651 - INFO  - [test-OrderedScheduler-0-0:ManagedLedgerImpl@1976] - [testLockReleaseWhenTrimLedger] Creating a new ledger after closed 12
2026-01-13T11:07:23,748 - INFO  - [test-OrderedScheduler-0-0:PulsarMockBookKeeper@138] - Creating ledger 13
2026-01-13T11:07:23,801 - INFO  - [test-OrderedScheduler-0-0:ManagedLedgerImpl@1745] - [testLockReleaseWhenTrimLedger] Created new ledger 13
2026-01-13T11:07:24,782 - WARN  - [test-OrderedScheduler-1-0:OpReadEntry@168] - [testLockReleaseWhenTrimLedger][test-cursor8784046c-0e28-4f04-8c8b-315231fb9b57] read failed from ledger at position:3:0
org.apache.bookkeeper.mledger.ManagedLedgerException: LastConfirmedEntry is 3:-1 when reading entry 0

org.apache.bookkeeper.mledger.ManagedLedgerException: LastConfirmedEntry is 3:-1 when reading entry 0

I was every confused since ledger.addEntry(String.valueOf(i).getBytes(Encoding)) is an sync method. The lastConfirmedEntry should not be 3:-1(which is the initial value) when executing List<Entry> entryList = cursor.readEntries(entries).

final var lastConfirmedEntry = ml.getLastConfirmedEntry();
if (lastConfirmedEntry == null) {
return CompletableFuture.failedFuture(new ManagedLedgerException(
"LastConfirmedEntry is null when reading ledger " + handle.getId()));
}
if (handle.getId() > lastConfirmedEntry.getLedgerId()) {
return CompletableFuture.failedFuture(new ManagedLedgerException("LastConfirmedEntry is "
+ lastConfirmedEntry + " when reading ledger " + handle.getId()));
}
if (handle.getId() == lastConfirmedEntry.getLedgerId() && lastEntry > lastConfirmedEntry.getEntryId()) {
return CompletableFuture.failedFuture(new ManagedLedgerException("LastConfirmedEntry is "
+ lastConfirmedEntry + " when reading entry " + lastEntry));
}

The entryCache is initialized using ManagedLedgerImpl.this.

public ManagedLedgerImpl(ManagedLedgerFactoryImpl factory, BookKeeper bookKeeper, MetaStore store,
ManagedLedgerConfig config, OrderedScheduler scheduledExecutor,
final String name, final Supplier<CompletableFuture<Boolean>> mlOwnershipChecker) {
if (config.isCacheEvictionByExpectedReadCount()) {
activeCursors = new ActiveManagedCursorContainerImpl(
config.getContinueCachingAddedEntriesAfterLastActiveCursorLeavesMillis());
} else {
activeCursors = new ManagedCursorContainerImpl();
}
this.factory = factory;
this.bookKeeper = bookKeeper;
this.config = config;
this.store = store;
this.name = name;
this.ledgerMetadata = LedgerMetadataUtils.buildBaseManagedLedgerMetadata(name);
this.digestType = BookKeeper.DigestType.fromApiDigestType(config.getDigestType());
this.scheduledExecutor = scheduledExecutor;
this.executor = bookKeeper.getMainWorkerPool().chooseThread(name);
TOTAL_SIZE_UPDATER.set(this, 0);
NUMBER_OF_ENTRIES_UPDATER.set(this, 0);
ENTRIES_ADDED_COUNTER_UPDATER.set(this, 0);
STATE_UPDATER.set(this, State.None);
this.ledgersStat = null;
this.mbean = new ManagedLedgerMBeanImpl(this);
if (config.getManagedLedgerInterceptor() != null) {
this.managedLedgerInterceptor = config.getManagedLedgerInterceptor();
}
this.entryCache = factory.getEntryCacheManager().getEntryCache(this);

The only inconsistent cause in code is here, we returns old currentEntryCache and ignore new ml in newEntryCache.

public EntryCache getEntryCache(ManagedLedger ml) {
if (maxSize == 0) {
// Cache is disabled
return new EntryCacheDisabled((ManagedLedgerImpl) ml);
}
EntryCache newEntryCache =
new RangeEntryCacheImpl(this, (ManagedLedgerImpl) ml, mlFactory.getConfig().isCopyEntriesInCache(),
rangeCacheRemovalQueue, entryLengthFunction);
EntryCache currentEntryCache = caches.putIfAbsent(ml.getName(), newEntryCache);
if (currentEntryCache != null) {
return currentEntryCache;
} else {
return newEntryCache;
}
}

I searched all the call chains to this method, and came out to a reason: there should be exactly one invocation of the RangeEntryCacheManagerImpl.getEntryCache() method. So I add a warn log in if (currentEntryCache != null) block(see PR code change), and found no warn log printed when ManagedLedgerException throwed.

Finally, I found that the root cause of the issue was the interaction between the spy and volatile fields. I'll explain in the following comments.

@github-actions github-actions bot added the doc-not-needed Your PR changes do not impact docs label Jan 15, 2026
@oneby-wang
Copy link
Contributor Author

I am not familiar with mockito or bytecode manipulation technology, but after some deeper debug on this test, I'd suggest that we should avoid spying an instance that contains volatile fields in tests, especially when these volatile fields are not accessed using setter and getter methods.

Here is the debug details:

All fields of originalLedger and ledger are identical after initialization.

image

lastConfirmedEntry is also identical.

image

Except entryCahce.ml field:

  • entryCahce.ml is originalLedger in originalLedger.
  • entryCahce.ml is originalLedger in ledger.

This is the inconsistency root cause.

image

After adding 10 entries, lastConfirmedEntry in ledger is updated to 12:0, but in originalLedger, lastConfirmedEntry is still 3:-1, which is not updated.

image

After adding 10 entries, currentLedger in originalLedger is also not updated.

image

Here comes the ManagedLedgerException.

image

@oneby-wang
Copy link
Contributor Author

The flaky test was very hard to reproduce, but I occasionally found a way to easily reproduce it by adding some breakpoints.

I don't know the reason, maybe breakpoints in IntelliJ IDEA cause further de-synchronization between the volatile fields of the spy object and the original object, and it seems that breakpoints added after ledger.addEntry(String.valueOf(i).getBytes(Encoding)) method are more likely to cause flakiness.

image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

doc-not-needed Your PR changes do not impact docs

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant