Fix JVM <clinit> deadlock in Cosmos SDK accessor initialization#48667
Fix JVM <clinit> deadlock in Cosmos SDK accessor initialization#48667jeet1995 wants to merge 12 commits intoAzure:mainfrom
Conversation
Replace broad initializeAllAccessors() fallback calls in each getXxxAccessor() method with targeted Class.forName() that loads only the specific class registering that accessor. This eliminates the circular class initialization dependency chain that caused permanent JVM-level deadlocks when multiple threads concurrently triggered Cosmos SDK class loading for the first time. Root cause: When any getXxxAccessor() found its accessor unset, it called initializeAllAccessors() which eagerly loaded 40+ classes via ModelBridgeInternal, UtilBridgeInternal, and BridgeInternal. If two threads entered different <clinit> methods simultaneously, the JVM class initialization monitors created an AB/BA deadlock that was permanent and unrecoverable. Fix: Each of the 35 getXxxAccessor() methods now calls ensureClassInitialized() with just the fully-qualified name of the class that registers that specific accessor. This dramatically narrows the class loading scope per accessor call, making circular dependencies practically impossible. The public initializeAllAccessors() method is preserved for explicit eager bootstrap use cases (e.g., Kafka connector, Spring @PostConstruct workaround). Fixes: Azure#48622 Fixes: Azure#48585 Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
There was a problem hiding this comment.
Pull request overview
This PR mitigates a JVM <clinit> deadlock in the Cosmos Java SDK by replacing broad “initialize everything” fallback behavior in accessor getters with targeted class initialization, reducing circular class-init dependency chains during concurrent first-touch class loading.
Changes:
- Added a targeted class-initialization helper and updated accessor getters to initialize only the specific registering class.
- Preserved the existing eager
initializeAllAccessors()entry point for explicit bootstrap scenarios. - Added a Bugs Fixed entry to the
azure-cosmoschangelog documenting the deadlock fix.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.
| File | Description |
|---|---|
sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/ImplementationBridgeHelpers.java |
Introduces targeted Class.forName initialization and replaces broad fallback initialization in many accessor getters. |
sdk/cosmos/azure-cosmos/CHANGELOG.md |
Documents the deadlock fix in the unreleased Bugs Fixed section. |
.../azure-cosmos/src/main/java/com/azure/cosmos/implementation/ImplementationBridgeHelpers.java
Outdated
Show resolved
Hide resolved
.../azure-cosmos/src/main/java/com/azure/cosmos/implementation/ImplementationBridgeHelpers.java
Outdated
Show resolved
Hide resolved
.../azure-cosmos/src/main/java/com/azure/cosmos/implementation/ImplementationBridgeHelpers.java
Show resolved
Hide resolved
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- ensureClassInitialized: fail fast with IllegalStateException instead of swallowing ClassNotFoundException; use explicit 3-arg Class.forName with classloader - CHANGELOG.md: fix bullet formatting for consistency - Add concurrent accessor initialization regression test in ImplementationBridgeHelpersTest to guard against deadlock reintroduction Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…sholdsHelper Rename to getCosmosDiagnosticsThresholdsAccessor() which correctly reflects the return type and initialized class. The old method is kept as a @deprecated delegating alias for binary compatibility. Updated all 3 internal call sites to use the new name. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
1532dd9 to
2b5a98b
Compare
|
/azp run java - cosmos - tests |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
/azp run java - cosmos - ci |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
@sdkReviewAgent-2 |
|
⏳ PR Review Agent — Starting review... |
The concurrentAccessorInitializationShouldNotDeadlock test reset all 35 accessors to null but only re-initialized 6 concurrently. The remaining 29 were left null, causing NoSuchMethodError/NPE cascades in every subsequent test running in the same JVM. Fix: wrap the test body in try/finally and call initializeAllAccessors() in the finally block to restore all accessors for downstream tests. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
.../azure-cosmos/src/main/java/com/azure/cosmos/implementation/ImplementationBridgeHelpers.java
Show resolved
Hide resolved
...mos-tests/src/test/java/com/azure/cosmos/implementation/ImplementationBridgeHelpersTest.java
Show resolved
Hide resolved
|
/azp run java - cosmos - ci |
|
/azp run java - cosmos - tests |
|
Azure Pipelines successfully started running 1 pipeline(s). |
1 similar comment
|
Azure Pipelines successfully started running 1 pipeline(s). |
Class.forName() is a no-op when the target class is already being initialized by the current thread (JLS 12.4.2). This broke classes like CosmosItemSerializer whose <clinit> transitively calls getCosmosItemSerializerAccessor() before the accessor is registered. Fix: after Class.forName(), also call the class's initialize() method reflectively. This static method explicitly registers the accessor and is allowed to execute during recursive <clinit> from the same thread, which was the behavior of the previous initializeAllAccessors() chain. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Add 'static { initialize(); }' to CosmosRequestContext,
CosmosOperationDetails, and CosmosDiagnosticsContext which had
initialize() methods but no <clinit> invocation, so Class.forName()
alone would not register their accessors.
- Improve regression test: assert accessor return values are non-null,
catch ExecutionException for clearer failure messages, and document
the inherent <clinit>-once-per-JVM limitation.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The <clinit> deadlock can only be reproduced in a fresh JVM where classes haven't been loaded yet. Replace the reflection-based in-process test with one that forks child JVM processes using ProcessBuilder, each running 6 concurrent threads that trigger <clinit> of different Cosmos classes simultaneously. A 30-second timeout detects deadlocks. Runs 3 iterations for reliability. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
|
/azp run java - cosmos - tests |
|
Azure Pipelines successfully started running 1 pipeline(s). |
8e70f51 to
87d5fcd
Compare
|
/azp run java - cosmos - ci |
|
/azp run java - cosmos - tests |
|
/azp run java - cosmos - kafka |
|
/azp run java - cosmos - spark |
|
/azp run java - spring - ci |
|
Azure Pipelines successfully started running 1 pipeline(s). |
4 similar comments
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
Azure Pipelines successfully started running 1 pipeline(s). |
- Fix potential test hang: drain child JVM stdout on a daemon gobbler
thread so the timeout check is always reached even if the child
deadlocks (Copilot review comment).
- Add allAccessorClassesMustHaveStaticInitializerBlock test that
scans ImplementationBridgeHelpers source for ensureClassInitialized
targets and verifies each has 'static { initialize(); }' — catches
missing blocks at build time.
- Shorten CHANGELOG entry to 1-2 lines per convention.
- Revert Spring README whitespace (only needed to trigger pipeline).
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
87d5fcd to
b406765
Compare
|
/azp run java - cosmos - ci |
|
/azp run java - cosmos - tests |
|
/azp run java - cosmos - kafka |
|
/azp run java - cosmos - spark |
|
/azp run java - spring - ci |
|
Azure Pipelines successfully started running 1 pipeline(s). |
4 similar comments
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
/azp run java - cosmos - spark |
|
Azure Pipelines successfully started running 1 pipeline(s). |
Alternative approach to Azure#48667 that avoids Class.forName() entirely. Each getXxxAccessor() fallback now calls a targeted bridge method (e.g., BridgeInternal.initializeCosmosAsyncClientAccessor()) that directly invokes the specific class's initialize() method. This eliminates classloader risks and is compile-time checked. Adds per-class initialize methods to BridgeInternal (17), ModelBridgeInternal (20), and UtilBridgeInternal (1). Also includes: - Missing static { initialize(); } blocks for 3 classes - CosmosItemSerializer <clinit> ordering fix - Renamed misnamed accessor in CosmosDiagnosticsThresholdsHelper - Forked-JVM deadlock regression test (invocationCount=5, 12 threads) - Behavioral enforcement test for static { initialize(); } blocks Fixes: Azure#48622 Fixes: Azure#48585 Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
|
Closing in favor of #48689 (lazy accessor approach). Class.forName() has a fundamental limitation — it can't force accessor registration during recursive (JLS §12.4.2 no-op on same-thread re-initialization). The lazy approach removes static final accessor fields entirely, so never triggers accessor initialization. No Class.forName(), no classloader risk, no recursive edge cases. |
Summary
Fixes a JVM-level
<clinit>deadlock that occurs when multiple threads concurrently trigger Cosmos SDK class loading for the first time. This is a permanent, unrecoverable deadlock that hangs all affected threads indefinitely.Fixes: #48622, #48585
Root Cause
Each
getXxxAccessor()method inImplementationBridgeHelperscalledinitializeAllAccessors()as a fallback when its accessor was not yet set. This eagerly loaded 40+ classes viaModelBridgeInternal,UtilBridgeInternal, andBridgeInternal. When two threads entered different<clinit>methods simultaneously, the JVM class initialization monitors created an AB/BA deadlock:CosmosAsyncClientinit lock → needsFeedResponse(held by Thread B's chain)JsonSerializableinit lock → needsCosmosAsyncClient(held by Thread A)Fix
Replace the broad
initializeAllAccessors()fallback in each of the 35getXxxAccessor()methods with a targetedensureClassInitialized()that loads only the specific class registering that accessor.Before (deadlock-prone):
After (targeted):
Where
ensureClassInitializedis:The public
initializeAllAccessors()method is preserved for explicit eager bootstrap use cases (e.g., Kafka connector static block, Spring@PostConstructworkaround).Additional fixes
Added missing
static { initialize(); }blocks toCosmosRequestContext,CosmosOperationDetails, andCosmosDiagnosticsContext. Without these,Class.forName()triggers<clinit>but<clinit>never callsinitialize()to register the accessor — so the accessor stays null. Previously masked becauseinitializeAllAccessors()calledinitialize()directly as a regular static method, bypassing<clinit>entirely.Fixed
<clinit>ordering inCosmosItemSerializer— movedstatic { initialize(); }before theDEFAULT_SERIALIZERfield. This class has a unique self-referencing cycle:DEFAULT_SERIALIZERtriggersDefaultCosmosItemSerializer.<clinit>, which needs theCosmosItemSerializeraccessor — the same class whose<clinit>is currently running. With the oldinitializeAllAccessors()fallback, this worked by accident: the fallback calledCosmosItemSerializer.initialize()as a regular static method (allowed by the JVM during recursive<clinit>from the same thread), which registered the accessor before it was needed. WithClass.forName(), this recursive call is a no-op per JLS §12.4.2 (class already being initialized by current thread), so the accessor stayed null. Moving the static block before the field ensures the accessor is registered beforeDefaultCosmosItemSerializerneeds it — no reflection, no accidental side effects.Renamed misnamed accessor —
getCosmosAsyncClientAccessor()→getCosmosDiagnosticsThresholdsAccessor()inCosmosDiagnosticsThresholdsHelper, updated all 3 internal call sites.Files Changed
ImplementationBridgeHelpers.javaensureClassInitialized(); replacedinitializeAllAccessors()in 35 getter methods; renamed misnamed accessorCosmosItemSerializer.javastatic { initialize(); }beforeDEFAULT_SERIALIZERCosmosRequestContext.javastatic { initialize(); }CosmosOperationDetails.javastatic { initialize(); }CosmosDiagnosticsContext.javastatic { initialize(); }CosmosItemRequestOptions.javaCosmosQueryRequestOptionsImpl.javaCosmosQueryRequestOptionsBase.javaImplementationBridgeHelpersTest.javaCHANGELOG.mdTests
Three tests in
ImplementationBridgeHelpersTest:1.
concurrentAccessorInitializationShouldNotDeadlock(invocationCount = 5)Forks a fresh child JVM (where no Cosmos classes are loaded) and concurrently triggers
<clinit>of 12 different Cosmos classes from 12 threads synchronized via aCyclicBarrier. The barrier ensures all threads release at the exact same instant, maximizing the probability of concurrent<clinit>collisions. A 30-second timeout detects deadlocks. Runs 5 invocations via TestNG (5 fresh JVMs × 12 threads = 60 concurrent<clinit>races).2.
allAccessorClassesMustHaveStaticInitializerBlockForks a fresh child JVM that iterates every
*Helperinner class, calls eachgetXxxAccessor()getter, and verifies the accessor is non-null via reflection. If a class is missingstatic { initialize(); },Class.forName()won't register its accessor and this test fails. Purely behavioral — no source scanning or regex.3.
accessorInitialization(existing, unchanged)Resets all accessors via reflection, calls
initializeAllAccessors(), and verifies all are re-registered. Validates the explicit bootstrap path.