Skip to content

Fix JVM <clinit> deadlock in Cosmos SDK accessor initialization#48667

Closed
jeet1995 wants to merge 12 commits intoAzure:mainfrom
jeet1995:fix/clinit-deadlock-targeted-accessor-init
Closed

Fix JVM <clinit> deadlock in Cosmos SDK accessor initialization#48667
jeet1995 wants to merge 12 commits intoAzure:mainfrom
jeet1995:fix/clinit-deadlock-targeted-accessor-init

Conversation

@jeet1995
Copy link
Copy Markdown
Member

@jeet1995 jeet1995 commented Apr 1, 2026

Summary

Fixes a JVM-level <clinit> deadlock that occurs when multiple threads concurrently trigger Cosmos SDK class loading for the first time. This is a permanent, unrecoverable deadlock that hangs all affected threads indefinitely.

Fixes: #48622, #48585

Root Cause

Each getXxxAccessor() method in ImplementationBridgeHelpers called initializeAllAccessors() as a fallback when its accessor was not yet set. This eagerly loaded 40+ classes via ModelBridgeInternal, UtilBridgeInternal, and BridgeInternal. When two threads entered different <clinit> methods simultaneously, the JVM class initialization monitors created an AB/BA deadlock:

  • Thread A holds CosmosAsyncClient init lock → needs FeedResponse (held by Thread B's chain)
  • Thread B holds JsonSerializable init lock → needs CosmosAsyncClient (held by Thread A)

Fix

Replace the broad initializeAllAccessors() fallback in each of the 35 getXxxAccessor() methods with a targeted ensureClassInitialized() that loads only the specific class registering that accessor.

Before (deadlock-prone):

public static CosmosClientBuilderAccessor getCosmosClientBuilderAccessor() {
    if (!cosmosClientBuilderClassLoaded.get()) {
        logger.debug("Initializing CosmosClientBuilderAccessor...");
        initializeAllAccessors(); // loads ALL 40+ classes — circular dependency
    }

    CosmosClientBuilderAccessor snapshot = accessor.get();
    if (snapshot == null) {
        logger.error("CosmosClientBuilderAccessor is not initialized yet!");
    }

    return snapshot;
}

After (targeted):

public static CosmosClientBuilderAccessor getCosmosClientBuilderAccessor() {
    if (!cosmosClientBuilderClassLoaded.get()) {
        logger.debug("Initializing CosmosClientBuilderAccessor...");
        ensureClassInitialized("com.azure.cosmos.CosmosClientBuilder"); // loads only this class
    }

    CosmosClientBuilderAccessor snapshot = accessor.get();
    if (snapshot == null) {
        logger.error("CosmosClientBuilderAccessor is not initialized yet!");
    }

    return snapshot;
}

Where ensureClassInitialized is:

private static void ensureClassInitialized(String className) {
    try {
        Class.forName(className, true, ImplementationBridgeHelpers.class.getClassLoader());
    } catch (ClassNotFoundException e) {
        logger.error("Failed to load class for accessor initialization: {}", className, e);
        throw new IllegalStateException(
            "Unable to load class for accessor initialization: " + className, e);
    }
}

The public initializeAllAccessors() method is preserved for explicit eager bootstrap use cases (e.g., Kafka connector static block, Spring @PostConstruct workaround).

Additional fixes

Added missing static { initialize(); } blocks to CosmosRequestContext, CosmosOperationDetails, and CosmosDiagnosticsContext. Without these, Class.forName() triggers <clinit> but <clinit> never calls initialize() to register the accessor — so the accessor stays null. Previously masked because initializeAllAccessors() called initialize() directly as a regular static method, bypassing <clinit> entirely.

Fixed <clinit> ordering in CosmosItemSerializer — moved static { initialize(); } before the DEFAULT_SERIALIZER field. This class has a unique self-referencing cycle: DEFAULT_SERIALIZER triggers DefaultCosmosItemSerializer.<clinit>, which needs the CosmosItemSerializer accessor — the same class whose <clinit> is currently running. With the old initializeAllAccessors() fallback, this worked by accident: the fallback called CosmosItemSerializer.initialize() as a regular static method (allowed by the JVM during recursive <clinit> from the same thread), which registered the accessor before it was needed. With Class.forName(), this recursive call is a no-op per JLS §12.4.2 (class already being initialized by current thread), so the accessor stayed null. Moving the static block before the field ensures the accessor is registered before DefaultCosmosItemSerializer needs it — no reflection, no accidental side effects.

Renamed misnamed accessorgetCosmosAsyncClientAccessor()getCosmosDiagnosticsThresholdsAccessor() in CosmosDiagnosticsThresholdsHelper, updated all 3 internal call sites.

Files Changed

File Change
ImplementationBridgeHelpers.java Added ensureClassInitialized(); replaced initializeAllAccessors() in 35 getter methods; renamed misnamed accessor
CosmosItemSerializer.java Reordered static { initialize(); } before DEFAULT_SERIALIZER
CosmosRequestContext.java Added static { initialize(); }
CosmosOperationDetails.java Added static { initialize(); }
CosmosDiagnosticsContext.java Added static { initialize(); }
CosmosItemRequestOptions.java Updated accessor call site
CosmosQueryRequestOptionsImpl.java Updated accessor call site
CosmosQueryRequestOptionsBase.java Updated accessor call site
ImplementationBridgeHelpersTest.java Added 2 forked-JVM regression tests
CHANGELOG.md Bug fix entry

Tests

Three tests in ImplementationBridgeHelpersTest:

1. concurrentAccessorInitializationShouldNotDeadlock (invocationCount = 5)

Forks a fresh child JVM (where no Cosmos classes are loaded) and concurrently triggers <clinit> of 12 different Cosmos classes from 12 threads synchronized via a CyclicBarrier. The barrier ensures all threads release at the exact same instant, maximizing the probability of concurrent <clinit> collisions. A 30-second timeout detects deadlocks. Runs 5 invocations via TestNG (5 fresh JVMs × 12 threads = 60 concurrent <clinit> races).

2. allAccessorClassesMustHaveStaticInitializerBlock

Forks a fresh child JVM that iterates every *Helper inner class, calls each getXxxAccessor() getter, and verifies the accessor is non-null via reflection. If a class is missing static { initialize(); }, Class.forName() won't register its accessor and this test fails. Purely behavioral — no source scanning or regex.

3. accessorInitialization (existing, unchanged)

Resets all accessors via reflection, calls initializeAllAccessors(), and verifies all are re-registered. Validates the explicit bootstrap path.

Replace broad initializeAllAccessors() fallback calls in each
getXxxAccessor() method with targeted Class.forName() that loads only
the specific class registering that accessor. This eliminates the
circular class initialization dependency chain that caused permanent
JVM-level deadlocks when multiple threads concurrently triggered
Cosmos SDK class loading for the first time.

Root cause: When any getXxxAccessor() found its accessor unset, it
called initializeAllAccessors() which eagerly loaded 40+ classes via
ModelBridgeInternal, UtilBridgeInternal, and BridgeInternal. If two
threads entered different <clinit> methods simultaneously, the JVM
class initialization monitors created an AB/BA deadlock that was
permanent and unrecoverable.

Fix: Each of the 35 getXxxAccessor() methods now calls
ensureClassInitialized() with just the fully-qualified name of the
class that registers that specific accessor. This dramatically
narrows the class loading scope per accessor call, making circular
dependencies practically impossible.

The public initializeAllAccessors() method is preserved for explicit
eager bootstrap use cases (e.g., Kafka connector, Spring @PostConstruct
workaround).

Fixes: Azure#48622
Fixes: Azure#48585

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings April 1, 2026 23:11
@jeet1995 jeet1995 requested review from a team and kirankumarkolli as code owners April 1, 2026 23:11
@github-actions github-actions bot added the Cosmos label Apr 1, 2026
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR mitigates a JVM <clinit> deadlock in the Cosmos Java SDK by replacing broad “initialize everything” fallback behavior in accessor getters with targeted class initialization, reducing circular class-init dependency chains during concurrent first-touch class loading.

Changes:

  • Added a targeted class-initialization helper and updated accessor getters to initialize only the specific registering class.
  • Preserved the existing eager initializeAllAccessors() entry point for explicit bootstrap scenarios.
  • Added a Bugs Fixed entry to the azure-cosmos changelog documenting the deadlock fix.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.

File Description
sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/ImplementationBridgeHelpers.java Introduces targeted Class.forName initialization and replaces broad fallback initialization in many accessor getters.
sdk/cosmos/azure-cosmos/CHANGELOG.md Documents the deadlock fix in the unreleased Bugs Fixed section.

jeet1995 and others added 3 commits April 1, 2026 19:43
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- ensureClassInitialized: fail fast with IllegalStateException instead
  of swallowing ClassNotFoundException; use explicit 3-arg Class.forName
  with classloader
- CHANGELOG.md: fix bullet formatting for consistency
- Add concurrent accessor initialization regression test in
  ImplementationBridgeHelpersTest to guard against deadlock reintroduction

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…sholdsHelper

Rename to getCosmosDiagnosticsThresholdsAccessor() which correctly
reflects the return type and initialized class. The old method is
kept as a @deprecated delegating alias for binary compatibility.

Updated all 3 internal call sites to use the new name.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@jeet1995 jeet1995 force-pushed the fix/clinit-deadlock-targeted-accessor-init branch from 1532dd9 to 2b5a98b Compare April 2, 2026 00:11
@jeet1995
Copy link
Copy Markdown
Member Author

jeet1995 commented Apr 2, 2026

/azp run java - cosmos - tests

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

@jeet1995
Copy link
Copy Markdown
Member Author

jeet1995 commented Apr 2, 2026

/azp run java - cosmos - ci

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

@xinlian12
Copy link
Copy Markdown
Member

@sdkReviewAgent-2

@xinlian12
Copy link
Copy Markdown
Member

PR Review Agent — Starting review...

The concurrentAccessorInitializationShouldNotDeadlock test reset all
35 accessors to null but only re-initialized 6 concurrently. The
remaining 29 were left null, causing NoSuchMethodError/NPE cascades
in every subsequent test running in the same JVM.

Fix: wrap the test body in try/finally and call initializeAllAccessors()
in the finally block to restore all accessors for downstream tests.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@jeet1995
Copy link
Copy Markdown
Member Author

jeet1995 commented Apr 2, 2026

/azp run java - cosmos - ci

@jeet1995
Copy link
Copy Markdown
Member Author

jeet1995 commented Apr 2, 2026

/azp run java - cosmos - tests

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

1 similar comment
@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

jeet1995 and others added 3 commits April 1, 2026 23:38
Class.forName() is a no-op when the target class is already being
initialized by the current thread (JLS 12.4.2). This broke classes
like CosmosItemSerializer whose <clinit> transitively calls
getCosmosItemSerializerAccessor() before the accessor is registered.

Fix: after Class.forName(), also call the class's initialize() method
reflectively. This static method explicitly registers the accessor and
is allowed to execute during recursive <clinit> from the same thread,
which was the behavior of the previous initializeAllAccessors() chain.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Add 'static { initialize(); }' to CosmosRequestContext,
  CosmosOperationDetails, and CosmosDiagnosticsContext which had
  initialize() methods but no <clinit> invocation, so Class.forName()
  alone would not register their accessors.
- Improve regression test: assert accessor return values are non-null,
  catch ExecutionException for clearer failure messages, and document
  the inherent <clinit>-once-per-JVM limitation.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The <clinit> deadlock can only be reproduced in a fresh JVM where
classes haven't been loaded yet. Replace the reflection-based
in-process test with one that forks child JVM processes using
ProcessBuilder, each running 6 concurrent threads that trigger
<clinit> of different Cosmos classes simultaneously. A 30-second
timeout detects deadlocks. Runs 3 iterations for reliability.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@jeet1995
Copy link
Copy Markdown
Member Author

jeet1995 commented Apr 2, 2026

/azp run java - cosmos - tests

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

@jeet1995 jeet1995 force-pushed the fix/clinit-deadlock-targeted-accessor-init branch 2 times, most recently from 8e70f51 to 87d5fcd Compare April 3, 2026 17:02
@jeet1995
Copy link
Copy Markdown
Member Author

jeet1995 commented Apr 3, 2026

/azp run java - cosmos - ci

@jeet1995
Copy link
Copy Markdown
Member Author

jeet1995 commented Apr 3, 2026

/azp run java - cosmos - tests

@jeet1995
Copy link
Copy Markdown
Member Author

jeet1995 commented Apr 3, 2026

/azp run java - cosmos - kafka

@jeet1995
Copy link
Copy Markdown
Member Author

jeet1995 commented Apr 3, 2026

/azp run java - cosmos - spark

@jeet1995
Copy link
Copy Markdown
Member Author

jeet1995 commented Apr 3, 2026

/azp run java - spring - ci

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

4 similar comments
@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

- Fix potential test hang: drain child JVM stdout on a daemon gobbler
  thread so the timeout check is always reached even if the child
  deadlocks (Copilot review comment).
- Add allAccessorClassesMustHaveStaticInitializerBlock test that
  scans ImplementationBridgeHelpers source for ensureClassInitialized
  targets and verifies each has 'static { initialize(); }' — catches
  missing blocks at build time.
- Shorten CHANGELOG entry to 1-2 lines per convention.
- Revert Spring README whitespace (only needed to trigger pipeline).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@jeet1995 jeet1995 force-pushed the fix/clinit-deadlock-targeted-accessor-init branch from 87d5fcd to b406765 Compare April 3, 2026 17:14
@jeet1995
Copy link
Copy Markdown
Member Author

jeet1995 commented Apr 3, 2026

/azp run java - cosmos - ci

@jeet1995
Copy link
Copy Markdown
Member Author

jeet1995 commented Apr 3, 2026

/azp run java - cosmos - tests

@jeet1995
Copy link
Copy Markdown
Member Author

jeet1995 commented Apr 3, 2026

/azp run java - cosmos - kafka

@jeet1995
Copy link
Copy Markdown
Member Author

jeet1995 commented Apr 3, 2026

/azp run java - cosmos - spark

@jeet1995
Copy link
Copy Markdown
Member Author

jeet1995 commented Apr 3, 2026

/azp run java - spring - ci

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

4 similar comments
@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

@jeet1995
Copy link
Copy Markdown
Member Author

jeet1995 commented Apr 3, 2026

/azp run java - cosmos - spark

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

jeet1995 added a commit to jeet1995/azure-sdk-for-java that referenced this pull request Apr 3, 2026
Alternative approach to Azure#48667 that avoids Class.forName() entirely.
Each getXxxAccessor() fallback now calls a targeted bridge method
(e.g., BridgeInternal.initializeCosmosAsyncClientAccessor()) that
directly invokes the specific class's initialize() method. This
eliminates classloader risks and is compile-time checked.

Adds per-class initialize methods to BridgeInternal (17),
ModelBridgeInternal (20), and UtilBridgeInternal (1).

Also includes:
- Missing static { initialize(); } blocks for 3 classes
- CosmosItemSerializer <clinit> ordering fix
- Renamed misnamed accessor in CosmosDiagnosticsThresholdsHelper
- Forked-JVM deadlock regression test (invocationCount=5, 12 threads)
- Behavioral enforcement test for static { initialize(); } blocks

Fixes: Azure#48622
Fixes: Azure#48585

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@jeet1995
Copy link
Copy Markdown
Member Author

jeet1995 commented Apr 5, 2026

Closing in favor of #48689 (lazy accessor approach). Class.forName() has a fundamental limitation — it can't force accessor registration during recursive (JLS §12.4.2 no-op on same-thread re-initialization). The lazy approach removes static final accessor fields entirely, so never triggers accessor initialization. No Class.forName(), no classloader risk, no recursive edge cases.

@jeet1995 jeet1995 closed this Apr 5, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants