Skip to content

OpaAuthorizer creates newHtttpClient on each authorize request causing threads exhaustion under load #142

@vikshab

Description

@vikshab

Background

We observed unusual behavior in our Druid cluster (v30.0.0) where the Coordinator would fail silently — producing no meaningful log output — making troubleshooting extremely difficult. This issue was most pronounced during periods of high ingestion load. A thread analysis revealed that the leader node was spawning a massive number of threads in a short period of time, surging from approximately 5,000 to 13,000.

Thread dump analysis uncovered three interconnected issues within the OPA authorizer:

  1. Jetty threads stuck spawning new threads
    Jetty worker threads were found blocked at the JVM level while attempting to start new threads. The stack trace points to the OpaAuthorizer.authorize() method, which was creating a brand new HttpClient instance on every authorization call
qtp1921332678-176" RUNNABLE
  at java.lang.Thread.start0(Native Method)          <-- STUCK HERE
  at java.lang.Thread.start(Thread.java:809)
  at jdk.internal.net.http.HttpClientImpl.start(HttpClientImpl.java:338)
  at HttpClient.newHttpClient(HttpClient.java:162)
  at tech.stackable.druid.opaauthorizer.OpaAuthorizer.authorize(OpaAuthorizer.java:57)
  1. Jetty threads indefinitely waiting on OPA responses
    A separate set of Jetty worker threads were found in a WAITING state, indefinitely blocked on CompletableFuture.get() while awaiting HTTP responses from OPA that would never arrive — effectively deadlocking those threads.
"qtp1921332678-168" WAITING
  at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:2072)
  at jdk.internal.net.http.HttpClientImpl.send(HttpClientImpl.java:554)
  at tech.stackable.druid.opaauthorizer.OpaAuthorizer.authorize(OpaAuthorizer.java:66)
  1. Massive thread churn and accumulation
  • _to_delete_list length = 5,113 (dead threads awaiting JVM cleanup)

Root cause summary:

The OpaAuthorizer was instantiating a new HttpClient on every authorize() call rather than reusing a shared instance. Under high ingestion load, this caused an uncontrolled explosion of short-lived threads, exhausting thread pool capacity and ultimately bringing down the Coordinator.

@Override
public Access authorize(...) {
    // ...
    var client = HttpClient.newHttpClient();  // <-- NEW CLIENT EVERY CALL (line 57)
    try {
        var request = HttpRequest.newBuilder()
                .uri(new URI(opaUri))
                .header("Content-Type", "application/json")
                .POST(HttpRequest.BodyPublishers.ofString(msgJson))
                .build();
        var response = client.send(request, HttpResponse.BodyHandlers.ofString());
        // ...
    }
}

Proposed solution:

Promote HttpClient from local variable to instance field:

@JsonTypeName("opa")
public class OpaAuthorizer implements Authorizer {
    private final HttpClient httpClient = HttpClient.newHttpClient();  // ADD: shared instance

    @Override
    public Access authorize(...) {
        // REMOVE: var client = HttpClient.newHttpClient();
        // USE: this.httpClient
        var response = httpClient.send(request, HttpResponse.BodyHandlers.ofString());
    }
}

java.net.http.HttpClient is thread-safe and designed for concurrent reuse. This is the same pattern used in Stackable's own Trino OPA authorizer

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions