-
Notifications
You must be signed in to change notification settings - Fork 4
OpaAuthorizer creates newHtttpClient on each authorize request causing threads exhaustion under load #142
Description
Background
We observed unusual behavior in our Druid cluster (v30.0.0) where the Coordinator would fail silently — producing no meaningful log output — making troubleshooting extremely difficult. This issue was most pronounced during periods of high ingestion load. A thread analysis revealed that the leader node was spawning a massive number of threads in a short period of time, surging from approximately 5,000 to 13,000.
Thread dump analysis uncovered three interconnected issues within the OPA authorizer:
- Jetty threads stuck spawning new threads
Jetty worker threads were found blocked at the JVM level while attempting to start new threads. The stack trace points to the OpaAuthorizer.authorize() method, which was creating a brand new HttpClient instance on every authorization call
qtp1921332678-176" RUNNABLE
at java.lang.Thread.start0(Native Method) <-- STUCK HERE
at java.lang.Thread.start(Thread.java:809)
at jdk.internal.net.http.HttpClientImpl.start(HttpClientImpl.java:338)
at HttpClient.newHttpClient(HttpClient.java:162)
at tech.stackable.druid.opaauthorizer.OpaAuthorizer.authorize(OpaAuthorizer.java:57)- Jetty threads indefinitely waiting on OPA responses
A separate set of Jetty worker threads were found in a WAITING state, indefinitely blocked on CompletableFuture.get() while awaiting HTTP responses from OPA that would never arrive — effectively deadlocking those threads.
"qtp1921332678-168" WAITING
at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:2072)
at jdk.internal.net.http.HttpClientImpl.send(HttpClientImpl.java:554)
at tech.stackable.druid.opaauthorizer.OpaAuthorizer.authorize(OpaAuthorizer.java:66)
- Massive thread churn and accumulation
_to_delete_listlength = 5,113 (dead threads awaiting JVM cleanup)
Root cause summary:
The OpaAuthorizer was instantiating a new HttpClient on every authorize() call rather than reusing a shared instance. Under high ingestion load, this caused an uncontrolled explosion of short-lived threads, exhausting thread pool capacity and ultimately bringing down the Coordinator.
@Override
public Access authorize(...) {
// ...
var client = HttpClient.newHttpClient(); // <-- NEW CLIENT EVERY CALL (line 57)
try {
var request = HttpRequest.newBuilder()
.uri(new URI(opaUri))
.header("Content-Type", "application/json")
.POST(HttpRequest.BodyPublishers.ofString(msgJson))
.build();
var response = client.send(request, HttpResponse.BodyHandlers.ofString());
// ...
}
}Proposed solution:
Promote HttpClient from local variable to instance field:
@JsonTypeName("opa")
public class OpaAuthorizer implements Authorizer {
private final HttpClient httpClient = HttpClient.newHttpClient(); // ADD: shared instance
@Override
public Access authorize(...) {
// REMOVE: var client = HttpClient.newHttpClient();
// USE: this.httpClient
var response = httpClient.send(request, HttpResponse.BodyHandlers.ofString());
}
}java.net.http.HttpClient is thread-safe and designed for concurrent reuse. This is the same pattern used in Stackable's own Trino OPA authorizer