Skip to content

[PECOBLR-2321] Result Set Heartbeat / Keep-Alive for Ongoing Query Executions#1415

Open
gopalldb wants to merge 11 commits intomainfrom
design/heartbeat-keep-alive
Open

[PECOBLR-2321] Result Set Heartbeat / Keep-Alive for Ongoing Query Executions#1415
gopalldb wants to merge 11 commits intomainfrom
design/heartbeat-keep-alive

Conversation

@gopalldb
Copy link
Copy Markdown
Collaborator

@gopalldb gopalldb commented Apr 22, 2026

Summary

Design + implementation for PECOBLR-2321: periodic heartbeat polling to keep server-side result state alive while the client consumes results slowly.

Problem

When users read query results slowly (pausing between next() calls), the warehouse can auto-stop after its idle timeout. For inline results (data only on cluster, not uploaded to cloud storage), this means permanent data loss. The user gets errors like INVALID_HANDLE_STATUS or "operation not found".

Solution

A ResultHeartbeatManager that periodically calls GetStatementStatus (SEA) or GetOperationStatus (Thrift) to signal the server that results are still being consumed. Opt-in via EnableHeartbeat=1 connection parameter (default false due to cost implications).

Design doc

docs/design/HEARTBEAT_KEEP_ALIVE.md — includes cross-driver survey, Mermaid diagrams (sequence flows, state machine, class diagram), and detailed lifecycle analysis.

Heartbeat eligibility (skipped when not needed)

Scenario Heartbeat? Reason
SEA cloud fetch (Arrow) Yes Statement must stay alive for URL refresh
Thrift inline (columnar) Yes Data fetched on-demand; server can evict
Thrift cloud fetch Yes Operation handle must stay alive
SEA inline (JSON) No All data loaded into memory at construction
Direct results (CLOSED state) No Server already closed; data fully delivered
Update count (DML) No No result rows; execution polling already kept it alive
Async execution wait No User controls polling via getExecutionResult() — heartbeat starts only when ResultSet is constructed

Error resilience

  • 10 consecutive failures before self-stop
  • Terminal states (CLOSED/ERROR/CANCELED/TIMEDOUT) auto-stop
  • Single success resets failure counter

Zero-leak guarantee

Heartbeat stops in 4 places: next() returns false, ResultSet.close(), Statement.close(), Connection.close()

Implementation

  • New: ResultHeartbeatManager — per-connection manager with ScheduledExecutorService (daemon thread)
  • New: ResultHeartbeatManagerTest — 7 unit tests
  • Modified: DatabricksJdbcUrlParamsEnableHeartbeat (default 0), HeartbeatIntervalSeconds (default 60)
  • Modified: DatabricksConnectionContext — getter methods
  • Modified: DatabricksConnection — creates/shuts down manager
  • Modified: DatabricksResultSet — starts heartbeat in constructor, stops on close/next-false
  • Modified: DatabricksStatement — safety net stop in close()
  • Modified: IDatabricksClientcheckStatementAlive() default method
  • Modified: DatabricksSdkClient — SEA heartbeat via GET /sql/statements/{id}
  • Modified: DatabricksThriftServiceClient — Thrift heartbeat via GetOperationStatus

Design doc for PECOBLR-2321: periodic heartbeat polling to keep
server-side result state alive while the client consumes results
slowly.

Key design points:
- Periodic GetStatementStatus (SEA) / GetOperationStatus (Thrift)
- Opt-in via EnableHeartbeat connection parameter (default false)
- Configurable interval (default 60s, aligned with ADBC C# driver)
- Zero-leak guarantee: stops on ResultSet.close, Statement.close,
  Connection.close, end-of-results, or server terminal state
- Error resilience: 10 consecutive failures before self-stop
- Includes cross-driver survey (ADBC C#, Python, Go, Node.js)
- Mermaid diagrams: sequence flows, state machine, class diagram

Signed-off-by: Gopal Lal <gopal.lal@databricks.com>
Co-authored-by: Isaac
Signed-off-by: Gopal Lal <gopal.lal@databricks.com>
Add periodic heartbeat polling to keep server-side result state alive
while the client consumes results slowly. Prevents warehouse auto-stop
from destroying in-progress results.

New files:
- ResultHeartbeatManager: per-connection manager with shared
  ScheduledExecutorService (daemon thread). Manages start/stop/shutdown
  lifecycle for heartbeats across all statements.
- ResultHeartbeatManagerTest: 7 unit tests covering lifecycle,
  idempotency, interval, shutdown, re-execution.

Connection parameters:
- EnableHeartbeat (default 0): opt-in to enable heartbeat polling
- HeartbeatIntervalSeconds (default 60): polling interval

Protocol support:
- SEA: GET /sql/statements/{id} via checkStatementAlive()
- Thrift: GetOperationStatus via checkStatementAlive()

Heartbeat eligibility (skipped when not needed):
- SEA inline (InlineJsonResult): all data in memory, no server state
- Update count / metadata results: no data to keep alive
- Direct results: server already closed the operation
- Null execution result: nothing to fetch

Error resilience:
- 10 consecutive failures before self-stop (transient error tolerance)
- Single success resets failure counter
- Terminal states (CLOSED/ERROR/CANCELED/TIMEDOUT) stop heartbeat

Cleanup guarantees (zero leak):
- next() returns false → stop
- ResultSet.close() → stop
- Statement.close() → safety net stop
- Connection.close() → shutdown all

Signed-off-by: Gopal Lal <gopal.lal@databricks.com>
Co-authored-by: Isaac
Signed-off-by: Gopal Lal <gopal.lal@databricks.com>
Metadata operations (getTables, getColumns, etc.) can return large
result sets that may need heartbeat. Only UPDATE statements are
excluded (they return update count, no data to keep alive).

Signed-off-by: Gopal Lal <gopal.lal@databricks.com>
Co-authored-by: Isaac
Signed-off-by: Gopal Lal <gopal.lal@databricks.com>
Direct results mean the server already closed the operation and
delivered all data inline. No heartbeat needed — detect via
ExecutionState.CLOSED in the constructor rather than waiting for
the first poll to discover it.

Signed-off-by: Gopal Lal <gopal.lal@databricks.com>
Co-authored-by: Isaac
Signed-off-by: Gopal Lal <gopal.lal@databricks.com>
- No heartbeat during executeAsync() wait — user controls polling
  via getExecutionResult(). Heartbeat starts only when ResultSet is
  constructed and user begins consuming results.
- Updated eligibility table: SEA inline doesn't need heartbeat
  (all data in memory), added update count row.
- Added execution phase vs consumption phase diagram.

Signed-off-by: Gopal Lal <gopal.lal@databricks.com>
Co-authored-by: Isaac
Signed-off-by: Gopal Lal <gopal.lal@databricks.com>
@gopalldb gopalldb changed the title [Design] Result Set Heartbeat / Keep-Alive for Ongoing Query Executions [PECOBLR-2321] Result Set Heartbeat / Keep-Alive for Ongoing Query Executions Apr 22, 2026
Signed-off-by: Gopal Lal <gopal.lal@databricks.com>
Co-authored-by: Isaac
Signed-off-by: Gopal Lal <gopal.lal@databricks.com>
Extract isHeartbeatEligible() as package-visible method for testing.
Add 10 tests covering all eligibility/ineligibility scenarios:

Eligible (heartbeat starts):
- SEA cloud fetch (Arrow) — statement alive for URL refresh
- Thrift inline — data fetched on-demand, server can evict
- Thrift cloud fetch (Arrow) — operation handle alive
- Metadata queries — can return large result sets

Ineligible (heartbeat skipped):
- SEA inline (JSON) — all data in memory
- Direct results (CLOSED) — server already closed
- Update count (DML) — no result rows
- Null execution result — nothing to fetch
- Async PENDING — user controls polling
- Async RUNNING — user controls polling

Signed-off-by: Gopal Lal <gopal.lal@databricks.com>
Co-authored-by: Isaac
Signed-off-by: Gopal Lal <gopal.lal@databricks.com>
1. consecutiveFailures → AtomicInteger (was plain int written by
   scheduler thread, no happens-before guarantee)

2. Stopped flag prevents RPC on closed client/session. stopHeartbeat
   sets AtomicBoolean flag BEFORE cancel(false). In-flight heartbeat
   tick checks flag before RPC, skips if set. Exceptions during
   shutdown don't count as consecutive failures.

3. Constructor leak: verified startHeartbeatIfEnabled() is already
   the last line of the constructor, after all throwing code. No
   change needed — already safe.

4. HeartbeatIntervalSeconds bounds check: reject <= 0 (use default
   60), warn for > 3600 (heartbeat may not keep operation alive).

Signed-off-by: Gopal Lal <gopal.lal@databricks.com>
Co-authored-by: Isaac
Signed-off-by: Gopal Lal <gopal.lal@databricks.com>
1. Null-defense on Thrift response: if getOperationState() returns
   null, assume alive and log warning (prevents NPE)

2. Better logging for heartbeat failures: first failure at INFO,
   terminal (10th) failure at WARN with statement ID and error
   message. Users will now see early signals instead of cryptic
   "operation not found" on next()

3. Stop old heartbeat on re-execute: resetForNewExecution() now
   explicitly calls stopHeartbeat(oldStatementId) before clearing
   state. Prevents wasteful 10-failure self-termination of orphaned
   heartbeats

4. Document cloud-fetch prefetch interaction: noted that
   StreamingChunkProvider/RemoteChunkProvider background RPCs act as
   implicit heartbeat. Explicit heartbeat is still useful for gaps
   (all chunks downloaded, prefetch paused, sliding window full)

Signed-off-by: Gopal Lal <gopal.lal@databricks.com>
Co-authored-by: Isaac
Signed-off-by: Gopal Lal <gopal.lal@databricks.com>
Tests rewritten to be deterministic (CountDownLatch, no Thread.sleep):
- testStoppedFlagSetOnStop: get flag after start, verify set on stop
- testStoppedFlagSetOnShutdown: same pattern, verify set on shutdown
- testStopRacingWithScheduledTick: verify stopped flag prevents RPC
- testShutdownWithBlockedTask: verify shutdownNow fires after 5s
- testReExecutionReplacesHeartbeat: verify old task stops

Fix: startHeartbeat resets stopped flag on new start (was stale
after stop-then-restart cycle).

Add DEBUG log on successful heartbeat start with statementId,
resultType, and interval for support diagnostics.

Signed-off-by: Gopal Lal <gopal.lal@databricks.com>
Co-authored-by: Isaac
Signed-off-by: Gopal Lal <gopal.lal@databricks.com>
Verified against dogfood warehouse:
- testHeartbeatKeepsResultsAliveDuringSlowConsumption: execute query,
  read first row, pause 15s (3 heartbeats at 5s interval), read
  remaining 99 rows successfully. All 100 rows returned.
- testHeartbeatStopsOnResultSetClose: verify clean shutdown after close

Run with:
  DATABRICKS_HOST=... DATABRICKS_TOKEN=... DATABRICKS_HTTP_PATH=... \
  mvn -pl jdbc-core test -Dtest="HeartbeatIntegrationTest"

Signed-off-by: Gopal Lal <gopal.lal@databricks.com>
Co-authored-by: Isaac
Signed-off-by: Gopal Lal <gopal.lal@databricks.com>
// Get the stopped flag from the manager — shared between the heartbeat task and
// stopHeartbeat(). Prevents RPC on a just-closed client/session: stopHeartbeat sets
// the flag before cancel(false), so an in-flight tick sees it and skips the RPC.
final java.util.concurrent.atomic.AtomicBoolean stopped = mgr.getStoppedFlag(statementId);
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[CRITICAL] Orphan stopped flag — heartbeat RPC never actually fires

The stopped flag is captured here at line 328 before mgr.startHeartbeat(...) is called at line 378. Inside ResultHeartbeatManager.startHeartbeat():

// ResultHeartbeatManager.java
void startHeartbeat(StatementId statementId, Runnable heartbeatTask) {
  ...
  stopHeartbeat(statementId);              // line 63 — REMOVES this flag from map AND sets it to true
  getStoppedFlag(statementId).set(false);  // line 66 — computeIfAbsent creates a NEW AtomicBoolean
  ...
}

So the AtomicBoolean captured by the closure here is the removed/orphaned one — permanently set to true. The new flag in the map (which mgr.stopHeartbeat(...) later mutates from DatabricksResultSet.stopHeartbeat, Statement.close, Connection.close) is invisible to the closure.

Net effect: every tick, if (stopped.get()) return; short-circuits → client.checkStatementAlive(statementId) is never called. The whole feature is non-functional.

The integration test only passes because warehouses don't actually expire results in 15s — so the absence of heartbeats isn't observed.

Fix options (any one):

  1. Capture the flag after mgr.startHeartbeat(...) returns.
  2. Reuse the same AtomicBoolean in startHeartbeat/stopHeartbeat (don't remove from the map — just set(true)/set(false)).
  3. Have the closure call mgr.getStoppedFlag(statementId).get() per tick instead of holding a captured reference.

Add a unit test that asserts client.checkStatementAlive is invoked at least once via the production wiring — currently no such test exists.

// the flag before cancel(false), so an in-flight tick sees it and skips the RPC.
final java.util.concurrent.atomic.AtomicBoolean stopped = mgr.getStoppedFlag(statementId);

Runnable heartbeatTask =
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[CRITICAL] Lambda strong-captures this — abandoned ResultSet keeps warehouse alive forever

This lambda invokes stopHeartbeat() (instance method, line 342, 373) and reads statementId (instance field, lines 336/340/352/353/358/367/369). Both implicitly capture this — the entire DatabricksResultSet, including executionResult (Arrow buffers, chunk providers, potentially MB of cached row data).

The future is held in ResultHeartbeatManager.activeHeartbeats for the connection's lifetime. So:

  • A user that does stmt.executeQuery(...).next() once and abandons the ResultSet reference (a real-world bug, but a JDBC driver shouldn't amplify it) will:
    • Never trigger next()→false or close() (the only auto-stop paths)
    • Have the entire ResultSet and its data retained until Connection.close() — typically hours in pooled environments
    • Have the heartbeat poll forever, holding the warehouse open and accumulating cost
  • This is the exact "cost forever" failure mode the design doc Requirements §3 explicitly tries to prevent.
  • It is also a denial-of-service amplifier: an app opening 10k orphaned result sets per hour holds 10k Arrow batches in heap until Connection.close().

The C# ADBC reference avoids this: its poller is per-statement with linked cancellation, so even GC of the statement helps. The Java implementation here is connection-scoped, so GC of the ResultSet alone won't help — the future keeps a hard reference back to the ResultSet.

Fix: Don't capture this. Pull statementId and mgr (or just Runnable stopFn = () -> mgr.stopHeartbeat(localStatementId)) into locals so the lambda has no implicit this reference. Verify with javap -p -c (no synthetic this$0 field on the lambda class) or a simple unit test that holds a WeakReference<DatabricksResultSet> and asserts it's collectable after the strong reference is dropped.


try {
DatabricksConnection conn =
(DatabricksConnection) parentStatement.getStatement().getConnection();
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[CRITICAL] Pooled connections (HikariCP, DBCP, DatabricksPooledConnection) silently get NO heartbeat

This direct cast (DatabricksConnection) parentStatement.getStatement().getConnection() will throw ClassCastException for any pooled connection wrapper:

  • DatabricksPooledConnection returns a JDK dynamic Proxy declaring Connection.class, IDatabricksConnectionInternal.class (see DatabricksPooledConnection.java:155-158) — not DatabricksConnection.
  • HikariCP returns HikariProxyConnection; DBCP returns PoolGuardConnectionWrapper — same story.

The exception is swallowed by the outer catch (Exception e) { LOGGER.debug(...) } at line 384-386 (and again at line 401-402 for stopHeartbeat). Result: users opt in to EnableHeartbeat=1 on the most common Java connection pool deployment, get no protection, and see no error — just a DEBUG line they have to enable to find.

Fix (one of):

  1. connection.unwrap(DatabricksConnection.class) — works through the proxy via IDatabricksConnectionInternal.
  2. Add getHeartbeatManager() to IDatabricksConnectionInternal so the pool proxy forwards it transparently.

Option 2 is cleaner and matches how the rest of the driver handles pooled access.

this.cachedTelemetryCollector = resolveTelemetryCollector(parentStatement);
this.isClosed = false;
this.wasNull = false;
startHeartbeatIfEnabled();
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[CRITICAL] Heartbeat never starts on Thrift result sets — feature is dead-on-arrival on the Thrift path

The Thrift constructor (this method, lines 153-196) does not call startHeartbeatIfEnabled(). Only the SEA constructor at line 127 does.

All Thrift result sets are constructed via DatabricksThriftAccessor (executeStatement, getStatementResult, etc.) using this constructor — so on a transportMode=thrift connection with EnableHeartbeat=1, the manager is created and the eligibility logic correctly returns true for THRIFT_INLINE / THRIFT_ARROW_ENABLED, but no heartbeat ever starts.

Per the design doc's eligibility table, Thrift inline (data only on cluster, server-evictable) is one of the most critical scenarios this feature is meant to cover. It's silently broken.

The eligibility tests in ResultSetHeartbeatEligibilityTest.testThriftInlineIsEligible / testThriftArrowIsEligible mock the instance via reflection and bypass the constructor entirely, so they pass while production reality is broken.

Fix: Add startHeartbeatIfEnabled(); at the end of this constructor (line 196). Add a real-constructor smoke test that builds a Thrift DatabricksResultSet via the production constructor and asserts mgr.getActiveHeartbeatCount() == 1.

statementSet.remove(statement);
}
if (heartbeatManager != null) {
heartbeatManager.shutdown();
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[HIGH] heartbeatManager.shutdown() is skipped if any statement.close() throws — scheduler + thread leak

for (IDatabricksStatementInternal statement : statementSet) {
  statement.close(false);          // makes RPCs — can throw
  statementSet.remove(statement);
}
if (heartbeatManager != null) {
  heartbeatManager.shutdown();     // never reached on throw above
}

statement.close(false) issues a closeStatement RPC — any network/server error throws SQLException out of this loop. The heartbeatManager.shutdown() and session.close() calls below it are skipped, leaking:

  • The ScheduledExecutorService daemon thread (yes, daemon — but still leaks until JVM exit)
  • All scheduled futures and references they hold (see the this-capture issue on DatabricksResultSet.java:330-376)

Fix: Wrap in try/finally so heartbeatManager.shutdown() always runs. Also catch per-statement exceptions so the loop completes:

try {
  for (IDatabricksStatementInternal statement : statementSet) {
    try { statement.close(false); } catch (Exception e) {
      LOGGER.warn("Error closing statement: {}", e.getMessage());
    }
    statementSet.remove(statement);
  }
} finally {
  if (heartbeatManager != null) {
    heartbeatManager.shutdown();
  }
}


private static ResultHeartbeatManager createHeartbeatManager(
IDatabricksConnectionContext connectionContext) {
if (connectionContext instanceof DatabricksConnectionContext) {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[HIGH] instanceof DatabricksConnectionContext silently disables heartbeat for any other context impl

isHeartbeatEnabled() and getHeartbeatIntervalSeconds() live on the concrete class DatabricksConnectionContext, not on the IDatabricksConnectionContext interface. Any test mock, test double, or alternate implementation of IDatabricksConnectionContext falls through to return null — heartbeat silently disabled.

This pattern also makes the feature impossible to enable from any future context implementation (e.g., a wrapped/decorated context for telemetry or testing) without modifying this exact instanceof check.

Fix: Add the two methods to IDatabricksConnectionContext with default impls and drop the instanceof:

// IDatabricksConnectionContext.java
default boolean isHeartbeatEnabled() { return false; }
default int getHeartbeatIntervalSeconds() { return 60; }

Then this method becomes:

private static ResultHeartbeatManager createHeartbeatManager(IDatabricksConnectionContext ctx) {
  if (ctx.isHeartbeatEnabled()) {
    return new ResultHeartbeatManager(ctx.getHeartbeatIntervalSeconds());
  }
  return null;
}

* @param statementId statement to check status for
* @return true if the statement is still in a non-terminal state (alive), false if terminal
*/
default boolean checkStatementAlive(StatementId statementId) throws SQLException {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[HIGH] Default checkStatementAlive returns false — caller treats this as terminal and self-stops

The default return false is interpreted as "terminal state" by the heartbeat task at DatabricksResultSet.java:336-343:

boolean alive = client.checkStatementAlive(statementId);
...
if (!alive) {
  LOGGER.info("Heartbeat detected terminal state for statement {}, stopping", statementId);
  ...
  stopHeartbeat();
}

Any future IDatabricksClient implementation (test fakes, custom transports, third-party impls) that doesn't override this method will:

  1. Stop on the first heartbeat tick
  2. Emit a misleading INFO log saying the statement is in terminal state — when in reality the client doesn't support heartbeat

The two production impls do override, so this is academic for production today, but the default semantics are wrong and surprising.

Fix (one of):

  1. Make this method abstract (no default) — forces every IDatabricksClient implementer to deal with it explicitly.
  2. Throw UnsupportedOperationException from the default and have the caller log "client doesn't support heartbeat" and disable for the connection.
  3. Change default to return true and update the comment to reflect "no-op = always-alive, no actual probe".

Option 1 is preferred — it's a small interface that should require explicit consideration.

GetStatementRequest request = new GetStatementRequest().setStatementId(statementId);
Request req = new Request(Request.GET, getStatusPath, apiClient.serialize(request));
req.withHeaders(getHeaders("getStatement"));
GetStatementResponse response = apiClient.execute(req, GetStatementResponse.class);
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[HIGH] No per-RPC timeout — HeartbeatRequestTimeoutSeconds is documented but never implemented

The design doc (docs/design/HEARTBEAT_KEEP_ALIVE.md:423) lists HeartbeatRequestTimeoutSeconds with default 30s. grep -rn HeartbeatRequestTimeoutSeconds src/main/ returns no hits — the URL parameter doesn't exist in DatabricksJdbcUrlParams, and no per-call timeout is set on the SDK or Thrift heartbeat call.

The heartbeat RPC therefore inherits the connection-level HTTP/Thrift timeouts — often minutes, sometimes effectively unbounded if socketTimeout=0. Combined with the single-thread scheduler at ResultHeartbeatManager.java:42, this has three concrete consequences:

  1. Single-point starvation: one stuck heartbeat blocks every other heartbeat on the connection — every other registered statement misses ticks → results expire while the warehouse is still being kept alive (wrong outcome on both axes).
  2. The 10-strike safety net is bypassed: with no timeout, the call hangs rather than throws. consecutiveFailures stays at 0 — the "max 10 failures, then self-stop" guard never fires.
  3. Connection.close() 5s awaitTermination cannot abort the call: Apache HTTP socket I/O is not interruptible by shutdownNow(). Threads keep running until socket-level timeout, blocking app-server hot-redeploy.

Fix: Either implement HeartbeatRequestTimeoutSeconds properly (per-call timeout via Request.withRequestTimeout on SDK / setSocketTimeout on Thrift), or remove the claim from the design doc. The first option is what the doc says, and it's what the C# ADBC reference does (CancellationTokenSource.CancelAfter(_requestTimeoutSeconds)).

"Starting heartbeat for statement {} with interval {}s", statementId, intervalSeconds);

ScheduledFuture<?> future =
scheduler.scheduleAtFixedRate(
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[HIGH] Use scheduleWithFixedDelay instead of scheduleAtFixedRate — current code bursts on slow ticks

scheduleAtFixedRate semantics: if a tick takes longer than the interval (e.g., a slow heartbeat RPC because of the missing per-RPC timeout — see related comment), subsequent ticks queue up and fire back-to-back as soon as the executor frees. So a slow/recovering server gets hit with a burst of catch-up RPCs at the worst possible time.

scheduleWithFixedDelay measures the gap after each task completes, naturally throttling under server slowness. It's a one-line change and matches the C# ADBC reference (await Task.Delay(...) AFTER each poll completes).

// before
scheduler.scheduleAtFixedRate(heartbeatTask, intervalSeconds, intervalSeconds, TimeUnit.SECONDS);
// after
scheduler.scheduleWithFixedDelay(heartbeatTask, intervalSeconds, intervalSeconds, TimeUnit.SECONDS);

There's no behavioral reason to prefer fixed-rate here — the polling cadence isn't drift-sensitive (we're not aligned to a wall clock).

return state != StatementState.CANCELED
&& state != StatementState.CLOSED
&& state != StatementState.FAILED;
} catch (IOException e) {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[HIGH] catch (IOException e) is too narrow — SDK runtime exceptions bypass the wrapping

apiClient.execute(...) does NOT only throw IOException. It also throws:

  • DatabricksException / DatabricksError — for HTTP 4xx/5xx responses (e.g., 401 token expired returns DatabricksError, not IOException)
  • RuntimeException — for serialization / NPE on malformed responses

These propagate uncaught past this try/catch. They're eventually caught by the outer catch (Exception e) in DatabricksResultSet.java:344, but:

  1. They bypass the wrapping into DatabricksSQLException(SDK_CLIENT_ERROR) — losing the structured error code surface.
  2. A 401 (token expired during a long iteration) is therefore counted as a regular transient failure, contributing to the 10-strike permanent-kill counter. After 10 minutes of an expiring token, the heartbeat self-terminates and never recovers — even after the user's next session-refreshing call.

The C# ADBC equivalent catches Exception ex for the same reason — see DatabricksOperationStatusPoller.cs:149.

Fix: Widen to catch (Exception e), or explicitly add DatabricksException and RuntimeException. Same goes for DatabricksThriftServiceClient.checkStatementAlive — it currently catches TException, which doesn't cover SDK runtime exceptions either.

if (statementId != null) {
ResultHeartbeatManager mgr = connection.getHeartbeatManager();
if (mgr != null) {
mgr.stopHeartbeat(statementId);
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[HIGH] Statement.cancel() does not stop the heartbeat

This cancel() calls cancelStatement on the server but does not call mgr.stopHeartbeat(statementId). Only close() (line 175-181) and resetForNewExecution() (line 982-988) clear the heartbeat.

After cancel() returns, the heartbeat keeps polling against a cancelled operation. In the happy path the server returns CANCELED_STATE and the heartbeat task self-stops on the terminal-state check — fine. But if there's a race or "operation not found" before the server registers the cancel, those errors count as transient failures, churning the 10-strike counter and emitting WARN/INFO log noise for up to ~10 minutes after a successful cancel.

Fix: Add a heartbeat stop to cancel(), mirroring the pattern in close():

public void cancel() throws SQLException {
  ...
  if (statementId != null) {
    ResultHeartbeatManager mgr = connection.getHeartbeatManager();
    if (mgr != null) {
      mgr.stopHeartbeat(statementId);
    }
  }
  this.connection.getSession().getDatabricksClient().cancelStatement(statementId);
  ...
}

@msrathore-db
Copy link
Copy Markdown
Collaborator

Code Review Squad — Critical findings

I ran a multi-perspective AI review (security, architecture, language, ops, performance, tests, maintainability, agent-compat, devil's advocate) and verified the findings below by reading the actual code on this branch (commit a51bccc). I posted 11 inline comments at specific file:line locations — please address each thread directly.

Most important: the feature does not work as written

Two independent verified bugs make the heartbeat permanently no-op on every code path:

  1. Orphan-flag bugDatabricksResultSet.startHeartbeatIfEnabled captures mgr.getStoppedFlag(statementId) at line 328 before calling mgr.startHeartbeat(...) at line 378. Inside startHeartbeat, the very first thing it does is stopHeartbeat(statementId) which removes that AtomicBoolean from the map and sets it to true; then getStoppedFlag(...).set(false) computeIfAbsents a brand-new AtomicBoolean. The closure now references the orphaned true-forever flag, so the if (stopped.get()) return; guard fires on every tick and client.checkStatementAlive(...) is never called.
  2. Thrift constructor never wires it — only the SEA constructor calls startHeartbeatIfEnabled(). The Thrift constructor at lines 153-196 doesn't, so the entire Thrift path (the protocol the design doc and ADBC reference prioritize) gets nothing even if (1) is fixed.

The integration test passes only because real warehouses don't actually expire results in 15s, so the absence of heartbeats isn't observed. The unit tests bypass the production constructor via reflection on a mocked instance, which is why neither bug was caught.

What I posted as inline comments

Critical (4):

  • Orphan-flag bug → DatabricksResultSet.java:328
  • Thrift constructor missing wiring → DatabricksResultSet.java:127 (referencing the missing call at :196)
  • Lambda strong-captures this (memory leak / cost-forever for abandoned ResultSets) → DatabricksResultSet.java:330
  • Pooled connections (HikariCP, DBCP, DatabricksPooledConnection) silently get no heartbeat due to direct (DatabricksConnection) cast → DatabricksResultSet.java:315

High (7):

  • Connection.close() skips heartbeatManager.shutdown() if any statement.close() throws → DatabricksConnection.java:443
  • instanceof DatabricksConnectionContext silently disables heartbeat for any other context impl → DatabricksConnection.java:70
  • Default IDatabricksClient.checkStatementAlive returns false → caller treats as terminal and self-stops → IDatabricksClient.java:116
  • HeartbeatRequestTimeoutSeconds documented in design doc but never implemented; no per-RPC timeout on heartbeat → DatabricksSdkClient.java:415
  • scheduleAtFixedRate queues bursts on slow ticks; should be scheduleWithFixedDelayResultHeartbeatManager.java:72
  • catch (IOException e) is too narrow — SDK runtime exceptions (e.g., 401 → DatabricksError) bypass wrapping and feed the 10-strike permanent-kill counter → DatabricksSdkClient.java:421
  • Statement.cancel() does not stop the heartbeat → DatabricksStatement.java:179

Other concerns worth addressing (not posted inline to keep this thread focused)

  • No telemetry / metrics for heartbeat operations — IDatabricksClient.checkStatementAlive lacks @DatabricksMetricsTimed; no poll_success/poll_error/max_failures_reached events. C# ADBC has all of these. Fleet-wide "all heartbeats failing" is undetectable.
  • Single-thread scheduler per connection at ResultHeartbeatManager.java:42 — 1000 pooled connections = 1000 OS threads. C# uses a shared ThreadPool. Combined with the missing per-RPC timeout, one stuck heartbeat blocks all others on the connection.
  • Hard-coded maxConsecutiveFailures = 10 inside the lambda at DatabricksResultSet.java:322 — should be a centralized constant or URL param.
  • getHeartbeatIntervalSeconds() validation — non-numeric URL value crashes connection open (raw NumberFormatException); <= 0 silently coerces to default. Use validateAndParsePositiveInteger.
  • Thread name lacks connectionId — design doc promised databricks-jdbc-heartbeat-{connectionId}; code at ResultHeartbeatManager.java:44 is just databricks-jdbc-heartbeat.
  • HeartbeatIntegrationTest will fail (not skip) without env vars; @Tag("e2e") is not excluded by pom.xml surefire config; assertions don't verify behavior (sleep + "no exception thrown").
  • Test coverage gaps: zero unit tests for checkStatementAlive on either client; no test exercises the 10-failure self-stop, terminal-state self-stop, or null-state branch; reflection-based eligibility tests bypass the constructor (which is why the Thrift wiring bug went undetected).
  • Design doc drift: HeartbeatTask/SeaHeartbeatTask/ThriftHeartbeatTask interfaces (design doc lines 285-315) — never built. HeartbeatRequestTimeoutSeconds parameter (line 423) — not implemented. "Retry once after 30s" (line 480) — no such path exists.

Recommendation

This PR is not safe to merge until at least the four critical bugs (especially the orphan-flag one, which makes the entire feature dead code) are fixed and there are real tests that would have caught them — i.e., a test that actually verifies client.checkStatementAlive is invoked through the production wiring on both SEA and Thrift paths.


Feedback? Drop it in #code-review-squad-feedback.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants