Skip to content

Telemetry: stop double-retrying, fix user-agent, tune circuit breaker#354

Open
samikshya-db wants to merge 1 commit intomainfrom
telemetry-429-circuit-breaker-and-user-agent
Open

Telemetry: stop double-retrying, fix user-agent, tune circuit breaker#354
samikshya-db wants to merge 1 commit intomainfrom
telemetry-429-circuit-breaker-and-user-agent

Conversation

@samikshya-db
Copy link
Copy Markdown
Collaborator

@samikshya-db samikshya-db commented May 5, 2026

Summary

After v1.11.0 enabled telemetry by default via the server feature flag, high-QPS workloads produced excessive 429s on /telemetry-ext. Three issues compounded:

  1. Double-retry. The telemetry exporter ran its own retry loop on top of the retryablehttp-wrapped HTTP client (internal/client.RetryableClient), which already retries 429/5xx with Retry-After. Result: up to RetryMax × (MaxRetries+1) HTTP attempts per export, all collapsed into a single circuit-breaker outcome — so the breaker barely opened against persistent throttling.
  2. Untraceable in access logs. Telemetry POSTs and feature-flag GETs sent no User-Agent, so 429s landed in access logs tagged as Go-http-client/1.1 and could not be attributed to godatabrickssqlconnector by driver version.
  3. High request volume. FlushInterval=5s / BatchSize=100.

Changes

Retry behavior

  • telemetry/exporter.go — Removed doExport's retry loop entirely. doExport now makes a single HTTP request; transient retries (429/5xx, Retry-After) are owned by the underlying retryablehttp client. Each export() call now corresponds to exactly one HTTP transaction = one breaker outcome.
  • telemetry/config.go, telemetry/driver_integration.go — Removed MaxRetries / RetryDelay from telemetry.Config and TelemetryInitOptions. telemetry_retry_count / telemetry_retry_delay DSN params still parse without error for backwards compatibility but are no-ops.

Identifiability

  • connector.go — New buildUserAgent helper mirroring internal/client/client.go:295-302 exactly: DriverName/DriverVersion + optional UserAgentEntry + agent product.
  • telemetry/exporter.go, telemetry/featureflag.go — Set User-Agent on telemetry POST and feature-flag GET. Plumbed via TelemetryInitOptions.UserAgent.

Cadence and breaker tuning

  • telemetry/config.goFlushInterval 5s → 30s, BatchSize 100 → 200.
  • telemetry/circuitbreaker.gominimumNumberOfCalls 20 → 10 (so low-traffic clients can trip the breaker now that each export is one signal), waitDurationInOpenState 30s → 60s (respect typical Retry-After).

Tests

  • Removed obsolete retry/backoff tests (TestExport_RetryOn5xx, TestExport_ExponentialBackoff, TestIsRetryableStatus, retry-config parsing tests).
  • Added TestExport_SingleAttemptPerExport covering 4xx/429/5xx, asserting the exporter never retries.
  • Added TestExport_SetsUserAgent and TestFetchFeatureFlag_SetsUserAgent.

Mitigation

While this rolls out, users can opt out via DSN: enableTelemetry=false. Server-side: disable enableTelemetryForGoDriver for affected workspaces.

Test plan

  • go test ./... — all green locally.
  • Exporter never retries (4xx, 429, 500, 503).
  • User-Agent set on telemetry POST and feature-flag GET.
  • Verify in Lumberjack post-deploy that /telemetry-ext and /api/2.0/connector-service/feature-flags/GOLANG/... requests carry godatabrickssqlconnector/<version> in http_user_agent.
  • Confirm 429 rate against /telemetry-ext drops after rollout.

This pull request and its description were written by Isaac.

@samikshya-db samikshya-db changed the title Stop telemetry 429s from amplifying load and identify telemetry traffic Stop telemetry 429s from retrying with backoff only for telemetry endpoint + fix userAgent on telemetry May 5, 2026
@samikshya-db samikshya-db force-pushed the telemetry-429-circuit-breaker-and-user-agent branch from 1300f1a to fbb0263 Compare May 5, 2026 10:19
After v1.11.0 enabled telemetry by default via the server feature flag,
high-QPS workloads produced excessive 429s on /telemetry-ext. Three
issues compounded:

1. Double-retry. The exporter ran its own retry loop on top of the
   retryablehttp-wrapped HTTP client (internal/client.RetryableClient),
   which already retries 429/5xx with Retry-After. Result: up to
   RetryMax * (MaxRetries+1) HTTP attempts per export, all collapsed
   into one circuit-breaker outcome — so the breaker barely opened.
2. Untraceable in access logs. Telemetry POSTs and feature-flag GETs
   sent no User-Agent, so 429s were tagged Go-http-client/1.1 and
   could not be attributed to godatabrickssqlconnector by version.
3. High request volume. FlushInterval=5s, BatchSize=100.

Changes:

- telemetry/exporter.go: drop the retry loop entirely. doExport now
  makes a single HTTP request; transient retries (429/5xx, Retry-After)
  are owned by the underlying retryablehttp client. Each export call
  → exactly one breaker outcome.
- telemetry/exporter.go, telemetry/featureflag.go: set User-Agent
  header on telemetry POST and feature-flag GET. Built once at the
  connector site (buildUserAgent in connector.go), mirroring
  internal/client/client.go format
  (DriverName/DriverVersion + UserAgentEntry + agent product), and
  plumbed via TelemetryInitOptions.UserAgent.
- telemetry/config.go: FlushInterval 5s → 30s, BatchSize 100 → 200.
  Remove MaxRetries/RetryDelay from telemetry.Config and
  TelemetryInitOptions; telemetry_retry_count/_delay DSN params still
  parse for backwards compat but are no-ops.
- telemetry/circuitbreaker.go: lower minimumNumberOfCalls 20 → 10
  (so low-traffic clients can still trip the breaker on a sustained
  outage now that each export is one signal), and raise
  waitDurationInOpenState 30s → 60s (respect typical Retry-After).
- Tests: removed obsolete retry/backoff tests; added single-attempt
  assertion across 4xx/429/5xx; added User-Agent assertions on both
  endpoints.

Co-authored-by: Isaac
@samikshya-db samikshya-db force-pushed the telemetry-429-circuit-breaker-and-user-agent branch from fbb0263 to d47bf99 Compare May 5, 2026 10:40
@samikshya-db samikshya-db changed the title Stop telemetry 429s from retrying with backoff only for telemetry endpoint + fix userAgent on telemetry Telemetry: stop double-retrying, identify traffic, tune breaker May 5, 2026
@samikshya-db samikshya-db changed the title Telemetry: stop double-retrying, identify traffic, tune breaker Telemetry: stop double-retrying, fix user-agent, tune circuit breaker May 5, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant