Telemetry: stop double-retrying, fix user-agent, tune circuit breaker#354
Open
samikshya-db wants to merge 1 commit intomainfrom
Open
Telemetry: stop double-retrying, fix user-agent, tune circuit breaker#354samikshya-db wants to merge 1 commit intomainfrom
samikshya-db wants to merge 1 commit intomainfrom
Conversation
1300f1a to
fbb0263
Compare
After v1.11.0 enabled telemetry by default via the server feature flag, high-QPS workloads produced excessive 429s on /telemetry-ext. Three issues compounded: 1. Double-retry. The exporter ran its own retry loop on top of the retryablehttp-wrapped HTTP client (internal/client.RetryableClient), which already retries 429/5xx with Retry-After. Result: up to RetryMax * (MaxRetries+1) HTTP attempts per export, all collapsed into one circuit-breaker outcome — so the breaker barely opened. 2. Untraceable in access logs. Telemetry POSTs and feature-flag GETs sent no User-Agent, so 429s were tagged Go-http-client/1.1 and could not be attributed to godatabrickssqlconnector by version. 3. High request volume. FlushInterval=5s, BatchSize=100. Changes: - telemetry/exporter.go: drop the retry loop entirely. doExport now makes a single HTTP request; transient retries (429/5xx, Retry-After) are owned by the underlying retryablehttp client. Each export call → exactly one breaker outcome. - telemetry/exporter.go, telemetry/featureflag.go: set User-Agent header on telemetry POST and feature-flag GET. Built once at the connector site (buildUserAgent in connector.go), mirroring internal/client/client.go format (DriverName/DriverVersion + UserAgentEntry + agent product), and plumbed via TelemetryInitOptions.UserAgent. - telemetry/config.go: FlushInterval 5s → 30s, BatchSize 100 → 200. Remove MaxRetries/RetryDelay from telemetry.Config and TelemetryInitOptions; telemetry_retry_count/_delay DSN params still parse for backwards compat but are no-ops. - telemetry/circuitbreaker.go: lower minimumNumberOfCalls 20 → 10 (so low-traffic clients can still trip the breaker on a sustained outage now that each export is one signal), and raise waitDurationInOpenState 30s → 60s (respect typical Retry-After). - Tests: removed obsolete retry/backoff tests; added single-attempt assertion across 4xx/429/5xx; added User-Agent assertions on both endpoints. Co-authored-by: Isaac
fbb0263 to
d47bf99
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
After
v1.11.0enabled telemetry by default via the server feature flag, high-QPS workloads produced excessive 429s on/telemetry-ext. Three issues compounded:internal/client.RetryableClient), which already retries 429/5xx withRetry-After. Result: up toRetryMax × (MaxRetries+1)HTTP attempts per export, all collapsed into a single circuit-breaker outcome — so the breaker barely opened against persistent throttling.User-Agent, so 429s landed in access logs tagged asGo-http-client/1.1and could not be attributed togodatabrickssqlconnectorby driver version.FlushInterval=5s/BatchSize=100.Changes
Retry behavior
telemetry/exporter.go— RemoveddoExport's retry loop entirely. doExport now makes a single HTTP request; transient retries (429/5xx,Retry-After) are owned by the underlying retryablehttp client. Eachexport()call now corresponds to exactly one HTTP transaction = one breaker outcome.telemetry/config.go,telemetry/driver_integration.go— RemovedMaxRetries/RetryDelayfromtelemetry.ConfigandTelemetryInitOptions.telemetry_retry_count/telemetry_retry_delayDSN params still parse without error for backwards compatibility but are no-ops.Identifiability
connector.go— NewbuildUserAgenthelper mirroringinternal/client/client.go:295-302exactly:DriverName/DriverVersion+ optionalUserAgentEntry+ agent product.telemetry/exporter.go,telemetry/featureflag.go— SetUser-Agenton telemetry POST and feature-flag GET. Plumbed viaTelemetryInitOptions.UserAgent.Cadence and breaker tuning
telemetry/config.go—FlushInterval5s → 30s,BatchSize100 → 200.telemetry/circuitbreaker.go—minimumNumberOfCalls20 → 10 (so low-traffic clients can trip the breaker now that each export is one signal),waitDurationInOpenState30s → 60s (respect typicalRetry-After).Tests
TestExport_RetryOn5xx,TestExport_ExponentialBackoff,TestIsRetryableStatus, retry-config parsing tests).TestExport_SingleAttemptPerExportcovering 4xx/429/5xx, asserting the exporter never retries.TestExport_SetsUserAgentandTestFetchFeatureFlag_SetsUserAgent.Mitigation
While this rolls out, users can opt out via DSN:
enableTelemetry=false. Server-side: disableenableTelemetryForGoDriverfor affected workspaces.Test plan
go test ./...— all green locally./telemetry-extand/api/2.0/connector-service/feature-flags/GOLANG/...requests carrygodatabrickssqlconnector/<version>inhttp_user_agent./telemetry-extdrops after rollout.This pull request and its description were written by Isaac.