Support opentelemetry observability phase3#31
Merged
Conversation
There was a problem hiding this comment.
Code Review
This pull request implements Phase 3 of the observability framework, adding per-attempt CLIENT spans for proxied requests, INTERNAL spans for IdP introspection, and per-message WebSocket spans. It introduces a centralized metrics catalog for §7 instruments and a graceful shutdown helper for route handlers. Feedback indicates that trace propagation for IdP introspection is incorrectly gated by the sampling state, which should be decoupled to maintain trace continuity.
Owner
Author
|
LGTM |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Phase 3 — OpenTelemetry observability follow-up
Implements Groups 2–7 of the original Phase 2 plan that were carried over to a follow-up PR after the foundation (Group 1 + Group 8) shipped in PR #29. Together with Phase 1 (PR #26) and Phase 2 (PR #29), the gateway now produces a complete inbound + auth + proxy + WebSocket trace tree with a §7-aligned metrics catalog, plus the operational hooks (self-handler shutdown, sampler self-noise) needed for production rollout.
Summary
Six capability groups + a polish pass + a doc/CI sweep. Every behavior is gated either by a config knob (default OFF where load-shape-sensitive) or by the existing
observability.enabledmaster switch — no behavior change for operators who haven't enabled observability.Test surface:
1336/1336 passon a cleanmakerebuild. Six new test suites land (obs_proxy_client,obs_auth_trace,obs_self_handler,obs_catalog,obs_kill_marshal,obs_ws_messages) covering happy path, edge cases, race conditions, and config round-trip.What's in this PR
Group 2 — Per-attempt CLIENT span on proxy
Each upstream attempt for a proxied request gets its own CLIENT span. Retries produce distinct spans linked to the same SERVER parent; the per-attempt span_id is stamped into the outbound
traceparentso each attempt is independently identifiable in the collector.error.typeis set on retry-rejecting and terminal outcomes.ProxyTransaction::AttemptCheckoutallocates the fresh CLIENT span; header rewrite + serialization moved here fromStart()so retries inject the new span_id.obs_manager(),inbound_span(),RebuildOutboundTraceHeaders(),FinalizeAttemptSpan(status, error_type),ErrorTypeForResult(int)(RESULT_* → string).OnResponseComplete(status),DeliverTerminalError(error_type),MaybeRetry(prev outcome before next attempt_++),Cancel(client_disconnect).obs_proxy_client(4 tests — successful proxy, upstream 5xx, retry attempts, observability disabled).Group 3 — Auth-path traceparent +
auth.idp_checkINTERNAL spanWhen
traces.auth_idp_span = true(default, live-reloadable), every deferred IdP introspection POST is wrapped by an INTERNAL span parented at the SERVER span. Setting it tofalsefalls back to recordingauth.pending_start/auth.pending_endevents on the SERVER span — useful when collector cardinality is a concern.traces.auth_idp_span(default true).IssueTraceContextgainsconst Propagator* propagatorfield.AuthManagerconstructor extended withObservabilityManager*;MakeIntrospectionDoneCallbackends the span / emits the pending_end event beforestate->Complete.IntrospectionClient::Verifygains an optional 13th param (std::optional<IssueTraceContext>) carried through to the upstream request.HttpRouter::AsyncPendingStatecarriesauth_idp_check_span,emit_pending_end_event,issue_ctx,inbound_server_spanfor the deferred resume.auth_upstream_http_client.cc::ApplyOutboundTraceContextstrips the W3C+Jaeger union (case-insensitive) before injecting via the bound propagator (or W3C fallback).test/mock_introspection_server.h) gainsreceived_header(name)for header-injection assertions.obs_auth_trace(3 tests).Group 4 — Self-handler shutdown helper
Replaces the broken pattern of calling
HttpServer::Stop()synchronously from inside a route handler (which deadlocks the dispatcher's drain barrier).HttpServer::ScheduleStopAfterCurrentResponse()— idempotent CAS viastop_scheduled_. SchedulesStop()throughNetServer::EnQueueOnConnDispatcher.Stop()runs onconn_dispatcher_so the drain barrier engages cleanly against the natural decrement ofactive_requests_.Stop()from a handler remains documented-undefined; the helper is the only supported pattern.obs_self_handler(3 tests including a sibling-in-flight drain test).Group 6 — Full §7 metrics catalog
Programmatic registration of every §7.1–§7.4 instrument at boot, owned by the manager (lifetime-safe across test create/destroy cycles).
MetricsCatalogstruct (include/observability/metrics_catalog.h) with Counter / UpDownCounter / Histogram pointers for every catalogued instrument.MetricsCatalog::Build(ObservabilityManager&, MetricsCatalog& out)registers them with the manager's MeterProvider; called once fromObservabilityManager::Init().active_requestsUpDownCounter (incremented at request entry, decremented inOnFinalizeWinner).reactor.otel.snapshots_killed_on_timeoutonce per un-finalized survivor.kBytesBuckets,kLatencyBuckets,kTokensBuckets.obs_catalog(3 tests).Group 5 — Kill-loop invariant guards
Documents and enforces the existing FinalizeFromSnapshot CAS contract via a regression suite. The
kill_marshals_in_flight_counter stays at 0 (RESERVED) — kept inWaitForAllAsyncDrain's predicate for forward-compat with a future per-dispatcher EnQueue marshal.kill_marshals_in_flight_field-declaration docstring with the RESERVED contract + safety invariant.obs_kill_marshal(3 tests):kill_marshals_in_flight_stays at 0 before AND after kill (RESERVED contract).reactor.otel.snapshots_killed_on_timeoutCounter delta = N for un-finalized survivors.Group 7 — Polish
7.1 — Sampler self-noise auto-derivation.
ConfigLoader::LoadFromStringcalls a newApplySamplerSelfNoiseDefaultshelper that auto-appendsalways_offroutes for the gateway's own probes:/healthand/statsunconditionally.prometheus.pathwhenmetrics.exporter == "prometheus_pull".7.2 —
metrics.prometheus.pathreload-warn.ObservabilityManager::Reloadnow detects path changes between live and staged config and warns ("restart to apply"); the liveconfig_.metrics.prometheus.pathis NOT mutated. Mirrors the pattern fortraces.exporter,metrics.exporter.7.3 — Per-message WebSocket spans.
traces.websocket_messages(defaultfalse).ObservabilityManager::websocket_messages_enabled_.WebSocketConnectiongainsSetObservabilitySnapshot+MaybeEmitMessageSpan.ws.recvon FIN-true Text/Binary AND on the final continuation frame.ws.sendfromSendText/SendBinary.HttpConnectionHandler.Still deferred (follow-up PRs)
connections.active/connections.totalneed protocol-label plumbing across 6+ inbound/outbound sites).max_value_cardinality_per_labelSIGHUP — Phase 1 deferred, still restart-only.OtlpHttpExporter::traces.otlp.upstream— Phase 1 deferred.Tests
New suites
obs_proxy_clienttest/observability_proxy_client_test.hobs_auth_tracetest/observability_auth_trace_test.hauth.idp_checkallocated with parent + outcome,auth_idp_span=falsefalls back to eventsobs_self_handlertest/observability_self_handler_test.hobs_catalogtest/observability_catalog_test.hobs_kill_marshaltest/observability_kill_marshal_test.hobs_ws_messagestest/observability_ws_messages_test.htraces.websocket_messagesgate, control-frame skip, fragmented-message single span at FIN, install-once rebind rejectVerification
CI workflow updates (
.github/workflows/)ci.yml—build-linux-tsan-restenumeration extended with all 6 new obs suites. macOS subset (build-macos) extended with the 4 socket-using ones (obs_self_handler,obs_proxy_client,obs_auth_trace,obs_ws_messages).weekly-valgrind.yml— extended with 4 memory-safety candidates (obs_self_handler,obs_proxy_client,obs_auth_trace,obs_catalog);obs_kill_marshalskipped (timing-sensitive concurrent test, valgrind interpreter would mask races).Documentation
docs/observability.md— new "Phase 3 features" section explaining each capability for operators; two new rows in the traces field reference (auth_idp_span,websocket_messages); pruned "Out of scope" list to only-still-deferred items..codex/mirror — every shared doc updated to match.Files changed
Test plan
build-macos)./test_runner obs_proxy_client— 4/4./test_runner obs_auth_trace— 3/3./test_runner obs_self_handler— 3/3./test_runner obs_catalog— 3/3./test_runner obs_kill_marshal— 3/3./test_runner obs_ws_messages— 6/6make clean && make -j4 && ./test_runner— 1336/1336error.typeon retries,auth.idp_checkshows up as a child of inbound SERVER,/healthand/metricsproduce no spanstraces.websocket_messages = trueagainst a WS endpoint and verifyws.recv/ws.sendspans appear withws.opcode+ws.payload_sizetraces.auth_idp_span,traces.websocket_messages,metrics.prometheus.pathproduce expected behavior (live flip vs. restart-only warn)