Skip to content

Support opentelemetry observability phase3#31

Merged
mwfj merged 16 commits into
mainfrom
support-opentelemetry-observability-phase3
May 11, 2026
Merged

Support opentelemetry observability phase3#31
mwfj merged 16 commits into
mainfrom
support-opentelemetry-observability-phase3

Conversation

@mwfj
Copy link
Copy Markdown
Owner

@mwfj mwfj commented May 10, 2026

Phase 3 — OpenTelemetry observability follow-up

Implements Groups 2–7 of the original Phase 2 plan that were carried over to a follow-up PR after the foundation (Group 1 + Group 8) shipped in PR #29. Together with Phase 1 (PR #26) and Phase 2 (PR #29), the gateway now produces a complete inbound + auth + proxy + WebSocket trace tree with a §7-aligned metrics catalog, plus the operational hooks (self-handler shutdown, sampler self-noise) needed for production rollout.


Summary

Six capability groups + a polish pass + a doc/CI sweep. Every behavior is gated either by a config knob (default OFF where load-shape-sensitive) or by the existing observability.enabled master switch — no behavior change for operators who haven't enabled observability.

Test surface: 1336/1336 pass on a clean make rebuild. Six new test suites land (obs_proxy_client, obs_auth_trace, obs_self_handler, obs_catalog, obs_kill_marshal, obs_ws_messages) covering happy path, edge cases, race conditions, and config round-trip.


What's in this PR

Group 2 — Per-attempt CLIENT span on proxy

Each upstream attempt for a proxied request gets its own CLIENT span. Retries produce distinct spans linked to the same SERVER parent; the per-attempt span_id is stamped into the outbound traceparent so each attempt is independently identifiable in the collector. error.type is set on retry-rejecting and terminal outcomes.

  • ProxyTransaction::AttemptCheckout allocates the fresh CLIENT span; header rewrite + serialization moved here from Start() so retries inject the new span_id.
  • New helpers: obs_manager(), inbound_span(), RebuildOutboundTraceHeaders(), FinalizeAttemptSpan(status, error_type), ErrorTypeForResult(int) (RESULT_* → string).
  • Terminal finalize sites: OnResponseComplete (status), DeliverTerminalError (error_type), MaybeRetry (prev outcome before next attempt_++), Cancel (client_disconnect).
  • Suite: obs_proxy_client (4 tests — successful proxy, upstream 5xx, retry attempts, observability disabled).

Group 3 — Auth-path traceparent + auth.idp_check INTERNAL span

When traces.auth_idp_span = true (default, live-reloadable), every deferred IdP introspection POST is wrapped by an INTERNAL span parented at the SERVER span. Setting it to false falls back to recording auth.pending_start / auth.pending_end events on the SERVER span — useful when collector cardinality is a concern.

  • New live-reloadable config: traces.auth_idp_span (default true).
  • IssueTraceContext gains const Propagator* propagator field.
  • AuthManager constructor extended with ObservabilityManager*; MakeIntrospectionDoneCallback ends the span / emits the pending_end event before state->Complete.
  • IntrospectionClient::Verify gains an optional 13th param (std::optional<IssueTraceContext>) carried through to the upstream request.
  • HttpRouter::AsyncPendingState carries auth_idp_check_span, emit_pending_end_event, issue_ctx, inbound_server_span for the deferred resume.
  • auth_upstream_http_client.cc::ApplyOutboundTraceContext strips the W3C+Jaeger union (case-insensitive) before injecting via the bound propagator (or W3C fallback).
  • Mock IdP fixture (test/mock_introspection_server.h) gains received_header(name) for header-injection assertions.
  • Suite: obs_auth_trace (3 tests).

Group 4 — Self-handler shutdown helper

Replaces the broken pattern of calling HttpServer::Stop() synchronously from inside a route handler (which deadlocks the dispatcher's drain barrier).

  • HttpServer::ScheduleStopAfterCurrentResponse() — idempotent CAS via stop_scheduled_. Schedules Stop() through NetServer::EnQueueOnConnDispatcher.
  • The handler returns its response normally; the deferred Stop() runs on conn_dispatcher_ so the drain barrier engages cleanly against the natural decrement of active_requests_.
  • Direct Stop() from a handler remains documented-undefined; the helper is the only supported pattern.
  • Suite: obs_self_handler (3 tests including a sibling-in-flight drain test).

Group 6 — Full §7 metrics catalog

Programmatic registration of every §7.1–§7.4 instrument at boot, owned by the manager (lifetime-safe across test create/destroy cycles).

  • New MetricsCatalog struct (include/observability/metrics_catalog.h) with Counter / UpDownCounter / Histogram pointers for every catalogued instrument.
  • MetricsCatalog::Build(ObservabilityManager&, MetricsCatalog& out) registers them with the manager's MeterProvider; called once from ObservabilityManager::Init().
  • HTTP server emit sites surface request body size + active_requests UpDownCounter (incremented at request entry, decremented in OnFinalizeWinner).
  • Kill loop bumps reactor.otel.snapshots_killed_on_timeout once per un-finalized survivor.
  • Constants: kBytesBuckets, kLatencyBuckets, kTokensBuckets.
  • Suite: obs_catalog (3 tests).
  • Connection-level counters (protocol-label plumbing across 6+ sites) and BSP/PMR/Tracer self-metrics deferred — see "Still deferred" below.

Group 5 — Kill-loop invariant guards

Documents and enforces the existing FinalizeFromSnapshot CAS contract via a regression suite. The kill_marshals_in_flight_ counter stays at 0 (RESERVED) — kept in WaitForAllAsyncDrain's predicate for forward-compat with a future per-dispatcher EnQueue marshal.

  • Tightened the kill_marshals_in_flight_ field-declaration docstring with the RESERVED contract + safety invariant.
  • New suite obs_kill_marshal (3 tests):
    • kill_marshals_in_flight_ stays at 0 before AND after kill (RESERVED contract).
    • N=256 snapshots × 4 finalizer threads × 1 kill thread → exactly N finalized (the CAS resolves cleanly).
    • reactor.otel.snapshots_killed_on_timeout Counter delta = N for un-finalized survivors.

Group 7 — Polish

7.1 — Sampler self-noise auto-derivation. ConfigLoader::LoadFromString calls a new ApplySamplerSelfNoiseDefaults helper that auto-appends always_off routes for the gateway's own probes:

  • /health and /stats unconditionally.
  • The configured prometheus.path when metrics.exporter == "prometheus_pull".
  • Operator-supplied entries with the same path are preserved verbatim (explicit override always wins).

7.2 — metrics.prometheus.path reload-warn. ObservabilityManager::Reload now detects path changes between live and staged config and warns ("restart to apply"); the live config_.metrics.prometheus.path is NOT mutated. Mirrors the pattern for traces.exporter, metrics.exporter.

7.3 — Per-message WebSocket spans.

  • New live-reloadable config: traces.websocket_messages (default false).
  • Atomic mirror in ObservabilityManager::websocket_messages_enabled_.
  • WebSocketConnection gains SetObservabilitySnapshot + MaybeEmitMessageSpan.
  • Inbound: emits ws.recv on FIN-true Text/Binary AND on the final continuation frame.
  • Outbound: emits ws.send from SendText / SendBinary.
  • Control frames (Ping / Pong / Close) are NOT spanned.
  • Wired from both sync + async-resume upgrade paths in HttpConnectionHandler.
  • Default OFF — high-throughput WS connections produce far more messages than HTTP requests; operators opt in when they want the visibility.

Still deferred (follow-up PRs)

  • Per-dispatcher EnQueue kill marshal (the RESERVED counter is wired into the drain predicate today; future PR adds the bump-pre-EnQueue / decrement-in-closure pair).
  • Connection-level metrics with protocol label (connections.active / connections.total need protocol-label plumbing across 6+ inbound/outbound sites).
  • BSP / PMR / Tracer self-metrics (workers need a manager pointer for self-instrumentation).
  • Native OTLP/protobuf serializer — Phase 1 deferred, still adequate via OTLP/JSON.
  • Per-key max_value_cardinality_per_label SIGHUP — Phase 1 deferred, still restart-only.
  • Live-swap of OtlpHttpExporter::traces.otlp.upstream — Phase 1 deferred.

Tests

New suites

Suite Where Coverage
obs_proxy_client test/observability_proxy_client_test.h Per-attempt CLIENT span — happy path, 5xx error.type, retry tree, observability-disabled passthrough
obs_auth_trace test/observability_auth_trace_test.h traceparent injection on the IdP hop, auth.idp_check allocated with parent + outcome, auth_idp_span=false falls back to events
obs_self_handler test/observability_self_handler_test.h Self-handler shutdown helper — happy path, idempotent CAS, drain waits for sibling in-flight requests
obs_catalog test/observability_catalog_test.h Every §7 instrument registered after Init, HTTP server body+active_requests emit, kill-loop self-metric
obs_kill_marshal test/observability_kill_marshal_test.h RESERVED counter stays zero, FinalizeFromSnapshot CAS race resolves cleanly, kill counter deltas
obs_ws_messages test/observability_ws_messages_test.h traces.websocket_messages gate, control-frame skip, fragmented-message single span at FIN, install-once rebind reject

Verification

make clean && make -j4 && ./test_runner
→ Total Tests: 1336 | Passed: 1336 | Failed: 0

CI workflow updates (.github/workflows/)

  • ci.ymlbuild-linux-tsan-rest enumeration extended with all 6 new obs suites. macOS subset (build-macos) extended with the 4 socket-using ones (obs_self_handler, obs_proxy_client, obs_auth_trace, obs_ws_messages).
  • weekly-valgrind.yml — extended with 4 memory-safety candidates (obs_self_handler, obs_proxy_client, obs_auth_trace, obs_catalog); obs_kill_marshal skipped (timing-sensitive concurrent test, valgrind interpreter would mask races).

Documentation

  • docs/observability.md — new "Phase 3 features" section explaining each capability for operators; two new rows in the traces field reference (auth_idp_span, websocket_messages); pruned "Out of scope" list to only-still-deferred items.
  • .codex/ mirror — every shared doc updated to match.

Files changed

 16 files changed (excluding docs and CI workflows)

 Configuration & manager
  include/observability/observability_config.h  +9
  include/observability/observability_manager.h +18
  server/observability_manager.cc               +60
  server/observability_middleware.cc            +30
  server/config_loader.cc                       +44

 Proxy CLIENT span (Group 2)
  include/upstream/proxy_transaction.h          +27
  server/proxy_transaction.cc                   +180

 Auth path (Group 3)
  include/auth/auth_manager.h                   +12
  include/auth/introspection_client.h           +5
  include/observability/trace_context.h         +3
  server/auth_manager.cc                        +120
  server/introspection_client.cc                +6
  server/auth_upstream_http_client.cc           +30
  test/mock_introspection_server.h              +20

 Self-handler shutdown (Group 4)
  include/http/http_server.h                    +5
  include/net_server.h                          +3
  server/http_server.cc                         +35
  server/net_server.cc                          +12

 §7 catalog (Group 6)
  include/observability/metrics_catalog.h       +90 (new)
  server/metrics_catalog.cc                     +150 (new)

 WS message spans (Group 7)
  include/ws/websocket_connection.h             +24
  server/websocket_connection.cc                +40
  server/http_connection_handler.cc             +2

 Tests (5 new suites + run_test.cc + Makefile)
  test/observability_proxy_client_test.h        +451 (new)
  test/observability_auth_trace_test.h          +432 (new)
  test/observability_self_handler_test.h        +213 (new)
  test/observability_catalog_test.h             +280 (new)
  test/observability_kill_marshal_test.h        +197 (new)
  test/observability_config_test.h              +281
  test/run_test.cc                              +24
  Makefile                                      +12

 CI workflows
  .github/workflows/ci.yml                      +14
  .github/workflows/weekly-valgrind.yml         +4

Test plan

  • CI all-suites jobs pass on Linux (gcc, clang, ASan+UBSan, TSan-heavy, TSan-rest)
  • CI macOS subset passes (the 3 new socket-using suites added to build-macos)
  • Manual: ./test_runner obs_proxy_client — 4/4
  • Manual: ./test_runner obs_auth_trace — 3/3
  • Manual: ./test_runner obs_self_handler — 3/3
  • Manual: ./test_runner obs_catalog — 3/3
  • Manual: ./test_runner obs_kill_marshal — 3/3
  • Manual: ./test_runner obs_ws_messages — 6/6
  • Manual: full sweep make clean && make -j4 && ./test_runner — 1336/1336
  • Manual smoke against a real OTel collector — verify per-attempt CLIENT spans appear with error.type on retries, auth.idp_check shows up as a child of inbound SERVER, /health and /metrics produce no spans
  • Manual: enable traces.websocket_messages = true against a WS endpoint and verify ws.recv / ws.send spans appear with ws.opcode + ws.payload_size
  • Manual: SIGHUP reloads of traces.auth_idp_span, traces.websocket_messages, metrics.prometheus.path produce expected behavior (live flip vs. restart-only warn)

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request implements Phase 3 of the observability framework, adding per-attempt CLIENT spans for proxied requests, INTERNAL spans for IdP introspection, and per-message WebSocket spans. It introduces a centralized metrics catalog for §7 instruments and a graceful shutdown helper for route handlers. Feedback indicates that trace propagation for IdP introspection is incorrectly gated by the sampling state, which should be decoupled to maintain trace continuity.

Comment thread server/auth_manager.cc Outdated
Repository owner deleted a comment from chatgpt-codex-connector Bot May 10, 2026
@mwfj
Copy link
Copy Markdown
Owner Author

mwfj commented May 11, 2026

LGTM

@mwfj mwfj merged commit 4794779 into main May 11, 2026
6 checks passed
@mwfj mwfj deleted the support-opentelemetry-observability-phase3 branch May 11, 2026 12:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant