OpenTelemetry-based observability for the gateway: distributed traces (W3C and Jaeger propagation, OTLP/HTTP push), metrics (Prometheus pull or OTLP push), and structured logs (via spdlog, with trace correlation). All disabled by default — enable per signal.
Enable tracing + Prometheus metrics:
{
"observability": {
"enabled": true,
"resource": {
"service.name": "edge-gateway",
"service.version": "1.4.0"
},
"traces": {
"enabled": true,
"exporter": "otlp_http",
"otlp": { "upstream": "otel-collector" },
"sampler": { "kind": "trace_id_ratio", "ratio": 0.05 },
"propagators": ["w3c"]
},
"metrics": {
"enabled": true,
"exporter": "prometheus_pull",
"prometheus": { "path": "/metrics" }
}
},
"upstreams": [
{
"name": "otel-collector",
"host": "otel.svc.cluster.local",
"port": 4318,
"pool": { "max_connections": 4, "max_idle_connections": 4 }
}
]
}observability.enabled is the master switch. When false, every per-request hook short-circuits at zero cost; nothing is allocated and no propagator headers are injected. The master switch is restart-required — turning observability on for the first time means a restart. Sub-switches (traces.enabled, metrics.enabled) are live-reloadable.
| Field | Restart required? | Behavior when off |
|---|---|---|
observability.enabled |
YES | Subsystem not allocated. Every hook is a if (!manager_) return short-circuit. |
traces.enabled |
NO (live-reloadable) | Tracer returns non-recording spans; propagator does not inject sampled=1. Allocations on the boot-time path still happen so a SIGHUP false → true works. |
metrics.enabled |
NO (live-reloadable) | /metrics handler returns 404. PMR skips Export() calls. Counters keep accumulating in memory so cumulative totals stay coherent across toggles. |
The master switch governs allocation; the sub-switches govern emission.
traces.sampler.kind selects the root sampler:
| Kind | Meaning |
|---|---|
always_on |
Sample every trace. |
always_off |
Sample no traces. |
trace_id_ratio |
Sample by hashing the trace-id; configured via traces.sampler.ratio (0.0-1.0). |
parent_based |
Honor the inbound parent's sampled flag; falls back to traces.sampler.default_root when no parent. |
Per-route overrides live in traces.sampler.routes:
"sampler": {
"kind": "trace_id_ratio",
"ratio": 0.05,
"routes": [
{ "pattern": "/health", "kind": "always_off" },
{ "pattern": "/checkout/*","kind": "always_on" }
]
}Route overrides are evaluated directly (not wrapped in ParentBased). /health: always_off will drop a sampled-parent request and propagate sampled=0 downstream — that is the documented intent. The default root falls back to ParentBased.
All sampler fields above are live-reloadable. In-flight spans keep their original sampler.
Set traces.exporter = "otlp_http". The exporter posts OTLP/JSON v1.10 to the upstream named in traces.otlp.upstream at path /v1/traces (the Collector default). Headers and timeout:
"traces": {
"exporter": "otlp_http",
"otlp": {
"upstream": "otel-collector",
"path": "/v1/traces",
"headers": { "authorization": "Bearer ${OTEL_EXPORTER_OTLP_TOKEN}" },
"timeout_ms": 10000
}
}Headers and timeout are live-reloadable; upstream and exporter are restart-required.
"traces": {
"batch": {
"max_queue_size": 2048,
"max_export_batch_size": 512,
"schedule_delay_ms": 5000,
"retries.max_attempts": 3,
"retries.initial_backoff_ms": 1000,
"retries.max_backoff_ms": 10000
}
}The per-batch export deadline is sourced from traces.otlp.timeout_ms — there is no separate batch.export_timeout_ms field. Batch shape (max_export_batch_size, schedule_delay_ms) and retry policy (retries.*) are all live-reloadable; the worker re-reads them on the next iteration after cv_.notify_all(). max_queue_size allocates the queue at construction and is restart-only.
traces.propagators is an ordered list of formats the gateway extracts and injects. Default ["w3c"]. Recognised tokens: w3c, jaeger. Live-reloadable.
"traces": { "propagators": ["w3c", "jaeger"] }CompositePropagator::Extract iterates the list in order and returns the first child that produced a valid context. With ["w3c", "jaeger"] configured, a request carrying both traceparent and uber-trace-id is parented by the traceparent value; the Jaeger header is ignored on that request. Reverse the list to flip the precedence.
CompositePropagator::Inject calls every child, so a single SpanContext is emitted in every wire format the operator configured. Each propagator strips its owned headers before injecting, so client-supplied trace headers never leak through the gateway.
| Propagator | Owned headers | Format |
|---|---|---|
w3c |
traceparent, tracestate |
00-{32-hex-trace}-{16-hex-span}-{02-hex-flags} |
jaeger |
uber-trace-id |
{trace-id}:{span-id}:{parent-span-id}:{flags}. trace-id is 16-hex (legacy 64-bit, left-padded with zeros) or 32-hex; flags' sampled bit (0x01) is honored on extract; debug/firehose bits are dropped. |
Validation rejects an empty propagators list and any unknown token at startup and on SIGHUP.
"metrics": {
"enabled": true,
"exporter": "prometheus_pull",
"prometheus": { "path": "/metrics", "include_target_info": true }
}The gateway registers a GET /metrics handler at startup if cli.metrics_endpoint and cli.health_endpoint are configured. The handler returns Prometheus exposition by default; if the request carries Accept: application/openmetrics-text, it returns OpenMetrics 1.0 instead.
When metrics.enabled = false the route is still registered (so a SIGHUP false → true works without restart) but the handler returns 404 Not Found.
OTel instrument names are sanitized to Prometheus naming rules: [a-zA-Z0-9_] → _, leading-digit gets a _ prepended. Counter names get a _total suffix. Sanitization collisions across distinct OTel names are detected per-render: the first instrument wins; subsequent collisions are SUPPRESSED so output never has conflicting # TYPE blocks. Each distinct collision pair is logged at most once per process.
"metrics": {
"enabled": true,
"exporter": "otlp_http",
"otlp": { "upstream": "otel-collector", "path": "/v1/metrics" },
"export_interval_ms": 10000,
"export_timeout_ms": 10000
}export_interval_ms and export_timeout_ms are live-reloadable. exporter and otlp.upstream are restart-required.
When traces and metrics both target otlp_http and the same upstream is configured, the gateway shares one OtlpHttpExporter instance between the BatchSpanProcessor and the PeriodicMetricReader. The shared-exporter shutdown coordinator ensures the first-finishing worker doesn't drop the other's final batch (see "Shutdown drain" below).
| Aspect | Value |
|---|---|
| Default path | /metrics (configurable via metrics.prometheus.path) |
| Method | GET |
| Default content-type | text/plain; version=0.0.4; charset=utf-8 (Prometheus exposition) |
| OpenMetrics content-type | application/openmetrics-text; version=1.0.0; charset=utf-8 (when Accept requests it) |
Live-reload of path |
When /metrics not yet registered, yes. After registration, restart-only. |
| Auth | None by default. Pair with a route-level auth policy if you want to gate scrapes. |
Tail-sampling is out of scope (see "Out of scope" below). For high-RPS deployments:
- Set
traces.sampler.kind = "trace_id_ratio"with a low default ratio (e.g.0.01). - Use route overrides to raise sampling on critical paths and drop sampling on health endpoints:
"sampler": {
"kind": "trace_id_ratio",
"ratio": 0.01,
"routes": [
{ "pattern": "/health", "kind": "always_off" },
{ "pattern": "/api/checkout/*", "kind": "always_on" }
]
}-
Route-override sampling is evaluated direct, not wrapped by
ParentBased.always_offwill drop sampled-parent traces — that is the operator's documented intent (the/healthrow above is the canonical use case). -
Tune the batch processor: under high RPS, raise
max_queue_size(default 2048) andmax_export_batch_size(default 512). A queue-overflow drop increments thedropped_on_overflow_counter on the BSP — surface it via the self-metrics catalog when that lands.
The gateway's four-phase shutdown ensures observability data finalizes before workers are joined:
- StopAccepting + drain inbound.
WaitForAllAsyncDrain(budget)waits for in-flight inbound + proxy transactions. Dispatchers + upstream pool still alive. FlushObservabilityForShutdown. CallsWaitForAllAsyncDrainfirst so finalizes land in the BSP queue, thenObservabilityManager::FlushAll(deadline)blocks the BSP until queue depth hits zero (or deadline) and the PMR until its in-flight cycle completes.KillAndShutdownObservability. If the flush did not drain in time,KillOutstandingSnapshotsCAS-finalizes survivors witherror_type="shutdown". Then the manager joins BSP + PMR workers within remaining budget. When BSP and PMR share anOtlpHttpExporterinstance, the manager callsDisableExporterShutdownOnDrainon each before signalling them, then signals the exporter exactly once after both workers drain.- StopDispatchers.
Total shutdown time is bounded by cli.shutdown_drain_timeout_sec (default 30s).
| Field | Live | Notes |
|---|---|---|
observability.enabled (master) |
NO | Restart-required. |
traces.enabled, metrics.enabled |
YES | Sub-switches govern emission, not allocation. |
traces.sampler.kind / ratio / routes |
YES | Atomic-swap; in-flight spans keep their original sampler. |
traces.otlp.headers / timeout_ms |
YES | OtlpHttpExporter::ReloadHeaders. |
traces.otlp.upstream, traces.exporter |
NO | Restart-required. Boot-time hot-swap from NoopProcessor to BatchSpanProcessor lands automatically once MarkServerReady wires the OTLP upstream. |
traces.propagators |
YES | Atomic-swap on propagator_. New requests immediately use the new composite. |
traces.batch.* |
YES | Worker re-reads after cv_.notify_all(). |
metrics.export_interval_ms / export_timeout_ms |
YES | MeterProvider::Reload. |
metrics.prometheus.path, metrics.prometheus.include_target_info |
YES (path: only when /metrics not yet registered) |
|
metrics.exporter, metrics.otlp.upstream |
NO | Restart-required. |
resource.* (service.name, service.version, service.instance.id) |
NO | Span identity baseline. |
- Confirm
observability.enabled = trueANDtraces.enabled = true. - Confirm
traces.exporter = "otlp_http"and thetraces.otlp.upstreamname maps to a configured upstream. - Check the Collector receives traffic from the gateway upstream IP/port (
tcpdump -i any -n port 4318). - Inspect the gateway's structured logs for
OtlpHttpExporterfailures — non-2xx responses are logged atwarnwith the response status. - If the sampler is
trace_id_ratiowith low ratio, generate enough traffic — or raise the ratio temporarily. - The BSP drops spans on queue overflow rather than blocking. Two drop counters are accessible programmatically:
BatchSpanProcessor::dropped_on_overflow()(queue full) andBatchSpanProcessor::dropped_on_export_failure()(non-retryable export, retry budget exhausted, or exporter exception). Until they're surfaced via Prometheus, watch the structured logs:BatchSpanProcessor: non-retryable export failure; dropping batch (N spans, attempt=K)— the collector returned 4xx (excluding 429) or rejected the payload outright.BatchSpanProcessor: retryable export failed after N attempts; dropping batch (M spans)— retry budget exhausted (network blips, repeated 5xx).BatchSpanProcessor::Export threw: ... (dropping N spans)— exporter raised; treated as a non-retryable failure.
- Confirm
metrics.enabled = trueANDmetrics.exporter = "prometheus_pull". - Confirm the path matches
metrics.prometheus.path(default/metrics). - Confirm
cli.health_endpointandcli.metrics_endpointare not disabled (--no-metrics-endpoint). The route is only registered when both are enabled at startup.
MetricLabelRegistry enforces a per-instrument allowlist + cap. When a label value exceeds the cap, subsequent values for that label key route to __overflow__:
http_server_request_duration_seconds{route="/api/checkout/*",method="GET",status_code="__overflow__"}
If you see __overflow__ series in /metrics, either the upstream emitting the offending value has unbounded cardinality (status code from external API; user-id label) or the configured cap is too low. The cap is a process-startup field today (per-key SIGHUP is restart-only).
ConfigLoader::Validate rejects an empty propagators list and any unknown token. Recognised tokens are w3c and jaeger. Check the SIGHUP target file for typos like "jeager" or "W3C" (the value is case-sensitive).
Verify propagators includes "w3c". If it does, the inbound traceparent value is malformed — the W3C parser rejects:
- length other than 55 chars,
- version field other than
00, - non-lowercase hex in any field,
- all-zero trace-id or span-id.
The gateway treats the inbound as no parent and starts a fresh trace. Run with debug logging to surface the per-request reason.
| Field | Type | Default | Live-reloadable | Notes |
|---|---|---|---|---|
enabled |
bool | false |
NO | Master switch. |
resource.service.name |
string | (none) | NO | Required when enabled=true. |
resource.service.version |
string | (none) | NO | |
resource.service.instance.id |
string | auto | NO | Defaults to ${HOSTNAME}-${PID}. |
| Field | Type | Default | Live-reloadable |
|---|---|---|---|
enabled |
bool | false |
YES |
exporter |
string | (empty) | NO |
otlp.upstream |
string | (none) | NO |
otlp.path |
string | /v1/traces |
NO |
otlp.headers |
object | {} |
YES |
otlp.timeout_ms |
int | 10000 | YES |
sampler.kind |
string | parent_based |
YES |
sampler.ratio |
number | 1.0 |
YES |
sampler.default_root |
string | always_on |
YES |
sampler.routes |
array | [] |
YES |
propagators |
array | ["w3c"] |
YES |
auth_idp_span |
bool | true |
YES |
websocket_messages |
bool | false |
YES |
batch.max_queue_size |
int | 2048 | NO (allocated at construction) |
batch.max_export_batch_size |
int | 512 | YES |
batch.schedule_delay_ms |
int | 5000 | YES |
batch.retries.max_attempts |
int | 3 | YES |
batch.retries.initial_backoff_ms |
int | 1000 | YES |
batch.retries.max_backoff_ms |
int | 10000 | YES |
| Field | Type | Default | Live-reloadable |
|---|---|---|---|
enabled |
bool | false |
YES |
exporter |
string | (empty) | NO |
otlp.upstream |
string | (none) | NO |
otlp.path |
string | /v1/metrics |
NO |
export_interval_ms |
int | 10000 | YES |
export_timeout_ms |
int | 10000 | YES |
prometheus.path |
string | /metrics |
YES (when route not yet registered) |
prometheus.include_target_info |
bool | true |
YES |
Every upstream attempt for a proxied request gets its own CLIENT span. Retries produce distinct spans linked to the same SERVER parent; the per-attempt span_id is stamped into the outbound traceparent so each attempt is independently identifiable in the collector. Terminal outcomes set http.response.status_code (success) or error.type (e.g. upstream_timeout, connect_failed, circuit_open, client_disconnect).
When traces.auth_idp_span = true (default), every deferred IdP introspection POST is wrapped by an INTERNAL span parented at the SERVER span. Setting it to false falls back to recording auth.pending_start / auth.pending_end events on the SERVER span — useful when collector cardinality is a concern. Live-reloadable.
"traces": { "websocket_messages": true }Default false. When enabled, every text/binary frame produces a short ws.recv (inbound) or ws.send (outbound) INTERNAL span parented at the upgrade SERVER span, with ws.opcode and ws.payload_size attributes. Control frames (Ping / Pong / Close) are NOT spanned. Live-reloadable. Caveat: WS connections produce far more messages than HTTP requests — enabling this on a high-throughput WS-heavy workload will significantly increase span volume.
The gateway's own /health, /stats, and configured Prometheus path are auto-added to traces.sampler.routes with always_off so operator-side probes never pollute traces. Operator-supplied entries with the same path are preserved verbatim — explicit override always wins.
When SIGHUP changes metrics.prometheus.path, the gateway logs a warn ("restart to apply") and keeps the live value. The HTTP route bound at startup remains the only path served. Restart to register the new path.
Breaking (Phase 3):
metrics.prometheus.path = "/"is now rejected at boot whenmetrics.exporter = "prometheus_pull". Previously the sampler self-noise auto-prepend silently no-op'd on this value, leaving/metricstraffic to feed its own traces; the loud-fail makes the misconfig visible immediately. Set a distinct path (the default/metricsis the canonical choice).
A route handler that needs to terminate the server (e.g. an admin endpoint exposing /shutdown) must NOT call HttpServer::Stop() synchronously — that deadlocks the dispatcher. Use HttpServer::ScheduleStopAfterCurrentResponse() instead. The helper:
- Populates the response normally and returns from the handler.
- Schedules
Stop()on the conn dispatcher viaNetServer::EnQueueOnConnDispatcher. - Drains the calling handler's
active_requests_decrement naturally before the deferred Stop runs. - Idempotent — repeated/concurrent calls collapse via internal CAS.
A new test suite ratchets the documented kill-loop contract:
kill_marshals_in_flight_stays at 0 (RESERVED for a future per-dispatcher EnQueue marshal).- The
FinalizeFromSnapshotCAS resolves multi-thread races: every snapshot is finalized exactly once. reactor.otel.snapshots_killed_on_timeoutCounter delta = N for N un-finalized survivors.
The following are deferred and not currently implemented — listed so future readers don't expect them.
- Connection-level metrics with protocol label.
connections.active/connections.totalneed protocol-label plumbing across 6+ inbound/outbound sites. - BSP / PMR / Tracer self-metrics. The BatchSpanProcessor and PeriodicMetricReader workers need a manager pointer for self-instrumentation.
- Per-dispatcher EnQueue kill marshal. The
kill_marshals_in_flight_counter is already wired intoWaitForAllAsyncDrain's predicate as RESERVED; the actual bump-pre-EnQueue / decrement-in-closure pair is deferred. - Tail sampling. Out of scope; deploy a Collector with the tail-sampling processor between the gateway and the trace backend.
- Histogram exemplars. Out of scope.
- OTLP/protobuf. OTLP/JSON is the only serializer; OTLP/protobuf may be added later for collector-side parity.
- B3, X-Ray propagators. Only W3C and Jaeger ship today.
- Logs signal. Logs continue via spdlog with trace correlation in the log format; the OpenTelemetry logs SDK is not wired.