Add OpenTelemetry distributed tracing across all Salt hops#69145
Open
dwoz wants to merge 7 commits into
Open
Conversation
twangboy
previously approved these changes
May 14, 2026
added 7 commits
May 16, 2026 01:09
Instrument every inter-process hop (network and IPC) with W3C TraceContext-propagated spans exported via OTLP. Tracing is opt-in via ``opts['tracing']['enabled']`` (default ``false``) — when disabled the new ``salt.utils.tracing`` module is a complete no-op: no exporter, no background threads, no payload bloat. The channel layer is the single chokepoint for all three transports (zeromq / tcp / ws): ``_package_load`` injects the traceparent inside the AES-encrypted load before ``session_crypticle.dumps``, and ``ReqServerChannel.handle_message`` extracts it after decryption and wraps ``payload_handler`` dispatch in a server span. This keeps trace context invisible to wire eavesdroppers while making it available to authenticated participants on both sides. Hops covered: CLI / salt-call root spans, LocalClient → master, master → minion publish, minion → master return, minion execution, event bus (``fire_event`` injects into ``data`` dict — covers IPC and TCP-IPC uniformly), reactor dispatch, syndic forwarding, salt-ssh (``TRACEPARENT`` env var prepended to the shim invocation), and salt-api (extracts ``traceparent`` HTTP header; webhooks propagate into fired events). ``BatchSpanProcessor``'s background thread does not survive ``fork``, so the provider is rebuilt per-PID: every public tracing API calls ``_ensure_tracer`` which detects a PID change and rebuilds. Forked workers (MWorker, minion executors, reactor) explicitly call ``tracing.configure(opts)`` at the top of their entry point. Configuration block under ``tracing``: ``enabled``, ``exporter`` (otlp-grpc / otlp-http / console), ``endpoint``, ``service_name`` (auto-derived from role when empty), ``sampler``, ``sampler_arg``, ``resource_attributes``, ``insecure``, ``headers``. Defaults added to both ``DEFAULT_MASTER_OPTS`` and ``DEFAULT_MINION_OPTS``. Tests cover the module surface, no-op fallback, fork isolation, channel-layer injection inside the AES envelope (not on the outer wrapper), event-bus round-trip and reactor extraction.
The Generate MAN Pages step builds sphinx with ``-W`` (warnings as errors). The new ``doc/topics/tracing/index.rst`` triggered two: * The title overline/underline were one character shorter than the title text, which sphinx reports as ``WARNING: Title overline too short``. Extend both delimiter lines to the full length of the title. * The page wasn't referenced in any toctree, producing ``WARNING: document isn't included in any toctree``. Add it to the main ``contents.rst`` between the event-bus and orchestrate sections.
The CI onedir / source-package builds (run 25888742458) failed on every matrix slot that pins Python 3.14: ``grpcio`` 1.80.0 has no manylinux wheel for 3.14, so pip falls back to a source build inside the salt-onedir build environment and the bundled BoringSSL ASM trips the toolchain's ``-Werror=implicit-function-declaration``. ``grpcio`` was pulled in transitively via ``opentelemetry-exporter-otlp-proto-grpc``, which we had listed in ``requirements/base.txt``. Remove the gRPC exporter from base requirements and keep only ``opentelemetry-exporter-otlp-proto-http``, which is pure Python, has no native deps, and ships wheels for every supported interpreter. Most collectors (Jaeger included) ingest OTLP/HTTP natively, so this is a near-zero-cost UX change. The gRPC exporter is still selectable: users can install ``opentelemetry-exporter-otlp-proto-grpc`` themselves and set ``tracing.exporter: otlp-grpc``. The import now lives inside ``_build_exporter`` as a lazy ``try/except ImportError`` so a missing optional package degrades to a logged error instead of breaking module import or daemon startup. Default exporter switched to ``otlp-http`` in both ``DEFAULT_MASTER_OPTS`` and ``DEFAULT_MINION_OPTS``. Docs and example Jaeger demo updated (port 4318/v1/traces). New unit test covers the graceful-when-grpc-missing path. Static lockfiles regenerated by pre-commit to drop ``grpcio`` and the gRPC exporter package.
CI run 25901418243 surfaced 39 failures spread across three buckets
(Windows MSI/NSIS upgrade-downgrade tests, unit zeromq shard 4, and
integration zeromq/tcp shard 1). All of them share the same backtrace:
File ".../salt/utils/tracing.py", line 46, in <module>
from opentelemetry import context as otel_context
ModuleNotFoundError: No module named 'opentelemetry'
The hard import at module top broke three real environments:
* the salt-ssh **thin tarball**, which by design only ships salt's own
source — no third-party dependencies follow it to the remote host;
* **older installed onedirs** (3007.14 / 3008.0rc3) whose bundled Python
does not yet contain opentelemetry, exercised by the upgrade /
downgrade test matrix;
* **test_thin_dir** which builds the same thin tarball locally and
invokes it via the host Python.
Wrap every opentelemetry import in ``try / except ImportError`` and
guard module state with an ``_OTEL_AVAILABLE`` flag. When the package
is missing:
* ``is_enabled()`` always returns False, regardless of opts.
* ``start_span(...)`` returns the singleton ``_NOOP_SPAN``.
* ``inject()`` / ``extract()`` short-circuit.
* ``configure()`` logs one warning if a user has ``enabled: true`` but
opentelemetry is absent, then becomes a no-op.
* ``SpanKind`` is a duck-typed stub (``SpanKind.SERVER == "SERVER"``)
so the call sites that pass it through continue to work.
* ``_NoopSpan.get_span_context()`` returns a duck-typed invalid context.
The "tracing is fully opt-in, never affects performance unless
configured" contract is preserved.
A subprocess-isolated unit test
(``test_module_works_when_opentelemetry_missing``) installs a
meta_path finder that raises ImportError for any opentelemetry import,
re-imports ``salt.utils.tracing``, and asserts every public API stays
silent. Subprocess isolation is necessary because reloading the
module in-process would leave other modules holding references to the
otel-blocked variant.
The rebase onto ``origin/3008.x`` resolved its conflicts with ``-X theirs``, which preserved my local view of the lockfiles instead of letting pip-compile reconcile them with the upstream urllib3==2.7.0 bump. CI's Pre-Commit job flagged every Py3.13 and Py3.14 packaging / CI / lint requirements file as needing regeneration. Re-run pip-compile via pre-commit locally; the resulting files agree with both the new urllib3 pin and the trimmed opentelemetry-without-grpc top-level set, so the Pre-Commit job will now pass.
``Build Source Packages / DEB (arm64)`` consistently fails on CI run
25947204634 (and the prior 25941045430) at the ``pip install`` step
inside ``debuild`` with::
/tmp/ccuWvGgq.s:653: Error: unknown mnemonic `bti' -- `bti jc'
...
Failed to build protobuf
``protobuf`` is a new transitive dependency introduced by
``opentelemetry-exporter-otlp-proto-http``
(via ``opentelemetry-proto`` → ``protobuf<7.0,>=5.0``). The onedir
install in :mod:`tools.pkg.build` passes ``--no-binary=:all:`` plus an
explicit ``--only-binary`` allowlist, and ``protobuf`` isn't on that
allowlist — so pip falls back to building the C/C++ extension from
source. On Python 3.14 aarch64 the bundled BoringSSL assembly uses the
ARMv8.5 ``bti`` instruction; the GAS shipped with the relenv aarch64
toolchain is too old to recognise that mnemonic, and the build fails.
``protobuf`` 6.33.x ships a ``cp39-abi3-manylinux2014_aarch64.whl``
(stable-ABI wheel that works on every supported interpreter including
3.14), so allowing the binary wheel resolves the failure cleanly without
toolchain or runtime changes. x86_64 source builds happen to succeed
today but go through the same fragile ASM path, so adding the package
to the allowlist also reduces blast radius there.
Add ``protobuf`` to both the cross-platform and the Linux/macOS
``--only-binary`` lists in :func:`onedir_dependencies`.
The rebase onto current ``origin/3008.x`` used ``-X theirs`` to auto- resolve conflicts in the static requirements lockfiles (origin bumped ``certifi``, ``croniter`` and a handful of other transitive pins on top of the earlier urllib3 update). That preserves my view of the lockfiles instead of letting ``pip-compile`` reconcile them, so the Pre-Commit CI hook would flag drift. Re-run pip-compile via pre-commit locally; the resulting files agree with both the new upstream pins and the trimmed opentelemetry-without- grpc top-level set, so the Pre-Commit job stays green.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Instrument every inter-process hop (network and IPC) with W3C TraceContext-propagated spans exported via OTLP. Tracing is opt-in via
opts['tracing']['enabled'](defaultfalse) — when disabled the newsalt.utils.tracingmodule is a complete no-op: no exporter, no background threads, no payload bloat.The channel layer is the single chokepoint for all three transports (zeromq / tcp / ws):
_package_loadinjects the traceparent inside the AES-encrypted load beforesession_crypticle.dumps, andReqServerChannel.handle_messageextracts it after decryption and wrapspayload_handlerdispatch in a server span. This keeps trace context invisible to wire eavesdroppers while making it available to authenticated participants on both sides.Hops covered: CLI / salt-call root spans, LocalClient → master, master → minion publish, minion → master return, minion execution, event bus (
fire_eventinjects intodatadict — covers IPC and TCP-IPC uniformly), reactor dispatch, syndic forwarding, salt-ssh (TRACEPARENTenv var prepended to the shim invocation), and salt-api (extractstraceparentHTTP header; webhooks propagate into fired events).BatchSpanProcessor's background thread does not survivefork, so the provider is rebuilt per-PID: every public tracing API calls_ensure_tracerwhich detects a PID change and rebuilds. Forked workers (MWorker, minion executors, reactor) explicitly calltracing.configure(opts)at the top of their entry point.Configuration block under
tracing:enabled,exporter(otlp-grpc / otlp-http / console),endpoint,service_name(auto-derived from role when empty),sampler,sampler_arg,resource_attributes,insecure,headers. Defaults added to bothDEFAULT_MASTER_OPTSandDEFAULT_MINION_OPTS.Tests cover the module surface, no-op fallback, fork isolation, channel-layer injection inside the AES envelope (not on the outer wrapper), event-bus round-trip and reactor extraction.