Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 8 additions & 0 deletions changelog/68999.added.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
Added OpenTelemetry distributed-tracing support across all Salt
inter-process hops (network and IPC). When `tracing.enabled` is true in the
master/minion config, salt emits W3C-TraceContext-propagated spans via an
OTLP exporter, covering the CLI, channel layer, master workers, minion
command execution, event bus, reactor, syndic forwarding, salt-ssh, and
salt-api. Trace context travels inside the AES-encrypted Salt envelope so
it remains opaque on the wire. Tracing is opt-in and a complete no-op when
disabled.
1 change: 1 addition & 0 deletions doc/contents.rst
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,7 @@ Salt Table of Contents
topics/return_codes/index
topics/utils/index
topics/event/index
topics/tracing/index
topics/orchestrate/index
topics/solaris/index
topics/ssh/index
Expand Down
160 changes: 160 additions & 0 deletions doc/topics/tracing/index.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,160 @@
.. _tracing:

===================================
Distributed Tracing (OpenTelemetry)
===================================

Salt can emit OpenTelemetry spans for every inter-process hop, so a single
job (``salt '*' test.ping``) becomes a single distributed trace that crosses
the CLI, the master, the minion, the return path, and any reactor or syndic
forwarding in between.

The implementation uses standard W3C TraceContext (``traceparent`` /
``tracestate``) for propagation and ships spans through an OTLP exporter.
Jaeger ingests OTLP natively, as do most modern tracing backends
(Tempo, Honeycomb, Datadog OTLP, etc.).

Trace context propagates **inside** the AES-encrypted Salt envelope: an
attacker on the wire cannot see the trace headers, and authenticated
participants (master / minion / syndic) decode them after AES decryption.

Tracing is **disabled by default** and is a complete no-op when not
configured. No spans are created, no exporter is initialised, and no
background threads are started.

Configuration
-------------

Add a ``tracing`` block to the master and minion configs. The block is
identical on both daemons, and applies to ``salt-cli``, ``salt-call``,
``salt-api`` and ``salt-ssh`` as well.

.. code-block:: yaml

tracing:
enabled: true
exporter: otlp-http # otlp-http | otlp-grpc | console
endpoint: "" # OTel SDK default endpoint when empty
service_name: "" # auto-derived when empty
sampler: parent_based # parent_based | always_on | always_off | trace_id_ratio
sampler_arg: 1.0
resource_attributes: {}
insecure: true # gRPC TLS disabled (ignored for HTTP)
headers: {} # OTLP authentication headers

``enabled``
Master switch. When ``false`` (the default), everything in this module
is a no-op.

``exporter``
``otlp-http`` (default) sends spans via HTTP/protobuf to ``endpoint``.
Pure-Python; ships in salt's base requirements; works on every
interpreter.
``otlp-grpc`` sends via gRPC. Requires
``opentelemetry-exporter-otlp-proto-grpc`` to be installed separately
(it pulls in ``grpcio``, which lacks prebuilt wheels for some
platform / interpreter combinations).
``console`` prints spans to stdout for debugging.

``endpoint``
OTLP collector URL. When empty, the OTel SDK default is used
(``http://localhost:4318/v1/traces`` for HTTP,
``http://localhost:4317`` for gRPC).

``service_name``
The ``service.name`` resource attribute. When empty, Salt fills this in
automatically: ``salt-master``, ``salt-minion-<id>``, ``salt-cli``,
``salt-call``, ``salt-api``.

``sampler``
Which sampler to install on the ``TracerProvider``.

- ``parent_based`` (default): follow the parent's sample decision; root
spans are sampled. Use ``sampler_arg`` < 1.0 to apply a ratio to
root spans.
- ``always_on``: sample every span.
- ``always_off``: drop every span (testing only).
- ``trace_id_ratio``: sample ``sampler_arg`` fraction of trace IDs.

``resource_attributes``
Extra attributes merged into the OTel Resource (e.g. ``deployment.environment: prod``).

``insecure``
Disable gRPC TLS to the collector. Ignored for the HTTP exporter.

``headers``
Additional headers sent on every OTLP request, e.g.
``Authorization: Bearer <token>`` for a hosted collector.

Hops covered
------------

A single ``salt '*' test.ping`` produces a trace spanning at least:

1. ``salt.cli.test.ping`` — root span on the CLI.
2. ``salt.req.send.publish`` — CLI → master request.
3. ``salt.req.recv.publish`` — master receives the request.
4. ``salt.pub.send`` — master publishes the job.
5. ``salt.minion.recv.test.ping`` — minion receives the published command.
6. ``salt.minion.exec.test.ping`` — minion executes the function.
7. ``salt.req.send._return`` — minion returns to master.
8. ``salt.req.recv._return`` — master receives the return.

Other instrumented hops:

- Event bus (``fire_event`` / ``get_event``) — every IPC and TCP-IPC event
carries trace context in its data dict.
- Reactor — extracts trace context from incoming events and parents the
reaction span correctly.
- Syndic forwarding — both inbound (from upstream master) and outbound (to
downstream minions).
- Salt-SSH — propagates trace context as the ``TRACEPARENT`` environment
variable on the remote shim.
- Salt-API — extracts the ``traceparent`` HTTP header from incoming
requests; webhooks inject context into the events they fire.

Running a quick demo
--------------------

Spin up an all-in-one Jaeger:

.. code-block:: bash

docker run -d --name jaeger \
-p 16686:16686 -p 4318:4318 \
jaegertracing/all-in-one:latest

Configure master + minion with:

.. code-block:: yaml

tracing:
enabled: true
exporter: otlp-http
endpoint: http://localhost:4318/v1/traces
sampler: always_on

Start them, run ``salt '*' test.ping``, then visit
``http://localhost:16686`` and search for the ``salt-cli`` service. You
should see a single trace with spans hanging off three services:
``salt-cli``, ``salt-master`` and ``salt-minion-<id>``.

Fork handling
-------------

The OTel ``BatchSpanProcessor`` runs a background thread that does not
survive ``fork()``. Salt rebuilds the provider in every forked child the
first time a tracing API is invoked, so worker processes spun up by the
master / minion get their own functioning exporter without any caller
action. Unflushed spans queued by the parent at the instant of fork may
be lost; for short-lived spans this is rarely visible, but if you observe
gaps consider lowering ``BatchSpanProcessor`` queue intervals via the OTel
environment variables.

Payload overhead
----------------

When tracing is enabled and a recording span is active, every Salt request
and event grows by roughly 60 bytes (the W3C ``traceparent`` string).
When no recording span is active — for example, an internal periodic event
fired outside a request handler — no headers are added.
3 changes: 3 additions & 0 deletions requirements/base.txt
Original file line number Diff line number Diff line change
Expand Up @@ -30,6 +30,9 @@ MarkupSafe<3.0.0
more-itertools>=9.1.0
msgpack>=1.0.0 ; python_version < '3.13'
msgpack>=1.1.0 ; python_version >= '3.13'
opentelemetry-api>=1.30.0
opentelemetry-sdk>=1.30.0
opentelemetry-exporter-otlp-proto-http>=1.30.0
xxhash>=3.0.0
# Packaging 24.1 imports annotations from __future__ which breaks salt ssh
# tests on target hosts with older python versions.
Expand Down
52 changes: 52 additions & 0 deletions requirements/static/ci/py3.10/cloud.txt
Original file line number Diff line number Diff line change
Expand Up @@ -211,6 +211,11 @@ gitpython==3.1.46
# -c requirements/static/pkg/py3.10/linux.txt
# -r requirements/base.txt
# -r requirements/static/ci/common.in
googleapis-common-protos==1.75.0
# via
# -c requirements/static/ci/py3.10/linux.txt
# -c requirements/static/pkg/py3.10/linux.txt
# opentelemetry-exporter-otlp-proto-http
idna==3.7
# via
# -c requirements/static/ci/py3.10/linux.txt
Expand All @@ -226,6 +231,7 @@ importlib-metadata==8.7.0
# -c requirements/static/pkg/py3.10/linux.txt
# -r requirements/base.txt
# -r requirements/static/pkg/linux.in
# opentelemetry-api
iniconfig==2.0.0
# via
# -c requirements/static/ci/py3.10/linux.txt
Expand Down Expand Up @@ -371,6 +377,41 @@ oauthlib==3.3.1
# via
# -c requirements/static/ci/py3.10/linux.txt
# requests-oauthlib
opentelemetry-api==1.41.1
# via
# -c requirements/static/ci/py3.10/linux.txt
# -c requirements/static/pkg/py3.10/linux.txt
# -r requirements/base.txt
# opentelemetry-exporter-otlp-proto-http
# opentelemetry-sdk
# opentelemetry-semantic-conventions
opentelemetry-exporter-otlp-proto-common==1.41.1
# via
# -c requirements/static/ci/py3.10/linux.txt
# -c requirements/static/pkg/py3.10/linux.txt
# opentelemetry-exporter-otlp-proto-http
opentelemetry-exporter-otlp-proto-http==1.41.1
# via
# -c requirements/static/ci/py3.10/linux.txt
# -c requirements/static/pkg/py3.10/linux.txt
# -r requirements/base.txt
opentelemetry-proto==1.41.1
# via
# -c requirements/static/ci/py3.10/linux.txt
# -c requirements/static/pkg/py3.10/linux.txt
# opentelemetry-exporter-otlp-proto-common
# opentelemetry-exporter-otlp-proto-http
opentelemetry-sdk==1.41.1
# via
# -c requirements/static/ci/py3.10/linux.txt
# -c requirements/static/pkg/py3.10/linux.txt
# -r requirements/base.txt
# opentelemetry-exporter-otlp-proto-http
opentelemetry-semantic-conventions==0.62b1
# via
# -c requirements/static/ci/py3.10/linux.txt
# -c requirements/static/pkg/py3.10/linux.txt
# opentelemetry-sdk
oscrypto==1.3.0
# via
# -c requirements/static/ci/py3.10/linux.txt
Expand Down Expand Up @@ -415,6 +456,12 @@ propcache==0.4.1
# -c requirements/static/pkg/py3.10/linux.txt
# aiohttp
# yarl
protobuf==6.33.6
# via
# -c requirements/static/ci/py3.10/linux.txt
# -c requirements/static/pkg/py3.10/linux.txt
# googleapis-common-protos
# opentelemetry-proto
psutil==5.9.6
# via
# -c requirements/static/ci/py3.10/linux.txt
Expand Down Expand Up @@ -610,6 +657,7 @@ requests==2.31.0
# etcd3-py
# kubernetes
# moto
# opentelemetry-exporter-otlp-proto-http
# profitbricks
# pywinrm
# requests-ntlm
Expand Down Expand Up @@ -747,6 +795,10 @@ typing-extensions==4.15.0
# aiosignal
# cryptography
# multidict
# opentelemetry-api
# opentelemetry-exporter-otlp-proto-http
# opentelemetry-sdk
# opentelemetry-semantic-conventions
# pyopenssl
# pytest-system-statistics
# virtualenv
Expand Down
44 changes: 44 additions & 0 deletions requirements/static/ci/py3.10/darwin.txt
Original file line number Diff line number Diff line change
Expand Up @@ -159,6 +159,10 @@ gitpython==3.1.46
# -r requirements/base.txt
# -r requirements/static/ci/common.in
# -r requirements/static/ci/darwin.in
googleapis-common-protos==1.75.0
# via
# -c requirements/static/pkg/py3.10/darwin.txt
# opentelemetry-exporter-otlp-proto-http
hglib==2.6.2
# via -r requirements/static/ci/darwin.in
idna==3.7
Expand All @@ -173,6 +177,7 @@ importlib-metadata==8.7.0
# via
# -c requirements/static/pkg/py3.10/darwin.txt
# -r requirements/base.txt
# opentelemetry-api
iniconfig==2.0.0
# via pytest
invoke==2.2.1
Expand Down Expand Up @@ -272,6 +277,35 @@ ncclient==0.7.0
# junos-eznc
oauthlib==3.3.1
# via requests-oauthlib
opentelemetry-api==1.41.1
# via
# -c requirements/static/pkg/py3.10/darwin.txt
# -r requirements/base.txt
# opentelemetry-exporter-otlp-proto-http
# opentelemetry-sdk
# opentelemetry-semantic-conventions
opentelemetry-exporter-otlp-proto-common==1.41.1
# via
# -c requirements/static/pkg/py3.10/darwin.txt
# opentelemetry-exporter-otlp-proto-http
opentelemetry-exporter-otlp-proto-http==1.41.1
# via
# -c requirements/static/pkg/py3.10/darwin.txt
# -r requirements/base.txt
opentelemetry-proto==1.41.1
# via
# -c requirements/static/pkg/py3.10/darwin.txt
# opentelemetry-exporter-otlp-proto-common
# opentelemetry-exporter-otlp-proto-http
opentelemetry-sdk==1.41.1
# via
# -c requirements/static/pkg/py3.10/darwin.txt
# -r requirements/base.txt
# opentelemetry-exporter-otlp-proto-http
opentelemetry-semantic-conventions==0.62b1
# via
# -c requirements/static/pkg/py3.10/darwin.txt
# opentelemetry-sdk
oscrypto==1.3.0
# via certvalidator
packaging==24.0
Expand Down Expand Up @@ -304,6 +338,11 @@ propcache==0.4.1
# -c requirements/static/pkg/py3.10/darwin.txt
# aiohttp
# yarl
protobuf==6.33.6
# via
# -c requirements/static/pkg/py3.10/darwin.txt
# googleapis-common-protos
# opentelemetry-proto
psutil==5.9.6
# via
# -c requirements/static/pkg/py3.10/darwin.txt
Expand Down Expand Up @@ -440,6 +479,7 @@ requests==2.31.0
# etcd3-py
# kubernetes
# moto
# opentelemetry-exporter-otlp-proto-http
# requests-oauthlib
# responses
# vcert
Expand Down Expand Up @@ -519,6 +559,10 @@ typing-extensions==4.15.0
# aiosignal
# cryptography
# multidict
# opentelemetry-api
# opentelemetry-exporter-otlp-proto-http
# opentelemetry-sdk
# opentelemetry-semantic-conventions
# pyopenssl
# pytest-system-statistics
# virtualenv
Expand Down
Loading
Loading