Skip to content

Epic: Enhanced distributed tracing #444

@psschwei

Description

@psschwei

Overview

This epic refreshes the tracing telemetry plan to bring it in line with the canonical plugin/hook pattern established by the metrics (#443) and logging (#442) epics, both closed 2026-04-27.

The original plan (2026-02-10) and its nine sub-issues (#469#477) predate two major refactors that reshape tracing's design surface:

The original sub-issues have been retired and replaced by the phased plan below. See the companion comment on this epic for a feature-level mapping of retired issues to their replacements.

Current state (2026-05-07)

  • Tracing lives in mellea/telemetry/tracing.py and mellea/telemetry/backend_instrumentation.py.
  • Uses direct inline instrumentation in all 5 backends (~20 call sites) — no plugin/hook usage.
  • No tracing_plugins.py; tracing is the only telemetry pillar still on pre-plugin patterns.
  • Stores the active backend span in mot._meta["_telemetry_span"] so it survives coroutine boundaries — collides with ModelOutputThunk structural cleanup — ._meta partitioning, raw responses, _thinking #909 item 2.
  • Reads raw provider responses for attribute extraction instead of mot.generation (the normalized GenerationMetadata).
  • Uses deprecated gen_ai.system attribute (current GenAI semconv is gen_ai.provider.name, which metrics uses). @planetf1's PR feat(telemetry): close five OTel GenAI semantic convention emission gaps (#1035) #1036 is adding dual-emission.
  • Env vars split and non-uniform: MELLEA_TRACE_APPLICATION, MELLEA_TRACE_BACKEND, MELLEA_TRACE_CONSOLE. No single umbrella flag. Only honors generic OTEL_EXPORTER_OTLP_ENDPOINT; no OTEL_EXPORTER_OTLP_TRACES_ENDPOINT support.
  • Initialized at module import; tests require importlib.reload (vs logging's lazy init).
  • Error-path span closure happens in core/base.py:520–541 before the generation_error hook fires — would block a tracing plugin hooking that event.
  • Missing span coverage per current semconv: tool calls, streaming events (first-token / chunk / complete), chat content events, gen_ai.conversation.id, most gen_ai.request.* parameters, SpanKind.CLIENT on backend calls.

Scope

Phase 1 — Foundation (serial, blocking)

These must land before Phase 2/3. Likely one PR, separate tracking issues for clarity.

Phase 2 — Coverage (parallelizable once Phase 1 lands)

Phase 3 — Polish

Acceptance Criteria

  • mellea/telemetry/tracing_plugins.py exists and parallels metrics_plugins.py in shape.
  • No backend file imports tracing helpers directly.
  • mot._meta["_telemetry_span"] is removed.
  • Env vars renamed to plural MELLEA_TRACES_* (aligned with OTel standard partner var and MELLEA_METRICS_*). Old MELLEA_TRACE_* names emit deprecation warnings for one release.
  • All emitted gen_ai.* attributes match current OTel GenAI semconv (gen_ai.provider.name, not gen_ai.system).
  • Span events emitted for streaming milestones (TTFB, first token, complete).
  • Tool calls produce spans via tool_post_invoke.
  • OTEL_EXPORTER_OTLP_TRACES_ENDPOINT is respected when set.
  • Docs example for full traces lands (docs: add Telemtry example for getting full traces #945).

Coordination

Related / Retired

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions