.NET: Fixing issue where OpenTelemetry span is never exported in .NET in-process workflow execution#4196
Conversation
…ity never stopped in streaming OffThread path The WorkflowRunActivity_IsStopped_Streaming_OffThread test demonstrates that the workflow.run OpenTelemetry Activity created in StreamingRunEventStream.RunLoopAsync is started but never stopped when using the OffThread/Default streaming execution. The background run loop keeps running after event consumption completes, so the using Activity? declaration never disposes until explicit StopAsync() is called. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> 2. Fix workflow.run Activity never stopped in streaming OffThread execution (microsoft#4155) The workflow.run OpenTelemetry Activity in StreamingRunEventStream.RunLoopAsync was scoped to the method lifetime via 'using'. Since the run loop only exits on cancellation, the Activity was never stopped/exported until explicit disposal. Fix: Remove 'using' and explicitly dispose the Activity when the workflow reaches Idle status (all supersteps complete). A safety-net disposal in the finally block handles cancellation and error paths. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
dotnet/src/Microsoft.Agents.AI.Workflows/Execution/StreamingRunEventStream.cs
Outdated
Show resolved
Hide resolved
dotnet/src/Microsoft.Agents.AI.Workflows/Execution/StreamingRunEventStream.cs
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Pull request overview
This PR aims to ensure OpenTelemetry workflow-run spans (Activity) are reliably stopped/disposed (and therefore exported) during .NET in-process workflow execution, including streaming scenarios, and adds regression tests around activity lifecycle behavior.
Changes:
- Updated
StreamingRunEventStream.RunLoopAsyncto manually manage the workflow-runActivitylifecycle (stop onIdleand ensure disposal on loop exit). - Added
WorkflowRunActivityStopTeststo assert workflow-run activities are started and stopped across multiple execution modes.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
| dotnet/src/Microsoft.Agents.AI.Workflows/Execution/StreamingRunEventStream.cs | Changes workflow-run Activity disposal timing to stop/export spans earlier and adds a safety-net disposal on exit. |
| dotnet/tests/Microsoft.Agents.AI.Workflows.UnitTests/WorkflowRunActivityStopTests.cs | Adds regression coverage validating workflow-run activities are stopped/disposed in lockstep, off-thread, and streaming usage. |
dotnet/src/Microsoft.Agents.AI.Workflows/Execution/StreamingRunEventStream.cs
Outdated
Show resolved
Hide resolved
dotnet/src/Microsoft.Agents.AI.Workflows/Execution/StreamingRunEventStream.cs
Show resolved
Hide resolved
dotnet/tests/Microsoft.Agents.AI.Workflows.UnitTests/WorkflowRunActivityStopTests.cs
Show resolved
Hide resolved
dotnet/src/Microsoft.Agents.AI.Workflows/Execution/StreamingRunEventStream.cs
Outdated
Show resolved
Hide resolved
…\nImplements two-level telemetry hierarchy per PR feedback from lokitoth:\n- workflow.session: spans the entire run loop / stream lifetime\n- workflow_invoke: per input-to-halt cycle, nested within the session\n\nThis ensures the session activity stays open across multiple turns,\nwhile individual run activities are created and disposed per cycle.\n\nAlso fixes linkedSource CancellationTokenSource disposal leak in\nStreamingRunEventStream (added using declaration)."
dotnet/src/Microsoft.Agents.AI.Workflows/Execution/LockstepRunEventStream.cs
Outdated
Show resolved
Hide resolved
dotnet/src/Microsoft.Agents.AI.Workflows/Execution/StreamingRunEventStream.cs
Show resolved
Hide resolved
dotnet/src/Microsoft.Agents.AI.Workflows/Observability/WorkflowTelemetryContext.cs
Outdated
Show resolved
Hide resolved
dotnet/tests/Microsoft.Agents.AI.Workflows.UnitTests/ObservabilityTests.cs
Show resolved
Hide resolved
dotnet/tests/Microsoft.Agents.AI.Workflows.UnitTests/WorkflowRunActivityStopTests.cs
Outdated
Show resolved
Hide resolved
dotnet/src/Microsoft.Agents.AI.Workflows/Execution/LockstepRunEventStream.cs
Outdated
Show resolved
Hide resolved
|
@copilot open a new pull request to apply changes based on the comments in this thread |
|
@copilot open a new pull request to apply changes based on the comments in this thread |
…dd error tag\n\n1. LockstepRunEventStream: Remove 'using' from Activity in async iterator\n and manually dispose in finally block (fixes microsoft#4155 pattern). Also dispose\n linkedSource CTS in finally to prevent leak.\n2. Tags.cs: Add ErrorMessage (\"error.message\") tag for runtime errors,\n distinct from BuildErrorMessage (\"build.error.message\").\n3. ActivityNames: Rename WorkflowRun from \"workflow_invoke\" to \"workflow.run\"\n for cross-language consistency.\n4. WorkflowTelemetryContext: Fix XML doc to say \"outer/parent span\" instead\n of \"root-level span\".\n5. ObservabilityTests: Assert WorkflowSession absence when DisableWorkflowRun\n is true.\n6. WorkflowRunActivityStopTests: Fix streaming test race by disposing\n StreamingRun before asserting activities are stopped.\n7. StreamingRunEventStream/LockstepRunEventStream: Use Tags.ErrorMessage\n instead of Tags.BuildErrorMessage for runtime error events."
…urce, move SessionStarted earlier\n\n- Revert ActivityNames.WorkflowRun back to \"workflow_invoke\" (OTEL semantic convention contract)\n- Use 'using' declaration for linkedSource CTS in LockstepRunEventStream (no timing sensitivity)\n- Move SessionStarted event before WaitForInputAsync in StreamingRunEventStream to match Lockstep behavior"
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 8 out of 8 changed files in this pull request and generated 2 comments.
Comments suppressed due to low confidence (1)
dotnet/src/Microsoft.Agents.AI.Workflows/Execution/StreamingRunEventStream.cs:123
- In
RunLoopAsync, a newrunActivityis created on every iteration of the outerwhileloop, butInputWaiter.WaitForInputAsync(TimeSpan.FromSeconds(1), ...)returns after the timeout even when no input was signaled. This can emit repeatedworkflow_invokespans (and flip_runStatusback toRunning) during idle periods with no new input, creating noisy/incorrect telemetry and potential performance issues. Consider havingWaitForInputAsyncsurface whether it was actually signaled (e.g., return theboolfromSemaphoreSlim.WaitAsync(timeout, token)), and only start a new run-stage activity / set_runStatus = Runningwhen real input arrives (or when there are unprocessed messages).
while (!linkedSource.Token.IsCancellationRequested)
{
// Start a new run-stage activity for this input→processing→halt cycle
runActivity = this._stepRunner.TelemetryContext.StartWorkflowRunActivity();
runActivity?.SetTag(Tags.WorkflowId, this._stepRunner.StartExecutorId)
.SetTag(Tags.SessionId, this._stepRunner.SessionId);
runActivity?.AddEvent(new ActivityEvent(EventNames.WorkflowStarted));
// Run all available supersteps continuously
// Events are streamed out in real-time as they happen via the event handler
while (this._stepRunner.HasUnprocessedMessages && !linkedSource.Token.IsCancellationRequested)
{
await this._stepRunner.RunSuperStepAsync(linkedSource.Token).ConfigureAwait(false);
}
// Update status based on what's waiting
this._runStatus = this._stepRunner.HasUnservicedRequests
? RunStatus.PendingRequests
: RunStatus.Idle;
// Signal completion to consumer so they can check status and decide whether to continue
// Increment epoch so next consumer iteration gets a new completion signal
// Capture the status at this moment to avoid race conditions with event reading
int currentEpoch = Interlocked.Increment(ref this._completionEpoch);
RunStatus capturedStatus = this._runStatus;
await this._eventChannel.Writer.WriteAsync(new InternalHaltSignal(currentEpoch, capturedStatus), linkedSource.Token).ConfigureAwait(false);
// Close the run-stage activity when processing halts.
// A new run activity will be created when the next input arrives.
if (runActivity is not null)
{
runActivity.AddEvent(new ActivityEvent(EventNames.WorkflowCompleted));
runActivity.Dispose();
runActivity = null;
}
// Wait for next input from the consumer
// Works for both Idle (no work) and PendingRequests (waiting for responses)
await this._inputWaiter.WaitForInputAsync(TimeSpan.FromSeconds(1), linkedSource.Token).ConfigureAwait(false);
// When signaled, resume running
this._runStatus = RunStatus.Running;
}
dotnet/src/Microsoft.Agents.AI.Workflows/Execution/LockstepRunEventStream.cs
Outdated
Show resolved
Hide resolved
dotnet/tests/Microsoft.Agents.AI.Workflows.UnitTests/ObservabilityTests.cs
Show resolved
Hide resolved
Save and restore Activity.Current in LockstepRunEventStream.Start() so the session activity doesn't leak into caller code via AsyncLocal. Re-establish Activity.Current = sessionActivity before creating the run activity in TakeEventStreamAsync to preserve parent-child nesting. Add test verifying app activities after RunAsync are not parented under the session, and that the workflow_invoke activity nests under the session."
lokitoth
left a comment
There was a problem hiding this comment.
I wonder if there is stitching that we need to do across Checkpoint restore boundaries...
* .NET: Add Microsoft Fabric sample #3674 (#4230) Co-authored-by: Chris <66376200+crickman@users.noreply.github.com> * Python: Phase 2: Embedding clients for Ollama, Bedrock, and Azure AI Inference (#4207) * Phase 2: Embedding clients for Ollama, Bedrock, and Azure AI Inference Add embedding client implementations to existing provider packages: - OllamaEmbeddingClient: Text embeddings via Ollama's embed API - BedrockEmbeddingClient: Text embeddings via Amazon Titan on Bedrock - AzureAIInferenceEmbeddingClient: Text and image embeddings via Azure AI Inference, supporting Content | str input with separate model IDs for text (AZURE_AI_INFERENCE_EMBEDDING_MODEL_ID) and image (AZURE_AI_INFERENCE_IMAGE_EMBEDDING_MODEL_ID) endpoints Additional changes: - Rename EmbeddingCoT -> EmbeddingT, EmbeddingOptionsCoT -> EmbeddingOptionsT - Add otel_provider_name passthrough to all embedding clients - Register integration pytest marker in all packages - Add lazy-loading namespace exports for Ollama and Bedrock embeddings - Add image embedding sample using Cohere-embed-v3-english - Add azure-ai-inference dependency to azure-ai package Part of #1188 Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Fix mypy duplicate name and ruff lint issues - Rename second 'vector' variable to 'img_vector' in image embedding loop - Combine nested with statements in tests - Remove unused result assignments in tests Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * updates from feedback * Fix CI failures in embedding usage handling - Fix Azure AI embedding mypy issues by normalizing vectors to list[float], safely accumulating optional usage token fields, and filtering None entries before constructing GeneratedEmbeddings - Avoid Bandit false positive by initializing usage details as an empty dict - Update OpenAI embedding tests to assert canonical usage keys (input_token_count/total_token_count) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> --------- Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * [Purview] Mark responses as responses and fix epoch bug for python long overflow (#4225) * .NET: Support InvokeMcpTool for declarative workflows (#4204) * Initial implementation of InvokeMcpTool in declarative workflow * Cleaned up sample implementation * Updated sample comments. * Added missing executor routing attribute * Fix PR comments. * Updated based on PR comments. * Updated based on PR comments. * Removed unnecessary using statement. * Update Python package versions to rc2 (#4258) - Bump core and azure-ai to 1.0.0rc2 - Bump preview packages to 1.0.0b260225 - Update dependencies to >=1.0.0rc2 - Add CHANGELOG entries for changes since rc1 - Update uv.lock Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * .NET: Fixing issue where OpenTelemetry span is never exported in .NET in-process workflow execution (#4196) * 1. Add reproduction test for issue #4155: workflow.run Activity never stopped in streaming OffThread path The WorkflowRunActivity_IsStopped_Streaming_OffThread test demonstrates that the workflow.run OpenTelemetry Activity created in StreamingRunEventStream.RunLoopAsync is started but never stopped when using the OffThread/Default streaming execution. The background run loop keeps running after event consumption completes, so the using Activity? declaration never disposes until explicit StopAsync() is called. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> 2. Fix workflow.run Activity never stopped in streaming OffThread execution (#4155) The workflow.run OpenTelemetry Activity in StreamingRunEventStream.RunLoopAsync was scoped to the method lifetime via 'using'. Since the run loop only exits on cancellation, the Activity was never stopped/exported until explicit disposal. Fix: Remove 'using' and explicitly dispose the Activity when the workflow reaches Idle status (all supersteps complete). A safety-net disposal in the finally block handles cancellation and error paths. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Add root-level workflow.session activity spanning run loop lifetime\n\nImplements two-level telemetry hierarchy per PR feedback from lokitoth:\n- workflow.session: spans the entire run loop / stream lifetime\n- workflow_invoke: per input-to-halt cycle, nested within the session\n\nThis ensures the session activity stays open across multiple turns,\nwhile individual run activities are created and disposed per cycle.\n\nAlso fixes linkedSource CancellationTokenSource disposal leak in\nStreamingRunEventStream (added using declaration)." * Address Copilot review: fix Activity/CTS disposal, rename activity, add error tag\n\n1. LockstepRunEventStream: Remove 'using' from Activity in async iterator\n and manually dispose in finally block (fixes #4155 pattern). Also dispose\n linkedSource CTS in finally to prevent leak.\n2. Tags.cs: Add ErrorMessage (\"error.message\") tag for runtime errors,\n distinct from BuildErrorMessage (\"build.error.message\").\n3. ActivityNames: Rename WorkflowRun from \"workflow_invoke\" to \"workflow.run\"\n for cross-language consistency.\n4. WorkflowTelemetryContext: Fix XML doc to say \"outer/parent span\" instead\n of \"root-level span\".\n5. ObservabilityTests: Assert WorkflowSession absence when DisableWorkflowRun\n is true.\n6. WorkflowRunActivityStopTests: Fix streaming test race by disposing\n StreamingRun before asserting activities are stopped.\n7. StreamingRunEventStream/LockstepRunEventStream: Use Tags.ErrorMessage\n instead of Tags.BuildErrorMessage for runtime error events." * Review fixes: revert workflow_invoke rename, use 'using' for linkedSource, move SessionStarted earlier\n\n- Revert ActivityNames.WorkflowRun back to \"workflow_invoke\" (OTEL semantic convention contract)\n- Use 'using' declaration for linkedSource CTS in LockstepRunEventStream (no timing sensitivity)\n- Move SessionStarted event before WaitForInputAsync in StreamingRunEventStream to match Lockstep behavior" * Improve naming and comments in WorkflowRunActivityStopTests" * Prevent session Activity.Current leak in lockstep mode, add nesting test Save and restore Activity.Current in LockstepRunEventStream.Start() so the session activity doesn't leak into caller code via AsyncLocal. Re-establish Activity.Current = sessionActivity before creating the run activity in TakeEventStreamAsync to preserve parent-child nesting. Add test verifying app activities after RunAsync are not parented under the session, and that the workflow_invoke activity nests under the session." * Fix stale XML doc: WorkflowRun -> WorkflowInvoke in ObservabilityTests --------- Co-authored-by: alliscode <bentho@microsoft.com> Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Python / .NET Samples - Restructure and Improve Samples (Feature Branc… (#4092) * Python: .NET Samples - Restructure and Improve Samples (Feature Branch) (#4091) * Moved by agent (#4094) * Fix readme links * .NET Samples - Create `04-hosting` learning path step (#4098) * Agent move * Agent reorderd * Remove A2A section from README Removed A2A section from the Getting Started README. * Agent fixed links * Fix broken sample links in durable-agents README (#4101) * Initial plan * Fix broken internal links in documentation Co-authored-by: crickman <66376200+crickman@users.noreply.github.com> * Revert template link changes; keep only durable-agents README fix Co-authored-by: crickman <66376200+crickman@users.noreply.github.com> --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: crickman <66376200+crickman@users.noreply.github.com> * .NET Samples - Create `03-workflows` learning path step (#4102) * Fix solution project path * Python: Fix broken markdown links to repo resources (outside /docs) (#4105) * Initial plan * Fix broken markdown links to repo resources Co-authored-by: crickman <66376200+crickman@users.noreply.github.com> * Update README to rename .NET Workflows Samples section --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: crickman <66376200+crickman@users.noreply.github.com> * .NET Samples - Create `02-agents` learning path step (#4107) * .NET: Fix broken relative link in GroupChatToolApproval README (#4108) * Initial plan * Fix broken link in GroupChatToolApproval README Co-authored-by: crickman <66376200+crickman@users.noreply.github.com> --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: crickman <66376200+crickman@users.noreply.github.com> * Update labeler configuration for workflow samples * .NET - Reorder Agents samples to start from Step01 instead of Step04 (#4110) * Fix solution * Resolve new sample paths * Move new AgentSkills and AgentWithMemory_Step04 samples * Fix link * Fix readme path * fix: update stale dotnet/samples/Durable path reference in AGENTS.md Co-authored-by: crickman <66376200+crickman@users.noreply.github.com> * Moved new sample * Update solution * Resolve merge (new sample) * Sync to new sample - FoundryAgents_Step21_BingCustomSearch * Updated README * .NET Samples - Configuration Naming Update (#4149) * .NET: Restore AzureFunctions index parity with ConsoleApps under DurableAgents samples (#4221) * Clean-up `05_host_your_agent` * Config setting consistency * Refine samples * AGENTS.md * Move new samples * Re-order samples * Move new project and fixup solution * Fixup model config * Fix up new UT project --------- Co-authored-by: Copilot <198982749+Copilot@users.noreply.github.com> * Python: Fix Bedrock embedding test stub missing meta attribute (#4287) * Fix Bedrock embedding test stub missing meta attribute * Increase test coverage so gate passes * Python: (ag-ui): fix approval payloads being re-processed on subsequent conversation turns (#4232) * Fix ag-ui tool call issue * Safe json fix * Python: Update workflow orchestration samples to use AzureOpenAIResponsesClient (#4285) * Update workflow orchestration samples to use AzureOpenAIResponsesClient * Fix broken link * Move scripts to scripts folder --------- Co-authored-by: Roger Barreto <19890735+rogerbarreto@users.noreply.github.com> Co-authored-by: Chris <66376200+crickman@users.noreply.github.com> Co-authored-by: Eduard van Valkenburg <eavanvalkenburg@users.noreply.github.com> Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Co-authored-by: Rishabh Chawla <rishabhchawla1995@gmail.com> Co-authored-by: Peter Ibekwe <109177538+peibekwe@users.noreply.github.com> Co-authored-by: Dmytro Struk <13853051+dmytrostruk@users.noreply.github.com> Co-authored-by: Ben Thomas <ben.thomas@microsoft.com> Co-authored-by: alliscode <bentho@microsoft.com> Co-authored-by: Copilot <198982749+Copilot@users.noreply.github.com> Co-authored-by: Evan Mattson <35585003+moonbox3@users.noreply.github.com>
This pull request addresses the issue where workflow run telemetry spans (
Activityobjects) were not always properly stopped and exported, particularly in streaming and lockstep execution environments. The changes ensure that workflow run activities are disposed as soon as the workflow reaches the idle state or when the run loop exits, preventing telemetry data from being lost. Additionally, comprehensive regression tests are added to verify correct activity lifecycle management.Improvements to Activity Lifecycle Management:
workflow.runActivityis disposed immediately when the workflow reaches theIdlestate, so telemetry spans are promptly exported rather than waiting for cancellation or disposal.workflow.runActivityif it was not already stopped when the run loop exits, covering cancellation and error scenarios.usingstatement from the activity initialization to allow manual control over the activity's disposal timing.Testing and Regression Coverage:
WorkflowRunActivityStopTests.csto verify that workflow run activities are always properly stopped and exported to telemetry backends, covering lockstep, off-thread, and streaming execution environments, as well as ensuring that all started activities are stopped.Closes #4155
Contribution Checklist