Add failed capability IDs to workflow execution failed metric#21735
Add failed capability IDs to workflow execution failed metric#21735
Conversation
- Track failed capability IDs in ExecutionHelper via sync.Map - Store capability ID on each callCapability error - Expose FailedCapabilityIDs() as comma-joined string - Add capabilityID label to IncrementWorkflowExecutionFailedCounter - Add capabilityID to 'Workflow execution failed' log line - Add unit tests for FailedCapabilityIDs
|
👋 vreff, thanks for creating this pull request! To help reviewers, please consider creating future PRs as drafts first. This allows you to self-review and make any final changes before notifying the team. Once you're ready, you can mark it as "Ready for review" to request feedback. Thanks! |
|
✅ No conflicts with other open PRs targeting |
|
I see you updated files related to
|
There was a problem hiding this comment.
Pull request overview
Risk Rating: MEDIUM — changes affect observability/metrics dimensions on a workflow-level counter and introduce new concurrent state tracking that can impact metric cardinality and operational cost.
This PR aims to attribute workflow execution failures to the capability IDs that failed during the run, surfacing that attribution in both logs and the platform_engine_workflow_execution_failed_count metric.
Changes:
- Track failed capability IDs during execution via
ExecutionHelper. - Add a
capabilityIDlabel (comma-joined IDs) to the workflow execution failed metric and error log line. - Add unit tests covering empty/single/dedup/multiple failed capability ID tracking.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
core/services/workflows/v2/engine.go |
Adds failed capability ID label to workflow execution failure metric/log. |
core/services/workflows/v2/capability_executor.go |
Tracks capability IDs on execution errors and exposes a getter for failed IDs. |
core/services/workflows/v2/capability_executor_test.go |
Adds tests for the failed capability ID aggregation behavior. |
Areas requiring scrupulous human review:
- The determinism and cardinality impact of using a comma-joined list as a metric label value.
- Ensuring failed capability IDs are recorded for all relevant error paths (not only
capability.Executeerrors).
Recommended reviewers (per CODEOWNERS):
@smartcontractkit/keystone(owners for/core/services/workflows)
Comments suppressed due to low confidence (1)
core/services/workflows/v2/capability_executor.go:217
- failedCapabilityIDs is only updated when capability.Execute(...) returns an error. Errors returned earlier in callCapability (e.g., GetExecutable/Info/ConfigForCapability failures) or in CallCapability (rate limit / semaphore waits) won't be attributed to the capability ID, so the new workflow-execution-failed metric label may be empty/misleading in those cases. Consider recording request.Id for any non-nil error returned from CallCapability/callCapability (e.g., via a deferred function that checks the named return err).
executionDuration := time.Since(executionStart)
c.metrics.With(platform.KeyCapabilityID, request.Id).UpdateCapabilityExecutionDurationHistogram(ctx, int64(executionDuration.Seconds()))
if err != nil {
c.failedCapabilityIDs.Store(request.Id, struct{}{})
var capabilityError caperrors.Error
if errors.As(err, &capabilityError) {
if capabilityError.Origin() == caperrors.OriginUser {
execLogger.Debugw("Capability execution failed with user error", "userErr", err)
- Add new platform_engine_workflow_execution_failed_capability_error_count metric - Emit one metric increment per failed capability ID (avoids high-cardinality label) - Change FailedCapabilityIDs() to return sorted []string - Keep comma-joined list in log line for debugging - Remove capabilityID label from existing execution_failed_count metric
|




Track which capabilities failed during a workflow execution and include them as a comma-joined capabilityID label on the platform_engine_workflow_execution_failed_count metric and in the "Workflow execution failed" log line.
Changes