Skip to content

Add failed capability IDs to workflow execution failed metric#21735

Closed
vreff wants to merge 3 commits intodevelopfrom
feat/capability-id-in-execution-failed-metric-v2
Closed

Add failed capability IDs to workflow execution failed metric#21735
vreff wants to merge 3 commits intodevelopfrom
feat/capability-id-in-execution-failed-metric-v2

Conversation

@vreff
Copy link
Contributor

@vreff vreff commented Mar 26, 2026

Track which capabilities failed during a workflow execution and include them as a comma-joined capabilityID label on the platform_engine_workflow_execution_failed_count metric and in the "Workflow execution failed" log line.

Changes

  • capability_executor.go: Add failedCapabilityIDs sync.Map to ExecutionHelper, store capability ID on each callCapability error, expose via FailedCapabilityIDs() getter
  • engine.go: Add capabilityID label to IncrementWorkflowExecutionFailedCounter and to the error log line in the result.GetError() path
  • capability_executor_test.go: Unit tests for empty, single, dedup, and multiple capability ID tracking

- Track failed capability IDs in ExecutionHelper via sync.Map
- Store capability ID on each callCapability error
- Expose FailedCapabilityIDs() as comma-joined string
- Add capabilityID label to IncrementWorkflowExecutionFailedCounter
- Add capabilityID to 'Workflow execution failed' log line
- Add unit tests for FailedCapabilityIDs
@vreff vreff requested a review from a team as a code owner March 26, 2026 19:40
Copilot AI review requested due to automatic review settings March 26, 2026 19:40
@github-actions
Copy link
Contributor

👋 vreff, thanks for creating this pull request!

To help reviewers, please consider creating future PRs as drafts first. This allows you to self-review and make any final changes before notifying the team.

Once you're ready, you can mark it as "Ready for review" to request feedback. Thanks!

@github-actions
Copy link
Contributor

github-actions bot commented Mar 26, 2026

✅ No conflicts with other open PRs targeting develop

@github-actions
Copy link
Contributor

I see you updated files related to core. Please run make gocs in the root directory to add a changeset as well as in the text include at least one of the following tags:

  • #added For any new functionality added.
  • #breaking_change For any functionality that requires manual action for the node to boot.
  • #bugfix For bug fixes.
  • #changed For any change to the existing functionality.
  • #db_update For any feature that introduces updates to database schema.
  • #deprecation_notice For any upcoming deprecation functionality.
  • #internal For changesets that need to be excluded from the final changelog.
  • #nops For any feature that is NOP facing and needs to be in the official Release Notes for the release.
  • #removed For any functionality/config that is removed.
  • #updated For any functionality that is updated.
  • #wip For any change that is not ready yet and external communication about it should be held off till it is feature complete.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Risk Rating: MEDIUM — changes affect observability/metrics dimensions on a workflow-level counter and introduce new concurrent state tracking that can impact metric cardinality and operational cost.

This PR aims to attribute workflow execution failures to the capability IDs that failed during the run, surfacing that attribution in both logs and the platform_engine_workflow_execution_failed_count metric.

Changes:

  • Track failed capability IDs during execution via ExecutionHelper.
  • Add a capabilityID label (comma-joined IDs) to the workflow execution failed metric and error log line.
  • Add unit tests covering empty/single/dedup/multiple failed capability ID tracking.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.

File Description
core/services/workflows/v2/engine.go Adds failed capability ID label to workflow execution failure metric/log.
core/services/workflows/v2/capability_executor.go Tracks capability IDs on execution errors and exposes a getter for failed IDs.
core/services/workflows/v2/capability_executor_test.go Adds tests for the failed capability ID aggregation behavior.

Areas requiring scrupulous human review:

  • The determinism and cardinality impact of using a comma-joined list as a metric label value.
  • Ensuring failed capability IDs are recorded for all relevant error paths (not only capability.Execute errors).

Recommended reviewers (per CODEOWNERS):

  • @smartcontractkit/keystone (owners for /core/services/workflows)
Comments suppressed due to low confidence (1)

core/services/workflows/v2/capability_executor.go:217

  • failedCapabilityIDs is only updated when capability.Execute(...) returns an error. Errors returned earlier in callCapability (e.g., GetExecutable/Info/ConfigForCapability failures) or in CallCapability (rate limit / semaphore waits) won't be attributed to the capability ID, so the new workflow-execution-failed metric label may be empty/misleading in those cases. Consider recording request.Id for any non-nil error returned from CallCapability/callCapability (e.g., via a deferred function that checks the named return err).
	executionDuration := time.Since(executionStart)
	c.metrics.With(platform.KeyCapabilityID, request.Id).UpdateCapabilityExecutionDurationHistogram(ctx, int64(executionDuration.Seconds()))
	if err != nil {
		c.failedCapabilityIDs.Store(request.Id, struct{}{})
		var capabilityError caperrors.Error
		if errors.As(err, &capabilityError) {
			if capabilityError.Origin() == caperrors.OriginUser {
				execLogger.Debugw("Capability execution failed with user error", "userErr", err)

- Add new platform_engine_workflow_execution_failed_capability_error_count metric
- Emit one metric increment per failed capability ID (avoids high-cardinality label)
- Change FailedCapabilityIDs() to return sorted []string
- Keep comma-joined list in log line for debugging
- Remove capabilityID label from existing execution_failed_count metric
@trunk-io
Copy link

trunk-io bot commented Mar 26, 2026

Static BadgeStatic BadgeStatic BadgeStatic Badge

View Full Report ↗︎Docs

@cl-sonarqube-production
Copy link

@vreff vreff requested a review from bolekk March 26, 2026 20:52
@vreff vreff closed this Mar 27, 2026
@vreff vreff deleted the feat/capability-id-in-execution-failed-metric-v2 branch March 27, 2026 16:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants