[refactor] Semantic function clustering: duplicates and outliers in pkg/workflow

Automated semantic analysis of all non-test Go files in `pkg/workflow/` (240+ files) and utility packages (`pkg/stringutil`, `pkg/sliceutil`, etc.) identified several concrete refactoring opportunities through function clustering and duplicate detection.

### Overview

| Metric | Value |
|--------|-------|
| Go files analyzed | 240+ (pkg/workflow) + 7 utility packages |
| Duplicate patterns found | 4 concrete cases |
| Outlier functions found | 2 |
| Estimated code reduction | ~700 lines |
| Detection method | Serena semantic analysis + naming pattern clustering |

---

### Critical Issues

#### Issue 1: Duplicate Mount Syntax Validation

Two functions implement nearly identical mount string validation (`source:dest:mode` format) in separate files:

- **`pkg/workflow/sandbox_validation.go`** → `validateMountsSyntax()`
- **`pkg/workflow/mcp_config_validation.go`** → `validateMCPMountsSyntax()`

Both split on `:`, check for exactly 3 parts, and validate that `mode` is `"ro"` or `"rw"`. The only differences are error types (`NewValidationError` vs `fmt.Errorf`) and that the sandbox version also validates non-empty source/dest fields.

**Recommendation:** Extract a shared `validateMountFormat(mounts []string, toolName string) error` function into `validation_helpers.go` or a new `mounts_validation.go`, then call it from both files with appropriate error wrapping.

#### Issue 2: Near-Identical missing_data.go and missing_tool.go

`pkg/workflow/missing_data.go` and `pkg/workflow/missing_tool.go` are structurally identical (~164 lines each, ~250 duplicated lines):

<details>
<summary>View duplicate struct definitions</summary>

```go
// missing_data.go
type MissingDataConfig struct {
 BaseSafeOutputConfig `yaml:",inline"`
 CreateIssue bool `yaml:"create-issue,omitempty"`
 TitlePrefix string `yaml:"title-prefix,omitempty"`
 Labels []string `yaml:"labels,omitempty"`
}

// missing_tool.go (IDENTICAL structure, only type name differs)
type MissingToolConfig struct {
 BaseSafeOutputConfig `yaml:",inline"`
 CreateIssue bool `yaml:"create-issue,omitempty"`
 TitlePrefix string `yaml:"title-prefix,omitempty"`
 Labels []string `yaml:"labels,omitempty"`
}
```

</details>

Both `parseMissingDataConfig()` / `parseMissingToolConfig()` and `buildCreateOutputMissingDataJob()` / `buildCreateOutputMissingToolJob()` follow exactly the same pattern with only the output type name and environment variable prefix differing (e.g., `GH_AW_MISSING_DATA_MAX` vs `GH_AW_MISSING_TOOL_MAX`).

**Recommendation:** Consider a shared `BaseMissingOutputConfig` type and a generic `buildCreateOutputMissingJob(outputType string, config BaseMissingOutputConfig)` builder, reducing ~250 lines to ~80 lines of shared code. `noop.go` follows the same outer parse pattern and could share the same boilerplate helper.

#### Issue 3: Safe Output Handler Parse Function Boilerplate (11 files)

The following files all implement the same three-part pattern — a config struct embedding `BaseSafeOutputConfig` + `SafeOutputTargetConfig`, a `parseXxxConfig()` method on `*Compiler`, and a build function:

`add_comment.go`, `add_labels.go`, `add_reviewer.go`, `assign_milestone.go`, `assign_to_agent.go`, `assign_to_user.go`, `hide_comment.go`, `link_sub_issue.go`, `remove_labels.go`, `reply_to_pr_review_comment.go`, `resolve_pr_review_thread.go`

<details>
<summary>View duplicated parse function pattern</summary>

```go
// add_labels.go (lines 20-40)
func (c *Compiler) parseAddLabelsConfig(outputMap map[string]any) *AddLabelsConfig {
 if _, exists := outputMap["add-labels"]; !exists {
 return nil
 }
 addLabelsLog.Print("Parsing add-labels configuration")
 var config AddLabelsConfig
 if err := unmarshalConfig(outputMap, "add-labels", &config, addLabelsLog); err != nil {
 addLabelsLog.Printf("Failed to unmarshal config: %v", err)
 return &AddLabelsConfig{}
 }
 return &config
}

// remove_labels.go (lines 18-38) — near-identical
func (c *Compiler) parseRemoveLabelsConfig(outputMap map[string]any) *RemoveLabelsConfig {
 if _, exists := outputMap["remove-labels"]; !exists {
 return nil
 }
 removeLabelsLog.Print("Parsing remove-labels configuration")
 var config RemoveLabelsConfig
 if err := unmarshalConfig(outputMap, "remove-labels", &config, removeLabelsLog); err != nil {
 removeLabelsLog.Printf("Failed to unmarshal config: %v", err)
 return &RemoveLabelsConfig{}
 }
 return &config
}
```

</details>

The `AssignToUserConfig` and `UnassignFromUserConfig` structs are also nearly identical:

```go
// Both have: BaseSafeOutputConfig, SafeOutputTargetConfig, Allowed []string, Blocked []string
// UnassignFromUser has one extra field: UnassignFirst *string
```

**Recommendation:** Extract a generic parse helper (similar to existing `unmarshalConfig`) that handles the nil-check, logging, and unmarshal boilerplate. Consider also whether `assign_to_user.go` and `unassign_from_user.go` could share a common base struct or live in one file.

#### Issue 4: SanitizeWorkflowName Duplication Across Packages

`pkg/workflow/strings.go` defines `SanitizeWorkflowName()` and `SanitizeName()`, while `pkg/stringutil/identifiers.go` defines `NormalizeWorkflowName()` — both sanitize workflow names for use in filenames/identifiers. The `workflow` package has a general-purpose `SanitizeName()` function with configurable options that overlaps with `pkg/stringutil`'s sanitization family (`SanitizeParameterName`, `SanitizePythonVariableName`, `SanitizeToolID`).

**Recommendation:** Evaluate whether `SanitizeName()` with its options pattern should be moved to `pkg/stringutil` as the canonical sanitization primitive, with `pkg/workflow/strings.go` importing it. This would centralize sanitization logic.

---

### Outlier Functions

<details>
<summary>View outlier details</summary>

#### Outlier 1: `extractAddDirPaths()` in copilot_engine.go

The function `extractAddDirPaths()` is a standalone package-level function in `copilot_engine.go`. All other engine files keep only constructor (`NewXxxEngine()`) and interface method implementations. This private helper function is an outlier for the file.

**Recommendation:** Move to `copilot_engine_execution.go` (where the directory add behavior is likely used), or to `engine_helpers.go`.

#### Outlier 2: `generateCopilotSessionFileCopyStep()` in copilot_logs.go

This function generates a YAML step for copying session files, but it lives in the log parsing file. It is closer in nature to execution/installation step generation than log parsing.

**Recommendation:** Move to `copilot_engine_execution.go` where other Copilot execution steps are generated.

</details>

---

### Function Clusters (Well-Organized Areas)

<details>
<summary>View cluster analysis</summary>

The following areas show good organization and do **not** need refactoring:

**Engine Interface Pattern (4 files):** `claude_engine.go`, `codex_engine.go`, `gemini_engine.go`, `copilot_engine.go` all correctly implement the same engine interface. The per-engine split is intentional and appropriate.

**Bundler Validation (3 files):** `bundler_runtime_validation.go`, `bundler_safety_validation.go`, `bundler_script_validation.go` each validate genuinely distinct concerns (runtime mixing, bundle completeness, API compatibility) and are correctly separated.

**Firewall Validation (2 files):** `firewall_validation.go` (log-level validation) and `network_firewall_validation.go` (cross-field dependency validation like `allow-urls requires ssl-bump`) serve different validation dimensions and are appropriately split.

**compiler_safe_outputs_* series (8 files):** Clear responsibility split by concern (job building, step building, environment setup, specialized handlers). Well-organized.

**Utility Packages:** `pkg/sliceutil`, `pkg/mathutil`, `pkg/fileutil`, `pkg/gitutil`, `pkg/envutil`, `pkg/timeutil` are all cleanly scoped with no cross-package duplication detected.

</details>

---

### Recommendations by Priority

#### Priority 1 — High Impact, Low Risk

1. **Extract shared mount validation** (`validateMountsSyntax` / `validateMCPMountsSyntax`) into a single function in `validation_helpers.go`
2. **Move `extractAddDirPaths()`** from `copilot_engine.go` to `copilot_engine_execution.go`
3. **Move `generateCopilotSessionFileCopyStep()`** from `copilot_logs.go` to `copilot_engine_execution.go`

#### Priority 2 — Medium Impact

4. **Extract generic parse helper** for the repeated `parseXxxConfig()` boilerplate across the 11 safe output handler files
5. **Consolidate `missing_data.go` and `missing_tool.go`** build functions via a shared base type or factory

#### Priority 3 — Long-term

6. **Evaluate moving `SanitizeName()`** to `pkg/stringutil` as the canonical sanitization primitive

---

### Implementation Checklist

- [ ] Extract `validateMountFormat()` helper from sandbox and MCP mount validators
- [ ] Relocate `extractAddDirPaths()` to `copilot_engine_execution.go`
- [ ] Relocate `generateCopilotSessionFileCopyStep()` to `copilot_engine_execution.go`
- [ ] Extract generic `parseOutputConfig[T]()` helper or similar to reduce parse boilerplate
- [ ] Explore shared base for `MissingDataConfig` / `MissingToolConfig`
- [ ] Evaluate consolidating sanitization primitives in `pkg/stringutil`
- [ ] Ensure all tests pass after each refactoring step

---

### References

- Workflow run: [§22281474970](https://github.com/github/gh-aw/actions/runs/22281474970)
- Analysis date: 2026-02-22
- Analysis scope: 240+ non-test `.go` files in `pkg/workflow/`, 7 utility packages




> Generated by [Semantic Function Refactoring](https://github.com/github/gh-aw/actions/runs/22281474970)
> - [x] expires  on Feb 24, 2026, 5:12 PM UTC

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[refactor] Semantic function clustering: duplicates and outliers in pkg/workflow #17762

Overview

Critical Issues

Issue 1: Duplicate Mount Syntax Validation

Issue 2: Near-Identical missing_data.go and missing_tool.go

Issue 3: Safe Output Handler Parse Function Boilerplate (11 files)

Issue 4: SanitizeWorkflowName Duplication Across Packages

Outlier Functions

Outlier 1: `extractAddDirPaths()` in copilot_engine.go

Outlier 2: `generateCopilotSessionFileCopyStep()` in copilot_logs.go

Function Clusters (Well-Organized Areas)

Recommendations by Priority

Priority 1 — High Impact, Low Risk

Priority 2 — Medium Impact

Priority 3 — Long-term

Implementation Checklist

References

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Metric	Value
Go files analyzed	240+ (pkg/workflow) + 7 utility packages
Duplicate patterns found	4 concrete cases
Outlier functions found	2
Estimated code reduction	~700 lines
Detection method	Serena semantic analysis + naming pattern clustering

[refactor] Semantic function clustering: duplicates and outliers in pkg/workflow #17762

Description

Overview

Critical Issues

Issue 1: Duplicate Mount Syntax Validation

Issue 2: Near-Identical missing_data.go and missing_tool.go

Issue 3: Safe Output Handler Parse Function Boilerplate (11 files)

Issue 4: SanitizeWorkflowName Duplication Across Packages

Outlier Functions

Outlier 1: extractAddDirPaths() in copilot_engine.go

Outlier 2: generateCopilotSessionFileCopyStep() in copilot_logs.go

Function Clusters (Well-Organized Areas)

Recommendations by Priority

Priority 1 — High Impact, Low Risk

Priority 2 — Medium Impact

Priority 3 — Long-term

Implementation Checklist

References

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Outlier 1: `extractAddDirPaths()` in copilot_engine.go

Outlier 2: `generateCopilotSessionFileCopyStep()` in copilot_logs.go