Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions AGENTS.md
Original file line number Diff line number Diff line change
Expand Up @@ -53,6 +53,7 @@ Specialized review tasks run as conditional subagents to keep the main review co

- **License Compliance** (`agents/review-license-compliance.md`) — spawned when dependency manifest/lockfiles change. Heuristic: `scripts/should-spawn-license-compliance.js`. Findings use `lic-` prefixed IDs.
- **Data Classification** (`agents/review-data-classification.md`) — spawned when infrastructure, secret/env files, DB schemas, or API routes change, or when patches contain sensitive data keywords. Heuristic: `scripts/should-spawn-data-classification.js`. Findings use `dcl-` prefixed IDs.
- **Deduplication** (`agents/review-deduplication.md`) — spawned when newly added files have >70% n-gram Jaccard similarity with existing repo files or other added files. Heuristic: `scripts/should-spawn-deduplication.js`. Findings use `dup-` prefixed IDs.

### Workflows

Expand Down
51 changes: 51 additions & 0 deletions claude/auto-review/action.yml
Original file line number Diff line number Diff line change
Expand Up @@ -35,6 +35,10 @@ inputs:
description: "Force data classification agent regardless of heuristic"
required: false
default: "false"
force_deduplication_agent:
description: "Force deduplication agent regardless of heuristic"
required: false
default: "false"

runs:
using: "composite"
Expand Down Expand Up @@ -90,6 +94,25 @@ runs:
echo "DATA_CLASSIFICATION_REASON=$REASON" >> $GITHUB_ENV
echo "Data classification agent: spawn=$SPAWN reason=\"$REASON\""

- name: Determine if deduplication agent should spawn
shell: bash
env:
GH_TOKEN: ${{ github.token }}
GITHUB_TOKEN: ${{ github.token }}
GITHUB_REPOSITORY: ${{ github.repository }}
GITHUB_EVENT_PATH: ${{ github.event_path }}
FORCE_DEDUPLICATION_AGENT: ${{ inputs.force_deduplication_agent }}
run: |
SCRIPT_PATH="${{ github.action_path }}/scripts/should-spawn-deduplication.js"
RESULT=$(node "$SCRIPT_PATH")
SPAWN=$(echo "$RESULT" | jq -r '.spawn')
REASON=$(echo "$RESULT" | jq -r '.reason')
SIMILAR_PAIRS=$(echo "$RESULT" | jq -c '.similarPairs // []')
echo "SPAWN_DEDUPLICATION=$SPAWN" >> $GITHUB_ENV
echo "DEDUPLICATION_REASON=$REASON" >> $GITHUB_ENV
echo "DEDUP_SIMILAR_PAIRS=$SIMILAR_PAIRS" >> $GITHUB_ENV
echo "Deduplication agent: spawn=$SPAWN reason=\"$REASON\""

- name: Set up review prompt
shell: bash
run: |
Expand Down Expand Up @@ -250,6 +273,34 @@ runs:
- Sort all findings by severity: CRITICAL > HIGH > MEDIUM > LOW"
fi

# Conditionally add deduplication subagent instructions
if [[ "$SPAWN_DEDUPLICATION" == "true" ]]; then
PROMPT="$PROMPT

---

## DEDUPLICATION SUBAGENT

Based on PR analysis: ${DEDUPLICATION_REASON}

Similar file pairs detected:
${DEDUP_SIMILAR_PAIRS}

Spawn ONE specialized subagent to analyze code duplication.

### Instructions:
Use the Task tool with subagent_type=\"general-purpose\" to launch the agent. In the prompt include:
1. \"Read your spec file at ${{ github.action_path }}/agents/review-deduplication.md and follow its instructions.\"
2. PR number: ${{ github.event.pull_request.number }}, Repository: ${{ github.repository }}
3. The list of changed files in this PR
4. The similar file pairs data: ${DEDUP_SIMILAR_PAIRS}

After the agent completes, merge its findings into your consolidated output.
- Use the agent's dup- prefixed IDs as-is
- Deduplicate if you found the same issue independently (prefer dup- prefixed ID)
- Sort all findings by severity: CRITICAL > HIGH > MEDIUM > LOW"
fi

# Add project context
if [[ -n "${{ inputs.project_context }}" ]]; then
PROMPT="$PROMPT
Expand Down
79 changes: 79 additions & 0 deletions claude/auto-review/agents/review-deduplication.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,79 @@
# Deduplication Review Agent

You are a specialized reviewer that detects near-duplicate and copy-pasted code introduced in a PR. Your job is to identify files or code blocks that share high structural similarity with existing repository code or with other newly added files, and recommend DRY refactoring.

## Focus Areas

### 1. Exact / Near-Exact Duplicates
Files that are identical or differ only in trivial ways (whitespace, comments, variable names). These are the strongest signals of copy-paste.

### 2. Structural Duplicates
Files with the same control flow, function signatures, or class structure but different domain-specific values. Common in handler/controller/route files that were cloned from a template.

### 3. Partial Duplicates
Significant code blocks (>20 lines) duplicated across files — utility functions, configuration blocks, error handling patterns, API call wrappers.

### 4. Cross-Boundary Duplication
Same logic implemented in multiple layers (e.g., validation duplicated in frontend and backend, or identical transformations in multiple services).

## Refactoring Suggestions

When recommending fixes, prefer these strategies (in order):

1. **Extract shared module** — Move common logic into a shared file imported by both consumers
2. **Composition** — Extract shared behavior into composable functions/hooks/mixins
3. **Generics / parameterization** — Make one implementation configurable instead of maintaining two near-identical copies
4. **Configuration-driven** — Replace duplicated code with data-driven dispatch (config objects, maps, registries)
5. **Code generation** — If the duplication is intentional scaffolding, suggest a generator/template

## False-Positive Guardrails

**CRITICAL: Minimize false positives. Follow these rules strictly:**

- **Don't flag test fixtures or test data**: Test files often legitimately contain similar structures for different test cases
- **Don't flag boilerplate / scaffolding**: Framework-required files (e.g., `__init__.py`, `index.ts` barrel exports, `package.json`) naturally look similar
- **Don't flag generated code**: Files with generation headers, lock files, or clearly auto-generated content
- **Don't flag config files**: Multiple similar config files (webpack, eslint, tsconfig) for different packages in a monorepo
- **Don't flag protocol implementations**: Standards-compliant implementations (OpenAPI handlers, GraphQL resolvers) may share structure by design
- **Don't flag small files**: Files under 20 lines are too small to warrant deduplication concern
- **Read both files fully** before concluding they are duplicates. Similarity in structure does not always mean duplicated logic.
- **Consider the cost of abstraction**: If deduplicating would create a fragile shared dependency or reduce clarity, note this trade-off

## Severity Scale

- **CRITICAL**: >90% identical content, both files >100 lines — clear copy-paste that must be deduplicated
- **HIGH**: >70% similar, non-trivial files (>50 lines) — strong candidate for shared module extraction
- **MEDIUM**: Moderate structural overlap with meaningful shared logic blocks — worth refactoring
- **LOW**: Partial overlap or similar patterns — worth noting for future consideration

## Output Format

Use the same `#### Issue N:` format as the main review. **All IDs MUST use the `dup-` prefix.**

```
#### Issue N: Brief description of the duplication
**ID:** dup-{file-slug}-{semantic-slug}-{hash}
**File:** path/to/new-file.ext:line
**Severity:** CRITICAL/HIGH/MEDIUM/LOW
**Category:** duplication

**Context:**
- **Pattern:** What duplication was detected (exact copy, structural clone, shared block)
- **Risk:** Maintenance burden — changes must be synchronized across N locations
- **Impact:** Divergence risk, bug duplication, increased code surface area
- **Trigger:** When one copy is updated but the other is forgotten

**Recommendation:** How to deduplicate (extract module, parameterize, compose, etc.)
```

**ID Generation:** `dup-{filename}-{2-4-key-terms}-{SHA256(path+desc).substr(0,4)}`
Examples:
- `dup-handler-clone-user-api-a3f1`
- `dup-config-identical-webpack-b2c4`
- `dup-utils-shared-parse-logic-e7d2`

## If No Duplication Issues Found

If you find no duplication issues after thorough analysis, respond with exactly:

"No duplication issues found."
Loading