Skip to content

Conversation

@liobrasil
Copy link
Collaborator

@liobrasil liobrasil commented Feb 10, 2026

Summary

This PR adds a new multi-round, multi-LLM debate review workflow for PRs in FlowYieldVaultsEVM.

The workflow runs multiple models on the same PR diff, lets them cross-review each other across rounds, and stops only when they converge (or fails safe to manual review if they do not).

What’s Added

  • New workflow: .github/workflows/multi-llm-debate-review.yml
  • New script: scripts/multi-llm-debate-review.sh

Behavior

  1. Trigger on pull_request events (opened, synchronize, ready_for_review).
  2. Build PR diff against base branch.
  3. Skip if diff is too large (configurable).
  4. Run debate rounds with multiple providers (claude, openai, optional gemini).
  5. In each round, providers run in parallel.
  6. Round 1 is independent assessment; later rounds include peer positions.
  7. Converge when all providers agree and minimum-round threshold is met.
  8. Post/update a single sticky PR comment with the debate transcript and result.
  9. Fail the job if debate execution fails or convergence is not reached.

Configuration

Repository Secrets:

  • CLAUDE_API_KEY (or ANTHROPIC_API_KEY)
  • OPENAI_API_KEY
  • GEMINI_API_KEY (optional)

Repository Variables:

  • AI_DEBATE_PROVIDERS (default: claude,openai)
  • AI_DEBATE_MAX_DIFF_SIZE (default: 150000)
  • AI_DEBATE_MAX_ROUNDS (default: 4)
  • AI_DEBATE_MIN_ROUNDS (default: 2)
  • AI_REVIEW_RISK_TOLERANCE (default: moderate)

Reference Implementation Sources

This PR was informed by patterns in the Dapper actions repository:

Notes

  • This is intentionally conservative: no convergence means high-risk/manual follow-up.
  • Existing Claude workflows remain unchanged; this is additive.

@claude
Copy link

claude bot commented Feb 10, 2026

<!-- claude-code-review -->

Multi-LLM Debate Review Workflow - Code Review

Summary

This PR adds a sophisticated multi-round debate system using multiple LLM providers (Claude, OpenAI, Gemini) to collaboratively review pull requests. The workflow orchestrates parallel AI reviews, aggregates feedback through debate rounds, and enforces a convergence gate before allowing merges.

🟢 Strengths

  1. Well-structured shell scripting - Proper use of set -euo pipefail, comprehensive error handling, and input validation
  2. Parallel execution - Efficient concurrent LLM calls to minimize latency
  3. Sanitization - Good credential/secret redaction patterns before sending diffs to external APIs
  4. Flexible configuration - Environment-driven parameters with sensible defaults
  5. Transcript logging - Full JSON transcript of debate rounds for auditability

🔴 Critical Issues

1. Blocking Merge Gate is Too Strict (High Severity)

Location: .github/workflows/multi-llm-debate-review.yml:142-148

The final enforcement step blocks ALL PRs where debate does not converge or any API call fails. This will halt merges for network issues, API downtime, rate limits, and legitimate complex changes where LLMs disagree.

Problem: Blocks every PR where:

  • LLMs dont reach consensus within max rounds
  • Any API call fails (network, rate limits, downtime)
  • Diff is too large (skipped diffs dont converge)

Impact: Significant workflow friction. Manual intervention required even for low-risk changes.

Recommendation: Only block when there is actual high risk:

if: |
  steps.diff.outputs.skip == 'false' &&
  steps.debate.outcome == 'success' &&
  steps.debate.outputs.final_risk == 'high' &&
  steps.debate.outputs.converged \!= 'true'

2. Sensitive Data Exposure Risk (High Severity)

Location: scripts/multi-llm-debate-review.sh:121-125

Sanitization patterns are incomplete. Missing:

  • Private keys (-----BEGIN RSA PRIVATE KEY-----)
  • AWS credentials (AKIA...)
  • Database connection strings
  • Flow blockchain addresses (0x + 40 hex chars)
  • JWTs

Impact: Contract changes with test keys or sentinel addresses (like 0xFFfFfFffFFfffFFfFFfFFFFFffFFFffffFfFFFfF) leak to external APIs.

Recommendation: Add patterns for private keys, AWS keys, Flow addresses, and connection strings.


3. API Key Validation Missing (Medium Severity)

Location: scripts/multi-llm-debate-review.sh:109-116

Only checks if keys exist, not if valid or have quota. First failure happens mid-debate after timeout.

Recommendation: Add pre-flight validation with minimal test prompt to each API endpoint.


🟡 Medium Priority Issues

4. No Retry Logic for API Failures

API calls have 90s timeout but no retry on transient failures (429 rate limits, 503 unavailable).

Recommendation: Implement exponential backoff with 3 retry attempts.


5. Hardcoded Model Versions

claude-sonnet-4-20250514 may be deprecated. Models like gemini-2.0-flash-exp are experimental.

Recommendation: Use stable aliases or add graceful error handling for model_not_found errors.


6. Race Condition in Transcript Updates

Parallel provider calls could race when appending to transcript if mv happens simultaneously.

Recommendation: Use file locking or serialize appending after parallel calls complete.


🟢 Minor Improvements

  • Missing cost estimation: 4 rounds × 3 providers = $1-6 per PR
  • No workflow timeout tuning: 15 min may be excessive for simple PRs
  • Debug output missing: Limited visibility into prompts sent to APIs

🔵 Questions

  1. Why block on non-convergence instead of just flagging for review?
  2. Are there integration tests with mocked API responses?
  3. How will you track API costs and convergence rates?
  4. What is the escape hatch for emergency merges?

📊 Security Assessment

Category Risk Notes
Secrets Management 🟡 Medium API keys in secrets (good), but sanitization incomplete
Data Exfiltration 🟡 Medium Diffs sent to 3rd party APIs
Supply Chain 🟢 Low Pure bash, no external deps
Injection Attacks 🟢 Low Good jq usage, quoted variables

🎯 Overall Assessment

Code Quality: 8/10
Security: 6/10
Performance: 7/10
Maintainability: 7/10

Recommendation: ⚠️ Approve with required changes

Issue #1 (blocking gate) and #2 (sanitization) must be addressed before merge. Others can be follow-ups.


🛠️ Suggested Follow-ups

  1. Add integration tests with mocked APIs
  2. Implement cost tracking and budget alerts
  3. Add convergence metrics dashboard
  4. Create bypass mechanism for emergencies
  5. Consider hybrid approach: block only for high-risk patterns

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant