Add multi-round multi-LLM debate PR review workflow #52

liobrasil · 2026-02-10T20:46:39Z

Summary

This PR adds a new multi-round, multi-LLM debate review workflow for PRs in FlowYieldVaultsEVM.

The workflow runs multiple models on the same PR diff, lets them cross-review each other across rounds, and stops only when they converge (or fails safe to manual review if they do not).

What’s Added

New workflow: .github/workflows/multi-llm-debate-review.yml
New script: scripts/multi-llm-debate-review.sh

Behavior

Trigger on pull_request events (opened, synchronize, ready_for_review).
Build PR diff against base branch.
Skip if diff is too large (configurable).
Run debate rounds with multiple providers (claude, openai, optional gemini).
In each round, providers run in parallel.
Round 1 is independent assessment; later rounds include peer positions.
Converge when all providers agree and minimum-round threshold is met.
Post/update a single sticky PR comment with the debate transcript and result.
Fail the job if debate execution fails or convergence is not reached.

Configuration

Repository Secrets:

CLAUDE_API_KEY (or ANTHROPIC_API_KEY)
OPENAI_API_KEY
GEMINI_API_KEY (optional)

Repository Variables:

AI_DEBATE_PROVIDERS (default: claude,openai)
AI_DEBATE_MAX_DIFF_SIZE (default: 150000)
AI_DEBATE_MAX_ROUNDS (default: 4)
AI_DEBATE_MIN_ROUNDS (default: 2)
AI_REVIEW_RISK_TOLERANCE (default: moderate)

Reference Implementation Sources

This PR was informed by patterns in the Dapper actions repository:

Main reference: https://github.com/dapperlabs/actions/tree/main/ai-code-review
Action contract and workflow shape: https://github.com/dapperlabs/actions/blob/main/ai-code-review/action.yml
AI review execution and risk handling patterns: https://github.com/dapperlabs/actions/blob/main/ai-code-review/scripts/ai-review.sh
Result posting and gating patterns: https://github.com/dapperlabs/actions/blob/main/ai-code-review/scripts/post-review.cjs

Notes

This is intentionally conservative: no convergence means high-risk/manual follow-up.
Existing Claude workflows remain unchanged; this is additive.

claude · 2026-02-10T20:50:45Z

Multi-LLM Debate Review Workflow - Code Review

Summary

This PR adds a sophisticated multi-round debate system using multiple LLM providers (Claude, OpenAI, Gemini) to collaboratively review pull requests. The workflow orchestrates parallel AI reviews, aggregates feedback through debate rounds, and enforces a convergence gate before allowing merges.

🟢 Strengths

Well-structured shell scripting - Proper use of set -euo pipefail, comprehensive error handling, and input validation
Parallel execution - Efficient concurrent LLM calls to minimize latency
Sanitization - Good credential/secret redaction patterns before sending diffs to external APIs
Flexible configuration - Environment-driven parameters with sensible defaults
Transcript logging - Full JSON transcript of debate rounds for auditability

🔴 Critical Issues

1. Blocking Merge Gate is Too Strict (High Severity)

Location: .github/workflows/multi-llm-debate-review.yml:142-148

The final enforcement step blocks ALL PRs where debate does not converge or any API call fails. This will halt merges for network issues, API downtime, rate limits, and legitimate complex changes where LLMs disagree.

Problem: Blocks every PR where:

LLMs dont reach consensus within max rounds
Any API call fails (network, rate limits, downtime)
Diff is too large (skipped diffs dont converge)

Impact: Significant workflow friction. Manual intervention required even for low-risk changes.

Recommendation: Only block when there is actual high risk:

if: |
  steps.diff.outputs.skip == 'false' &&
  steps.debate.outcome == 'success' &&
  steps.debate.outputs.final_risk == 'high' &&
  steps.debate.outputs.converged \!= 'true'

2. Sensitive Data Exposure Risk (High Severity)

Location: scripts/multi-llm-debate-review.sh:121-125

Sanitization patterns are incomplete. Missing:

Private keys (-----BEGIN RSA PRIVATE KEY-----)
AWS credentials (AKIA...)
Database connection strings
Flow blockchain addresses (0x + 40 hex chars)
JWTs

Impact: Contract changes with test keys or sentinel addresses (like 0xFFfFfFffFFfffFFfFFfFFFFFffFFFffffFfFFFfF) leak to external APIs.

Recommendation: Add patterns for private keys, AWS keys, Flow addresses, and connection strings.

3. API Key Validation Missing (Medium Severity)

Location: scripts/multi-llm-debate-review.sh:109-116

Only checks if keys exist, not if valid or have quota. First failure happens mid-debate after timeout.

Recommendation: Add pre-flight validation with minimal test prompt to each API endpoint.

🟡 Medium Priority Issues

4. No Retry Logic for API Failures

API calls have 90s timeout but no retry on transient failures (429 rate limits, 503 unavailable).

Recommendation: Implement exponential backoff with 3 retry attempts.

5. Hardcoded Model Versions

claude-sonnet-4-20250514 may be deprecated. Models like gemini-2.0-flash-exp are experimental.

Recommendation: Use stable aliases or add graceful error handling for model_not_found errors.

6. Race Condition in Transcript Updates

Parallel provider calls could race when appending to transcript if mv happens simultaneously.

Recommendation: Use file locking or serialize appending after parallel calls complete.

🟢 Minor Improvements

Missing cost estimation: 4 rounds × 3 providers = $1-6 per PR
No workflow timeout tuning: 15 min may be excessive for simple PRs
Debug output missing: Limited visibility into prompts sent to APIs

🔵 Questions

Why block on non-convergence instead of just flagging for review?
Are there integration tests with mocked API responses?
How will you track API costs and convergence rates?
What is the escape hatch for emergency merges?

📊 Security Assessment

Category	Risk	Notes
Secrets Management	🟡 Medium	API keys in secrets (good), but sanitization incomplete
Data Exfiltration	🟡 Medium	Diffs sent to 3rd party APIs
Supply Chain	🟢 Low	Pure bash, no external deps
Injection Attacks	🟢 Low	Good jq usage, quoted variables

🎯 Overall Assessment

Code Quality: 8/10
Security: 6/10
Performance: 7/10
Maintainability: 7/10

Recommendation: ⚠️ Approve with required changes

Issue #1 (blocking gate) and #2 (sanitization) must be addressed before merge. Others can be follow-ups.

🛠️ Suggested Follow-ups

Add integration tests with mocked APIs
Implement cost tracking and budget alerts
Add convergence metrics dashboard
Create bypass mechanism for emergencies
Consider hybrid approach: block only for high-risk patterns

Add multi-round multi-LLM debate PR review workflow

fbe860c

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add multi-round multi-LLM debate PR review workflow #52

Add multi-round multi-LLM debate PR review workflow #52

Uh oh!

liobrasil commented Feb 10, 2026 •

edited

Loading

Uh oh!

claude bot commented Feb 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Add multi-round multi-LLM debate PR review workflow #52

Are you sure you want to change the base?

Add multi-round multi-LLM debate PR review workflow #52

Uh oh!

Conversation

liobrasil commented Feb 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What’s Added

Behavior

Configuration

Reference Implementation Sources

Notes

Uh oh!

claude bot commented Feb 10, 2026

Multi-LLM Debate Review Workflow - Code Review

Summary

🟢 Strengths

🔴 Critical Issues

1. Blocking Merge Gate is Too Strict (High Severity)

2. Sensitive Data Exposure Risk (High Severity)

3. API Key Validation Missing (Medium Severity)

🟡 Medium Priority Issues

4. No Retry Logic for API Failures

5. Hardcoded Model Versions

6. Race Condition in Transcript Updates

🟢 Minor Improvements

🔵 Questions

📊 Security Assessment

🎯 Overall Assessment

🛠️ Suggested Follow-ups

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

liobrasil commented Feb 10, 2026 •

edited

Loading