Integrating SWE-Pro (Public) Dataset Eval #1197
Conversation
Signed-off-by: wasiahmad <wasiahmad@ucla.edu>
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughAdds a SWE-bench Pro dataset package (config constants and a prepare script), an Alpine-based Dockerfile for SWE-bench, and evaluation updates: a gold-patch agent mode plus a new SWE-bench Pro evaluation path and dataset-type config. Changes
Sequence Diagram(s)sequenceDiagram
participant Task as GenerationTask
participant FS as FileSystem
participant Agent as AgentFramework
participant Harness as SWEbenchHarness
Note over Task,Agent: For each datapoint
Task->>Agent: check agent_framework
alt agent_framework == "gold_patch"
Agent->>FS: write gold_patch JSONL (output_dir/gold_patches/<id>.jsonl)
FS-->>Task: gold_patch_path
Task->>Harness: run pro-style evaluation (--raw_sample_path, --patch_path, --output_dir, --scripts_dir)
Harness-->>FS: write eval-outputs
else non-gold
Agent->>Agent: generate predictions
Agent->>FS: write predictions file
Task->>Harness: run evaluation
alt dataset_type == swe_bench_pro
Task->>Harness: invoke pro-style CLI (raw_sample_path/patch_path/output_dir/scripts_dir)
Harness-->>FS: write eval-outputs
else
Task->>Harness: invoke legacy CLI (predictions_path/instance_ids/dataset_name)
Harness-->>FS: write logs/run_evaluation/eval-outputs
end
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes 🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In `@nemo_skills/dataset/swe-bench-pro/prepare.py`:
- Around line 24-26: The get_dockerhub_image_uri function currently uses
repo_name.lower().split("/") which will fail if repo_name defaults to "" — make
repo_name required (remove the default "") or validate and raise a clear error
if it's empty/doesn't contain a slash; also make the split robust by using
rsplit("/", 1) and assign to repo_base and repo_name_only (keep references to
uid and hsh as-is), so update the function signature and replace
repo_name.lower().split("/") with repo_name.lower().rsplit("/", 1) and add a
guard that raises ValueError with a helpful message when repo_name is invalid.
🧹 Nitpick comments (1)
nemo_skills/dataset/swe-bench-pro/prepare.py (1)
45-73: Allow an explicit output path to avoid writing into package directories.When this script is run from an installed package,
Path(__file__).parentmay be read-only. Adding an--output_fileoption keeps the default behavior but avoids permission failures.🛠️ Proposed tweak
parser.add_argument( "--dataset_name", type=str, default="ScaleAI/SWE-bench_Pro", help="Dataset name to load", ) + parser.add_argument( + "--output_file", + type=Path, + default=None, + help="Path to write JSONL. Defaults to <script_dir>/<setup>.jsonl.", + ) args = parser.parse_args() @@ - output_file = Path(__file__).parent / f"{args.setup}.jsonl" + output_file = ( + Path(args.output_file) + if args.output_file is not None + else Path(__file__).parent / f"{args.setup}.jsonl" + )
Signed-off-by: wasiahmad <wasiahmad@ucla.edu>
Signed-off-by: wasiahmad <wasiahmad@ucla.edu>
Signed-off-by: wasiahmad <wasiahmad@ucla.edu>
Signed-off-by: wasiahmad <wasiahmad@ucla.edu>
Signed-off-by: Nikolai Ludwig <nliudvig@nvidia.com>
Signed-off-by: wasiahmad <wasiahmad@ucla.edu>
Signed-off-by: wasiahmad <wasiahmad@ucla.edu>
Signed-off-by: wasiahmad <wasiahmad@ucla.edu>
Signed-off-by: wasiahmad <wasiahmad@ucla.edu>
Signed-off-by: wasiahmad <wasiahmad@ucla.edu>
Signed-off-by: wasiahmad <wasiahmad@ucla.edu>
Signed-off-by: wasiahmad <wasiahmad@ucla.edu>
Signed-off-by: wasiahmad <wasiahmad@ucla.edu>
Signed-off-by: wasiahmad <wasiahmad@ucla.edu>
Signed-off-by: wasiahmad <wasiahmad@ucla.edu>
Signed-off-by: Nikolai Ludwig <nliudvig@nvidia.com>
Signed-off-by: Nikolai Ludwig <nliudvig@nvidia.com>
Signed-off-by: Nikolai Ludwig <nliudvig@nvidia.com>
Signed-off-by: Nikolai Ludwig <nliudvig@nvidia.com>
Signed-off-by: Nikolai Ludwig <nliudvig@nvidia.com>
Signed-off-by: Nikolai Ludwig <nliudvig@nvidia.com>
Signed-off-by: Nikolai Ludwig <nliudvig@nvidia.com>
Signed-off-by: Nikolai Ludwig <nliudvig@nvidia.com>
Signed-off-by: Nikolai Ludwig <nliudvig@nvidia.com>
Signed-off-by: Nikolai Ludwig <nliudvig@nvidia.com>
Signed-off-by: Nikolai Ludwig <nliudvig@nvidia.com>
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (1)
dockerfiles/swe-bench/Dockerfile.nemo-skills.alpine (1)
25-25: Add a comment explaining whypy3-blinkeris removed.This deletion lacks context. A brief inline comment would help future maintainers understand the rationale (e.g., conflict with another package version).
Suggested comment
+# Remove py3-blinker to avoid version conflicts with Flask/Werkzeug dependencies RUN apk del py3-blinker🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@dockerfiles/swe-bench/Dockerfile.nemo-skills.alpine` at line 25, The Dockerfile removal of py3-blinker (the RUN apk del py3-blinker line) needs an inline comment explaining the rationale; update the Dockerfile to add a brief comment on that RUN instruction specifying why py3-blinker is removed (e.g., causes version/conflict with package X, is not used at runtime, or breaks dependency Y) and include any relevant context such as linked issue/PR or package name that conflicted so future maintainers understand the reason.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@nemo_skills/inference/eval/swebench.py`:
- Around line 815-832: The _get_gold_patch function writes data_point["patch"]
directly and doesn't normalize trailing newlines like other helpers
(_run_mini_swe_agent, _run_openhands); update _get_gold_patch to ensure the
patch ends with a single newline before writing (e.g., strip any trailing
newlines then append "\n"), write that normalized string into the "model_patch"
field when dumping the JSONL, and return the file path as before.
- Around line 881-894: The code builds a command string (swe_bench_cmd) that
invokes a non-existent module swebench.harness.run_local_evaluation for
SWE-bench Pro; update the branch handling self.cfg.dataset_type ==
SupportedDatasetTypes.swe_bench_pro to call the correct SWE-bench Pro entry
point (e.g., the pro evaluator script provided in the SWE-bench Pro repo such as
swe_bench_pro_eval.py) using the same arguments (--raw_sample_path,
--patch_path, --output_dir, --scripts_dir) and the same venv python binary
(/root/SWE-bench/venv/bin/python), ensuring the repository copy steps (cp -r
/root_mount/SWE-bench /root and cp -r /root_mount/uv /root) and the final cp of
eval-outputs to /trajectories_mount/ are preserved; modify the string built in
swe_bench_cmd accordingly so the runtime invokes the correct script name and
path instead of swebench.harness.run_local_evaluation.
---
Nitpick comments:
In `@dockerfiles/swe-bench/Dockerfile.nemo-skills.alpine`:
- Line 25: The Dockerfile removal of py3-blinker (the RUN apk del py3-blinker
line) needs an inline comment explaining the rationale; update the Dockerfile to
add a brief comment on that RUN instruction specifying why py3-blinker is
removed (e.g., causes version/conflict with package X, is not used at runtime,
or breaks dependency Y) and include any relevant context such as linked issue/PR
or package name that conflicted so future maintainers understand the reason.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 7278c5b4-d1db-4794-b071-532ae193e29e
📒 Files selected for processing (4)
dockerfiles/swe-bench/Dockerfile.nemo-skills.alpinenemo_skills/dataset/swe-bench-pro/__init__.pynemo_skills/dataset/swe-bench-pro/prepare.pynemo_skills/inference/eval/swebench.py
🚧 Files skipped from review as they are similar to previous changes (2)
- nemo_skills/dataset/swe-bench-pro/init.py
- nemo_skills/dataset/swe-bench-pro/prepare.py
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@docs/evaluation/code.md`:
- Line 232: Replace generic link text "here" with descriptive link text for the
three occurrences; update the link at the benchmark reference to "SWE-bench Pro
leaderboard", the Dockerfile reference to "Alpine host Dockerfile", and the repo
link to "ComputeEval repository", ensuring each anchor text replaces "here"
while keeping the same URLs and Markdown link syntax (search for the three
"here" anchors in the document and swap the label only).
- Around line 245-248: Multiple fenced code blocks in the doc are missing
language tags; update each affected block containing shell commands (e.g., the
blocks with "ns prepare_data swe-bench-pro", "ns prepare_data swe-bench-pro
--container_formatter '/swe-bench-images/{docker_image}.sif'", and the various
"ns eval \" examples) to use ```bash and mark the evaluation output/table block
(the block showing "---------------------------- compute-eval
-----------------------------" and the columns like evaluation_mode |
num_entries ...) as ```text so markdownlint passes and rendering/tooling are
consistent.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 2ad81ed4-5374-4f2c-80a7-ea8b9d7443fd
📒 Files selected for processing (1)
docs/evaluation/code.md
Signed-off-by: Nikolai Ludwig <nliudvig@nvidia.com>
Signed-off-by: Nikolai Ludwig <nliudvig@nvidia.com>
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@docs/evaluation/code.md`:
- Line 307: The docs line references the wrong placeholder—update the result
paths so they use the same placeholders as the commands: replace
'<OUTPUT_DIR>/eval-results/swe-bench-pro/alpine/metrics.json' and
'<OUTPUT_DIR>/eval-results/swe-bench-pro/ubuntu/metrics.json' with
'<OUTPUT_DIR_ALPINE>/eval-results/swe-bench-pro/alpine/metrics.json' and
'<OUTPUT_DIR_UBUNTU>/eval-results/swe-bench-pro/ubuntu/metrics.json'
respectively so the placeholders match the earlier commands and avoid confusing
users.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: dea0ab39-ff22-497f-b6b7-54b6df27fa68
📒 Files selected for processing (1)
docs/evaluation/code.md
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@docs/evaluation/code.md`:
- Line 244: Update the wording "Dockerhub" to the official branding "Docker Hub"
in the sentence that explains preparing data with Dockerhub container URLs (the
line mentioning the container formatter using `{docker_image}` instead of
`{instance_id}`); ensure only the display text is changed to "Docker Hub"
without altering the example placeholders like `{docker_image}`.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 2cace89c-2efc-4674-9f39-28bf19eb9f50
📒 Files selected for processing (1)
docs/evaluation/code.md
|
|
||
| Here's how to run a sample evaluation of [Qwen3-Coder-Next](https://huggingface.co/Qwen/Qwen3-Coder-Next) with SWE-agent on a Slurm cluster. | ||
|
|
||
| 1. Prepare the data following similar [instructions](#data-preparation) as for SWE-bench. The container formatter format is slightly different, using `{docker_image}` instead of `{instance_id}`. To prepare the data with Dockerhub container URLs, you can simply run |
There was a problem hiding this comment.
Use official branding: “Docker Hub” instead of “Dockerhub”.
At Line 244, this is a minor docs polish issue but improves readability/professionalism.
🧰 Tools
🪛 LanguageTool
[grammar] ~244-~244: Ensure spelling is correct
Context: ...instance_id}`. To prepare the data with Dockerhub container URLs, you can simply run ...
(QB_NEW_EN_ORTHOGRAPHY_ERROR_IDS_1)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@docs/evaluation/code.md` at line 244, Update the wording "Dockerhub" to the
official branding "Docker Hub" in the sentence that explains preparing data with
Dockerhub container URLs (the line mentioning the container formatter using
`{docker_image}` instead of `{instance_id}`); ensure only the display text is
changed to "Docker Hub" without altering the example placeholders like
`{docker_image}`.
|
@Kipok Wasi and I are ready to merge, but wanted to get your approval as well, since I'm adding a new dockerfile for this. We need it to run a subset of instances that don't work with our pipeline otherwise. |
Kipok
left a comment
There was a problem hiding this comment.
so this needs to be manually set instead of normal nemo-skills container? I guess we'd want to add this to automatic gitlab ci to build and upload to clusters @activatedgeek
Kipok
left a comment
There was a problem hiding this comment.
I guess we need to update test_eval.py in gpu tests to not run this benchmark automatically in ci?
Signed-off-by: Nikolai Ludwig <nliudvig@nvidia.com>
Signed-off-by: Nikolai Ludwig <nliudvig@nvidia.com>
|
Updated test_eval.py to exclude this benchmark. Regarding gitlab ci - would be ideal but for now I do have a prebuilt sqsh file we can use |
Signed-off-by: wasiahmad <wasiahmad@ucla.edu> Signed-off-by: Nikolai Ludwig <nliudvig@nvidia.com> Co-authored-by: Nikolai Ludwig <nliudvig@nvidia.com>
Signed-off-by: wasiahmad <wasiahmad@ucla.edu> Signed-off-by: Nikolai Ludwig <nliudvig@nvidia.com> Co-authored-by: Nikolai Ludwig <nliudvig@nvidia.com>
[Description written by @ludwig-n]
Adds the SWE-bench Pro benchmark.
This benchmark has a subset of 88 instances where the containers are based on Alpine Linux, which is incompatible with the standard Nemo-Skills container. Therefore, they have to be run in a separate job with a different host container based on Alpine. The Dockerfile is included in the PR.
Because of technical issues, this benchmark does not support OpenHands. Only SWE-agent and mini-SWE-agent are supported.
Sample evaluation scores:
This PR also adds the ability to evaluate gold (ground truth) patches by specifying
++agent_framework=gold_patch.Summary by CodeRabbit
New Features
Documentation
Chores