Skip to content

Integrating SWE-Pro (Public) Dataset Eval #1197

Merged
wasiahmad merged 65 commits intomainfrom
wasiahmad/swe-pro
Mar 24, 2026
Merged

Integrating SWE-Pro (Public) Dataset Eval #1197
wasiahmad merged 65 commits intomainfrom
wasiahmad/swe-pro

Conversation

@wasiahmad
Copy link
Copy Markdown
Collaborator

@wasiahmad wasiahmad commented Jan 28, 2026

[Description written by @ludwig-n]

Adds the SWE-bench Pro benchmark.

This benchmark has a subset of 88 instances where the containers are based on Alpine Linux, which is incompatible with the standard Nemo-Skills container. Therefore, they have to be run in a separate job with a different host container based on Alpine. The Dockerfile is included in the PR.

Because of technical issues, this benchmark does not support OpenHands. Only SWE-agent and mini-SWE-agent are supported.

Sample evaluation scores:

model framework 1 2 3 avg
Qwen3-Coder-Next SWE-agent 40.1 40.2 39.4 39.9
Qwen3-Coder-480B-A35B-Instruct SWE-agent 34.6 32.8 33.0 33.5
Minimax-M2.5 SWE-agent 45.4 50.2 48.7 48.1

This PR also adds the ability to evaluate gold (ground truth) patches by specifying ++agent_framework=gold_patch.

Summary by CodeRabbit

  • New Features

    • SWE-bench Pro dataset integration with sensible default evaluation/configuration values.
    • Data preparation tool that normalizes languages, enriches records with container metadata, and emits split-specific JSONL outputs (alpine vs ubuntu).
    • Alpine-based runtime image for evaluation tooling.
    • New evaluation options: gold-patch mode and a pro-style evaluation workflow producing compatible evaluation artifacts.
  • Documentation

    • Added SWE-bench Pro docs with usage examples, split handling notes, and evaluation command samples.
  • Chores

    • Command-line parameters for dataset setup and container formatting.

Signed-off-by: wasiahmad <wasiahmad@ucla.edu>
@wasiahmad wasiahmad requested a review from ludwig-n January 28, 2026 23:52
@wasiahmad wasiahmad marked this pull request as draft January 28, 2026 23:52
Copy link
Copy Markdown
Contributor

@greptile-apps greptile-apps Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 file reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

Comment thread nemo_skills/dataset/swe-bench-pro/prepare.py Outdated
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Jan 28, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Adds a SWE-bench Pro dataset package (config constants and a prepare script), an Alpine-based Dockerfile for SWE-bench, and evaluation updates: a gold-patch agent mode plus a new SWE-bench Pro evaluation path and dataset-type config.

Changes

Cohort / File(s) Summary
SWE-bench Pro dataset package
nemo_skills/dataset/swe-bench-pro/__init__.py, nemo_skills/dataset/swe-bench-pro/prepare.py
Adds module-level evaluation constants (EVAL_SPLIT, DATASET_GROUP, METRICS_TYPE, GENERATION_MODULE, GENERATION_ARGS). prepare.py normalizes repo_languagelanguage via LANGUAGE_MAP, appends requirements/interface into problem_statement, computes container_formatter, sets ALPINE_INSTANCE_IDS, enriches metadata, and emits <setup>.alpine.jsonl and <setup>.ubuntu.jsonl.
SWE-bench evaluation logic
nemo_skills/inference/eval/swebench.py
Adds SupportedAgentFrameworks.gold_patch and SupportedDatasetTypes (swe_bench, swe_bench_pro); extends SweBenchGenerationConfig with dataset_type; implements async _get_gold_patch() and routes _process_single_datapoint_impl for gold_patch; builds a pro-style harness CLI for swe_bench_pro and adjusts eval-output handling while preserving legacy CLI for other datasets.
Docker image for SWE-bench (Alpine)
dockerfiles/swe-bench/Dockerfile.nemo-skills.alpine
Adds an Alpine-based Dockerfile: installs Python 3.10 toolchain, build deps and utilities (git, git-lfs, ffmpeg, bash), installs Apptainer from Alpine v3.19, copies project files and core requirements (removing sentence_transformers), and installs core requirements. Declares ARG CACHEBUST.
Documentation (reordering + new section)
docs/evaluation/code.md
Inserts a new swe-bench-pro section with dataset details, prepare/eval command examples, and notes on differences from SWE-bench; reorders compute-eval section and adjusts sample table formatting.

Sequence Diagram(s)

sequenceDiagram
  participant Task as GenerationTask
  participant FS as FileSystem
  participant Agent as AgentFramework
  participant Harness as SWEbenchHarness

  Note over Task,Agent: For each datapoint
  Task->>Agent: check agent_framework
  alt agent_framework == "gold_patch"
    Agent->>FS: write gold_patch JSONL (output_dir/gold_patches/<id>.jsonl)
    FS-->>Task: gold_patch_path
    Task->>Harness: run pro-style evaluation (--raw_sample_path, --patch_path, --output_dir, --scripts_dir)
    Harness-->>FS: write eval-outputs
  else non-gold
    Agent->>Agent: generate predictions
    Agent->>FS: write predictions file
    Task->>Harness: run evaluation
    alt dataset_type == swe_bench_pro
      Task->>Harness: invoke pro-style CLI (raw_sample_path/patch_path/output_dir/scripts_dir)
      Harness-->>FS: write eval-outputs
    else
      Task->>Harness: invoke legacy CLI (predictions_path/instance_ids/dataset_name)
      Harness-->>FS: write logs/run_evaluation/eval-outputs
    end
  end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 66.67% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title 'Integrating SWE-Pro (Public) Dataset Eval' accurately reflects the main objective of adding SWE-bench Pro support to the codebase, as evidenced by new files for dataset preparation, configuration, evaluation logic, and documentation.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch wasiahmad/swe-pro

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In `@nemo_skills/dataset/swe-bench-pro/prepare.py`:
- Around line 24-26: The get_dockerhub_image_uri function currently uses
repo_name.lower().split("/") which will fail if repo_name defaults to "" — make
repo_name required (remove the default "") or validate and raise a clear error
if it's empty/doesn't contain a slash; also make the split robust by using
rsplit("/", 1) and assign to repo_base and repo_name_only (keep references to
uid and hsh as-is), so update the function signature and replace
repo_name.lower().split("/") with repo_name.lower().rsplit("/", 1) and add a
guard that raises ValueError with a helpful message when repo_name is invalid.
🧹 Nitpick comments (1)
nemo_skills/dataset/swe-bench-pro/prepare.py (1)

45-73: Allow an explicit output path to avoid writing into package directories.

When this script is run from an installed package, Path(__file__).parent may be read-only. Adding an --output_file option keeps the default behavior but avoids permission failures.

🛠️ Proposed tweak
     parser.add_argument(
         "--dataset_name",
         type=str,
         default="ScaleAI/SWE-bench_Pro",
         help="Dataset name to load",
     )
+    parser.add_argument(
+        "--output_file",
+        type=Path,
+        default=None,
+        help="Path to write JSONL. Defaults to <script_dir>/<setup>.jsonl.",
+    )
     args = parser.parse_args()
@@
-    output_file = Path(__file__).parent / f"{args.setup}.jsonl"
+    output_file = (
+        Path(args.output_file)
+        if args.output_file is not None
+        else Path(__file__).parent / f"{args.setup}.jsonl"
+    )

Comment thread nemo_skills/dataset/swe-bench-pro/prepare.py Outdated
wasiahmad and others added 24 commits January 30, 2026 12:12
Signed-off-by: wasiahmad <wasiahmad@ucla.edu>
Signed-off-by: wasiahmad <wasiahmad@ucla.edu>
Signed-off-by: wasiahmad <wasiahmad@ucla.edu>
Signed-off-by: wasiahmad <wasiahmad@ucla.edu>
Signed-off-by: Nikolai Ludwig <nliudvig@nvidia.com>
Signed-off-by: wasiahmad <wasiahmad@ucla.edu>
Signed-off-by: wasiahmad <wasiahmad@ucla.edu>
Signed-off-by: wasiahmad <wasiahmad@ucla.edu>
Signed-off-by: wasiahmad <wasiahmad@ucla.edu>
Signed-off-by: wasiahmad <wasiahmad@ucla.edu>
Signed-off-by: wasiahmad <wasiahmad@ucla.edu>
Signed-off-by: wasiahmad <wasiahmad@ucla.edu>
Signed-off-by: wasiahmad <wasiahmad@ucla.edu>
Signed-off-by: wasiahmad <wasiahmad@ucla.edu>
Signed-off-by: wasiahmad <wasiahmad@ucla.edu>
Signed-off-by: Nikolai Ludwig <nliudvig@nvidia.com>
Signed-off-by: Nikolai Ludwig <nliudvig@nvidia.com>
Signed-off-by: Nikolai Ludwig <nliudvig@nvidia.com>
Signed-off-by: Nikolai Ludwig <nliudvig@nvidia.com>
Signed-off-by: Nikolai Ludwig <nliudvig@nvidia.com>
Signed-off-by: Nikolai Ludwig <nliudvig@nvidia.com>
Signed-off-by: Nikolai Ludwig <nliudvig@nvidia.com>
Signed-off-by: Nikolai Ludwig <nliudvig@nvidia.com>
Signed-off-by: Nikolai Ludwig <nliudvig@nvidia.com>
Signed-off-by: Nikolai Ludwig <nliudvig@nvidia.com>
Signed-off-by: Nikolai Ludwig <nliudvig@nvidia.com>
@ludwig-n ludwig-n marked this pull request as ready for review March 19, 2026 15:19
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (1)
dockerfiles/swe-bench/Dockerfile.nemo-skills.alpine (1)

25-25: Add a comment explaining why py3-blinker is removed.

This deletion lacks context. A brief inline comment would help future maintainers understand the rationale (e.g., conflict with another package version).

Suggested comment
+# Remove py3-blinker to avoid version conflicts with Flask/Werkzeug dependencies
 RUN apk del py3-blinker
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@dockerfiles/swe-bench/Dockerfile.nemo-skills.alpine` at line 25, The
Dockerfile removal of py3-blinker (the RUN apk del py3-blinker line) needs an
inline comment explaining the rationale; update the Dockerfile to add a brief
comment on that RUN instruction specifying why py3-blinker is removed (e.g.,
causes version/conflict with package X, is not used at runtime, or breaks
dependency Y) and include any relevant context such as linked issue/PR or
package name that conflicted so future maintainers understand the reason.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@nemo_skills/inference/eval/swebench.py`:
- Around line 815-832: The _get_gold_patch function writes data_point["patch"]
directly and doesn't normalize trailing newlines like other helpers
(_run_mini_swe_agent, _run_openhands); update _get_gold_patch to ensure the
patch ends with a single newline before writing (e.g., strip any trailing
newlines then append "\n"), write that normalized string into the "model_patch"
field when dumping the JSONL, and return the file path as before.
- Around line 881-894: The code builds a command string (swe_bench_cmd) that
invokes a non-existent module swebench.harness.run_local_evaluation for
SWE-bench Pro; update the branch handling self.cfg.dataset_type ==
SupportedDatasetTypes.swe_bench_pro to call the correct SWE-bench Pro entry
point (e.g., the pro evaluator script provided in the SWE-bench Pro repo such as
swe_bench_pro_eval.py) using the same arguments (--raw_sample_path,
--patch_path, --output_dir, --scripts_dir) and the same venv python binary
(/root/SWE-bench/venv/bin/python), ensuring the repository copy steps (cp -r
/root_mount/SWE-bench /root and cp -r /root_mount/uv /root) and the final cp of
eval-outputs to /trajectories_mount/ are preserved; modify the string built in
swe_bench_cmd accordingly so the runtime invokes the correct script name and
path instead of swebench.harness.run_local_evaluation.

---

Nitpick comments:
In `@dockerfiles/swe-bench/Dockerfile.nemo-skills.alpine`:
- Line 25: The Dockerfile removal of py3-blinker (the RUN apk del py3-blinker
line) needs an inline comment explaining the rationale; update the Dockerfile to
add a brief comment on that RUN instruction specifying why py3-blinker is
removed (e.g., causes version/conflict with package X, is not used at runtime,
or breaks dependency Y) and include any relevant context such as linked issue/PR
or package name that conflicted so future maintainers understand the reason.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 7278c5b4-d1db-4794-b071-532ae193e29e

📥 Commits

Reviewing files that changed from the base of the PR and between 88b6942 and 4f3a08b.

📒 Files selected for processing (4)
  • dockerfiles/swe-bench/Dockerfile.nemo-skills.alpine
  • nemo_skills/dataset/swe-bench-pro/__init__.py
  • nemo_skills/dataset/swe-bench-pro/prepare.py
  • nemo_skills/inference/eval/swebench.py
🚧 Files skipped from review as they are similar to previous changes (2)
  • nemo_skills/dataset/swe-bench-pro/init.py
  • nemo_skills/dataset/swe-bench-pro/prepare.py

Comment thread nemo_skills/inference/eval/swebench.py
Comment thread nemo_skills/inference/eval/swebench.py
Signed-off-by: Nikolai Ludwig <nliudvig@nvidia.com>
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@docs/evaluation/code.md`:
- Line 232: Replace generic link text "here" with descriptive link text for the
three occurrences; update the link at the benchmark reference to "SWE-bench Pro
leaderboard", the Dockerfile reference to "Alpine host Dockerfile", and the repo
link to "ComputeEval repository", ensuring each anchor text replaces "here"
while keeping the same URLs and Markdown link syntax (search for the three
"here" anchors in the document and swap the label only).
- Around line 245-248: Multiple fenced code blocks in the doc are missing
language tags; update each affected block containing shell commands (e.g., the
blocks with "ns prepare_data swe-bench-pro", "ns prepare_data swe-bench-pro
--container_formatter '/swe-bench-images/{docker_image}.sif'", and the various
"ns eval \" examples) to use ```bash and mark the evaluation output/table block
(the block showing "---------------------------- compute-eval
-----------------------------" and the columns like evaluation_mode |
num_entries ...) as ```text so markdownlint passes and rendering/tooling are
consistent.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 2ad81ed4-5374-4f2c-80a7-ea8b9d7443fd

📥 Commits

Reviewing files that changed from the base of the PR and between 4f3a08b and fb7a07d.

📒 Files selected for processing (1)
  • docs/evaluation/code.md

Comment thread docs/evaluation/code.md
Comment thread docs/evaluation/code.md
Signed-off-by: Nikolai Ludwig <nliudvig@nvidia.com>
Signed-off-by: Nikolai Ludwig <nliudvig@nvidia.com>
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@docs/evaluation/code.md`:
- Line 307: The docs line references the wrong placeholder—update the result
paths so they use the same placeholders as the commands: replace
'<OUTPUT_DIR>/eval-results/swe-bench-pro/alpine/metrics.json' and
'<OUTPUT_DIR>/eval-results/swe-bench-pro/ubuntu/metrics.json' with
'<OUTPUT_DIR_ALPINE>/eval-results/swe-bench-pro/alpine/metrics.json' and
'<OUTPUT_DIR_UBUNTU>/eval-results/swe-bench-pro/ubuntu/metrics.json'
respectively so the placeholders match the earlier commands and avoid confusing
users.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: dea0ab39-ff22-497f-b6b7-54b6df27fa68

📥 Commits

Reviewing files that changed from the base of the PR and between fb7a07d and 8ec51a4.

📒 Files selected for processing (1)
  • docs/evaluation/code.md

Comment thread docs/evaluation/code.md Outdated
Signed-off-by: Nikolai Ludwig <nliudvig@nvidia.com>
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@docs/evaluation/code.md`:
- Line 244: Update the wording "Dockerhub" to the official branding "Docker Hub"
in the sentence that explains preparing data with Dockerhub container URLs (the
line mentioning the container formatter using `{docker_image}` instead of
`{instance_id}`); ensure only the display text is changed to "Docker Hub"
without altering the example placeholders like `{docker_image}`.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 2cace89c-2efc-4674-9f39-28bf19eb9f50

📥 Commits

Reviewing files that changed from the base of the PR and between 8ec51a4 and 9eeeaca.

📒 Files selected for processing (1)
  • docs/evaluation/code.md

Comment thread docs/evaluation/code.md

Here's how to run a sample evaluation of [Qwen3-Coder-Next](https://huggingface.co/Qwen/Qwen3-Coder-Next) with SWE-agent on a Slurm cluster.

1. Prepare the data following similar [instructions](#data-preparation) as for SWE-bench. The container formatter format is slightly different, using `{docker_image}` instead of `{instance_id}`. To prepare the data with Dockerhub container URLs, you can simply run
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Use official branding: “Docker Hub” instead of “Dockerhub”.

At Line 244, this is a minor docs polish issue but improves readability/professionalism.

🧰 Tools
🪛 LanguageTool

[grammar] ~244-~244: Ensure spelling is correct
Context: ...instance_id}`. To prepare the data with Dockerhub container URLs, you can simply run ...

(QB_NEW_EN_ORTHOGRAPHY_ERROR_IDS_1)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/evaluation/code.md` at line 244, Update the wording "Dockerhub" to the
official branding "Docker Hub" in the sentence that explains preparing data with
Dockerhub container URLs (the line mentioning the container formatter using
`{docker_image}` instead of `{instance_id}`); ensure only the display text is
changed to "Docker Hub" without altering the example placeholders like
`{docker_image}`.

@ludwig-n
Copy link
Copy Markdown
Collaborator

@Kipok Wasi and I are ready to merge, but wanted to get your approval as well, since I'm adding a new dockerfile for this. We need it to run a subset of instances that don't work with our pipeline otherwise.

@ludwig-n ludwig-n requested a review from Kipok March 19, 2026 17:57
Copy link
Copy Markdown
Collaborator

@Kipok Kipok left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so this needs to be manually set instead of normal nemo-skills container? I guess we'd want to add this to automatic gitlab ci to build and upload to clusters @activatedgeek

Comment thread nemo_skills/dataset/swe-bench-pro/__init__.py Outdated
Copy link
Copy Markdown
Collaborator

@Kipok Kipok left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess we need to update test_eval.py in gpu tests to not run this benchmark automatically in ci?

Signed-off-by: Nikolai Ludwig <nliudvig@nvidia.com>
Signed-off-by: Nikolai Ludwig <nliudvig@nvidia.com>
@ludwig-n
Copy link
Copy Markdown
Collaborator

Updated test_eval.py to exclude this benchmark. Regarding gitlab ci - would be ideal but for now I do have a prebuilt sqsh file we can use

@wasiahmad wasiahmad enabled auto-merge (squash) March 24, 2026 18:59
@wasiahmad wasiahmad merged commit 9691710 into main Mar 24, 2026
5 checks passed
@wasiahmad wasiahmad deleted the wasiahmad/swe-pro branch March 24, 2026 19:18
jeffwillette pushed a commit that referenced this pull request Mar 30, 2026
Signed-off-by: wasiahmad <wasiahmad@ucla.edu>
Signed-off-by: Nikolai Ludwig <nliudvig@nvidia.com>
Co-authored-by: Nikolai Ludwig <nliudvig@nvidia.com>
mehrzads pushed a commit to mehrzads/NeMo-Skills that referenced this pull request Apr 8, 2026
Signed-off-by: wasiahmad <wasiahmad@ucla.edu>
Signed-off-by: Nikolai Ludwig <nliudvig@nvidia.com>
Co-authored-by: Nikolai Ludwig <nliudvig@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants