Integrating SWE-Pro (Public) Dataset Eval by wasiahmad · Pull Request #1197 · NVIDIA-NeMo/Skills

wasiahmad · 2026-01-28T23:52:25Z

[Description written by @ludwig-n]

Adds the SWE-bench Pro benchmark.

This benchmark has a subset of 88 instances where the containers are based on Alpine Linux, which is incompatible with the standard Nemo-Skills container. Therefore, they have to be run in a separate job with a different host container based on Alpine. The Dockerfile is included in the PR.

Because of technical issues, this benchmark does not support OpenHands. Only SWE-agent and mini-SWE-agent are supported.

Sample evaluation scores:

model	framework	1	2	3	avg
Qwen3-Coder-Next	SWE-agent	40.1	40.2	39.4	39.9
Qwen3-Coder-480B-A35B-Instruct	SWE-agent	34.6	32.8	33.0	33.5
Minimax-M2.5	SWE-agent	45.4	50.2	48.7	48.1

This PR also adds the ability to evaluate gold (ground truth) patches by specifying ++agent_framework=gold_patch.

Summary by CodeRabbit

New Features
- SWE-bench Pro dataset integration with sensible default evaluation/configuration values.
- Data preparation tool that normalizes languages, enriches records with container metadata, and emits split-specific JSONL outputs (alpine vs ubuntu).
- Alpine-based runtime image for evaluation tooling.
- New evaluation options: gold-patch mode and a pro-style evaluation workflow producing compatible evaluation artifacts.
Documentation
- Added SWE-bench Pro docs with usage examples, split handling notes, and evaluation command samples.
Chores
- Command-line parameters for dataset setup and container formatting.

Signed-off-by: wasiahmad <wasiahmad@ucla.edu>

greptile-apps

_{1 file reviewed, 1 comment}

_{Edit Code Review Agent Settings | Greptile}

coderabbitai · 2026-01-28T23:56:53Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

📝 Walkthrough

Walkthrough

Adds a SWE-bench Pro dataset package (config constants and a prepare script), an Alpine-based Dockerfile for SWE-bench, and evaluation updates: a gold-patch agent mode plus a new SWE-bench Pro evaluation path and dataset-type config.

Changes

Cohort / File(s)	Summary
SWE-bench Pro dataset package `nemo_skills/dataset/swe-bench-pro/__init__.py`, `nemo_skills/dataset/swe-bench-pro/prepare.py`	Adds module-level evaluation constants (`EVAL_SPLIT`, `DATASET_GROUP`, `METRICS_TYPE`, `GENERATION_MODULE`, `GENERATION_ARGS`). `prepare.py` normalizes `repo_language`→`language` via `LANGUAGE_MAP`, appends `requirements`/`interface` into `problem_statement`, computes `container_formatter`, sets `ALPINE_INSTANCE_IDS`, enriches metadata, and emits `<setup>.alpine.jsonl` and `<setup>.ubuntu.jsonl`.
SWE-bench evaluation logic `nemo_skills/inference/eval/swebench.py`	Adds `SupportedAgentFrameworks.gold_patch` and `SupportedDatasetTypes` (`swe_bench`, `swe_bench_pro`); extends `SweBenchGenerationConfig` with `dataset_type`; implements async `_get_gold_patch()` and routes `_process_single_datapoint_impl` for `gold_patch`; builds a pro-style harness CLI for `swe_bench_pro` and adjusts eval-output handling while preserving legacy CLI for other datasets.
Docker image for SWE-bench (Alpine) `dockerfiles/swe-bench/Dockerfile.nemo-skills.alpine`	Adds an Alpine-based Dockerfile: installs Python 3.10 toolchain, build deps and utilities (git, git-lfs, ffmpeg, bash), installs Apptainer from Alpine v3.19, copies project files and core requirements (removing `sentence_transformers`), and installs core requirements. Declares `ARG CACHEBUST`.
Documentation (reordering + new section) `docs/evaluation/code.md`	Inserts a new `swe-bench-pro` section with dataset details, prepare/eval command examples, and notes on differences from SWE-bench; reorders `compute-eval` section and adjusts sample table formatting.

Sequence Diagram(s)

sequenceDiagram
  participant Task as GenerationTask
  participant FS as FileSystem
  participant Agent as AgentFramework
  participant Harness as SWEbenchHarness

  Note over Task,Agent: For each datapoint
  Task->>Agent: check agent_framework
  alt agent_framework == "gold_patch"
    Agent->>FS: write gold_patch JSONL (output_dir/gold_patches/<id>.jsonl)
    FS-->>Task: gold_patch_path
    Task->>Harness: run pro-style evaluation (--raw_sample_path, --patch_path, --output_dir, --scripts_dir)
    Harness-->>FS: write eval-outputs
  else non-gold
    Agent->>Agent: generate predictions
    Agent->>FS: write predictions file
    Task->>Harness: run evaluation
    alt dataset_type == swe_bench_pro
      Task->>Harness: invoke pro-style CLI (raw_sample_path/patch_path/output_dir/scripts_dir)
      Harness-->>FS: write eval-outputs
    else
      Task->>Harness: invoke legacy CLI (predictions_path/instance_ids/dataset_name)
      Harness-->>FS: write logs/run_evaluation/eval-outputs
    end
  end

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 66.67% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title 'Integrating SWE-Pro (Public) Dataset Eval' accurately reflects the main objective of adding SWE-bench Pro support to the codebase, as evidenced by new files for dataset preparation, configuration, evaluation logic, and documentation.
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch wasiahmad/swe-pro

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🤖 Fix all issues with AI agents

In `@nemo_skills/dataset/swe-bench-pro/prepare.py`:
- Around line 24-26: The get_dockerhub_image_uri function currently uses
repo_name.lower().split("/") which will fail if repo_name defaults to "" — make
repo_name required (remove the default "") or validate and raise a clear error
if it's empty/doesn't contain a slash; also make the split robust by using
rsplit("/", 1) and assign to repo_base and repo_name_only (keep references to
uid and hsh as-is), so update the function signature and replace
repo_name.lower().split("/") with repo_name.lower().rsplit("/", 1) and add a
guard that raises ValueError with a helpful message when repo_name is invalid.

🧹 Nitpick comments (1)

nemo_skills/dataset/swe-bench-pro/prepare.py (1)

45-73: Allow an explicit output path to avoid writing into package directories.

When this script is run from an installed package, Path(__file__).parent may be read-only. Adding an --output_file option keeps the default behavior but avoids permission failures.

🛠️ Proposed tweak

     parser.add_argument(
         "--dataset_name",
         type=str,
         default="ScaleAI/SWE-bench_Pro",
         help="Dataset name to load",
     )
+    parser.add_argument(
+        "--output_file",
+        type=Path,
+        default=None,
+        help="Path to write JSONL. Defaults to <script_dir>/<setup>.jsonl.",
+    )
     args = parser.parse_args()
@@
-    output_file = Path(__file__).parent / f"{args.setup}.jsonl"
+    output_file = (
+        Path(args.output_file)
+        if args.output_file is not None
+        else Path(__file__).parent / f"{args.setup}.jsonl"
+    )

Signed-off-by: wasiahmad <wasiahmad@ucla.edu>

Signed-off-by: Nikolai Ludwig <nliudvig@nvidia.com>

Signed-off-by: wasiahmad <wasiahmad@ucla.edu>

Signed-off-by: Nikolai Ludwig <nliudvig@nvidia.com>

coderabbitai

Actionable comments posted: 2

🧹 Nitpick comments (1)

dockerfiles/swe-bench/Dockerfile.nemo-skills.alpine (1)

25-25: Add a comment explaining why py3-blinker is removed.

This deletion lacks context. A brief inline comment would help future maintainers understand the rationale (e.g., conflict with another package version).

Suggested comment

+# Remove py3-blinker to avoid version conflicts with Flask/Werkzeug dependencies
 RUN apk del py3-blinker

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@dockerfiles/swe-bench/Dockerfile.nemo-skills.alpine` at line 25, The
Dockerfile removal of py3-blinker (the RUN apk del py3-blinker line) needs an
inline comment explaining the rationale; update the Dockerfile to add a brief
comment on that RUN instruction specifying why py3-blinker is removed (e.g.,
causes version/conflict with package X, is not used at runtime, or breaks
dependency Y) and include any relevant context such as linked issue/PR or
package name that conflicted so future maintainers understand the reason.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@nemo_skills/inference/eval/swebench.py`:
- Around line 815-832: The _get_gold_patch function writes data_point["patch"]
directly and doesn't normalize trailing newlines like other helpers
(_run_mini_swe_agent, _run_openhands); update _get_gold_patch to ensure the
patch ends with a single newline before writing (e.g., strip any trailing
newlines then append "\n"), write that normalized string into the "model_patch"
field when dumping the JSONL, and return the file path as before.
- Around line 881-894: The code builds a command string (swe_bench_cmd) that
invokes a non-existent module swebench.harness.run_local_evaluation for
SWE-bench Pro; update the branch handling self.cfg.dataset_type ==
SupportedDatasetTypes.swe_bench_pro to call the correct SWE-bench Pro entry
point (e.g., the pro evaluator script provided in the SWE-bench Pro repo such as
swe_bench_pro_eval.py) using the same arguments (--raw_sample_path,
--patch_path, --output_dir, --scripts_dir) and the same venv python binary
(/root/SWE-bench/venv/bin/python), ensuring the repository copy steps (cp -r
/root_mount/SWE-bench /root and cp -r /root_mount/uv /root) and the final cp of
eval-outputs to /trajectories_mount/ are preserved; modify the string built in
swe_bench_cmd accordingly so the runtime invokes the correct script name and
path instead of swebench.harness.run_local_evaluation.

---

Nitpick comments:
In `@dockerfiles/swe-bench/Dockerfile.nemo-skills.alpine`:
- Line 25: The Dockerfile removal of py3-blinker (the RUN apk del py3-blinker
line) needs an inline comment explaining the rationale; update the Dockerfile to
add a brief comment on that RUN instruction specifying why py3-blinker is
removed (e.g., causes version/conflict with package X, is not used at runtime,
or breaks dependency Y) and include any relevant context such as linked issue/PR
or package name that conflicted so future maintainers understand the reason.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 7278c5b4-d1db-4794-b071-532ae193e29e

📥 Commits

Reviewing files that changed from the base of the PR and between 88b6942 and 4f3a08b.

📒 Files selected for processing (4)

dockerfiles/swe-bench/Dockerfile.nemo-skills.alpine
nemo_skills/dataset/swe-bench-pro/__init__.py
nemo_skills/dataset/swe-bench-pro/prepare.py
nemo_skills/inference/eval/swebench.py

🚧 Files skipped from review as they are similar to previous changes (2)

nemo_skills/dataset/swe-bench-pro/init.py
nemo_skills/dataset/swe-bench-pro/prepare.py

Signed-off-by: Nikolai Ludwig <nliudvig@nvidia.com>

coderabbitai

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@docs/evaluation/code.md`:
- Line 232: Replace generic link text "here" with descriptive link text for the
three occurrences; update the link at the benchmark reference to "SWE-bench Pro
leaderboard", the Dockerfile reference to "Alpine host Dockerfile", and the repo
link to "ComputeEval repository", ensuring each anchor text replaces "here"
while keeping the same URLs and Markdown link syntax (search for the three
"here" anchors in the document and swap the label only).
- Around line 245-248: Multiple fenced code blocks in the doc are missing
language tags; update each affected block containing shell commands (e.g., the
blocks with "ns prepare_data swe-bench-pro", "ns prepare_data swe-bench-pro
--container_formatter '/swe-bench-images/{docker_image}.sif'", and the various
"ns eval \" examples) to use ```bash and mark the evaluation output/table block
(the block showing "---------------------------- compute-eval
-----------------------------" and the columns like evaluation_mode |
num_entries ...) as ```text so markdownlint passes and rendering/tooling are
consistent.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 2ad81ed4-5374-4f2c-80a7-ea8b9d7443fd

📥 Commits

Reviewing files that changed from the base of the PR and between 4f3a08b and fb7a07d.

📒 Files selected for processing (1)

docs/evaluation/code.md

Signed-off-by: Nikolai Ludwig <nliudvig@nvidia.com>

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@docs/evaluation/code.md`:
- Line 307: The docs line references the wrong placeholder—update the result
paths so they use the same placeholders as the commands: replace
'<OUTPUT_DIR>/eval-results/swe-bench-pro/alpine/metrics.json' and
'<OUTPUT_DIR>/eval-results/swe-bench-pro/ubuntu/metrics.json' with
'<OUTPUT_DIR_ALPINE>/eval-results/swe-bench-pro/alpine/metrics.json' and
'<OUTPUT_DIR_UBUNTU>/eval-results/swe-bench-pro/ubuntu/metrics.json'
respectively so the placeholders match the earlier commands and avoid confusing
users.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: dea0ab39-ff22-497f-b6b7-54b6df27fa68

📥 Commits

Reviewing files that changed from the base of the PR and between fb7a07d and 8ec51a4.

📒 Files selected for processing (1)

docs/evaluation/code.md

Signed-off-by: Nikolai Ludwig <nliudvig@nvidia.com>

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@docs/evaluation/code.md`:
- Line 244: Update the wording "Dockerhub" to the official branding "Docker Hub"
in the sentence that explains preparing data with Dockerhub container URLs (the
line mentioning the container formatter using `{docker_image}` instead of
`{instance_id}`); ensure only the display text is changed to "Docker Hub"
without altering the example placeholders like `{docker_image}`.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 2cace89c-2efc-4674-9f39-28bf19eb9f50

📥 Commits

Reviewing files that changed from the base of the PR and between 8ec51a4 and 9eeeaca.

📒 Files selected for processing (1)

docs/evaluation/code.md

coderabbitai · 2026-03-19T17:53:29Z

+
+Here's how to run a sample evaluation of [Qwen3-Coder-Next](https://huggingface.co/Qwen/Qwen3-Coder-Next) with SWE-agent on a Slurm cluster.
+
+1. Prepare the data following similar [instructions](#data-preparation) as for SWE-bench. The container formatter format is slightly different, using `{docker_image}` instead of `{instance_id}`. To prepare the data with Dockerhub container URLs, you can simply run


⚠️ Potential issue | 🟡 Minor

Use official branding: “Docker Hub” instead of “Dockerhub”.

At Line 244, this is a minor docs polish issue but improves readability/professionalism.

🧰 Tools

🪛 LanguageTool

[grammar] ~244-~244: Ensure spelling is correct
Context: ...instance_id}`. To prepare the data with Dockerhub container URLs, you can simply run ...

(QB_NEW_EN_ORTHOGRAPHY_ERROR_IDS_1)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@docs/evaluation/code.md` at line 244, Update the wording "Dockerhub" to the official branding "Docker Hub" in the sentence that explains preparing data with Dockerhub container URLs (the line mentioning the container formatter using `{docker_image}` instead of `{instance_id}`); ensure only the display text is changed to "Docker Hub" without altering the example placeholders like `{docker_image}`.

ludwig-n · 2026-03-19T17:57:11Z

@Kipok Wasi and I are ready to merge, but wanted to get your approval as well, since I'm adding a new dockerfile for this. We need it to run a subset of instances that don't work with our pipeline otherwise.

Kipok

so this needs to be manually set instead of normal nemo-skills container? I guess we'd want to add this to automatic gitlab ci to build and upload to clusters @activatedgeek

Kipok

I guess we need to update test_eval.py in gpu tests to not run this benchmark automatically in ci?

Signed-off-by: Nikolai Ludwig <nliudvig@nvidia.com>

ludwig-n · 2026-03-20T11:56:17Z

Updated test_eval.py to exclude this benchmark. Regarding gitlab ci - would be ideal but for now I do have a prebuilt sqsh file we can use

Signed-off-by: wasiahmad <wasiahmad@ucla.edu> Signed-off-by: Nikolai Ludwig <nliudvig@nvidia.com> Co-authored-by: Nikolai Ludwig <nliudvig@nvidia.com>

adding dataset prep

88b6942

Signed-off-by: wasiahmad <wasiahmad@ucla.edu>

wasiahmad requested a review from ludwig-n January 28, 2026 23:52

wasiahmad marked this pull request as draft January 28, 2026 23:52

greptile-apps Bot reviewed Jan 28, 2026

View reviewed changes

Comment thread nemo_skills/dataset/swe-bench-pro/prepare.py Outdated

coderabbitai Bot reviewed Jan 28, 2026

View reviewed changes

Comment thread nemo_skills/dataset/swe-bench-pro/prepare.py Outdated

wasiahmad and others added 24 commits January 30, 2026 12:12

Merge branch 'main' into wasiahmad/swe-pro

ae51b64

Merge branch 'main' into wasiahmad/swe-pro

b36fd83

removing default param value for repo_name

a651cb0

Signed-off-by: wasiahmad <wasiahmad@ucla.edu>

updating dataset prep

13ce056

Signed-off-by: wasiahmad <wasiahmad@ucla.edu>

updating dataset prep

815f1ea

Signed-off-by: wasiahmad <wasiahmad@ucla.edu>

updating dataset prep

22c06f8

Signed-off-by: wasiahmad <wasiahmad@ucla.edu>

Downgrade cryptography during OH install to fix missing glibc 2.33

80a6354

Signed-off-by: Nikolai Ludwig <nliudvig@nvidia.com>

Merge branch 'main' into wasiahmad/swe-pro

a0528e5

minor updates

987bf0b

Signed-off-by: wasiahmad <wasiahmad@ucla.edu>

cp /app to /testbed for swe-pro

ef01bfe

Signed-off-by: wasiahmad <wasiahmad@ucla.edu>

attempting to resolve poetry related issues

1b83003

Signed-off-by: wasiahmad <wasiahmad@ucla.edu>

rolling back

3605d25

Signed-off-by: wasiahmad <wasiahmad@ucla.edu>

fixing cp command for eval outputs

bd354a0

Signed-off-by: wasiahmad <wasiahmad@ucla.edu>

fixing python version issue

eeb9df4

Signed-off-by: wasiahmad <wasiahmad@ucla.edu>

rolling back to python 3.12 use

5e18ea4

Signed-off-by: wasiahmad <wasiahmad@ucla.edu>

compatible python installation

f452c08

Signed-off-by: wasiahmad <wasiahmad@ucla.edu>

compatible python installation

34d0018

Signed-off-by: wasiahmad <wasiahmad@ucla.edu>

rolling back

8260ddf

Signed-off-by: wasiahmad <wasiahmad@ucla.edu>

Merge branch 'main' into wasiahmad/swe-pro

ab2d8f4

Move repo copying logic to _execute_container_command

a977107

Signed-off-by: Nikolai Ludwig <nliudvig@nvidia.com>

Install gcompat in alpine containers

1ca1c30

Signed-off-by: Nikolai Ludwig <nliudvig@nvidia.com>

mkdir /testbed && cp -r /app/* /testbed

2fd6d2f

Signed-off-by: Nikolai Ludwig <nliudvig@nvidia.com>

Copy hidden files also

cee1146

Signed-off-by: Nikolai Ludwig <nliudvig@nvidia.com>

Install newest gcompat

5914e7d

Signed-off-by: Nikolai Ludwig <nliudvig@nvidia.com>

ludwig-n added 6 commits March 18, 2026 17:00

Drop unused dataset type

0184411

Signed-off-by: Nikolai Ludwig <nliudvig@nvidia.com>

Merge branch 'main' into wasiahmad/swe-pro

a937a4a

Signed-off-by: Nikolai Ludwig <nliudvig@nvidia.com>

Set container_repo_dir in prepare script

6d4611b

Signed-off-by: Nikolai Ludwig <nliudvig@nvidia.com>

Add comments

4206a7f

Signed-off-by: Nikolai Ludwig <nliudvig@nvidia.com>

Add Alpine-based NS dockerfile

c2bc31f

Signed-off-by: Nikolai Ludwig <nliudvig@nvidia.com>

Change docker_tag -> docker_image to match SWE-rebench

4f3a08b

Signed-off-by: Nikolai Ludwig <nliudvig@nvidia.com>

ludwig-n marked this pull request as ready for review March 19, 2026 15:19

coderabbitai Bot reviewed Mar 19, 2026

View reviewed changes

Comment thread nemo_skills/inference/eval/swebench.py

Comment thread nemo_skills/inference/eval/swebench.py

Add docs

fb7a07d

Signed-off-by: Nikolai Ludwig <nliudvig@nvidia.com>

coderabbitai Bot reviewed Mar 19, 2026

View reviewed changes

Comment thread docs/evaluation/code.md

Comment thread docs/evaluation/code.md

ludwig-n added 2 commits March 19, 2026 21:43

Fix docs formatting

882409c

Signed-off-by: Nikolai Ludwig <nliudvig@nvidia.com>

Fix docs formatting

8ec51a4

Signed-off-by: Nikolai Ludwig <nliudvig@nvidia.com>

ludwig-n approved these changes Mar 19, 2026

View reviewed changes

coderabbitai Bot reviewed Mar 19, 2026

View reviewed changes

Comment thread docs/evaluation/code.md Outdated

Fix doc

9eeeaca

Signed-off-by: Nikolai Ludwig <nliudvig@nvidia.com>

coderabbitai Bot reviewed Mar 19, 2026

View reviewed changes

ludwig-n requested a review from Kipok March 19, 2026 17:57

Kipok reviewed Mar 20, 2026

View reviewed changes

Comment thread nemo_skills/dataset/swe-bench-pro/__init__.py Outdated

Kipok reviewed Mar 20, 2026

View reviewed changes

ludwig-n added 2 commits March 20, 2026 15:49

Remove deprecated DATASET_GROUP

e977b44

Signed-off-by: Nikolai Ludwig <nliudvig@nvidia.com>

Exclude swe-bench-pro from gpu tests

30332d6

Signed-off-by: Nikolai Ludwig <nliudvig@nvidia.com>

Merge branch 'main' into wasiahmad/swe-pro

928ffe4

wasiahmad enabled auto-merge (squash) March 24, 2026 18:59

Kipok approved these changes Mar 24, 2026

View reviewed changes

wasiahmad merged commit 9691710 into main Mar 24, 2026
5 checks passed

wasiahmad deleted the wasiahmad/swe-pro branch March 24, 2026 19:18

jeffwillette pushed a commit that referenced this pull request Mar 30, 2026

Integrating SWE-Pro (Public) Dataset Eval (#1197)

0f5c664

Signed-off-by: wasiahmad <wasiahmad@ucla.edu> Signed-off-by: Nikolai Ludwig <nliudvig@nvidia.com> Co-authored-by: Nikolai Ludwig <nliudvig@nvidia.com>


		Here's how to run a sample evaluation of [Qwen3-Coder-Next](https://huggingface.co/Qwen/Qwen3-Coder-Next) with SWE-agent on a Slurm cluster.

		1. Prepare the data following similar [instructions](#data-preparation) as for SWE-bench. The container formatter format is slightly different, using `{docker_image}` instead of `{instance_id}`. To prepare the data with Dockerhub container URLs, you can simply run

Conversation

wasiahmad commented Jan 28, 2026 • edited by ludwig-n Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

greptile-apps Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai Bot commented Jan 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Mar 19, 2026

Choose a reason for hiding this comment

Uh oh!

ludwig-n commented Mar 19, 2026

Uh oh!

Kipok left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Kipok left a comment

Choose a reason for hiding this comment

Uh oh!

ludwig-n commented Mar 20, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

wasiahmad commented Jan 28, 2026 •

edited by ludwig-n

Loading

coderabbitai Bot commented Jan 28, 2026 •

edited

Loading