[arteval-bench] Integrate ae-agent into ArtEval benchmark#131
Merged
xuafeng merged 7 commits intosys-intelligence:mainfrom Mar 5, 2026
Merged
[arteval-bench] Integrate ae-agent into ArtEval benchmark#131xuafeng merged 7 commits intosys-intelligence:mainfrom
xuafeng merged 7 commits intosys-intelligence:mainfrom
Conversation
f5c3bab to
b99ebc4
Compare
xuafeng
reviewed
Feb 13, 2026
| {"artifact_id": "osdi24_anvil", "artifact_dir": "osdi24_anvil", "artifact_readme": "osdi24_anvil/anvil/README.md", "artifact_url": "https://github.com/anvil-verifier/anvil", "evaluator": "osdi24_anvil/_agent_eval/main.py", "expected_score": 4, "docer_env": "bastoica/ae-agent-ubuntu24.04:latest"} | ||
| {"artifact_id": "sosp23_acto", "artifact_dir": "sosp23_acto", "artifact_readme": "sosp23_acto/acto/README.md", "artifact_url": "https://github.com/xlab-uiuc/acto", "evaluator": "sosp23_acto/_agent_eval/main.py", "expected_score": 4, "docer_env": "bastoica/ae-agent-ubuntu24.04:latest"} | ||
| {"artifact_id": "eurosys25_egwalker", "artifact_dir": "eurosys25_egwalker", "artifact_readme": "eurosys25_egwalker/egwalker/README.md", "artifact_url": "https://github.com/josephg/egwalker-paper", "evaluator": "eurosys25_egwalker/_agent_eval/main.py", "expected_score": 4, "docer_env": "bastoica/ae-agent-ubuntu24.04:latest"} No newline at end of file | ||
| {"artifact_id": "sosp24_wasabi", "artifact_dir": "sosp24_wasabi", "artifact_readme": "sosp24_wasabi/wasabi/README.md", "artifact_url": "https://github.com/bastoica/wasabi/tree/sosp24-ae", "env": "bastoica/ae-agent-ubuntu24.04:latest", "gpu": false} |
Collaborator
There was a problem hiding this comment.
This is benchmark. I think that we need evaluator and expected_scocre, right?
Collaborator
|
@Couen Please continue to work on this PR. |
30223e5 to
1f38e1c
Compare
Collaborator
|
@Couen I think this PR looks good now. Can you rebase on the main branch and also please resolve the conflict? |
…rmat - Add run_eval.py and main.py to ae_agent for running tasks on host (env=local) or in Docker; run_eval(env, ...) is the single entry point. - Expand utils.py with helpers for main/run_eval (safe_task_id, env_from_item, resolve_project_path, Tee, write_task_report, compute_and_write_summary). - Update ae_agent README with host mode usage and new file descriptions. - Unify arteval_tasks.jsonl to new format: artifact_id, artifact_dir, artifact_readme, artifact_url, env, gpu; remove evaluator/expected_score. - Ignore duplicate task list copies (arteval_tasks copy*.jsonl) in .gitignore.
…oke test - Add ae_agent under benchmarks/arteval_bench/src/agents/ae_agent (main, run_eval, runner, utils, runner.sh, install.sh) - Wire benchmark main.py and run_eval_in_env.py for ae_agent: host path runs agent then evaluator, parses score; Docker path uses same flow - Add src/utils.py re-export for get_task when running from benchmark root - SDK utils: do not overwrite existing env vars when loading env.toml (preserve API key) - Add minimal smoke test: ae_agent_smoke artifact, ae_agent_smoke_test.jsonl (host + docker), run_ae_agent_smoke_test.sh - Remove interactive_runner.py (interactive handled in runner) - Use English throughout (docs, comments); ruff-compliant; single _make_eval_result for result shape
1f38e1c to
2c3af40
Compare
xuafeng
reviewed
Feb 26, 2026
| @@ -0,0 +1,2 @@ | |||
| {"artifact_id": "ae_agent_smoke_host", "artifact_dir": "ae_agent_smoke", "artifact_readme": "ae_agent_smoke/README.md", "evaluator": "python3 _agent_eval/check.py", "expected_score": 1, "run_on_host": true} | |||
| {"artifact_id": "ae_agent_smoke_docker", "artifact_dir": "ae_agent_smoke", "artifact_readme": "ae_agent_smoke/README.md", "docker_env": "bastoica/ae-agent-ubuntu24.04:latest", "evaluator": "python3 _agent_eval/check.py", "expected_score": 1, "run_on_host": false} | |||
Collaborator
There was a problem hiding this comment.
@Couen Can we unify the input format w/ the artifact-agent? To merge docker_env and run_on_host.
Then, we should update arteval_tasks.jsonl and the related code.
Collaborator
|
@bastoica can you please take a look at the code and left your comments? Thanks. |
…env, omit optional fields Made-with: Cursor
- main.py: Extract _is_ae_agent(agent) helper and use it for report/summary writing; use json.dumps(..., ensure_ascii=False) for result.jsonl - run_eval_in_env.py: Remove unused Path import in interactive foreground path; reuse _get_container_id_from_runtime for long-running agent block instead of duplicating container ID resolution - README: Update usage and JSONL/CLI options
535615a to
511bfa3
Compare
ArtEval evaluator field in JSONL points to _agent_eval/main.py. The code was running it as a bare shell command, causing "No such file or directory" (127). Now run with "cd /repo && python <evaluator_path>" in Docker and "cd <project_path> && python <evaluator_path>" on host when path ends with .py. - run_eval_in_env.py: build eval_cmd with cd /repo and python for .py paths - ae_agent/run_eval.py: same for run_agent_then_eval (host path)
Collaborator
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
benchmarks/arteval_bench/src/agents/ae_agent.benchmarks/arteval_bench/src/main.pyso it can be selected via-a ae_agent.benchmarks/arteval_bench/src/run_eval_in_env.pyto treat ae_agent as a long-running agent: pass tasks via/agent/current_task.txt, stream live logs, and support Anthropic Foundry env vars.README_ae_agent.mdandrun_ae_agent.shunderbenchmarks/arteval_bench/data/benchmarkto document and simplify running ArtEval with ae_agent.Details
_agent_evalremoval before run and re-upload before evaluation, container kept running for inspection)./agent/current_task.txt.ANTHROPIC_FOUNDRY_API_KEY,ANTHROPIC_FOUNDRY_BASE_URL, andCLAUDE_CODE_USE_FOUNDRY.Testing
python benchmarks/arteval_bench/src/main.py -i benchmarks/arteval_bench/data/benchmark/arteval_tasks.jsonl -a ae_agent -m claude-sonnet-4-5-20250929 -o ./outputs/ae_agent_smoke_test(basic smoke test).benchmarks/arteval_bench/data/benchmark/run_ae_agent.sh(helper wrapper around the same command).Made with Cursor