Skip to content

[arteval-bench] Integrate ae-agent into ArtEval benchmark#131

Merged
xuafeng merged 7 commits intosys-intelligence:mainfrom
Couen:feature/ae-agent-arteval-bench
Mar 5, 2026
Merged

[arteval-bench] Integrate ae-agent into ArtEval benchmark#131
xuafeng merged 7 commits intosys-intelligence:mainfrom
Couen:feature/ae-agent-arteval-bench

Conversation

@Couen
Copy link
Collaborator

@Couen Couen commented Feb 12, 2026

Summary

  • Sync ae-agent runner logic from the standalone ae-agent repo into benchmarks/arteval_bench/src/agents/ae_agent.
  • Wire ae_agent into benchmarks/arteval_bench/src/main.py so it can be selected via -a ae_agent.
  • Update benchmarks/arteval_bench/src/run_eval_in_env.py to treat ae_agent as a long-running agent: pass tasks via /agent/current_task.txt, stream live logs, and support Anthropic Foundry env vars.
  • Add README_ae_agent.md and run_ae_agent.sh under benchmarks/arteval_bench/data/benchmark to document and simplify running ArtEval with ae_agent.

Details

  • ae_agent now shares the same long-running behavior as claude_sdk (48h timeout, _agent_eval removal before run and re-upload before evaluation, container kept running for inspection).
  • For ae_agent we avoid passing large task strings directly through the shell by uploading the task to /agent/current_task.txt.
  • Foundry environments are supported via ANTHROPIC_FOUNDRY_API_KEY, ANTHROPIC_FOUNDRY_BASE_URL, and CLAUDE_CODE_USE_FOUNDRY.

Testing

  • python benchmarks/arteval_bench/src/main.py -i benchmarks/arteval_bench/data/benchmark/arteval_tasks.jsonl -a ae_agent -m claude-sonnet-4-5-20250929 -o ./outputs/ae_agent_smoke_test (basic smoke test).
  • benchmarks/arteval_bench/data/benchmark/run_ae_agent.sh (helper wrapper around the same command).

Made with Cursor

@Couen Couen requested a review from xuafeng February 12, 2026 10:01
@Couen Couen force-pushed the feature/ae-agent-arteval-bench branch 3 times, most recently from f5c3bab to b99ebc4 Compare February 12, 2026 10:16
{"artifact_id": "osdi24_anvil", "artifact_dir": "osdi24_anvil", "artifact_readme": "osdi24_anvil/anvil/README.md", "artifact_url": "https://github.com/anvil-verifier/anvil", "evaluator": "osdi24_anvil/_agent_eval/main.py", "expected_score": 4, "docer_env": "bastoica/ae-agent-ubuntu24.04:latest"}
{"artifact_id": "sosp23_acto", "artifact_dir": "sosp23_acto", "artifact_readme": "sosp23_acto/acto/README.md", "artifact_url": "https://github.com/xlab-uiuc/acto", "evaluator": "sosp23_acto/_agent_eval/main.py", "expected_score": 4, "docer_env": "bastoica/ae-agent-ubuntu24.04:latest"}
{"artifact_id": "eurosys25_egwalker", "artifact_dir": "eurosys25_egwalker", "artifact_readme": "eurosys25_egwalker/egwalker/README.md", "artifact_url": "https://github.com/josephg/egwalker-paper", "evaluator": "eurosys25_egwalker/_agent_eval/main.py", "expected_score": 4, "docer_env": "bastoica/ae-agent-ubuntu24.04:latest"} No newline at end of file
{"artifact_id": "sosp24_wasabi", "artifact_dir": "sosp24_wasabi", "artifact_readme": "sosp24_wasabi/wasabi/README.md", "artifact_url": "https://github.com/bastoica/wasabi/tree/sosp24-ae", "env": "bastoica/ae-agent-ubuntu24.04:latest", "gpu": false}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is benchmark. I think that we need evaluator and expected_scocre, right?

@xuafeng
Copy link
Collaborator

xuafeng commented Feb 13, 2026

@Couen Please continue to work on this PR.

@xuafeng xuafeng changed the title Integrate ae-agent into ArtEval benchmark [arteval-bench] Integrate ae-agent into ArtEval benchmark Feb 19, 2026
@Couen Couen force-pushed the feature/ae-agent-arteval-bench branch from 30223e5 to 1f38e1c Compare February 25, 2026 06:55
@xuafeng
Copy link
Collaborator

xuafeng commented Feb 25, 2026

@Couen I think this PR looks good now. Can you rebase on the main branch and also please resolve the conflict?

…rmat

- Add run_eval.py and main.py to ae_agent for running tasks on host (env=local)
  or in Docker; run_eval(env, ...) is the single entry point.
- Expand utils.py with helpers for main/run_eval (safe_task_id, env_from_item,
  resolve_project_path, Tee, write_task_report, compute_and_write_summary).
- Update ae_agent README with host mode usage and new file descriptions.
- Unify arteval_tasks.jsonl to new format: artifact_id, artifact_dir,
  artifact_readme, artifact_url, env, gpu; remove evaluator/expected_score.
- Ignore duplicate task list copies (arteval_tasks copy*.jsonl) in .gitignore.
…oke test

- Add ae_agent under benchmarks/arteval_bench/src/agents/ae_agent (main, run_eval, runner, utils, runner.sh, install.sh)
- Wire benchmark main.py and run_eval_in_env.py for ae_agent: host path runs agent then evaluator, parses score; Docker path uses same flow
- Add src/utils.py re-export for get_task when running from benchmark root
- SDK utils: do not overwrite existing env vars when loading env.toml (preserve API key)
- Add minimal smoke test: ae_agent_smoke artifact, ae_agent_smoke_test.jsonl (host + docker), run_ae_agent_smoke_test.sh
- Remove interactive_runner.py (interactive handled in runner)
- Use English throughout (docs, comments); ruff-compliant; single _make_eval_result for result shape
@Couen Couen force-pushed the feature/ae-agent-arteval-bench branch from 1f38e1c to 2c3af40 Compare February 26, 2026 03:21
@@ -0,0 +1,2 @@
{"artifact_id": "ae_agent_smoke_host", "artifact_dir": "ae_agent_smoke", "artifact_readme": "ae_agent_smoke/README.md", "evaluator": "python3 _agent_eval/check.py", "expected_score": 1, "run_on_host": true}
{"artifact_id": "ae_agent_smoke_docker", "artifact_dir": "ae_agent_smoke", "artifact_readme": "ae_agent_smoke/README.md", "docker_env": "bastoica/ae-agent-ubuntu24.04:latest", "evaluator": "python3 _agent_eval/check.py", "expected_score": 1, "run_on_host": false}
Copy link
Collaborator

@xuafeng xuafeng Feb 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Couen Can we unify the input format w/ the artifact-agent? To merge docker_env and run_on_host.
Then, we should update arteval_tasks.jsonl and the related code.

@xuafeng xuafeng requested review from bastoica February 26, 2026 17:35
@xuafeng
Copy link
Collaborator

xuafeng commented Feb 26, 2026

@bastoica can you please take a look at the code and left your comments? Thanks.

- main.py: Extract _is_ae_agent(agent) helper and use it for report/summary
  writing; use json.dumps(..., ensure_ascii=False) for result.jsonl
- run_eval_in_env.py: Remove unused Path import in interactive foreground
  path; reuse _get_container_id_from_runtime for long-running agent block
  instead of duplicating container ID resolution
- README: Update usage and JSONL/CLI options
@Couen Couen force-pushed the feature/ae-agent-arteval-bench branch from 535615a to 511bfa3 Compare March 1, 2026 14:10
ArtEval evaluator field in JSONL points to _agent_eval/main.py. The code was
running it as a bare shell command, causing "No such file or directory" (127).
Now run with "cd /repo && python <evaluator_path>" in Docker and
"cd <project_path> && python <evaluator_path>" on host when path ends with .py.

- run_eval_in_env.py: build eval_cmd with cd /repo and python for .py paths
- ae_agent/run_eval.py: same for run_agent_then_eval (host path)
@bastoica
Copy link
Collaborator

bastoica commented Mar 1, 2026

@Couen @xuafeng I added my comments and review as an issue. @Couen: We think we should hold off merging this version into the repository, for now. Please check our offline discussion.

@xuafeng xuafeng merged commit 173f695 into sys-intelligence:main Mar 5, 2026
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants