Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions benchmarks/arteval_bench/.gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -117,3 +117,6 @@ a.out
# Build directories
build/
cmake-build-*/

# Duplicate task list copies (canonical: arteval_tasks.jsonl)
data/benchmark/arteval_tasks copy*.jsonl
62 changes: 62 additions & 0 deletions benchmarks/arteval_bench/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -182,5 +182,67 @@ The benchmark supports multiple AI agents:
- **Claude Code**: Anthropic's code assistant
- **Mini SWE Agent**: The compact version of [SWE-agent](https://github.com/SWE-agent) assistant
- **OpenHands**: Open-source coding agent
- **ae_agent**: Claude Agent SDK–based agent (same logic as the standalone [artifact-agent](https://github.com/sys-intelligence/artifact-agent) repo), with full support for host/Docker, interactive mode, Skill, Sub-agent, per-task timeout, GPU, and optional container sync/commit/stop.

To add your own agent to the benchmark, see [add_agents.md](add_agents.md).

#### » ae_agent usage and options

When using the **ae_agent** (`-a ae_agent` or `-a ae-agent`), you can pass the following from the command line and/or the task JSONL.

**Command-line arguments**

| Argument | Description |
|----------|-------------|
| `-i`, `--input_file` | Input JSONL file with tasks (default: `./data/benchmark/arteval_tasks.jsonl`). |
| `-o`, `--save_path` | Directory for results (default: `./outputs/ae_<model>_ae-agent_<timestamp>`). |
| `-a`, `--agent` | Agent name; use `ae_agent` or `ae-agent` for this agent. |
| `-m`, `--model_name` | Model name (e.g. `claude-sonnet-4-5-20250929`). |
| `--interactive` | After the task completes, keep a session open so you can give more instructions (requires a TTY). In Docker mode the runner is executed in the foreground via `docker exec -it`. |
| `--enable-skill` | Enable Claude Agent SDK Skill (load from `~/.claude/skills/` and `.claude/skills/`). |
| `--enable-subagent` | Enable Claude Agent SDK Sub-agent (Task tool). |

**JSONL task fields (per line)**

| Field | Description |
|-------|-------------|
| `artifact_id` | Unique task identifier. |
| `artifact_dir` | Artifact directory name (relative to the JSONL file’s directory). |
| `artifact_readme` | Path to the README or task description file (relative to artifact root). |
| `artifact_url` | Optional. Git clone URL; used when `artifact_dir` is missing or the path does not exist. |
| `env` | `"local"` for host; Docker image name (e.g. `bastoica/ae-agent-ubuntu24.04:latest`) for Docker. |
| `evaluator` | Command to run after the agent (e.g. `python _agent_eval/main.py`). |
| `expected_score` | Expected score for this artifact (default 4). |
| `timeout` | Optional. Per-task timeout in seconds or milliseconds (see utils: values &lt; 86400 are seconds, else milliseconds). |
| `gpu` | Optional. When `true`, pass `--gpus all` to Docker (Docker mode only). |
| `interactive` | Optional. When `true`, enable interactive mode for this task (overrides CLI default). |
| `enable_skill` | Optional. When `true`, enable Skill for this task. |
| `enable_subagent` | Optional. When `true`, enable Sub-agent for this task. |
| `keep_container` | Optional. When `false` (default for ae_agent), after the run the workspace is synced from the container to the host, the container is committed as an image, and the container is stopped. When `true`, the container is left running for inspection. |

**Examples**

```sh
# Host mode, default options
python src/main.py -i ./data/benchmark/arteval_tasks.jsonl -a ae_agent -o ./outputs/run1

# With interactive mode (TTY required for Docker)
python src/main.py --interactive -i ./data/benchmark/arteval_tasks.jsonl -a ae_agent -o ./outputs/run2

# Enable Skill and Sub-agent
python src/main.py --enable-skill --enable-subagent -i ./data/benchmark/arteval_tasks.jsonl -a ae_agent -o ./outputs/run3
```

**Outputs (when using ae_agent)**

Results are written under the given `save_path`:

- `result.jsonl` — One JSON object per task (task_id, status, score, agent_run_results, etc.).
- `avg_score.json` — Benchmark summary (final_score, total_tasks).
- `ae_report_<artifact_id>.md` — Per-task report (status, project path, log file, agent summary, and optional Docker image instructions).
- `summary.json` — Total and successful task counts and success rate (same format as standalone artifact-agent).
- When running via the benchmark entry, log paths and agent summary are filled from available data; standalone `python -m ae_agent.main` also produces `ae_log_<artifact_id>.log`.

**Docker + interactive**

For Docker tasks with `interactive: true` (or `--interactive`), the benchmark runs the agent in the foreground via `docker exec -it` so you can interact in the same terminal. This requires a real TTY (e.g. running `python src/main.py ...` in a terminal, not under CI or with redirected stdin). If stdin is not a TTY, the run falls back to non-interactive (background runner) and a warning is logged.
13 changes: 13 additions & 0 deletions benchmarks/arteval_bench/data/benchmark/ae_agent_smoke/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
# AE Agent Smoke Test Artifact

Minimal task for quick testing of ae_agent (host/docker + evaluation). Should complete in under a minute.

## Task

1. In this directory (the artifact root), create a file named **success.txt**.
2. The file must contain exactly the single character **1** (no newline required).
3. No other steps are required.

Example (bash): `echo -n 1 > success.txt`

After you finish, the benchmark will run an evaluation script that checks for this file and outputs a score (1 if correct, 0 otherwise).
Original file line number Diff line number Diff line change
@@ -0,0 +1,44 @@
# AE Agent smoke test

## Purpose

- Test the agent under `src/agents/ae_agent`: **host** and **docker** modes, and the **evaluation script** flow (evaluator runs after the agent and parses score).
- Task is minimal (create `success.txt` with content `1` in the artifact root); finishes in a few minutes and avoids long runs with full arteval_tasks.

## Files

- **ae_agent_smoke/**: Minimal artifact
- `README.md`: Task description (create success.txt with content 1)
- `_agent_eval/check.py`: Evaluator; outputs `1` if success.txt exists and contains `1`, else `0`
- **ae_agent_smoke_test.jsonl**: Two lines
- First line: `run_on_host: true`, run ae_agent + evaluator on host
- Second line: `run_on_host: false`, run ae_agent + evaluator in Docker

## How to run

From the **benchmarks/arteval_bench** directory:

```bash
# Set ANTHROPIC_API_KEY or ANTHROPIC_FOUNDRY_API_KEY first
python src/main.py \
-i ./data/benchmark/ae_agent_smoke_test.jsonl \
-a ae_agent \
-m claude-sonnet-4-5-20250929 \
-o ./outputs/ae_agent_smoke_$(date +%Y%m%d_%H%M%S)
```

- **Host task**: Runs the agent on the host, then runs `python3 _agent_eval/check.py` on the host to get the score.
- **Docker task**: Runs the agent in the container, then runs the evaluator in the container to get the score; the container is kept running by default for debugging.

Results are under the `-o` directory: `result.jsonl` (one JSON object per line with `score`, `status`, `test_method`, etc.) and `avg_score.json`.

## Interactive mode

The benchmark’s `src/main.py` does not read an `interactive` field from the JSONL, so the command above only covers **non-interactive** runs. To test interactive mode:

- Use ae_agent’s main entry with `--interactive`, and set `"env": "local"` or `"run_on_host": true` / `"env": "docker"` in the JSONL for the task, for example:
```bash
cd src/agents/ae_agent
python -m ae_agent.main --interactive -i ../../../data/benchmark/ae_agent_smoke_test.jsonl -o ../../../outputs/ae_agent_smoke_int
```
- In interactive mode, after the first task completes you can keep typing instructions; type `quit` or `exit` to end.
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
#!/usr/bin/env python3
"""Minimal evaluator for ae_agent_smoke: output 1 if success.txt exists and contains '1', else 0.

Output must be a single digit on a line (or last line) for benchmark score parsing.
"""
import os
import sys

def main():
root = os.path.dirname(os.path.dirname(os.path.abspath(__file__)))
path = os.path.join(root, "success.txt")
if os.path.isfile(path):
with open(path, "r") as f:
content = f.read().strip()
if content == "1":
print(1)
sys.exit(0)
print(0)
sys.exit(0)

if __name__ == "__main__":
main()
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
{"artifact_id": "ae_agent_smoke_host", "artifact_dir": "ae_agent_smoke", "artifact_readme": "ae_agent_smoke/README.md", "evaluator": "python3 _agent_eval/check.py", "expected_score": 1, "env": "local"}
{"artifact_id": "ae_agent_smoke_docker", "artifact_dir": "ae_agent_smoke", "artifact_readme": "ae_agent_smoke/README.md", "evaluator": "python3 _agent_eval/check.py", "expected_score": 1, "env": "bastoica/ae-agent-ubuntu24.04:latest", "timeout": 120000}
12 changes: 6 additions & 6 deletions benchmarks/arteval_bench/data/benchmark/arteval_tasks.jsonl
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
{"artifact_id": "sosp24_wasabi", "artifact_dir": "sosp24_wasabi", "artifact_readme": "sosp24_wasabi/wasabi/README.md", "artifact_url": "https://github.com/bastoica/wasabi/tree/sosp24-ae", "evaluator": "sosp24_wasabi/wasabi/_agent_eval/main.py", "expected_score": 4, "docer_env": "bastoica/ae-agent-ubuntu24.04:latest"}
{"artifact_id": "osdi24_anvil", "artifact_dir": "osdi24_anvil", "artifact_readme": "osdi24_anvil/anvil/README.md", "artifact_url": "https://github.com/anvil-verifier/anvil", "evaluator": "osdi24_anvil/_agent_eval/main.py", "expected_score": 4, "docer_env": "bastoica/ae-agent-ubuntu24.04:latest"}
{"artifact_id": "sosp23_acto", "artifact_dir": "sosp23_acto", "artifact_readme": "sosp23_acto/acto/README.md", "artifact_url": "https://github.com/xlab-uiuc/acto", "evaluator": "sosp23_acto/_agent_eval/main.py", "expected_score": 4, "docer_env": "bastoica/ae-agent-ubuntu24.04:latest"}
{"artifact_id": "eurosys25_egwalker", "artifact_dir": "eurosys25_egwalker", "artifact_readme": "eurosys25_egwalker/egwalker/README.md", "artifact_url": "https://github.com/josephg/egwalker-paper", "evaluator": "eurosys25_egwalker/_agent_eval/main.py", "expected_score": 4, "docer_env": "bastoica/ae-agent-ubuntu24.04:latest"}
{"artifact_id": "eurosys25_depsurf", "artifact_dir": "eurosys25_depsurf", "artifact_readme": "eurosys25_depsurf/depsurf/README.md", "artifact_url": "https://github.com/ShawnZhong/DepSurf", "evaluator": "eurosys25_depsurf/_agent_eval/main.py", "expected_score": 4, "docer_env": "bastoica/ae-agent-ubuntu24.04:latest"}
{"artifact_id": "osdi24_eet", "artifact_dir": "osdi24_eet", "artifact_readme": "osdi24_eet/eet/README.md", "artifact_url": "https://github.com/JZuming/EET", "evaluator": "osdi24_eet/_agent_eval/main.py", "expected_score": 4, "docer_env": "bastoica/ae-agent-ubuntu24.04:latest"}
{"artifact_id": "sosp24_wasabi", "artifact_dir": "sosp24_wasabi", "artifact_readme": "sosp24_wasabi/wasabi/README.md", "artifact_url": "https://github.com/bastoica/wasabi/tree/sosp24-ae", "evaluator": "sosp24_wasabi/wasabi/_agent_eval/main.py", "expected_score": 4, "env": "bastoica/ae-agent-ubuntu24.04:latest"}
{"artifact_id": "osdi24_anvil", "artifact_dir": "osdi24_anvil", "artifact_readme": "osdi24_anvil/anvil/README.md", "artifact_url": "https://github.com/anvil-verifier/anvil", "evaluator": "osdi24_anvil/_agent_eval/main.py", "expected_score": 4, "env": "bastoica/ae-agent-ubuntu24.04:latest"}
{"artifact_id": "sosp23_acto", "artifact_dir": "sosp23_acto", "artifact_readme": "sosp23_acto/acto/README.md", "artifact_url": "https://github.com/xlab-uiuc/acto", "evaluator": "sosp23_acto/_agent_eval/main.py", "expected_score": 4, "env": "bastoica/ae-agent-ubuntu24.04:latest"}
{"artifact_id": "eurosys25_egwalker", "artifact_dir": "eurosys25_egwalker", "artifact_readme": "eurosys25_egwalker/egwalker/README.md", "artifact_url": "https://github.com/josephg/egwalker-paper", "evaluator": "eurosys25_egwalker/_agent_eval/main.py", "expected_score": 4, "env": "bastoica/ae-agent-ubuntu24.04:latest"}
{"artifact_id": "eurosys25_depsurf", "artifact_dir": "eurosys25_depsurf", "artifact_readme": "eurosys25_depsurf/depsurf/README.md", "artifact_url": "https://github.com/ShawnZhong/DepSurf", "evaluator": "eurosys25_depsurf/_agent_eval/main.py", "expected_score": 4, "env": "bastoica/ae-agent-ubuntu24.04:latest"}
{"artifact_id": "osdi24_eet", "artifact_dir": "osdi24_eet", "artifact_readme": "osdi24_eet/eet/README.md", "artifact_url": "https://github.com/JZuming/EET", "evaluator": "osdi24_eet/_agent_eval/main.py", "expected_score": 4, "env": "bastoica/ae-agent-ubuntu24.04:latest"}
2 changes: 1 addition & 1 deletion benchmarks/arteval_bench/env.toml
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
AZURE_API_KEY = "XXX"
AZURE_API_BASE = "XXXX"
AZURE_API_VERSION = "XXX"
ANTHROPIC_API_KEY = "sk-XXXX"
ANTHROPIC_API_KEY = "YOUR_ANTHROPIC_API_KEY"

[hardware]
use_gpu = false
Expand Down
21 changes: 21 additions & 0 deletions benchmarks/arteval_bench/run_ae_agent_smoke_test.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
#!/bin/bash
# Run ae_agent smoke test under arteval_bench (host + docker, with evaluation).
# Usage: ./run_ae_agent_smoke_test.sh [model_name]
# Default model: claude-sonnet-4-5-20250929

set -e
BENCH_ROOT="$(cd "$(dirname "$0")" && pwd)"
cd "$BENCH_ROOT"
MODEL="${1:-claude-sonnet-4-5-20250929}"
OUT_DIR="./outputs/ae_agent_smoke_$(date +%Y%m%d_%H%M%S)"
echo "==> AE Agent smoke test (host + docker + evaluation)"
echo " Model: $MODEL"
echo " Output: $OUT_DIR"
echo ""
python src/main.py \
-i ./data/benchmark/ae_agent_smoke_test.jsonl \
-a ae_agent \
-m "$MODEL" \
-o "$OUT_DIR"
echo ""
echo "==> Done. Results: $OUT_DIR/result.jsonl and $OUT_DIR/avg_score.json"
62 changes: 62 additions & 0 deletions benchmarks/arteval_bench/src/agents/ae_agent/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,62 @@
# AE Agent (ArtEval sub-agent)

This agent is the **ae_agent** for the system-intelligence-benchmark ArtEval benchmark, with the same logic as the standalone [ae-agent](https://github.com/Couen/ae-agent) repo. It runs inside the benchmark container using the Claude Agent SDK to execute artifact evaluation tasks.

## Files

- **install.sh**: Installs `claude-agent-sdk` inside the container for use by runner.py.
- **runner.sh**: Entry script; invoked as `runner.sh <model> <task_or_path>`. Uses `/agent/current_task.txt` when the benchmark passes the task via file.
- **runner.py**: Runs the task with Claude Agent SDK; supports 429 rate-limit retry; second argument can be task text or path to a task file. Artifact path in container is `/repo`.
- **run_eval.py**: Single-task orchestration: `env='local'` runs on host, otherwise runs in Docker (requires swerex/swe-rex).
- **main.py**: CLI entry for batch runs from JSONL; supports host or Docker per task.
- **utils.py**: Timeout, task/path helpers, Tee, reports, summary (used by runner, main, run_eval).
- **__init__.py**: Package marker.

## Usage from the benchmark

From the benchmark root (`benchmarks/arteval_bench/`):

```bash
python src/main.py -i ./data/benchmark/arteval_tasks.jsonl -a ae_agent -m claude-sonnet-4-5-20250929 -o ./outputs/ae_agent_run
```

You can also use `-a ae-agent`; it is equivalent to `ae_agent`.

The benchmark will:

1. Upload this agent to `/agent` in the container.
2. For ae_agent: write the task to `/agent/current_task.txt`, then run `runner.sh "$model" /agent/current_task.txt` (avoids shell quoting issues with large tasks).
3. Use long-running and live-log behavior (48h timeout, streamed logs, remove `_agent_eval` before run and re-upload before evaluation, container kept for debugging).
4. **Evaluation script flow** (same as claude_sdk): after the agent finishes, run the JSONL `evaluator` (test_method), e.g. `cd /repo && python _agent_eval/main.py`, parse output for `score` and write to result.
5. If set, pass through `ANTHROPIC_API_KEY`, `ANTHROPIC_FOUNDRY_API_KEY`, `ANTHROPIC_FOUNDRY_BASE_URL`, `CLAUDE_CODE_USE_FOUNDRY`.

**Evaluation flow on host**: When `run_on_host=True` and the agent is ae_agent, `run_eval_in_env.run_eval_on_host` calls this package's `run_agent_then_eval()`: run the agent first, then run `test_method` on the host (e.g. `cd project_path && python _agent_eval/main.py`), parse score with `utils.parse_eval_score()`, and return a result with the same shape as the Docker path (`score`, `test_method`, `status`).

## Dependencies

- Python 3; `claude-agent-sdk` is installed in the container via `install.sh`.
- When running in Docker via the benchmark's `run_eval_in_env.py`, install `swerex` on the host (the benchmark includes it). When using this directory's `main.py` for Docker mode standalone, you also need `swe-rex`.

## Running on host (local)

You can run tasks on the **host** from this directory (without the benchmark's Docker flow):

1. **Single or batch via main.py**
Use a JSONL where each line can set `"env": "local"` or `"run_on_host": true` to run that task on the host; others run in Docker (requires swerex).

```bash
cd benchmarks/arteval_bench/src/agents/ae_agent
python -m ae_agent.main -i /path/to/tasks.jsonl -a ae_agent -m claude-sonnet-4-5-20250929 -o ./outputs/host_run
```

2. **Host mode requirements**
- Set `ANTHROPIC_API_KEY` or `ANTHROPIC_FOUNDRY_API_KEY`
- Docker installed and running (for prereq check; agent runs on host)
- `pip install claude-agent-sdk`

3. **Docker mode from this directory**
If the JSONL has `"env": "docker"` (or `run_on_host` is not set), `main.py` runs that task in Docker via `run_eval.py` (requires `swe-rex`/`swerex`).

## Relation to the standalone ae-agent repo

The standalone ae-agent repo provides the same host/Docker CLI. This sub-agent includes both the **in-container** runner (used by the benchmark's `run_eval_in_env.py`) and **host/local** mode via `main.py` and `run_eval.py`.
23 changes: 23 additions & 0 deletions benchmarks/arteval_bench/src/agents/ae_agent/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
"""AE Agent - A tool for running Claude Agent SDK on artifact evaluation tasks.

Output files (under save_path):
- ae_report_<artifact_id>.md: Per-artifact report with status and agent summary
- ae_log_<artifact_id>.log: Per-artifact execution log
- result.jsonl: Per-task results (one JSON per line)
- summary.json: Overall statistics
"""

from .main import cli_main, main
from .run_eval import run_agent_then_eval, run_eval
from .runner import build_system_prompt, run_agent
from .utils import parse_eval_score

__all__ = [
'build_system_prompt',
'cli_main',
'main',
'parse_eval_score',
'run_agent',
'run_agent_then_eval',
'run_eval',
]
12 changes: 12 additions & 0 deletions benchmarks/arteval_bench/src/agents/ae_agent/install.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
#!/bin/bash
# Setup agent running environment inside Docker container.
# Ensures claude-agent-sdk is available so runner.py can import claude_agent_sdk.
set -e
if ! python3 -c "import claude_agent_sdk" 2>/dev/null; then
echo "Installing claude-agent-sdk..."
pip3 install claude-agent-sdk==0.1.24 || pip3 install --break-system-packages claude-agent-sdk==0.1.24 || true
if ! python3 -c "import claude_agent_sdk"; then
echo "WARNING: claude_agent_sdk still not importable; runner may fail."
fi
fi
echo "Agent environment ready."
Loading
Loading