sys-intelligence · xuafeng · Mar 5, 2026 · Feb 12, 2026 · Feb 12, 2026 · Feb 13, 2026
diff --git a/benchmarks/arteval_bench/.gitignore b/benchmarks/arteval_bench/.gitignore
@@ -117,3 +117,6 @@ a.out
 # Build directories
 build/
 cmake-build-*/
+
+# Duplicate task list copies (canonical: arteval_tasks.jsonl)
+data/benchmark/arteval_tasks copy*.jsonl
diff --git a/benchmarks/arteval_bench/README.md b/benchmarks/arteval_bench/README.md
@@ -182,5 +182,67 @@ The benchmark supports multiple AI agents:
 - **Claude Code**: Anthropic's code assistant
 - **Mini SWE Agent**: The compact version of [SWE-agent](https://github.com/SWE-agent) assistant
 - **OpenHands**: Open-source coding agent
+- **ae_agent**: Claude Agent SDK–based agent (same logic as the standalone [artifact-agent](https://github.com/sys-intelligence/artifact-agent) repo), with full support for host/Docker, interactive mode, Skill, Sub-agent, per-task timeout, GPU, and optional container sync/commit/stop.
 
 To add your own agent to the benchmark, see [add_agents.md](add_agents.md).
+
+#### » ae_agent usage and options
+
+When using the **ae_agent** (`-a ae_agent` or `-a ae-agent`), you can pass the following from the command line and/or the task JSONL.
+
+**Command-line arguments**
+
+| Argument | Description |
+|----------|-------------|
+| `-i`, `--input_file` | Input JSONL file with tasks (default: `./data/benchmark/arteval_tasks.jsonl`). |
+| `-o`, `--save_path` | Directory for results (default: `./outputs/ae_<model>_ae-agent_<timestamp>`). |
+| `-a`, `--agent` | Agent name; use `ae_agent` or `ae-agent` for this agent. |
+| `-m`, `--model_name` | Model name (e.g. `claude-sonnet-4-5-20250929`). |
+| `--interactive` | After the task completes, keep a session open so you can give more instructions (requires a TTY). In Docker mode the runner is executed in the foreground via `docker exec -it`. |
+| `--enable-skill` | Enable Claude Agent SDK Skill (load from `~/.claude/skills/` and `.claude/skills/`). |
+| `--enable-subagent` | Enable Claude Agent SDK Sub-agent (Task tool). |
+
+**JSONL task fields (per line)**
+
+| Field | Description |
+|-------|-------------|
+| `artifact_id` | Unique task identifier. |
+| `artifact_dir` | Artifact directory name (relative to the JSONL file’s directory). |
+| `artifact_readme` | Path to the README or task description file (relative to artifact root). |
+| `artifact_url` | Optional. Git clone URL; used when `artifact_dir` is missing or the path does not exist. |
+| `env` | `"local"` for host; Docker image name (e.g. `bastoica/ae-agent-ubuntu24.04:latest`) for Docker. |
+| `evaluator` | Command to run after the agent (e.g. `python _agent_eval/main.py`). |
+| `expected_score` | Expected score for this artifact (default 4). |
+| `timeout` | Optional. Per-task timeout in seconds or milliseconds (see utils: values &lt; 86400 are seconds, else milliseconds). |
+| `gpu` | Optional. When `true`, pass `--gpus all` to Docker (Docker mode only). |
+| `interactive` | Optional. When `true`, enable interactive mode for this task (overrides CLI default). |
+| `enable_skill` | Optional. When `true`, enable Skill for this task. |
+| `enable_subagent` | Optional. When `true`, enable Sub-agent for this task. |
+| `keep_container` | Optional. When `false` (default for ae_agent), after the run the workspace is synced from the container to the host, the container is committed as an image, and the container is stopped. When `true`, the container is left running for inspection. |
+
+**Examples**
+
+```sh
+# Host mode, default options
+python src/main.py -i ./data/benchmark/arteval_tasks.jsonl -a ae_agent -o ./outputs/run1
+
+# With interactive mode (TTY required for Docker)
+python src/main.py --interactive -i ./data/benchmark/arteval_tasks.jsonl -a ae_agent -o ./outputs/run2
+
+# Enable Skill and Sub-agent
+python src/main.py --enable-skill --enable-subagent -i ./data/benchmark/arteval_tasks.jsonl -a ae_agent -o ./outputs/run3
+```
+
+**Outputs (when using ae_agent)**
+
+Results are written under the given `save_path`:
+
+- `result.jsonl` — One JSON object per task (task_id, status, score, agent_run_results, etc.).
+- `avg_score.json` — Benchmark summary (final_score, total_tasks).
+- `ae_report_<artifact_id>.md` — Per-task report (status, project path, log file, agent summary, and optional Docker image instructions).
+- `summary.json` — Total and successful task counts and success rate (same format as standalone artifact-agent).
+- When running via the benchmark entry, log paths and agent summary are filled from available data; standalone `python -m ae_agent.main` also produces `ae_log_<artifact_id>.log`.
+
+**Docker + interactive**
+
+For Docker tasks with `interactive: true` (or `--interactive`), the benchmark runs the agent in the foreground via `docker exec -it` so you can interact in the same terminal. This requires a real TTY (e.g. running `python src/main.py ...` in a terminal, not under CI or with redirected stdin). If stdin is not a TTY, the run falls back to non-interactive (background runner) and a warning is logged.
diff --git a/benchmarks/arteval_bench/data/benchmark/ae_agent_smoke/README.md b/benchmarks/arteval_bench/data/benchmark/ae_agent_smoke/README.md
@@ -0,0 +1,13 @@
+# AE Agent Smoke Test Artifact
+
+Minimal task for quick testing of ae_agent (host/docker + evaluation). Should complete in under a minute.
+
+## Task
+
+1. In this directory (the artifact root), create a file named **success.txt**.
+2. The file must contain exactly the single character **1** (no newline required).
+3. No other steps are required.
+
+Example (bash): `echo -n 1 > success.txt`
+
+After you finish, the benchmark will run an evaluation script that checks for this file and outputs a score (1 if correct, 0 otherwise).
diff --git a/benchmarks/arteval_bench/data/benchmark/ae_agent_smoke/README_SMOKE_TEST.md b/benchmarks/arteval_bench/data/benchmark/ae_agent_smoke/README_SMOKE_TEST.md
@@ -0,0 +1,44 @@
+# AE Agent smoke test
+
+## Purpose
+
+- Test the agent under `src/agents/ae_agent`: **host** and **docker** modes, and the **evaluation script** flow (evaluator runs after the agent and parses score).
+- Task is minimal (create `success.txt` with content `1` in the artifact root); finishes in a few minutes and avoids long runs with full arteval_tasks.
+
+## Files
+
+- **ae_agent_smoke/**: Minimal artifact
+  - `README.md`: Task description (create success.txt with content 1)
+  - `_agent_eval/check.py`: Evaluator; outputs `1` if success.txt exists and contains `1`, else `0`
+- **ae_agent_smoke_test.jsonl**: Two lines
+  - First line: `run_on_host: true`, run ae_agent + evaluator on host
+  - Second line: `run_on_host: false`, run ae_agent + evaluator in Docker
+
+## How to run
+
+From the **benchmarks/arteval_bench** directory:
+
+```bash
+# Set ANTHROPIC_API_KEY or ANTHROPIC_FOUNDRY_API_KEY first
+python src/main.py \
+  -i ./data/benchmark/ae_agent_smoke_test.jsonl \
+  -a ae_agent \
+  -m claude-sonnet-4-5-20250929 \
+  -o ./outputs/ae_agent_smoke_$(date +%Y%m%d_%H%M%S)
+```
+
+- **Host task**: Runs the agent on the host, then runs `python3 _agent_eval/check.py` on the host to get the score.
+- **Docker task**: Runs the agent in the container, then runs the evaluator in the container to get the score; the container is kept running by default for debugging.
+
+Results are under the `-o` directory: `result.jsonl` (one JSON object per line with `score`, `status`, `test_method`, etc.) and `avg_score.json`.
+
+## Interactive mode
+
+The benchmark’s `src/main.py` does not read an `interactive` field from the JSONL, so the command above only covers **non-interactive** runs. To test interactive mode:
+
+- Use ae_agent’s main entry with `--interactive`, and set `"env": "local"` or `"run_on_host": true` / `"env": "docker"` in the JSONL for the task, for example:
+  ```bash
+  cd src/agents/ae_agent
+  python -m ae_agent.main --interactive -i ../../../data/benchmark/ae_agent_smoke_test.jsonl -o ../../../outputs/ae_agent_smoke_int
+  ```
+- In interactive mode, after the first task completes you can keep typing instructions; type `quit` or `exit` to end.
diff --git a/benchmarks/arteval_bench/data/benchmark/ae_agent_smoke/_agent_eval/check.py b/benchmarks/arteval_bench/data/benchmark/ae_agent_smoke/_agent_eval/check.py
@@ -0,0 +1,22 @@
+#!/usr/bin/env python3
+"""Minimal evaluator for ae_agent_smoke: output 1 if success.txt exists and contains '1', else 0.
+
+Output must be a single digit on a line (or last line) for benchmark score parsing.
+"""
+import os
+import sys
+
+def main():
+    root = os.path.dirname(os.path.dirname(os.path.abspath(__file__)))
+    path = os.path.join(root, "success.txt")
+    if os.path.isfile(path):
+        with open(path, "r") as f:
+            content = f.read().strip()
+        if content == "1":
+            print(1)
+            sys.exit(0)
+    print(0)
+    sys.exit(0)
+
+if __name__ == "__main__":
+    main()
diff --git a/benchmarks/arteval_bench/data/benchmark/ae_agent_smoke_test.jsonl b/benchmarks/arteval_bench/data/benchmark/ae_agent_smoke_test.jsonl
@@ -0,0 +1,2 @@
+{"artifact_id": "ae_agent_smoke_host", "artifact_dir": "ae_agent_smoke", "artifact_readme": "ae_agent_smoke/README.md", "evaluator": "python3 _agent_eval/check.py", "expected_score": 1, "env": "local"}
+{"artifact_id": "ae_agent_smoke_docker", "artifact_dir": "ae_agent_smoke", "artifact_readme": "ae_agent_smoke/README.md", "evaluator": "python3 _agent_eval/check.py", "expected_score": 1, "env": "bastoica/ae-agent-ubuntu24.04:latest", "timeout": 120000}
diff --git a/benchmarks/arteval_bench/data/benchmark/arteval_tasks.jsonl b/benchmarks/arteval_bench/data/benchmark/arteval_tasks.jsonl
@@ -1,6 +1,6 @@
-{"artifact_id": "sosp24_wasabi", "artifact_dir": "sosp24_wasabi", "artifact_readme": "sosp24_wasabi/wasabi/README.md", "artifact_url": "https://github.com/bastoica/wasabi/tree/sosp24-ae", "evaluator": "sosp24_wasabi/wasabi/_agent_eval/main.py", "expected_score": 4, "docer_env": "bastoica/ae-agent-ubuntu24.04:latest"}
-{"artifact_id": "osdi24_anvil", "artifact_dir": "osdi24_anvil", "artifact_readme": "osdi24_anvil/anvil/README.md", "artifact_url": "https://github.com/anvil-verifier/anvil", "evaluator": "osdi24_anvil/_agent_eval/main.py", "expected_score": 4, "docer_env": "bastoica/ae-agent-ubuntu24.04:latest"}
-{"artifact_id": "sosp23_acto", "artifact_dir": "sosp23_acto", "artifact_readme": "sosp23_acto/acto/README.md", "artifact_url": "https://github.com/xlab-uiuc/acto", "evaluator": "sosp23_acto/_agent_eval/main.py", "expected_score": 4, "docer_env": "bastoica/ae-agent-ubuntu24.04:latest"}
-{"artifact_id": "eurosys25_egwalker", "artifact_dir": "eurosys25_egwalker", "artifact_readme": "eurosys25_egwalker/egwalker/README.md", "artifact_url": "https://github.com/josephg/egwalker-paper", "evaluator": "eurosys25_egwalker/_agent_eval/main.py", "expected_score": 4, "docer_env": "bastoica/ae-agent-ubuntu24.04:latest"}
-{"artifact_id": "eurosys25_depsurf", "artifact_dir": "eurosys25_depsurf", "artifact_readme": "eurosys25_depsurf/depsurf/README.md", "artifact_url": "https://github.com/ShawnZhong/DepSurf", "evaluator": "eurosys25_depsurf/_agent_eval/main.py", "expected_score": 4, "docer_env": "bastoica/ae-agent-ubuntu24.04:latest"}
-{"artifact_id": "osdi24_eet", "artifact_dir": "osdi24_eet", "artifact_readme": "osdi24_eet/eet/README.md", "artifact_url": "https://github.com/JZuming/EET", "evaluator": "osdi24_eet/_agent_eval/main.py", "expected_score": 4, "docer_env": "bastoica/ae-agent-ubuntu24.04:latest"}
+{"artifact_id": "sosp24_wasabi", "artifact_dir": "sosp24_wasabi", "artifact_readme": "sosp24_wasabi/wasabi/README.md", "artifact_url": "https://github.com/bastoica/wasabi/tree/sosp24-ae", "evaluator": "sosp24_wasabi/wasabi/_agent_eval/main.py", "expected_score": 4, "env": "bastoica/ae-agent-ubuntu24.04:latest"}
+{"artifact_id": "osdi24_anvil", "artifact_dir": "osdi24_anvil", "artifact_readme": "osdi24_anvil/anvil/README.md", "artifact_url": "https://github.com/anvil-verifier/anvil", "evaluator": "osdi24_anvil/_agent_eval/main.py", "expected_score": 4, "env": "bastoica/ae-agent-ubuntu24.04:latest"}
+{"artifact_id": "sosp23_acto", "artifact_dir": "sosp23_acto", "artifact_readme": "sosp23_acto/acto/README.md", "artifact_url": "https://github.com/xlab-uiuc/acto", "evaluator": "sosp23_acto/_agent_eval/main.py", "expected_score": 4, "env": "bastoica/ae-agent-ubuntu24.04:latest"}
+{"artifact_id": "eurosys25_egwalker", "artifact_dir": "eurosys25_egwalker", "artifact_readme": "eurosys25_egwalker/egwalker/README.md", "artifact_url": "https://github.com/josephg/egwalker-paper", "evaluator": "eurosys25_egwalker/_agent_eval/main.py", "expected_score": 4, "env": "bastoica/ae-agent-ubuntu24.04:latest"}
+{"artifact_id": "eurosys25_depsurf", "artifact_dir": "eurosys25_depsurf", "artifact_readme": "eurosys25_depsurf/depsurf/README.md", "artifact_url": "https://github.com/ShawnZhong/DepSurf", "evaluator": "eurosys25_depsurf/_agent_eval/main.py", "expected_score": 4, "env": "bastoica/ae-agent-ubuntu24.04:latest"}
+{"artifact_id": "osdi24_eet", "artifact_dir": "osdi24_eet", "artifact_readme": "osdi24_eet/eet/README.md", "artifact_url": "https://github.com/JZuming/EET", "evaluator": "osdi24_eet/_agent_eval/main.py", "expected_score": 4, "env": "bastoica/ae-agent-ubuntu24.04:latest"}
diff --git a/benchmarks/arteval_bench/env.toml b/benchmarks/arteval_bench/env.toml
@@ -2,7 +2,7 @@
 AZURE_API_KEY = "XXX"
 AZURE_API_BASE = "XXXX"
 AZURE_API_VERSION = "XXX"
-ANTHROPIC_API_KEY = "sk-XXXX"
+ANTHROPIC_API_KEY = "YOUR_ANTHROPIC_API_KEY"
 
 [hardware]
 use_gpu = false

diff --git a/benchmarks/arteval_bench/run_ae_agent_smoke_test.sh b/benchmarks/arteval_bench/run_ae_agent_smoke_test.sh
@@ -0,0 +1,21 @@
+#!/bin/bash
+# Run ae_agent smoke test under arteval_bench (host + docker, with evaluation).
+# Usage: ./run_ae_agent_smoke_test.sh [model_name]
+# Default model: claude-sonnet-4-5-20250929
+
+set -e
+BENCH_ROOT="$(cd "$(dirname "$0")" && pwd)"
+cd "$BENCH_ROOT"
+MODEL="${1:-claude-sonnet-4-5-20250929}"
+OUT_DIR="./outputs/ae_agent_smoke_$(date +%Y%m%d_%H%M%S)"
+echo "==> AE Agent smoke test (host + docker + evaluation)"
+echo "    Model: $MODEL"
+echo "    Output: $OUT_DIR"
+echo ""
+python src/main.py \
+  -i ./data/benchmark/ae_agent_smoke_test.jsonl \
+  -a ae_agent \
+  -m "$MODEL" \
+  -o "$OUT_DIR"
+echo ""
+echo "==> Done. Results: $OUT_DIR/result.jsonl and $OUT_DIR/avg_score.json"
diff --git a/benchmarks/arteval_bench/src/agents/ae_agent/README.md b/benchmarks/arteval_bench/src/agents/ae_agent/README.md
@@ -0,0 +1,62 @@
+# AE Agent (ArtEval sub-agent)
+
+This agent is the **ae_agent** for the system-intelligence-benchmark ArtEval benchmark, with the same logic as the standalone [ae-agent](https://github.com/Couen/ae-agent) repo. It runs inside the benchmark container using the Claude Agent SDK to execute artifact evaluation tasks.
+
+## Files
+
+- **install.sh**: Installs `claude-agent-sdk` inside the container for use by runner.py.
+- **runner.sh**: Entry script; invoked as `runner.sh <model> <task_or_path>`. Uses `/agent/current_task.txt` when the benchmark passes the task via file.
+- **runner.py**: Runs the task with Claude Agent SDK; supports 429 rate-limit retry; second argument can be task text or path to a task file. Artifact path in container is `/repo`.
+- **run_eval.py**: Single-task orchestration: `env='local'` runs on host, otherwise runs in Docker (requires swerex/swe-rex).
+- **main.py**: CLI entry for batch runs from JSONL; supports host or Docker per task.
+- **utils.py**: Timeout, task/path helpers, Tee, reports, summary (used by runner, main, run_eval).
+- **__init__.py**: Package marker.
+
+## Usage from the benchmark
+
+From the benchmark root (`benchmarks/arteval_bench/`):
+
+```bash
+python src/main.py -i ./data/benchmark/arteval_tasks.jsonl -a ae_agent -m claude-sonnet-4-5-20250929 -o ./outputs/ae_agent_run
+```
+
+You can also use `-a ae-agent`; it is equivalent to `ae_agent`.
+
+The benchmark will:
+
+1. Upload this agent to `/agent` in the container.
+2. For ae_agent: write the task to `/agent/current_task.txt`, then run `runner.sh "$model" /agent/current_task.txt` (avoids shell quoting issues with large tasks).
+3. Use long-running and live-log behavior (48h timeout, streamed logs, remove `_agent_eval` before run and re-upload before evaluation, container kept for debugging).
+4. **Evaluation script flow** (same as claude_sdk): after the agent finishes, run the JSONL `evaluator` (test_method), e.g. `cd /repo && python _agent_eval/main.py`, parse output for `score` and write to result.
+5. If set, pass through `ANTHROPIC_API_KEY`, `ANTHROPIC_FOUNDRY_API_KEY`, `ANTHROPIC_FOUNDRY_BASE_URL`, `CLAUDE_CODE_USE_FOUNDRY`.
+
+**Evaluation flow on host**: When `run_on_host=True` and the agent is ae_agent, `run_eval_in_env.run_eval_on_host` calls this package's `run_agent_then_eval()`: run the agent first, then run `test_method` on the host (e.g. `cd project_path && python _agent_eval/main.py`), parse score with `utils.parse_eval_score()`, and return a result with the same shape as the Docker path (`score`, `test_method`, `status`).
+
+## Dependencies
+
+- Python 3; `claude-agent-sdk` is installed in the container via `install.sh`.
+- When running in Docker via the benchmark's `run_eval_in_env.py`, install `swerex` on the host (the benchmark includes it). When using this directory's `main.py` for Docker mode standalone, you also need `swe-rex`.
+
+## Running on host (local)
+
+You can run tasks on the **host** from this directory (without the benchmark's Docker flow):
+
+1. **Single or batch via main.py**  
+   Use a JSONL where each line can set `"env": "local"` or `"run_on_host": true` to run that task on the host; others run in Docker (requires swerex).
+
+   ```bash
+   cd benchmarks/arteval_bench/src/agents/ae_agent
+   python -m ae_agent.main -i /path/to/tasks.jsonl -a ae_agent -m claude-sonnet-4-5-20250929 -o ./outputs/host_run
+   ```
+
+2. **Host mode requirements**  
+   - Set `ANTHROPIC_API_KEY` or `ANTHROPIC_FOUNDRY_API_KEY`  
+   - Docker installed and running (for prereq check; agent runs on host)  
+   - `pip install claude-agent-sdk`
+
+3. **Docker mode from this directory**  
+   If the JSONL has `"env": "docker"` (or `run_on_host` is not set), `main.py` runs that task in Docker via `run_eval.py` (requires `swe-rex`/`swerex`).
+
+## Relation to the standalone ae-agent repo
+
+The standalone ae-agent repo provides the same host/Docker CLI. This sub-agent includes both the **in-container** runner (used by the benchmark's `run_eval_in_env.py`) and **host/local** mode via `main.py` and `run_eval.py`.
diff --git a/benchmarks/arteval_bench/src/agents/ae_agent/__init__.py b/benchmarks/arteval_bench/src/agents/ae_agent/__init__.py
@@ -0,0 +1,23 @@
+"""AE Agent - A tool for running Claude Agent SDK on artifact evaluation tasks.
+
+Output files (under save_path):
+- ae_report_<artifact_id>.md: Per-artifact report with status and agent summary
+- ae_log_<artifact_id>.log: Per-artifact execution log
+- result.jsonl: Per-task results (one JSON per line)
+- summary.json: Overall statistics
+"""
+
+from .main import cli_main, main
+from .run_eval import run_agent_then_eval, run_eval
+from .runner import build_system_prompt, run_agent
+from .utils import parse_eval_score
+
+__all__ = [
+    'build_system_prompt',
+    'cli_main',
+    'main',
+    'parse_eval_score',
+    'run_agent',
+    'run_agent_then_eval',
+    'run_eval',
+]
diff --git a/benchmarks/arteval_bench/src/agents/ae_agent/install.sh b/benchmarks/arteval_bench/src/agents/ae_agent/install.sh
@@ -0,0 +1,12 @@
+#!/bin/bash
+# Setup agent running environment inside Docker container.
+# Ensures claude-agent-sdk is available so runner.py can import claude_agent_sdk.
+set -e
+if ! python3 -c "import claude_agent_sdk" 2>/dev/null; then
+  echo "Installing claude-agent-sdk..."
+  pip3 install claude-agent-sdk==0.1.24 || pip3 install --break-system-packages claude-agent-sdk==0.1.24 || true
+  if ! python3 -c "import claude_agent_sdk"; then
+    echo "WARNING: claude_agent_sdk still not importable; runner may fail."
+  fi
+fi
+echo "Agent environment ready."
Original file line number	Diff line number	Diff line change
		@@ -0,0 +1,2 @@
		{"artifact_id": "ae_agent_smoke_host", "artifact_dir": "ae_agent_smoke", "artifact_readme": "ae_agent_smoke/README.md", "evaluator": "python3 _agent_eval/check.py", "expected_score": 1, "env": "local"}
		{"artifact_id": "ae_agent_smoke_docker", "artifact_dir": "ae_agent_smoke", "artifact_readme": "ae_agent_smoke/README.md", "evaluator": "python3 _agent_eval/check.py", "expected_score": 1, "env": "bastoica/ae-agent-ubuntu24.04:latest", "timeout": 120000}