Skip to content

maitrix-org/FIRE-Bench

Repository files navigation

🔥 FIRE-Bench: Evaluating Agents on the Rediscovery of Scientific Insights

arXiv   GitHub   Website

FIRE-Bench is constructed through research-problem decomposition, a process that transforms high-quality empirical analysis papers into verifiable benchmark tasks. This approach balances:

  • Exploratory freedom (avoiding tasks that are too broad to benchmark), and
  • Empirical verifiability (avoiding tasks that are too narrow to allow genuine exploration).

We evaluate agent performance through claim-level analysis. Both agent conclusions C_agent and ground-truth conclusions C_gt are decomposed into atomic, verifiable claims. Overall performance is measured using Precision, Recall, and the F_1 score.

Workflow and Results

# Agent Prec. Recall F1 Score
1 CCSonnet-4 52.1±26.1 48.3±24.8 46.7±23.4
2 CXgpt-5-med 44.8±24.1 49.0±28.5 41.9±25.4
3 OHgpt-5 41.7±22.7 41.4±24.9 37.9±23.0
4 OHo4-mini 36.8±18.5 36.6±19.2 31.9±17.6

Setup

Environment

mamba create -n firebench python=3.11  # or conda
mamba activate firebench
pip install -r requirements.txt

API Keys

Create a .env file:

OPENAI_API_KEY=
GOOGLE_API_KEY=
HF_TOKEN=
ANTHROPIC_API_KEY=
USE_SUBSCRIPTION=1 (mark this as 1 if you want to use Claude Code subscription, 0 if using API key; note that some benchmark papers running on Claude models still require ANTHROPIC_API_KEY)

Agent Dependencies

  • Codex: npx @openai/codex@0.39.0 --version (v0.39.0 required for timestamp logging)
  • Claude Code: Setup guide or curl -fsSL https://claude.ai/install.sh | bash
  • OpenHands: export OPENHANDS_HOME=./.openhands; mkdir -p ./.openhands (Requires docker)

Usage

1. Benchmark Your Own Agent

You can use FIRE-Bench to evaluate any agent system beyond the built-in ones (Codex, Claude Code, OpenHands). Tasks are published as two HuggingFace datasets — pick whichever fits your evaluation goals:

Dataset Tasks Status
silence-suzuki/FIRE-Bench-verified 35 Hand-curated by the FIRE-Bench team. Many tasks bundle local data files.
silence-suzuki/FIRE-Bench-unverified 153 Auto-generated end-to-end by the Paper2Bench pipeline. No human review.

End-to-end skeleton

from datasets import load_dataset
from huggingface_hub import snapshot_download
import os, time

ds = load_dataset("silence-suzuki/FIRE-Bench-verified", split="train")
local = snapshot_download(
    "silence-suzuki/FIRE-Bench-verified", repo_type="dataset",
)  # one-time pull of all data assets

for task in ds:
    tid = task["task_id"]
    work_dir = f"runs/{tid}"
    os.makedirs(work_dir, exist_ok=True)

    # Symlink (or copy) the task's data dir into the agent's CWD if present
    src_data = os.path.join(local, "tasks", tid, "data")
    if os.path.isdir(src_data):
        os.symlink(src_data, os.path.join(work_dir, "data"))

    output = my_agent.run(task["instruction"], cwd=work_dir)

    # Write the log in the expected format
    log_dir = f"log/my_agent/gpt-4o/{tid}/{time.strftime('%Y%m%d_%H%M%S')}"
    os.makedirs(log_dir, exist_ok=True)
    with open(f"{log_dir}/log.log", "w", encoding="utf-8") as f:
        f.write(f"agent_id: my_agent\ntask_id: {tid}\nllm_model: gpt-4o\n")
        f.write("=" * 40 + "\n")
        f.write(output["trajectory"])
        f.write(f'\n{{"result": "{output["final_conclusion"]}"}}\n')

Schema

Both datasets share the same row shape:

Field Description
task_id unique identifier (e.g. reversal_curse_rq0)
research_question the question the agent must answer
instruction the full prompt the agent sees (research question + resources + constraints)
instruction_gt ground-truth procedural plan (used by the evaluator, not shown to the agent)
conclusion ground-truth answer; what the agent's final write-up is compared against

The verified dataset adds dataset_source and has_local_data; the unverified dataset adds paper_type and task_config (typed datasets/models/constraints).

Step A: Load the Tasks

from datasets import load_dataset

ds = load_dataset("silence-suzuki/FIRE-Bench-verified", split="train")
# or: load_dataset("silence-suzuki/FIRE-Bench-unverified", split="train")

print(f"{len(ds)} tasks")
print(ds[0]["task_id"], "->", ds[0]["research_question"][:80])

For tasks where has_local_data is true, fetch the bundled files (JSONL, JSON, images, etc.) with snapshot_download:

from huggingface_hub import snapshot_download

local_dir = snapshot_download(
    "silence-suzuki/FIRE-Bench-verified",
    repo_type="dataset",
    allow_patterns=["tasks/lost_in_the_middle/**"],
)
# files now live at <local_dir>/tasks/lost_in_the_middle/data/...

tasks/<task_id>/dataset.txt (when present) documents the upstream source if the curators left a pointer rather than bundling files.

Step B: Run Your Agent

Pass task["instruction"] to your agent verbatim. The agent should:

  • Design and execute experiments using the resources listed in the instruction
  • Produce a final written conclusion summarizing its findings

Do not show the agent instruction_gt or conclusion — both are evaluation-only.

Step C: Save Output in the Expected Log Format

The evaluation pipeline reads logs from:

log/<agent_name>/<model_name>/<task_id>/<timestamp>/log.log

Each log.log must begin with three metadata lines followed by the full agent output:

agent_id: <your_agent_name>
task_id: <task_id>
llm_model: <model_name>
========================================
<full agent trajectory and output>

The evaluator extracts the agent's final conclusion from the log. It recognizes three formats — append one of these at the end of your log:

Format How to emit
JSON (simplest) Append a JSON line: {"result": "<final conclusion>"}
OpenHands-style Include final_thought='<conclusion>', outputs= in the log
Codex-style Bracket conclusions between [YYYY-MM-DDTHH:MM:SS] timestamp lines

Step D: Evaluate

Run the evaluation pipeline on your logs:

bash run_eval.sh --agents <your_agent_name> --models <model_name> --tasks <task_id>

# Or evaluate everything at once
bash run_eval.sh --agents all --models all --tasks all

The pipeline decomposes both the agent's conclusion and the dataset's conclusion field into atomic claims, then computes Precision, Recall, and F₁ via claim-level analysis.

2. Parse Other Papers into Problem Trees

Use the tree parser to decompose a research paper (PDF) into a hierarchical research-problem tree via OpenAI:

# Single paper
bash run_tree_parser.sh --papers /path/to/paper.pdf

# Multiple papers (quote the list)
bash run_tree_parser.sh --papers "/path/to/paper1.pdf /path/to/paper2.pdf" --model gpt-4o

# Glob pattern for a whole directory
bash run_tree_parser.sh --papers "/path/to/papers/*.pdf" --output_dir benchmark/trees

Options:

  • --papers: space-separated PDF paths or a glob pattern (required)
  • --model: OpenAI model name (default: gpt-4o)
  • --output_dir: directory for output JSON trees (default: benchmark/trees)
  • --max_tokens: max output tokens (default: 16384)
  • --temperature: sampling temperature (default: 0.0)

Each paper produces a <name>_tree.json file containing the problem tree.

3. Run Built-in Experiments

Edit run_experiment.sh to configure your agent/task/model combinations, then run:

bash run_experiment.sh

This iterates over all combinations of AGENT_IDS, TASK_IDS, and LLM_MODELS, calling run_agent.py for each. Results are saved to the log/ folder.

Parameters in run_experiment.sh:

  • AGENT_IDS: agents to run (e.g., codex, claude_code, openhands)
  • TASK_IDS: benchmark tasks (e.g., rational)
  • LLM_MODELS: models to use (e.g., gpt-5)

4. Evaluate Results

After experiments finish, evaluate the generated logs:

# Evaluate all agents/models/tasks
bash run_eval.sh --agents all --models all --tasks all

# Evaluate a specific run
bash run_eval.sh --agents codex --models gpt-5 --tasks rational --timestamp 20251016232701_10997

Options:

  • --agents: agent name or all
  • --models: model name or all
  • --tasks: task name or all
  • --timestamp: (optional) evaluate a specific run by timestamp

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors