🔥 FIRE-Bench: Evaluating Agents on the Rediscovery of Scientific Insights

FIRE-Bench is constructed through research-problem decomposition, a process that transforms high-quality empirical analysis papers into verifiable benchmark tasks. This approach balances:

Exploratory freedom (avoiding tasks that are too broad to benchmark), and
Empirical verifiability (avoiding tasks that are too narrow to allow genuine exploration).

We evaluate agent performance through claim-level analysis. Both agent conclusions C_agent and ground-truth conclusions C_gt are decomposed into atomic, verifiable claims. Overall performance is measured using Precision, Recall, and the F_1 score.

Workflow and Results

#	Agent	Prec.	Recall	F₁ Score
1	CC_Sonnet-4	52.1_±26.1	48.3_±24.8	46.7_±23.4
2	CX_gpt-5-med	44.8_±24.1	49.0_±28.5	41.9_±25.4
3	OH_gpt-5	41.7_±22.7	41.4_±24.9	37.9_±23.0
4	OH_o4-mini	36.8_±18.5	36.6_±19.2	31.9_±17.6

Setup

Environment

mamba create -n firebench python=3.11  # or conda
mamba activate firebench
pip install -r requirements.txt

API Keys

Create a .env file:

OPENAI_API_KEY=
GOOGLE_API_KEY=
HF_TOKEN=
ANTHROPIC_API_KEY=
USE_SUBSCRIPTION=1 (mark this as 1 if you want to use Claude Code subscription, 0 if using API key; note that some benchmark papers running on Claude models still require ANTHROPIC_API_KEY)

Agent Dependencies

Codex: npx @openai/codex@0.39.0 --version (v0.39.0 required for timestamp logging)
Claude Code: Setup guide or curl -fsSL https://claude.ai/install.sh | bash
OpenHands: export OPENHANDS_HOME=./.openhands; mkdir -p ./.openhands (Requires docker)

Usage

1. Benchmark Your Own Agent

You can use FIRE-Bench to evaluate any agent system beyond the built-in ones (Codex, Claude Code, OpenHands). Tasks are published as two HuggingFace datasets — pick whichever fits your evaluation goals:

Dataset	Tasks	Status
`silence-suzuki/FIRE-Bench-verified`	35	Hand-curated by the FIRE-Bench team. Many tasks bundle local data files.
`silence-suzuki/FIRE-Bench-unverified`	153	Auto-generated end-to-end by the Paper2Bench pipeline. No human review.

End-to-end skeleton

from datasets import load_dataset
from huggingface_hub import snapshot_download
import os, time

ds = load_dataset("silence-suzuki/FIRE-Bench-verified", split="train")
local = snapshot_download(
    "silence-suzuki/FIRE-Bench-verified", repo_type="dataset",
)  # one-time pull of all data assets

for task in ds:
    tid = task["task_id"]
    work_dir = f"runs/{tid}"
    os.makedirs(work_dir, exist_ok=True)

    # Symlink (or copy) the task's data dir into the agent's CWD if present
    src_data = os.path.join(local, "tasks", tid, "data")
    if os.path.isdir(src_data):
        os.symlink(src_data, os.path.join(work_dir, "data"))

    output = my_agent.run(task["instruction"], cwd=work_dir)

    # Write the log in the expected format
    log_dir = f"log/my_agent/gpt-4o/{tid}/{time.strftime('%Y%m%d_%H%M%S')}"
    os.makedirs(log_dir, exist_ok=True)
    with open(f"{log_dir}/log.log", "w", encoding="utf-8") as f:
        f.write(f"agent_id: my_agent\ntask_id: {tid}\nllm_model: gpt-4o\n")
        f.write("=" * 40 + "\n")
        f.write(output["trajectory"])
        f.write(f'\n{{"result": "{output["final_conclusion"]}"}}\n')

Schema

Both datasets share the same row shape:

Field	Description
`task_id`	unique identifier (e.g. `reversal_curse_rq0`)
`research_question`	the question the agent must answer
`instruction`	the full prompt the agent sees (research question + resources + constraints)
`instruction_gt`	ground-truth procedural plan (used by the evaluator, not shown to the agent)
`conclusion`	ground-truth answer; what the agent's final write-up is compared against

The verified dataset adds dataset_source and has_local_data; the unverified dataset adds paper_type and task_config (typed datasets/models/constraints).

Step A: Load the Tasks

from datasets import load_dataset

ds = load_dataset("silence-suzuki/FIRE-Bench-verified", split="train")
# or: load_dataset("silence-suzuki/FIRE-Bench-unverified", split="train")

print(f"{len(ds)} tasks")
print(ds[0]["task_id"], "->", ds[0]["research_question"][:80])

For tasks where has_local_data is true, fetch the bundled files (JSONL, JSON, images, etc.) with snapshot_download:

from huggingface_hub import snapshot_download

local_dir = snapshot_download(
    "silence-suzuki/FIRE-Bench-verified",
    repo_type="dataset",
    allow_patterns=["tasks/lost_in_the_middle/**"],
)
# files now live at <local_dir>/tasks/lost_in_the_middle/data/...

tasks/<task_id>/dataset.txt (when present) documents the upstream source if the curators left a pointer rather than bundling files.

Step B: Run Your Agent

Pass task["instruction"] to your agent verbatim. The agent should:

Design and execute experiments using the resources listed in the instruction
Produce a final written conclusion summarizing its findings

Do not show the agent instruction_gt or conclusion — both are evaluation-only.

Step C: Save Output in the Expected Log Format

The evaluation pipeline reads logs from:

log/<agent_name>/<model_name>/<task_id>/<timestamp>/log.log

Each log.log must begin with three metadata lines followed by the full agent output:

agent_id: <your_agent_name>
task_id: <task_id>
llm_model: <model_name>
========================================
<full agent trajectory and output>

The evaluator extracts the agent's final conclusion from the log. It recognizes three formats — append one of these at the end of your log:

Format	How to emit
JSON (simplest)	Append a JSON line: `{"result": "<final conclusion>"}`
OpenHands-style	Include `final_thought='<conclusion>', outputs=` in the log
Codex-style	Bracket conclusions between `[YYYY-MM-DDTHH:MM:SS]` timestamp lines

Step D: Evaluate

Run the evaluation pipeline on your logs:

bash run_eval.sh --agents <your_agent_name> --models <model_name> --tasks <task_id>

# Or evaluate everything at once
bash run_eval.sh --agents all --models all --tasks all

The pipeline decomposes both the agent's conclusion and the dataset's conclusion field into atomic claims, then computes Precision, Recall, and F₁ via claim-level analysis.

2. Parse Other Papers into Problem Trees

Use the tree parser to decompose a research paper (PDF) into a hierarchical research-problem tree via OpenAI:

# Single paper
bash run_tree_parser.sh --papers /path/to/paper.pdf

# Multiple papers (quote the list)
bash run_tree_parser.sh --papers "/path/to/paper1.pdf /path/to/paper2.pdf" --model gpt-4o

# Glob pattern for a whole directory
bash run_tree_parser.sh --papers "/path/to/papers/*.pdf" --output_dir benchmark/trees

Options:

--papers: space-separated PDF paths or a glob pattern (required)
--model: OpenAI model name (default: gpt-4o)
--output_dir: directory for output JSON trees (default: benchmark/trees)
--max_tokens: max output tokens (default: 16384)
--temperature: sampling temperature (default: 0.0)

Each paper produces a <name>_tree.json file containing the problem tree.

3. Run Built-in Experiments

Edit run_experiment.sh to configure your agent/task/model combinations, then run:

bash run_experiment.sh

This iterates over all combinations of AGENT_IDS, TASK_IDS, and LLM_MODELS, calling run_agent.py for each. Results are saved to the log/ folder.

Parameters in run_experiment.sh:

AGENT_IDS: agents to run (e.g., codex, claude_code, openhands)
TASK_IDS: benchmark tasks (e.g., rational)
LLM_MODELS: models to use (e.g., gpt-5)

4. Evaluate Results

After experiments finish, evaluate the generated logs:

# Evaluate all agents/models/tasks
bash run_eval.sh --agents all --models all --tasks all

# Evaluate a specific run
bash run_eval.sh --agents codex --models gpt-5 --tasks rational --timestamp 20251016232701_10997

Options:

--agents: agent name or all
--models: model name or all
--tasks: task name or all
--timestamp: (optional) evaluate a specific run by timestamp

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🔥 FIRE-Bench: Evaluating Agents on the Rediscovery of Scientific Insights

Workflow and Results

Setup

Environment

API Keys

Agent Dependencies

Usage

1. Benchmark Your Own Agent

End-to-end skeleton

Schema

Step A: Load the Tasks

Step B: Run Your Agent

Step C: Save Output in the Expected Log Format

Step D: Evaluate

2. Parse Other Papers into Problem Trees

3. Run Built-in Experiments

4. Evaluate Results

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
agents		agents
benchmark		benchmark
eval		eval
resources		resources
utils		utils
.env		.env
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
requirements.txt		requirements.txt
run_agent.py		run_agent.py
run_eval.sh		run_eval.sh
run_experiment.sh		run_experiment.sh
run_plan_generator.sh		run_plan_generator.sh
run_tree_parser.sh		run_tree_parser.sh

Folders and files

Latest commit

History

Repository files navigation

🔥 FIRE-Bench: Evaluating Agents on the Rediscovery of Scientific Insights

Workflow and Results

Setup

Environment

API Keys

Agent Dependencies

Usage

1. Benchmark Your Own Agent

End-to-end skeleton

Schema

Step A: Load the Tasks

Step B: Run Your Agent

Step C: Save Output in the Expected Log Format

Step D: Evaluate

2. Parse Other Papers into Problem Trees

3. Run Built-in Experiments

4. Evaluate Results

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages