Welcome to the official repository for the LLM Benchmarking Project, led by the Center for Open Science (COS). This project provides a modular framework to evaluate the capabilities of large language model (LLM) agents across key components of the scientific research lifecycle, including replication, peer review, and research design.
- Information Extraction: Automated extraction of structured metadata from PDFs and data files.
- Research Design: LLM-driven generation of replication plans and analysis scripts.
- Execution & Sandboxing: Secure execution of generated code within Docker environments.
- Scientific Interpretation: Synthesis of statistical results into human-readable research reports.
- Automated Validation: An LLM-as-judge system that benchmarks agent performance against expert-annotated ground truths.
This work builds on the conceptual structure outlined in our Open Philanthropy grant, emphasizing real-world relevance, task diversity, and community participation.
The project relies on the following core libraries:
- LLM Orchestration:
openai,python-dotenv - Data Science:
pandas,numpy,pyreadr - Document Parsing:
pymupdf(fitz),python-docx - Infrastructure:
docker - Testing:
pytest,pytest-cov
-
Clone repository:
git clone https://github.com/CenterForOpenScience/llm-benchmarking.git cd llm-benchmarking -
Environment Setup The project uses a
Makefileto streamline set up and execute different components of our framework. Make sure you have Python 3.9+ and Docker installed# Install all required dependencies make install-deps # Verify your environment and dependencies make check-deps make check-docker
-
API Configuration Create a
.envfile in the root directory:
OPENAI_API_KEY=your_api_key_here
You can run the full end-to-end pipeline or individual using make.
To run the full flow (Extract → Design → Execute → Interpret) for a specific study:
make pipeline-easy STUDY=./data/original/1 MODEL=gpt-4o| Module / Stage | Command | Description |
|---|---|---|
| Info Extraction | make extract-stage1 |
Extracts structured metadata from the original study into post_registration.json. |
| Web Search | make web-search |
Performs an open-ended web search to identify data resources needed to replicate a claim given the original paper. |
| Research Design | make design-easy |
Generates the replication design and analysis plan based on extracted info into replication_info.json. |
| Execution | make execute-easy |
Runs the generated Python analysis script inside a secure Docker container. |
| Interpretation | make interpret-easy |
Analyzes execution results to produce a final scientific interpretation report. |
| Validation: Extract | make evaluate-extract |
Benchmarks the extraction stage against human-annotated ground truth. |
| Validation: Design | make evaluate-design |
Evaluates the quality and validity of the LLM-generated research design. |
| Validation: Execute | make evaluate-execute |
Compares the statistical output of the executed code against expected results. |
| Validation: Summary | make evaluate-summary |
Generates a comprehensive evaluation report across all pipeline stages. |
The validator compares agent outputs against human-annotated ground truths using specific research rubrics.
- Evaluate All Stages:
make evaluate-pipeline-easy STUDY=./data/original/1
- Specific Evaluations:
make evaluate-extract: Validates JSON metadata accuracy.make evaluate-design: Checks research plan validity.make evaluate-execute: Validates statistical outputs.make evaluate-summary: Generates an overall performance report.
llm-benchmarking/
├── core/ # Central logic containing autonomous agent, tools, prompts, and actions.
├── info_extractor/ # PDF parsing and metadata extraction
├── generator/ # Research design and code generation
├── interpreter/ # Result analysis and report generation
├── validator/ # CLI tools for LLM-based evaluation
├── templates/ # JSON schemas and prompt templates
├── data/ # Benchmark datasets and ground truth
├── Makefile # Project automation
└── requirements-dev.txt
All content in this repository is shared under the Apache License 2.0
Core team members from COS, plus external partners from Old Dominion University, Pennsylvania State University, and University of Notre Dame specializing in:
- Agent development
- Benchmark design
- Open Science Research
This project is funded by Coefficient Giving as part of its 'Benchmarking LLM Agents on Consequential Real-World Tasks' program. We thank Anna Szabelska, Adam Gill, and Ahana Biswas for their annotation of the ground-truth post-registrations for the extraction stage.
For questions please contact:
Shakhlo Nematova Research Scientist shakhlo@cos.io