Official Python implementation of the paper "Rethinking LLM Evaluation: Can We Evaluate LLMs with 200× Less Data?" in ICLR 2026.
- [2026/01/26] Our paper has been accepted to ICLR 2026. Thanks!
Here's an overview of the process behind our EssenceBench method:
Notably, the method can be run on a CPU without any GPUs!
To get started with EssenceBench, follow the installation instructions below.
- Clone the repo
git clone https://github.com/gszfwsb/EssenceBench.git- Install dependencies
conda create -n essencebench python=3.11
conda activate essencebench
pip install -r requirements.txt- Download and preprocess data (eg. gsm8k)
export HF_ENDPOINT=https://huggingface.co # or https://hf-mirror.com
export HF_TOKEN=hf_********************** # your Huggingface token
# if you want to run other dataset, just replacing "gsm8k" using one of "arc", "truthfulqa", "winogrande", "hellswag" and "mmlu"
python process_data_main.py \
--datasets gsm8k \
--temp_dir ./ \
--data_subdir data
- Subset Selection (eg. gsm8k)
python code/subset_selection/train_multi_round.py \
--data ./data/gsm8k/gsm8k_discard_items_transform.jsonl \
-k 50 \
--rounds 3 \
--gens 2000 \
--ablation full\
# if you want to run other dataset, replace "gsm8k" using one of "arc", "truthfulqa", "winogrande", "hellswag" and "mmlu"
# eg. --data ./data/arc/arc_discard_items_transform.jsonl
If you have any questions, please contact us:
- Shaobo Wang(
shaobowang1009@sjtu.edu.cn) - Cong Wang(
2149367891a@gmail.com) - Wenjie Fu(
22300240028@m.fudan.edu.cn).
If you find EssenceBench useful for your research and applications, please cite using this BibTeX:
@inproceedings{wang2026EssenceBench,
title={Rethinking LLM Evaluation: Can We Evaluate LLMs with 200× Less Data?},
author={Shaobo Wang and Cong Wang and Wenjie Fu and Yue Min and Mingquan Feng and Isabel Guan and Xuming Hu and Conghui He and Cunxiang Wang and Kexin Yang and Xingzhang Ren and Fei Huang and Dayiheng Liu and Linfeng Zhang},
booktitle={International Conference on Learning Representations (ICLR)},
year={2026}
}