Awesome AI Benchmarks & Evaluation

A curated list of evaluation tools, benchmark datasets, leaderboards, frameworks, and resources for assessing model performance across reasoning, safety, robustness, multimodality, RAG, LLMs, and traditional machine learning tasks.

General LLM Benchmarks

HELM – Holistic evaluation of LLMs across dozens of tasks and domains.
LM Evaluation Harness – Standardized benchmarking suite for language models.
OpenLLM Leaderboard (HuggingFace) – Public leaderboard for open-source LLM performance.
BIG-Bench – Broad set of challenging tasks for evaluating model generalization.
MT-Bench – Multi-turn chat interaction benchmark.
AlpacaEval – Automatic evaluation of instruction-following models.

Reasoning & Math

GSM8K – Benchmark for math reasoning at grade-school level.
MATH – Extensive dataset of high school and competition math problems.
AIME Bench – Benchmark based on American Invitational Mathematics Examination questions.
ARC (Abstraction & Reasoning Corpus) – Tests generalization and abstraction capabilities.
AGIEval – Benchmarks for human-style exams and reasoning.

Multimodal Benchmarks

MMBench – Large benchmark for vision–language reasoning.
COCO Captions & VQA – Standard dataset for VQA tasks.
Flickr30k – Image–text benchmark for captioning and retrieval.
ImageNet – Core image classification benchmark still used for comparison.
LAION Benchmarks – Evaluation datasets for multimodal embeddings and retrieval.
Video Question Answering Benchmarks – For video–text reasoning.

RAG Benchmarks

RAGAS – Comprehensive metrics for evaluating retrieval-augmented generation.
BEIR – Retrieval benchmark widely used to test search quality in RAG systems.
FiQA / TREC Variants – Evaluation datasets for information retrieval tasks.
Natural Questions (NQ) – Large dataset for retrieval + QA evaluation.
HotpotQA – Multi-hop question answering benchmark.

Safety & Robustness

HarmBench – Safety and harm classification benchmark.
SafetyBench – Benchmark suite for safety testing.
ToxiGen – Dataset for toxic or harmful content detection.
Red Team Prompt Datasets – Collections of prompts to stress-test model alignment.
RobustBench – Leaderboard of robust classification models.
AdversarialNLI – Dataset for robustness in natural language inference.

Evaluation Frameworks

OpenAI Evals – Evaluation framework for custom metrics and tasks.
TruLens – Observability and feedback evaluation for LLM apps and RAG.
Arize Phoenix – Open-source toolkit for LLM/RAG evals and trace analysis.
LightEval – Fast LLM evaluation pipeline.
Evalchemy – Lightweight framework for running LLM benchmarks.
Weights & Biases Evaluation – Model comparison and metric visualization tools.

Datasets

TruthfulQA – Benchmark for truthfulness in open-ended QA.
SQuAD – Reading comprehension dataset.
BoolQ – Yes/no question dataset.
SuperGLUE – General NLU evaluation suite.
MultiRC – Multi-sentence reasoning benchmark.
WikiQA – QA benchmark used in retrieval + QA systems.
CommonSenseQA – Commonsense reasoning dataset.

Learning Resources

HELM Overview – Intro to full-spectrum model benchmarking.
LLM Evaluation Guide (HuggingFace) – End-to-end guide for evaluating language models.
Stanford CS25 Notes – Covers benchmarking and model evaluation basics.
MLPerf – Industry-standard ML performance benchmarking guidelines.
DeepMind Papers on Evaluation – Research on model testing and evaluation.

Related Awesome Lists

Contribute

Contributions are welcome!

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
.github		.github
.editorconfig		.editorconfig
.gitattributes		.gitattributes
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
README.md		README.md
check_readme_links.py		check_readme_links.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Repository files navigation

Awesome AI Benchmarks & Evaluation

Contents

General LLM Benchmarks

Reasoning & Math

Multimodal Benchmarks

RAG Benchmarks

Safety & Robustness

Evaluation Frameworks

Datasets

Learning Resources

Related Awesome Lists

Contribute

License

About

Uh oh!

Releases

Sponsor this project

Uh oh!

Packages

Languages

Uh oh!

awesomelistsio/awesome-ai-benchmarks-evaluation

Folders and files

Latest commit

History

Repository files navigation

Awesome AI Benchmarks & Evaluation

Contents

General LLM Benchmarks

Reasoning & Math

Multimodal Benchmarks

RAG Benchmarks

Safety & Robustness

Evaluation Frameworks

Datasets

Learning Resources

Related Awesome Lists

Contribute

License

About

Topics

Resources

Code of conduct

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Sponsor this project

Uh oh!

Packages 0

Languages

Packages