A curated list of evaluation tools, benchmark datasets, leaderboards, frameworks, and resources for assessing model performance across reasoning, safety, robustness, multimodality, RAG, LLMs, and traditional machine learning tasks.
- General LLM Benchmarks
- Reasoning & Math
- Multimodal Benchmarks
- RAG Benchmarks
- Safety & Robustness
- Evaluation Frameworks
- Datasets
- Learning Resources
- Related Awesome Lists
- HELM – Holistic evaluation of LLMs across dozens of tasks and domains.
- LM Evaluation Harness – Standardized benchmarking suite for language models.
- OpenLLM Leaderboard (HuggingFace) – Public leaderboard for open-source LLM performance.
- BIG-Bench – Broad set of challenging tasks for evaluating model generalization.
- MT-Bench – Multi-turn chat interaction benchmark.
- AlpacaEval – Automatic evaluation of instruction-following models.
- GSM8K – Benchmark for math reasoning at grade-school level.
- MATH – Extensive dataset of high school and competition math problems.
- AIME Bench – Benchmark based on American Invitational Mathematics Examination questions.
- ARC (Abstraction & Reasoning Corpus) – Tests generalization and abstraction capabilities.
- AGIEval – Benchmarks for human-style exams and reasoning.
- MMBench – Large benchmark for vision–language reasoning.
- COCO Captions & VQA – Standard dataset for VQA tasks.
- Flickr30k – Image–text benchmark for captioning and retrieval.
- ImageNet – Core image classification benchmark still used for comparison.
- LAION Benchmarks – Evaluation datasets for multimodal embeddings and retrieval.
- Video Question Answering Benchmarks – For video–text reasoning.
- RAGAS – Comprehensive metrics for evaluating retrieval-augmented generation.
- BEIR – Retrieval benchmark widely used to test search quality in RAG systems.
- FiQA / TREC Variants – Evaluation datasets for information retrieval tasks.
- Natural Questions (NQ) – Large dataset for retrieval + QA evaluation.
- HotpotQA – Multi-hop question answering benchmark.
- HarmBench – Safety and harm classification benchmark.
- SafetyBench – Benchmark suite for safety testing.
- ToxiGen – Dataset for toxic or harmful content detection.
- Red Team Prompt Datasets – Collections of prompts to stress-test model alignment.
- RobustBench – Leaderboard of robust classification models.
- AdversarialNLI – Dataset for robustness in natural language inference.
- OpenAI Evals – Evaluation framework for custom metrics and tasks.
- TruLens – Observability and feedback evaluation for LLM apps and RAG.
- Arize Phoenix – Open-source toolkit for LLM/RAG evals and trace analysis.
- LightEval – Fast LLM evaluation pipeline.
- Evalchemy – Lightweight framework for running LLM benchmarks.
- Weights & Biases Evaluation – Model comparison and metric visualization tools.
- TruthfulQA – Benchmark for truthfulness in open-ended QA.
- SQuAD – Reading comprehension dataset.
- BoolQ – Yes/no question dataset.
- SuperGLUE – General NLU evaluation suite.
- MultiRC – Multi-sentence reasoning benchmark.
- WikiQA – QA benchmark used in retrieval + QA systems.
- CommonSenseQA – Commonsense reasoning dataset.
- HELM Overview – Intro to full-spectrum model benchmarking.
- LLM Evaluation Guide (HuggingFace) – End-to-end guide for evaluating language models.
- Stanford CS25 Notes – Covers benchmarking and model evaluation basics.
- MLPerf – Industry-standard ML performance benchmarking guidelines.
- DeepMind Papers on Evaluation – Research on model testing and evaluation.
- Awesome AI
- Awesome AI Safety & Alignment
- Awesome AI Security
- Awesome AI Research Tools
- Awesome Machine Learning
Contributions are welcome!