Skip to content

A curated list of evaluation tools, benchmark datasets, leaderboards, frameworks, and resources for assessing model performance.

Notifications You must be signed in to change notification settings

awesomelistsio/awesome-ai-benchmarks-evaluation

Repository files navigation

Awesome AI Benchmarks & Evaluation Awesome Lists

Ko-Fi   PayPal   Stripe   X   Facebook

A curated list of evaluation tools, benchmark datasets, leaderboards, frameworks, and resources for assessing model performance across reasoning, safety, robustness, multimodality, RAG, LLMs, and traditional machine learning tasks.

Contents

General LLM Benchmarks

  • HELM – Holistic evaluation of LLMs across dozens of tasks and domains.
  • LM Evaluation Harness – Standardized benchmarking suite for language models.
  • OpenLLM Leaderboard (HuggingFace) – Public leaderboard for open-source LLM performance.
  • BIG-Bench – Broad set of challenging tasks for evaluating model generalization.
  • MT-Bench – Multi-turn chat interaction benchmark.
  • AlpacaEval – Automatic evaluation of instruction-following models.

Reasoning & Math

  • GSM8K – Benchmark for math reasoning at grade-school level.
  • MATH – Extensive dataset of high school and competition math problems.
  • AIME Bench – Benchmark based on American Invitational Mathematics Examination questions.
  • ARC (Abstraction & Reasoning Corpus) – Tests generalization and abstraction capabilities.
  • AGIEval – Benchmarks for human-style exams and reasoning.

Multimodal Benchmarks

RAG Benchmarks

  • RAGAS – Comprehensive metrics for evaluating retrieval-augmented generation.
  • BEIR – Retrieval benchmark widely used to test search quality in RAG systems.
  • FiQA / TREC Variants – Evaluation datasets for information retrieval tasks.
  • Natural Questions (NQ) – Large dataset for retrieval + QA evaluation.
  • HotpotQA – Multi-hop question answering benchmark.

Safety & Robustness

  • HarmBench – Safety and harm classification benchmark.
  • SafetyBench – Benchmark suite for safety testing.
  • ToxiGen – Dataset for toxic or harmful content detection.
  • Red Team Prompt Datasets – Collections of prompts to stress-test model alignment.
  • RobustBench – Leaderboard of robust classification models.
  • AdversarialNLI – Dataset for robustness in natural language inference.

Evaluation Frameworks

  • OpenAI Evals – Evaluation framework for custom metrics and tasks.
  • TruLens – Observability and feedback evaluation for LLM apps and RAG.
  • Arize Phoenix – Open-source toolkit for LLM/RAG evals and trace analysis.
  • LightEval – Fast LLM evaluation pipeline.
  • Evalchemy – Lightweight framework for running LLM benchmarks.
  • Weights & Biases Evaluation – Model comparison and metric visualization tools.

Datasets

  • TruthfulQA – Benchmark for truthfulness in open-ended QA.
  • SQuAD – Reading comprehension dataset.
  • BoolQ – Yes/no question dataset.
  • SuperGLUE – General NLU evaluation suite.
  • MultiRC – Multi-sentence reasoning benchmark.
  • WikiQA – QA benchmark used in retrieval + QA systems.
  • CommonSenseQA – Commonsense reasoning dataset.

Learning Resources

Related Awesome Lists

Contribute

Contributions are welcome!

License

CC0

About

A curated list of evaluation tools, benchmark datasets, leaderboards, frameworks, and resources for assessing model performance.

Topics

Resources

Code of conduct

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages