Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
39 changes: 30 additions & 9 deletions examples/mmlu_benchmark/README.md
Original file line number Diff line number Diff line change
@@ -1,19 +1,27 @@
# MMLU Benchmark Example

Evaluate language models on [MMLU (Massive Multitask Language Understanding)](https://arxiv.org/abs/2009.03300) with optional efficient evaluation via [DISCO](https://arxiv.org/abs/2510.07959).
Evaluate language models on [MMLU (Massive Multitask Language Understanding)](https://arxiv.org/abs/2009.03300) with optional efficient evaluation via [DISCO (Diversifying Sample Condensation)](https://arxiv.org/abs/2510.07959).

## Installation

For basic MMLU evaluation:
Install [uv package manager](https://docs.astral.sh/uv/) as described [here](https://docs.astral.sh/uv/getting-started/installation/).

Create Python environment:

```bash
uv venv --python 3.11
```

Install dependencies for basic MMLU evaluation:

```bash
uv pip install .[mmlu]
uv sync --extra mmlu
```

For DISCO prediction (includes DISCO dependencies):
Install dependencies for MMLU evaluation with DISCO:

```bash
uv pip install .[disco]
uv sync --extra disco
```

## Run without DISCO (full evaluation)
Expand All @@ -31,9 +39,8 @@ Full evaluation results look like:
Results Summary (Evaluated Tasks)
================================================================================
Total tasks: 14042
Correct: 8291
Accuracy (on anchor points): 0.5904
Accuracy norm (on anchor points): 0.5904
Correct: 8292
Accuracy: 0.5905
```

## Run with DISCO (predicted full-benchmark score)
Expand All @@ -47,10 +54,24 @@ uv run python examples/mmlu_benchmark/mmlu_benchmark.py --model_id alignment-han
Predicted score output:

```
================================================================================
Results Summary (Evaluated Tasks)
================================================================================
Total tasks: 100
Correct: 36
Accuracy: 0.3600

================================================================================
DISCO Prediction
================================================================================
Computing embeddings and predicting full benchmark accuracy...
Fetching 9 files: 100%|██████████████████████████████████████████████████████████████████████████████████████| 9/9 [00:00<00:00, 19171.53it/s]
Using: DISCO predictor from Hugging Face (arubique/DISCO-MMLU)

----------------------------------------
DISCO Predicted Full Benchmark Accuracy:
----------------------------------------
Model 0: 0.606739
Model 0 (alignment-handbook/zephyr-7b-sft-full): 0.602309
```

## Arguments
Expand Down
9 changes: 3 additions & 6 deletions examples/mmlu_benchmark/mmlu_benchmark.py
Original file line number Diff line number Diff line change
Expand Up @@ -258,8 +258,6 @@ def extract_eval_entries(res):
print(f"Saved predictions tensor to {output_path}")
print(f" Shape: {predictions.shape}")
print(f" Dtype: {predictions.dtype}")
else:
print(f"Built predictions tensor with shape: {predictions.shape}")

return predictions

Expand Down Expand Up @@ -723,8 +721,7 @@ def main():
print("=" * 80)
print(f"Total tasks: {metrics['total_tasks']}")
print(f"Correct: {metrics['correct_count']}")
print(f"Accuracy (on anchor points): {metrics['acc']:.4f}")
print(f"Accuracy norm (on anchor points): {metrics['acc_norm']:.4f}")
print(f"Accuracy: {metrics['acc']:.4f}")

# Build predictions tensor for DISCO
predictions = None
Expand Down Expand Up @@ -754,8 +751,8 @@ def main():
print("\n" + "-" * 40)
print("DISCO Predicted Full Benchmark Accuracy:")
print("-" * 40)
for model_idx, acc in disco_results["predicted_accuracies"].items():
print(f" Model {model_idx}: {acc:.6f}")
for model_idx, acc in sorted(disco_results["predicted_accuracies"].items()):
print(f" Model {model_idx} ({args.model_id}): {acc:.6f}")

# Save summary
summary_data = {
Expand Down
Loading