Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .pre-commit-config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -109,6 +109,7 @@ repos:
examples/speculative_decoding/main.py|
examples/speculative_decoding/medusa_utils.py|
examples/speculative_decoding/server_generate.py|
examples/puzzletron/evaluation/hf_deployable_anymodel\.py|
modelopt/torch/puzzletron/decilm/deci_lm_hf_code/transformers_.*\.py|
)$
Expand Down
38 changes: 30 additions & 8 deletions examples/puzzletron/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,11 +15,11 @@ In this example, we compress the [Llama-3.1-8B-Instruct](https://huggingface.co/

## Environment

- Install Model-Optimizer in editable mode with the corresponding dependencies:
- Install Model-Optimizer in editable mode with the corresponding dependencies (run from the repo root):

```bash
pip install -e .[hf,puzzletron]
pip install -r requirements.txt
pip install -r examples/puzzletron/requirements.txt
```

- For this example we are using 2x NVIDIA H100 80GB HBM3 to show multi-GPU steps. You can use also use s single GPU.
Expand Down Expand Up @@ -199,16 +199,38 @@ block_14: attention no_op ffn intermediate_3072

## Evaluation

Once the model is ready, you can evaluate it using [Language Model Evaluation Harness](https://pypi.org/project/lm-eval/). For example, run the following to evaluate the model on [Massive Multitask Language Understanding](https://huggingface.co/datasets/cais/mmlu) benchmark.
Evaluate AnyModel checkpoints by deploying a local OpenAI-compatible completions endpoint and running benchmarks against it.

**1. Deploy the model (2 GPUs example):**

```bash
# Install the AnyModel-patched deployable (first time only: backs up the original)
# /opt/Export-Deploy is the default path in NeMo containers — adjust if needed
cp /opt/Export-Deploy/nemo_deploy/llm/hf_deployable.py /opt/Export-Deploy/nemo_deploy/llm/hf_deployable.py.bak
cp examples/puzzletron/evaluation/hf_deployable_anymodel.py /opt/Export-Deploy/nemo_deploy/llm/hf_deployable.py

# Start the server (blocks while running — use a separate terminal)
ray start --head --num-gpus 2 --port 6379 --disable-usage-stats
python /opt/Export-Deploy/scripts/deploy/nlp/deploy_ray_hf.py \
--model_path path/to/checkpoint \
--model_id anymodel-hf \
--num_gpus 2 --num_gpus_per_replica 2 --num_cpus_per_replica 16 \
--trust_remote_code --port 8083 --device_map "auto" --cuda_visible_devices "0,1"
```

**2. Run MMLU:**

```bash
lm_eval --model hf \
--model_args pretrained=path/to/model,dtype=bfloat16,trust_remote_code=true,parallelize=True \
--tasks mmlu \
--num_fewshot 5 \
--batch_size 4
eval-factory run_eval \
--eval_type mmlu \
--model_id anymodel-hf \
--model_type completions \
--model_url http://0.0.0.0:8083/v1/completions/ \
--output_dir examples/puzzletron/evals/mmlu_anymodel
```

For a quick debug run, add `--overrides "config.params.limit_samples=5"`.

## Inference Performance Benchmarking

Now let's evaluate how much speedup we get with the compressed model in terms of throughput and latency.
Expand Down
Loading