Bug fixes for evo2 inference. CP at least doesn't error. TP working…#1454
Bug fixes for evo2 inference. CP at least doesn't error. TP working…#1454kjaniknvidia wants to merge 1 commit intoNVIDIA:mainfrom
Conversation
|
Important Review skippedAuto reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the You can disable this status message by setting the
✨ Finishing touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
@balvisio could you help review? |
| cp_group = None | ||
| cp_size = 1 |
There was a problem hiding this comment.
Should we set this to 1/None in the case when we do not use CP? Forgot why we had these defaults here but there may be weird cases that fail where we divide by cp_size somewhere...
|
I ran py.test tests/bionemo/evo2/ Were these all passing previously? =========================== short test summary info ============================ First one missed, just outside the range. E comparison failed The other two look like they are related to TE not supporting (NVFP4/MXFP8) precisions for hyena.....so "shouldn't" have anything to do with the inference changes? |
|
@kjaniknvidia interesting, yeah we should mark tests/bionemo/evo2/run/test_predict.py::test_predict_evo2_equivalent_with_log_probs[ddp=1,cp=1,pp=1,tp=2,fp8=True,wi=epoch] The thing I'm more interested in is adding test coverage for the code you added. Specifically: Add a test to https://github.com/NVIDIA/bionemo-framework/blob/main/bionemo-recipes/recipes/evo2_megatron/tests/bionemo/evo2/run/test_infer.py that uses the same gold-standard values and test method of https://github.com/NVIDIA/bionemo-framework/blob/main/bionemo-recipes/recipes/evo2_megatron/tests/bionemo/evo2/test_evo2.py#L735 but with torchrun CLI calls and use of TP=[1,2], PP=[1,2] and look into whether megatron inference supports CP like is being added to vLLM https://docs.vllm.ai/en/latest/serving/context_parallel_deployment/ if CP is supported also add CP=[1,2] coverage. Add skips for CI but verify that this works manually. The test works by taking four well known sequences and first splitting them in half. These are the same four sequences that Arc used for internal testing of their model. Then 500 tokens (verify this number) are generated for each sequence and compared to the next 500 known tokens for each sequence. We have expected sequence identity for each of the 4 sequences given the deterministic topk=1 sampling strategy for generation. |
… on 1 node.
Description
EVO2 Dockerfile Change - Added ENVs to force casual_conv1d to be built
CP Partial Fixes
Running Evo2 Megatron inference with --context-parallel-size 2 caused tensor shape mismatches because the inference path doesn't split sequences across CP ranks (each rank gets the full sequence), but the code was still applying CP-related AllToAll operations and CP-sharded parameter slicing.
Tested that it no longer errors and returns the generation, but enabling full useful CP is a much heavier lift.
TP FIxes
With TP > 1, sequence_parallel was automatically set to True. Megatron's inference engine doesn't support sequence parallelism for non-MoE models, raising NotImplementedError. Sequence parallelism is a training-only activation memory optimization and isn't needed for inference.
In ParallelHyenaOperator.forward_long(), the call to engine.parallel_iir() passed self.hidden_size (the full model hidden size, e.g., 4096). With TP=8, each rank only has width_per_tp_group = 4096 / 8 = 512 channels, so the concatenated tensor u has 3 * 512 = 1536 channels. The parallel_iir function tried to split it into three chunks of 4096, causing a RuntimeError.
Infer.py Test Enabling
I didn't know this, but apparently unix command lines have a character limit. Even though I was generating the prompt it evaluated to something longer than 128K characters. Added a --prompt-file switch that reads the prompt... from a file.
Usage
Tested TP=8 on one node.
16K prompt fits on one GPU. 32K OOMs
Confirmed that 128K prompt fits on 8 GPUs and 256K prompt OOMs.
Generate the prompt:
python3 -c "
import random
random.seed(42)
seq = ''.join(random.choices('ACGT', k=131072))
prompt = '|d_Bacteria;p_Pseudomonadota;c_Gammaproteobacteria;o_Enterobacterales|' + seq
with open('/tmp/prompt128k.txt', 'w') as f:
f.write(prompt)
print(f'Wrote prompt of length {len(prompt)}')
"
Execute:
torchrun --nproc-per-node 8 ./src/bionemo/evo2/run/infer.py
--tensor-parallel-size 8
--ckpt-dir $CKPT_OUT_DIR
--max-new-tokens 100
--max-seq-length 131300
--prompt-file /tmp/prompt128k.txt
Type of changes
CI Pipeline Configuration
Configure CI behavior by applying the relevant labels. By default, only basic unit tests are run.
It's my first real change. Run everything?
Unit tests marked as
@pytest.mark.multi_gpuor@pytest.mark.distributedare not run in the PR pipeline.For more details, see CONTRIBUTING
Note
By default, only basic unit tests are run. Add appropriate labels to enable an additional test coverage.
Authorizing CI Runs
We use copy-pr-bot to manage authorization of CI
runs on NVIDIA's compute resources.
automatically be copied to a pull-request/ prefixed branch in the source repository (e.g. pull-request/123)
/ok to testcomment on the pull request to trigger CI. This will need to be done for each new commit.Triggering Code Rabbit AI Review
To trigger a code review from code rabbit, comment on a pull request with one of these commands:
See https://docs.coderabbit.ai/reference/review-commands for a full list of commands.
Pre-submit Checklist