Bug fixes for evo2 inference. CP at least doesn't error. TP working… by kjaniknvidia · Pull Request #1454 · NVIDIA/bionemo-framework

kjaniknvidia · 2026-02-06T21:25:36Z

… on 1 node.

Description

EVO2 Dockerfile Change - Added ENVs to force casual_conv1d to be built

CP Partial Fixes
Running Evo2 Megatron inference with --context-parallel-size 2 caused tensor shape mismatches because the inference path doesn't split sequences across CP ranks (each rank gets the full sequence), but the code was still applying CP-related AllToAll operations and CP-sharded parameter slicing.

Tested that it no longer errors and returns the generation, but enabling full useful CP is a much heavier lift.

TP FIxes
With TP > 1, sequence_parallel was automatically set to True. Megatron's inference engine doesn't support sequence parallelism for non-MoE models, raising NotImplementedError. Sequence parallelism is a training-only activation memory optimization and isn't needed for inference.

In ParallelHyenaOperator.forward_long(), the call to engine.parallel_iir() passed self.hidden_size (the full model hidden size, e.g., 4096). With TP=8, each rank only has width_per_tp_group = 4096 / 8 = 512 channels, so the concatenated tensor u has 3 * 512 = 1536 channels. The parallel_iir function tried to split it into three chunks of 4096, causing a RuntimeError.

Infer.py Test Enabling
I didn't know this, but apparently unix command lines have a character limit. Even though I was generating the prompt it evaluated to something longer than 128K characters. Added a --prompt-file switch that reads the prompt... from a file.

Usage

Tested TP=8 on one node.
16K prompt fits on one GPU. 32K OOMs

Confirmed that 128K prompt fits on 8 GPUs and 256K prompt OOMs.

Generate the prompt:
python3 -c "
import random
random.seed(42)
seq = ''.join(random.choices('ACGT', k=131072))
prompt = '|d_Bacteria;p_Pseudomonadota;c_Gammaproteobacteria;o_Enterobacterales|' + seq
with open('/tmp/prompt128k.txt', 'w') as f:
f.write(prompt)
print(f'Wrote prompt of length {len(prompt)}')
"

Execute:
torchrun --nproc-per-node 8 ./src/bionemo/evo2/run/infer.py
--tensor-parallel-size 8
--ckpt-dir $CKPT_OUT_DIR
--max-new-tokens 100
--max-seq-length 131300
--prompt-file /tmp/prompt128k.txt

Type of changes

[ x] Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Refactor
Documentation update
Other (please describe):

CI Pipeline Configuration

Configure CI behavior by applying the relevant labels. By default, only basic unit tests are run.

It's my first real change. Run everything?

ciflow:all-recipes - Run tests for all recipes (under bionemo-recipes). This label can be used to enforce running tests for all recipes.

Unit tests marked as @pytest.mark.multi_gpu or @pytest.mark.distributed are not run in the PR pipeline.

For more details, see CONTRIBUTING

Note

By default, only basic unit tests are run. Add appropriate labels to enable an additional test coverage.

Authorizing CI Runs

We use copy-pr-bot to manage authorization of CI
runs on NVIDIA's compute resources.

If a pull request is opened by a trusted user and contains only trusted changes, the pull request's code will
automatically be copied to a pull-request/ prefixed branch in the source repository (e.g. pull-request/123)
If a pull request is opened by an untrusted user or contains untrusted changes, an NVIDIA org member must leave an
/ok to test comment on the pull request to trigger CI. This will need to be done for each new commit.

Triggering Code Rabbit AI Review

To trigger a code review from code rabbit, comment on a pull request with one of these commands:

@coderabbitai review - Triggers a standard review
@coderabbitai full review - Triggers a comprehensive review

See https://docs.coderabbit.ai/reference/review-commands for a full list of commands.

Pre-submit Checklist

I have tested these changes locally
I have updated the documentation accordingly
I have added/updated tests as needed
All existing tests pass successfully

…n 1 node.

copy-pr-bot · 2026-02-06T21:25:40Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

coderabbitai · 2026-02-06T21:25:44Z

Important

Review skipped

Auto reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

🔍 Trigger a full review

✨ Finishing touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

trvachov · 2026-02-06T22:18:19Z

@balvisio could you help review?

jstjohn · 2026-02-09T18:46:14Z

bionemo-recipes/recipes/evo2_megatron/src/bionemo/evo2/models/megatron/hyena/hyena_mixer.py

-            cp_group = None
-            cp_size = 1


Should we set this to 1/None in the case when we do not use CP? Forgot why we had these defaults here but there may be weird cases that fail where we divide by cp_size somewhere...

kjaniknvidia · 2026-02-09T22:08:39Z

I ran py.test tests/bionemo/evo2/

Were these all passing previously?

=========================== short test summary info ============================
FAILED tests/bionemo/evo2/run/test_predict.py::test_predict_evo2_equivalent_with_log_probs[ddp=1,cp=1,pp=1,tp=2,fp8=True,wi=epoch]
FAILED tests/bionemo/evo2/test_stop_and_go.py::test_stop_and_go[1-1-1-False-bf16_with_nvfp4_mixed]
FAILED tests/bionemo/evo2/test_stop_and_go.py::test_stop_and_go[1-1-1-False-bf16_with_mxfp8_mixed]
==== 3 failed, 232 passed, 9 skipped, 161107 warnings in 2885.77s (0:48:05) ====

First one missed, just outside the range.

E comparison failed
E Obtained: -76.17060852050781
E Expected: -77.04071044921875 ± 0.770407

The other two look like they are related to TE not supporting (NVFP4/MXFP8) precisions for hyena.....so "shouldn't" have anything to do with the inference changes?

jstjohn · 2026-02-13T00:29:45Z

@kjaniknvidia interesting, yeah we should mark bf16_with_mxfp8_mixed and bf16_with_nvfp4_mixed as skip. I don't think those were passing before. We may want to get them passing at some point though, so maybe mark them xfail or something.

tests/bionemo/evo2/run/test_predict.py::test_predict_evo2_equivalent_with_log_probs[ddp=1,cp=1,pp=1,tp=2,fp8=True,wi=epoch]
This one is interesting. It's likely a hardware difference. Were you running this on an H100 or something else?

The thing I'm more interested in is adding test coverage for the code you added. Specifically:

Add a test to https://github.com/NVIDIA/bionemo-framework/blob/main/bionemo-recipes/recipes/evo2_megatron/tests/bionemo/evo2/run/test_infer.py that uses the same gold-standard values and test method of https://github.com/NVIDIA/bionemo-framework/blob/main/bionemo-recipes/recipes/evo2_megatron/tests/bionemo/evo2/test_evo2.py#L735 but with torchrun CLI calls and use of TP=[1,2], PP=[1,2] and look into whether megatron inference supports CP like is being added to vLLM https://docs.vllm.ai/en/latest/serving/context_parallel_deployment/ if CP is supported also add CP=[1,2] coverage. Add skips for CI but verify that this works manually.

The test works by taking four well known sequences and first splitting them in half. These are the same four sequences that Arc used for internal testing of their model. Then 500 tokens (verify this number) are generated for each sequence and compared to the next 500 known tokens for each sequence. We have expected sequence identity for each of the 4 sequences given the deterministic topk=1 sampling strategy for generation.

Bug fixes for evo2 inference. CP at least doesn't error. TP working o…

4fdd984

…n 1 node.

kjaniknvidia requested review from cspades, dorotat-nv, jomitchellnv, jstjohn, jwilber, pstjohn, savitha-eng and trvachov as code owners February 6, 2026 21:25

jstjohn reviewed Feb 9, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug fixes for evo2 inference. CP at least doesn't error. TP working…#1454

Bug fixes for evo2 inference. CP at least doesn't error. TP working…#1454
kjaniknvidia wants to merge 1 commit intoNVIDIA:mainfrom
kjaniknvidia:feat/inferencebugs

kjaniknvidia commented Feb 6, 2026

Uh oh!

copy-pr-bot bot commented Feb 6, 2026

Uh oh!

coderabbitai bot commented Feb 6, 2026

Review skipped

Uh oh!

trvachov commented Feb 6, 2026

Uh oh!

jstjohn Feb 9, 2026

Uh oh!

kjaniknvidia commented Feb 9, 2026

Uh oh!

jstjohn commented Feb 13, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

kjaniknvidia commented Feb 6, 2026

Description

Usage

Type of changes

CI Pipeline Configuration

Authorizing CI Runs

Triggering Code Rabbit AI Review

Pre-submit Checklist

Uh oh!

copy-pr-bot bot commented Feb 6, 2026

Uh oh!

coderabbitai bot commented Feb 6, 2026

Review skipped

Uh oh!

trvachov commented Feb 6, 2026

Uh oh!

jstjohn Feb 9, 2026

Choose a reason for hiding this comment

Uh oh!

kjaniknvidia commented Feb 9, 2026

Uh oh!

jstjohn commented Feb 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

jstjohn commented Feb 13, 2026 •

edited

Loading