Skip to content

Convert DPP to PyTest#186

Open
matthew-pisano wants to merge 71 commits intomainfrom
ZSPYRT-68_convert_dpp_to_pytest
Open

Convert DPP to PyTest#186
matthew-pisano wants to merge 71 commits intomainfrom
ZSPYRT-68_convert_dpp_to_pytest

Conversation

@matthew-pisano
Copy link
Copy Markdown
Collaborator

@matthew-pisano matthew-pisano commented Feb 12, 2026

Summary

The DPP script is currently very long and complex to use. Additionally, there is no uniform way for running it within test automation.

This PR splits the original DPP script between a proper PyTest and a frontend with the old interface. Now, the test can be run with pytest without the complex and arbitrary parameters usually required. This logic has been internalized into the test to represent a single working path. Additionally, the --distributed parameter is now inferred based on whether the script is run using torchrun or not.

Running from PyTest

torchrun --nproc-per-node 4 -m pytest -s ./tests/testing/test_drive_paged_programs.py

Note: This test must be run with 4 cards or else it may fail.

The interface for the standalone script remains unchanged.

Internal changes:

  • The original ~1500 line long file has been split into several different files within the testing/dpp module.
  • Arbitrary strings have also been replaced with enums to handle invalid inputs.
  • Error handling has been shifted towards early failure when parameters are invalid or files are not found.
  • A representative DPP program criteria JSON added as a fixture. No need to manually specify a path.

Fixes:

Signed-off-by: Matthew Pisano <matthewpisano14@gmail.com>
Signed-off-by: Matthew Pisano <matthewpisano14@gmail.com>
Signed-off-by: Matthew Pisano <matthewpisano14@gmail.com>
Signed-off-by: Matthew Pisano <matthewpisano14@gmail.com>
Signed-off-by: Matthew Pisano <matthewpisano14@gmail.com>
Signed-off-by: Matthew Pisano <matthewpisano14@gmail.com>
Signed-off-by: Matthew Pisano <matthewpisano14@gmail.com>
Signed-off-by: Matthew Pisano <matthewpisano14@gmail.com>
Signed-off-by: Matthew Pisano <matthewpisano14@gmail.com>
Signed-off-by: Matthew Pisano <matthewpisano14@gmail.com>
Signed-off-by: Matthew Pisano <matthewpisano14@gmail.com>
Signed-off-by: Matthew Pisano <matthewpisano14@gmail.com>
Signed-off-by: Matthew Pisano <matthewpisano14@gmail.com>
Signed-off-by: Matthew Pisano <matthewpisano14@gmail.com>
Signed-off-by: Matthew Pisano <matthewpisano14@gmail.com>
Signed-off-by: Matthew Pisano <matthewpisano14@gmail.com>
Signed-off-by: Matthew Pisano <matthewpisano14@gmail.com>
Signed-off-by: Matthew Pisano <matthewpisano14@gmail.com>
Signed-off-by: Matthew Pisano <matthewpisano14@gmail.com>
Signed-off-by: Matthew Pisano <matthewpisano14@gmail.com>
Signed-off-by: Matthew Pisano <matthewpisano14@gmail.com>
Signed-off-by: Matthew Pisano <matthewpisano14@gmail.com>
Signed-off-by: Matthew Pisano <matthewpisano14@gmail.com>
Signed-off-by: Matthew Pisano <matthewpisano14@gmail.com>
Signed-off-by: Matthew Pisano <matthewpisano14@gmail.com>
Signed-off-by: Matthew Pisano <matthewpisano14@gmail.com>
Comment thread aiu_fms_testing_utils/utils/dpp_config.py
@Ssukriti
Copy link
Copy Markdown
Contributor

Thank you for the effort that went into splitting utilities.

After careful review, my primary comments have been submitted .

data sampling part - please ensure that users can continue to pass ShareGpt and RAG datasets from file on disk - with type "shargept", "rag" as we have use cases for it and don't want to break existing behavior.

On file structure: @matthew-pisano I was wondering if we need another folder dpp under testing (aiu-fms-testing-utils/testing/dpp?) , or if all those files can reside in /utils https://github.com/foundation-model-stack/aiu-fms-testing-utils/tree/main/aiu_fms_testing_utils/utils, as they are essentially utils?
Please address my review changes in existing structure first as submitted, so I can review them before moving files around. I am open to leaving the files and structure as it is or moving them to utils, whatever you decide.

furthermore, after review changes are addressed, the DPP script should be tested before we merge to ensure it wont break any existing users - @rafvasq and @Abhishek-TAMU will need your help for that.

Signed-off-by: Matthew Pisano <matthewpisano14@gmail.com>
Signed-off-by: Matthew Pisano <matthewpisano14@gmail.com>
Signed-off-by: Matthew Pisano <matthewpisano14@gmail.com>
Signed-off-by: Matthew Pisano <matthewpisano14@gmail.com>
@matthew-pisano
Copy link
Copy Markdown
Collaborator Author

matthew-pisano commented Feb 23, 2026

@Ssukriti Thank you very much for your thorough review!

In response to you final folder suggestion, I am flexible on where the folder is, but I really think everything should be grouped into a single folder. We tend to just dump everything into a single folder throughout the stack and it often causes significant confusion when loosely unrelated code is in the same folder.

@Ssukriti
Copy link
Copy Markdown
Contributor

Thanks @matthew-pisano . The only pending change is that we do strongly want to support shargept and Rag factoid datasets from local paths on disk like previously, as we have processed the HF datasets for maximum coverage of programs. The processed datasets are shared between associated teams.

@joerunde @Abhishek-TAMU @rafvasq FYI ^

@matthew-pisano
Copy link
Copy Markdown
Collaborator Author

Thank you again @Ssukriti I will address your final concerns

Signed-off-by: Matthew Pisano <matthewpisano14@gmail.com>
Signed-off-by: Matthew Pisano <matthewpisano14@gmail.com>
@matthew-pisano
Copy link
Copy Markdown
Collaborator Author

I think I have gotten all of your suggestions. Let me know if not.

@Ssukriti
Copy link
Copy Markdown
Contributor

@Abhishek-TAMU or @rafvasq the branch is ready for testing our pipelines before we merge

Copy link
Copy Markdown
Contributor

@Ssukriti Ssukriti left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

approving as code is reviewed. we need to run some pipelines to double check logs before we merge

Copy link
Copy Markdown
Collaborator

@Abhishek-TAMU Abhishek-TAMU left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Requested a small change after running DPP from existing Jenkins Pipeline using existing interface and not pytest. Thank you.

Comment thread aiu_fms_testing_utils/scripts/drive_paged_programs.py Outdated
Signed-off-by: Matthew Pisano <matthewpisano14@gmail.com>
Copy link
Copy Markdown
Collaborator

@Abhishek-TAMU Abhishek-TAMU left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR, @matthew-pisano . Left a few comments to address.

Comment thread aiu_fms_testing_utils/testing/dpp/run_drive_paged_programs.py Outdated
Comment on lines +84 to +86
f"Could not determine tkv_limit for model variant '{model_variant}'. "
"Please set the environment variable VLLM_DT_MAX_BATCH_TKV_LIMIT or "
"run this program in distributed mode."
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@matthew-pisano Looks like we would always get this error when running with 1 AIU using existing interface from aiu_fms_testing_utils/scripts/drive_paged_programs.py. Are we planning to add code to use VLLM_DT_MAX_BATCH_TKV_LIMIT for 1 AIU runs too ?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will need some information how how you are running this with 1 AIU, 4 AIUs is the only good configuration that I know of. Are you running with torchrun?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do run with 1 AIU DPP runs using torchrun and sharing the command for the same if that helps:

export VLLM_DT_CHUNK_LEN=1024
export VLLM_DT_MAX_BATCH_SIZE=16
export VLLM_DT_MAX_BATCH_TKV_LIMIT=131072
export VLLM_DT_MAX_CONTEXT_LEN=3072

torchrun --nproc-per-node=1 aiu-fms-testing-utils/scripts/drive_paged_programs.py \
--max_new_tokens=32 \
--prefill_chunk_size=1024 \
--model_variant=ibm-granite/granite-3.3-8b-instruct \
--program_criteria_json_path=program_criteria_new.json \
--dataset_path=ShareGPT_V3_unfiltered_cleaned_split.json \
--dataset_type=sharegpt \
--test_type=metrics \
--cross_entropy_threshold=2.6 \
--failure_rate_threshold=0.1 \
--attention_type=paged \
--validation_info_outputs_dir=granite_3/sharegpt_homogeneous \
--prioritize_large_batch_sizes \
--enforce_homogeneous_prompt_programs

Copy link
Copy Markdown
Collaborator Author

@matthew-pisano matthew-pisano Mar 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was unable to get it working with these parameters. I got an error complaining that it could not allocate a KV cache. I tried it with a few smaller values and frequently got floating point errors. I found one set of values that worked:

VLLM_DT_CHUNK_LEN"="256"
VLLM_DT_MAX_BATCH_SIZE="8"
VLLM_DT_MAX_BATCH_TKV_LIMIT="1024"
VLLM_DT_MAX_CONTEXT_LEN="256"
AFTU_PAGED_KVCACHE_NUM_BLOCKS_HINT="1024"

Is there a way to decide which combination of environment variables will work? I cannot find any documentation, just things declaring "These should probably work".

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Abhishek-TAMU Any thoughts

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@matthew-pisano Earlier I had shared the detailed product use case internally, along with the potential fix. Summarizing it again here:

There is currently no documentation describing which parameter combination should be used. Based on the product use case, the 1-AIU case is currently running and testing successfully with the following parameters:

VLLM_DT_MAX_BATCH_SIZE= 16 
VLLM_DT_MAX_CONTEXT_LEN=3072
VLLM_DT_MAX_BATCH_TKV_LIMIT=131072
VLLM_DT_CHUNK_LEN=1024

AFTU_PAGED_KVCACHE_NUM_BLOCKS_HINT is Not set because it is later set in generate function using the logic here.

For the 4-AIU configuration, the current logic works fine:
AFTU_PAGED_KVCACHE_NUM_BLOCKS_HINT = 8192 if prefill_chunk_size > 0 else 2080

So in this commit, the logic added for the 1-AIU run to assign VLLM_DT_MAX_BATCH_TKV_LIMIT looks correct:

elif use_distributed and world_size == 1:
    ##Only set defaults for TP=1
    context = (
        "Model granite-3.3-8b (or compatible) "
        "with tensor parallel size 1 detected"
    )
    self.tkv_limit = self._get_int_env(
        key="VLLM_DT_MAX_BATCH_TKV_LIMIT",
        default=131072,
        context=context,
    )

However, it seems better to keep the earlier behavior of assigning self.num_blocks only for 4-AIU runs, not for 1-AIU runs, since for 1 AIU num_blocks is set later and that is already working in the current pipeline.

So this code block should be removed, and this earlier code here should be restored.

This will run 1 AIU configuration successfully with the params given here.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Abhishek-TAMU It looks like DPP is unacceptably flaky/machine dependent with 1 AIU. We have been unable to get the 1 AIU case working for any configuration recently. Even the parameters that I measured as good do not work. We are thinking that there may be some issues beyond the environment variables, possibly dependent on container configuration.

Copy link
Copy Markdown
Collaborator

@Abhishek-TAMU Abhishek-TAMU Apr 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@matthew-pisano

1- Can I ask what those errors are? Just to confirm, are you testing with this change in aiu_fms_testing_utils/utils/dpp_config.py ? If you are testing with this change and are unable to get it running on your machine for verification, then we can stick with this change since it works in the existing pipeline for the 1 AIU run. Once you make this change, we can proceed with the PR.

2- I was also testing the 4 AIU runs again and noticed this: mapping here attn_type_map and passing attn_type_map[attn_type] instead of just attn_type.value in the get_default_validation_prefix() function changes the name of the CPU info file. Because of that, it no longer finds the existing CPU info files that we have already saved for all models and dtypes.

So, if this change is not mandatory, can we continue passing just attn_type.value? That would save the effort of re-saving all CPU info for the entire DPP pipeline across all models just because the file name changes.

3- When DPP tries to save new CPU info, there are a couple of errors in that path. Please test the CPU info saving flow as well. One fix is here in aiu_fms_testing_utils/testing/dpp/generation.py due to an argument mismatch, but there seem to be additional issues in that path when saving new CPU info.

Signed-off-by: Matthew Pisano <matthewpisano14@gmail.com>
Signed-off-by: Matthew Pisano <matthewpisano14@gmail.com>
Signed-off-by: Matthew Pisano <matthewpisano14@gmail.com>
Signed-off-by: Matthew Pisano <matthewpisano14@gmail.com>
Signed-off-by: Matthew Pisano <matthewpisano14@gmail.com>
Signed-off-by: Matthew Pisano <matthewpisano14@gmail.com>
Signed-off-by: Matthew Pisano <matthewpisano14@gmail.com>
@JRosenkranz
Copy link
Copy Markdown
Contributor

bot:test
TEST_FILE=test_scripts.py

@matthew-pisano matthew-pisano requested review from cfsarmiento and removed request for jameslivulpi March 20, 2026 15:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants