Conversation
Signed-off-by: Matthew Pisano <matthewpisano14@gmail.com>
Signed-off-by: Matthew Pisano <matthewpisano14@gmail.com>
Signed-off-by: Matthew Pisano <matthewpisano14@gmail.com>
Signed-off-by: Matthew Pisano <matthewpisano14@gmail.com>
Signed-off-by: Matthew Pisano <matthewpisano14@gmail.com>
Signed-off-by: Matthew Pisano <matthewpisano14@gmail.com>
Signed-off-by: Matthew Pisano <matthewpisano14@gmail.com>
Signed-off-by: Matthew Pisano <matthewpisano14@gmail.com>
Signed-off-by: Matthew Pisano <matthewpisano14@gmail.com>
Signed-off-by: Matthew Pisano <matthewpisano14@gmail.com>
Signed-off-by: Matthew Pisano <matthewpisano14@gmail.com>
Signed-off-by: Matthew Pisano <matthewpisano14@gmail.com>
Signed-off-by: Matthew Pisano <matthewpisano14@gmail.com>
Signed-off-by: Matthew Pisano <matthewpisano14@gmail.com>
Signed-off-by: Matthew Pisano <matthewpisano14@gmail.com>
Signed-off-by: Matthew Pisano <matthewpisano14@gmail.com>
Signed-off-by: Matthew Pisano <matthewpisano14@gmail.com>
Signed-off-by: Matthew Pisano <matthewpisano14@gmail.com>
Signed-off-by: Matthew Pisano <matthewpisano14@gmail.com>
Signed-off-by: Matthew Pisano <matthewpisano14@gmail.com>
Signed-off-by: Matthew Pisano <matthewpisano14@gmail.com>
Signed-off-by: Matthew Pisano <matthewpisano14@gmail.com>
Signed-off-by: Matthew Pisano <matthewpisano14@gmail.com>
Signed-off-by: Matthew Pisano <matthewpisano14@gmail.com>
Signed-off-by: Matthew Pisano <matthewpisano14@gmail.com>
Signed-off-by: Matthew Pisano <matthewpisano14@gmail.com>
|
Thank you for the effort that went into splitting utilities. After careful review, my primary comments have been submitted . data sampling part - please ensure that users can continue to pass ShareGpt and RAG datasets from file on disk - with type "shargept", "rag" as we have use cases for it and don't want to break existing behavior. On file structure: @matthew-pisano I was wondering if we need another folder dpp under testing (aiu-fms-testing-utils/testing/dpp?) , or if all those files can reside in /utils https://github.com/foundation-model-stack/aiu-fms-testing-utils/tree/main/aiu_fms_testing_utils/utils, as they are essentially utils? furthermore, after review changes are addressed, the DPP script should be tested before we merge to ensure it wont break any existing users - @rafvasq and @Abhishek-TAMU will need your help for that. |
Signed-off-by: Matthew Pisano <matthewpisano14@gmail.com>
Signed-off-by: Matthew Pisano <matthewpisano14@gmail.com>
Signed-off-by: Matthew Pisano <matthewpisano14@gmail.com>
Signed-off-by: Matthew Pisano <matthewpisano14@gmail.com>
|
@Ssukriti Thank you very much for your thorough review! In response to you final folder suggestion, I am flexible on where the folder is, but I really think everything should be grouped into a single folder. We tend to just dump everything into a single folder throughout the stack and it often causes significant confusion when loosely unrelated code is in the same folder. |
|
Thanks @matthew-pisano . The only pending change is that we do strongly want to support shargept and Rag factoid datasets from local paths on disk like previously, as we have processed the HF datasets for maximum coverage of programs. The processed datasets are shared between associated teams. @joerunde @Abhishek-TAMU @rafvasq FYI ^ |
|
Thank you again @Ssukriti I will address your final concerns |
Signed-off-by: Matthew Pisano <matthewpisano14@gmail.com>
Signed-off-by: Matthew Pisano <matthewpisano14@gmail.com>
|
I think I have gotten all of your suggestions. Let me know if not. |
|
@Abhishek-TAMU or @rafvasq the branch is ready for testing our pipelines before we merge |
Ssukriti
left a comment
There was a problem hiding this comment.
approving as code is reviewed. we need to run some pipelines to double check logs before we merge
Signed-off-by: Matthew Pisano <matthewpisano14@gmail.com>
Abhishek-TAMU
left a comment
There was a problem hiding this comment.
Thanks for the PR, @matthew-pisano . Left a few comments to address.
| f"Could not determine tkv_limit for model variant '{model_variant}'. " | ||
| "Please set the environment variable VLLM_DT_MAX_BATCH_TKV_LIMIT or " | ||
| "run this program in distributed mode." |
There was a problem hiding this comment.
@matthew-pisano Looks like we would always get this error when running with 1 AIU using existing interface from aiu_fms_testing_utils/scripts/drive_paged_programs.py. Are we planning to add code to use VLLM_DT_MAX_BATCH_TKV_LIMIT for 1 AIU runs too ?
There was a problem hiding this comment.
I will need some information how how you are running this with 1 AIU, 4 AIUs is the only good configuration that I know of. Are you running with torchrun?
There was a problem hiding this comment.
We do run with 1 AIU DPP runs using torchrun and sharing the command for the same if that helps:
export VLLM_DT_CHUNK_LEN=1024
export VLLM_DT_MAX_BATCH_SIZE=16
export VLLM_DT_MAX_BATCH_TKV_LIMIT=131072
export VLLM_DT_MAX_CONTEXT_LEN=3072
torchrun --nproc-per-node=1 aiu-fms-testing-utils/scripts/drive_paged_programs.py \
--max_new_tokens=32 \
--prefill_chunk_size=1024 \
--model_variant=ibm-granite/granite-3.3-8b-instruct \
--program_criteria_json_path=program_criteria_new.json \
--dataset_path=ShareGPT_V3_unfiltered_cleaned_split.json \
--dataset_type=sharegpt \
--test_type=metrics \
--cross_entropy_threshold=2.6 \
--failure_rate_threshold=0.1 \
--attention_type=paged \
--validation_info_outputs_dir=granite_3/sharegpt_homogeneous \
--prioritize_large_batch_sizes \
--enforce_homogeneous_prompt_programs
There was a problem hiding this comment.
I was unable to get it working with these parameters. I got an error complaining that it could not allocate a KV cache. I tried it with a few smaller values and frequently got floating point errors. I found one set of values that worked:
VLLM_DT_CHUNK_LEN"="256"
VLLM_DT_MAX_BATCH_SIZE="8"
VLLM_DT_MAX_BATCH_TKV_LIMIT="1024"
VLLM_DT_MAX_CONTEXT_LEN="256"
AFTU_PAGED_KVCACHE_NUM_BLOCKS_HINT="1024"
Is there a way to decide which combination of environment variables will work? I cannot find any documentation, just things declaring "These should probably work".
There was a problem hiding this comment.
@matthew-pisano Earlier I had shared the detailed product use case internally, along with the potential fix. Summarizing it again here:
There is currently no documentation describing which parameter combination should be used. Based on the product use case, the 1-AIU case is currently running and testing successfully with the following parameters:
VLLM_DT_MAX_BATCH_SIZE= 16
VLLM_DT_MAX_CONTEXT_LEN=3072
VLLM_DT_MAX_BATCH_TKV_LIMIT=131072
VLLM_DT_CHUNK_LEN=1024
AFTU_PAGED_KVCACHE_NUM_BLOCKS_HINT is Not set because it is later set in generate function using the logic here.
For the 4-AIU configuration, the current logic works fine:
AFTU_PAGED_KVCACHE_NUM_BLOCKS_HINT = 8192 if prefill_chunk_size > 0 else 2080
So in this commit, the logic added for the 1-AIU run to assign VLLM_DT_MAX_BATCH_TKV_LIMIT looks correct:
elif use_distributed and world_size == 1:
##Only set defaults for TP=1
context = (
"Model granite-3.3-8b (or compatible) "
"with tensor parallel size 1 detected"
)
self.tkv_limit = self._get_int_env(
key="VLLM_DT_MAX_BATCH_TKV_LIMIT",
default=131072,
context=context,
)
However, it seems better to keep the earlier behavior of assigning self.num_blocks only for 4-AIU runs, not for 1-AIU runs, since for 1 AIU num_blocks is set later and that is already working in the current pipeline.
So this code block should be removed, and this earlier code here should be restored.
This will run 1 AIU configuration successfully with the params given here.
There was a problem hiding this comment.
@Abhishek-TAMU It looks like DPP is unacceptably flaky/machine dependent with 1 AIU. We have been unable to get the 1 AIU case working for any configuration recently. Even the parameters that I measured as good do not work. We are thinking that there may be some issues beyond the environment variables, possibly dependent on container configuration.
There was a problem hiding this comment.
1- Can I ask what those errors are? Just to confirm, are you testing with this change in aiu_fms_testing_utils/utils/dpp_config.py ? If you are testing with this change and are unable to get it running on your machine for verification, then we can stick with this change since it works in the existing pipeline for the 1 AIU run. Once you make this change, we can proceed with the PR.
2- I was also testing the 4 AIU runs again and noticed this: mapping here attn_type_map and passing attn_type_map[attn_type] instead of just attn_type.value in the get_default_validation_prefix() function changes the name of the CPU info file. Because of that, it no longer finds the existing CPU info files that we have already saved for all models and dtypes.
So, if this change is not mandatory, can we continue passing just attn_type.value? That would save the effort of re-saving all CPU info for the entire DPP pipeline across all models just because the file name changes.
3- When DPP tries to save new CPU info, there are a couple of errors in that path. Please test the CPU info saving flow as well. One fix is here in aiu_fms_testing_utils/testing/dpp/generation.py due to an argument mismatch, but there seem to be additional issues in that path when saving new CPU info.
Signed-off-by: Matthew Pisano <matthewpisano14@gmail.com>
Signed-off-by: Matthew Pisano <matthewpisano14@gmail.com>
Signed-off-by: Matthew Pisano <matthewpisano14@gmail.com>
Signed-off-by: Matthew Pisano <matthewpisano14@gmail.com>
Signed-off-by: Matthew Pisano <matthewpisano14@gmail.com>
Signed-off-by: Matthew Pisano <matthewpisano14@gmail.com>
Signed-off-by: Matthew Pisano <matthewpisano14@gmail.com>
|
bot:test |
Summary
The DPP script is currently very long and complex to use. Additionally, there is no uniform way for running it within test automation.
This PR splits the original DPP script between a proper PyTest and a frontend with the old interface. Now, the test can be run with
pytestwithout the complex and arbitrary parameters usually required. This logic has been internalized into the test to represent a single working path. Additionally, the--distributedparameter is now inferred based on whether the script is run using torchrun or not.Running from PyTest
Note: This test must be run with 4 cards or else it may fail.
The interface for the standalone script remains unchanged.
Internal changes:
testing/dppmodule.Fixes: