Use separate Conda environments for each stage to avoid dependency conflicts.
conda create -n llamafactory python=3.12
conda activate llamafactory
pip install torch==2.8.0 torchvision==0.23.0 torchaudio==2.8.0 --index-url https://download.pytorch.org/whl/cu128
pip install https://github.com/Dao-AILab/flash-attention/releases/download/v2.8.3/flash_attn-2.8.3+cu12torch2.8cxx11abiFALSE-cp312-cp312-linux_x86_64.whl
pip install -r requirements-llamafactory.txtconda create -n verl python=3.10
conda activate verl
pip install torch==2.8.0 torchvision==0.23.0 torchaudio==2.8.0 --index-url https://download.pytorch.org/whl/cu128
pip install https://github.com/Dao-AILab/flash-attention/releases/download/v2.8.3/flash_attn-2.8.3+cu12torch2.8cxx11abiFALSE-cp310-cp310-linux_x86_64.whl
pip install -r requirements-verl.txtconda create -n visualization python=3.10
conda activate visualization
pip install torch==2.8.0 torchvision==0.23.0 torchaudio==2.8.0 --index-url https://download.pytorch.org/whl/cu128
pip install https://github.com/Dao-AILab/flash-attention/releases/download/v2.8.3/flash_attn-2.8.3+cu12torch2.8cxx11abiFALSE-cp310-cp310-linux_x86_64.whl
pip install -r requirements-visualization.txtPrepare the training data before starting model training.
bash ./scripts/pacs/pacs_process.shbash ./scripts/vgg/vgg_process.shRun the following scripts for two-stage training.
bash ./scripts/pacs/pacs_train_stage1.sh
bash ./scripts/pacs/pacs_train_stage2.shbash ./scripts/vgg/vgg_train_stage1.sh
bash ./scripts/vgg/vgg_train_stage2.shImportant
If the reward stops improving during stage-2 RL training, run the script below to replace non-weight files and then resume training:
bash ./scripts/replace_non_weight_files.shRun the following scripts to evaluate model performance.
bash ./scripts/pacs/pacs_test.shbash ./scripts/vgg/vgg_test.shIf you only need visualization for the face recognition scenario, you can directly use the provided checkpoint at ./model/qwen_to_clip_projector.pt:
bash ./scripts/visualization/test.shIf you want visualization support for other scenarios, train the mapping network first:
bash ./scripts/visualization/train.shOur framework builds upon the excellent work of: