We introduce UI-RFT, the first framework utilizing rule-based RL to enhance VLMs' GUI grounding capabilities.
This work is the course project of CS3316 Reinforcement Learning.
- Reinforced fine-tuning with only 128 high-quality samples significantly enhances GUI grounding.
- GUI grounding is a fundamental visual ability in VLMs, improved without needing long reasoning chains.
To train VLM with verl:
./train.shTo test VLM on ScreenSpot:
python ./screenspot/test.pyTo test VLM on ScreenSpot-Pro:
python ./screenspot/test-pro.pyWe would like to express our sincere gratitude to Yan Ma for his invaluable and highly insightful discussions.