| FALCON | From Spatial to Actions:
Grounding Vision-Language-Action Model in Spatial Foundation Priors (ICLR 2026)
Chenchen Liu β Dong Wang β Francis E. H. Tay β Sijin Chen β
Ziwei Liu β Yuxiao Liu*β β Xinghang Li* β Pan Zhou* β
*Corresponding Authorβ β Project Lead
ByteDance Seed
National University of Singapore β Nanyang Technological University
Tsinghua University β Singapore Management University
-
[26/01/2026] π Thrilled to share that our paper has been accepted to ICLR 2026! Code will be open-sourced soon. Stay tuned!
-
[20/10/2025] Existing vision-language-action (VLA) models act in 3D real-world but are typically built on 2D encoders, leaving a spatial reasoning gap that limits generalization and adaptability. In this work, we introduce FALCON (From Spatial to Action), a novel paradigm that injects rich 3D spatial tokens into the action head of a VLA model, enabling robust spatial understanding and SOTA performance across diverse manipulation tasks without disrupting vision-language alignment. See our paper at here.
π‘ For more sim/real-world benchmark results, please refer to our paper.
- Release the code, model of FALCON.
- Release the CALVIN & SimplerEnv evaluation code and model weights for FALCON series.
- Release pre-training / fine-tuning code for FALCON series.
- Release the code for real-world deployment of FALCON via ManiUniCon.
If you find this project useful in your research, please consider cite:
@article{zhang2025spatial,
title={From spatial to actions: Grounding vision-language-action model in spatial foundation priors},
author={Zhang, Zhengshen and Li, Hao and Dai, Yalun and Zhu, Zhengbang and Zhou, Lei and Liu, Chenchen and Wang, Dong and Tay, Francis EH and Chen, Sijin and Liu, Ziwei and others},
journal={arXiv preprint arXiv:2510.17439},
year={2025}
}



