Skip to content

FALCON-VLA/FALCON

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

1 Commit
Β 
Β 
Β 
Β 

Repository files navigation

FALCON_logo

| FALCON | From Spatial to Actions:
Grounding Vision-Language-Action Model in Spatial Foundation Priors (ICLR 2026)

arXiv Website HF Paper: FALCON HF Model: FALCON
Python 3.8 PyTorch

Zhengshen Zhang   Hao Li   Yalun Dai   Zhengbang Zhu   Lei Zhou  
Chenchen Liu   Dong Wang   Francis E. H. Tay   Sijin Chen  
Ziwei Liu   Yuxiao Liu*†   Xinghang Li*   Pan Zhou*  

*Corresponding Author  †Project Lead


ByteDance Seed
National University of Singapore   Nanyang Technological University
Tsinghua University   Singapore Management University


FALCON_teaser

Updates πŸš€πŸš€πŸš€

  • [26/01/2026] 🎊 Thrilled to share that our paper has been accepted to ICLR 2026! Code will be open-sourced soon. Stay tuned!

  • [20/10/2025] Existing vision-language-action (VLA) models act in 3D real-world but are typically built on 2D encoders, leaving a spatial reasoning gap that limits generalization and adaptability. In this work, we introduce FALCON (From Spatial to Action), a novel paradigm that injects rich 3D spatial tokens into the action head of a VLA model, enabling robust spatial understanding and SOTA performance across diverse manipulation tasks without disrupting vision-language alignment. See our paper at here.

Contents

πŸ’ͺ Benchmark Performance Comparison

CALVIN Benchmark

calvin

SimplerEnv WidowX Robot Experiments

simpler

SimplerEnv Google Robot Experiments

simpler

Real-World Experiments

real-world πŸ’‘ For more sim/real-world benchmark results, please refer to our paper.

πŸ—’οΈ TODO List

  • Release the code, model of FALCON.
  • Release the CALVIN & SimplerEnv evaluation code and model weights for FALCON series.
  • Release pre-training / fine-tuning code for FALCON series.
  • Release the code for real-world deployment of FALCON via ManiUniCon.

πŸ–ŠοΈ Citation

If you find this project useful in your research, please consider cite:

@article{zhang2025spatial,
  title={From spatial to actions: Grounding vision-language-action model in spatial foundation priors},
  author={Zhang, Zhengshen and Li, Hao and Dai, Yalun and Zhu, Zhengbang and Zhou, Lei and Liu, Chenchen and Wang, Dong and Tay, Francis EH and Chen, Sijin and Liu, Ziwei and others},
  journal={arXiv preprint arXiv:2510.17439},
  year={2025}
}

About

πŸ¦… FALCON: an effective vision-language-action model injects rich 3D spatial tokens into the action head, enabling robust spatial understanding and SOTA performance across diverse manipulation tasks. || Accepted at ICLR 2026.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages