Zixin Zhang1*, Chenfei Liao1*, Hongfei Zhang1, Harold H. Chen1, Kanghao Chen1, Zichen Wen3, Litao Guo1, Bin Ren4, Xu Zheng1, Yinchuan Li6, Xuming Hu1, Nicu Sebe5, Ying-Cong Chen1,2โ
1HKUST(GZ), 2HKUST, 3SJTU, 4MBZUAI, 5UniTrento, 6Knowin
*Equal contribution ย ย ย โ Corresponding author
Official repository for the paper: Panoramic Affordance Prediction.
Affordance prediction serves as a critical bridge between perception and action in embodied AI. However, existing research is confined to pinhole camera models, which suffer from narrow Fields of View (FoV) and fragmented observations. In this paper, we present the first exploration into Panoramic Affordance Prediction, utilizing 360-degree imagery to capture global spatial relationships and holistic scene understanding.
The codebase is currently undergoing internal review and clean-up.
We plan to release the following components soon (in two weeks):
- PAP-12K Dataset (All full resolution images, QA annotations, and Segmentation Masks)
- Evaluation Scripts for the benchmark
- Source Code for the PAP inference pipeline
Please stay tuned for updates!
- New Task: We introduce the First Exploration into Panoramic Affordance Prediction, overcoming the "tunnel vision" of traditional pinhole camera based affordance methods.
- PAP-12K Dataset (100% Real-World): A large-scale benchmark featuring 1,003 natively captured ultra-high-resolution (12K) panoramic images from diverse indoor environments, coupled with over 13,000 carefully annotated reasoning-based QA pairs with pixel-level affordance masks.
- PAP Framework: A training-free, coarse-to-fine pipeline mimicking human foveal vision to handle panoramic challenges like geometric distortion, scale variations, and boundary discontinuity.
PAP-12K is explicitly designed to encapsulate the unique challenges of 360ยฐ Equirectangular Projection (ERP) imagery. Unlike synthetic or web-crawled datasets, all 1,003 ultra-high resolution (11904ร5952) panoramic images in PAP-12K were natively captured in real-world environments using professional 360ยฐ cameras. This ensures authentic geometric distortions, lighting conditions, and natural object scales, bridging the gap between static dataset evaluation and practical robotic applications.
Key challenges captured include:
- Geometric Distortion: Objects suffer from severe stretching near the poles.
- Extreme Scale Variations: Unconstrained environments lead to minute, sub-scale interactive targets.
- Boundary Discontinuity: Continuous objects are split at image edges.
(Dataset download links and formatting instructions will be provided here soon.)
Our proposed PAP framework operates in three primary stages to tackle 360-degree scenes:
- Recursive Visual Routing: Uses numerical grid prompting to guide Vision-Language Models (VLMs) to dynamically "zoom in" and coarsely locate target tools.
- Adaptive Gaze: Projects the spherical region onto a tailored perspective plane to act as a domain adapter, eliminating geometric distortions and boundary discontinuities.
- Cascaded Affordance Grounding: Deploys robust 2D vision models (Open-Vocabulary Detector + SAM) within the rectified patch to extract precise, instance-level masks.
If you have any questions or suggestions, please feel free to contact us at zzhang300@connect.hkust-gz.edu.cn, cliao127@connect.hkust-gz.edu.cn.
