Skip to content

EnVision-Research/PAP

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

13 Commits
ย 
ย 
ย 
ย 

Repository files navigation

Panoramic Affordance Prediction

Zixin Zhang1*, Chenfei Liao1*, Hongfei Zhang1, Harold H. Chen1, Kanghao Chen1, Zichen Wen3, Litao Guo1, Bin Ren4, Xu Zheng1, Yinchuan Li6, Xuming Hu1, Nicu Sebe5, Ying-Cong Chen1,2โ€ 

1HKUST(GZ), 2HKUST, 3SJTU, 4MBZUAI, 5UniTrento, 6Knowin

*Equal contribution ย ย ย  โ€ Corresponding author

Project Page Paper Dataset

Official repository for the paper: Panoramic Affordance Prediction.

Affordance prediction serves as a critical bridge between perception and action in embodied AI. However, existing research is confined to pinhole camera models, which suffer from narrow Fields of View (FoV) and fragmented observations. In this paper, we present the first exploration into Panoramic Affordance Prediction, utilizing 360-degree imagery to capture global spatial relationships and holistic scene understanding.


teaser

๐Ÿš€ Right Around the Corner!

The codebase is currently undergoing internal review and clean-up.

We plan to release the following components soon (in two weeks):

  • PAP-12K Dataset (All full resolution images, QA annotations, and Segmentation Masks)
  • Evaluation Scripts for the benchmark
  • Source Code for the PAP inference pipeline

Please stay tuned for updates!


๐ŸŒŸ Highlights

  • New Task: We introduce the First Exploration into Panoramic Affordance Prediction, overcoming the "tunnel vision" of traditional pinhole camera based affordance methods.
  • PAP-12K Dataset (100% Real-World): A large-scale benchmark featuring 1,003 natively captured ultra-high-resolution (12K) panoramic images from diverse indoor environments, coupled with over 13,000 carefully annotated reasoning-based QA pairs with pixel-level affordance masks.
  • PAP Framework: A training-free, coarse-to-fine pipeline mimicking human foveal vision to handle panoramic challenges like geometric distortion, scale variations, and boundary discontinuity.

๐Ÿ“Š Dataset (PAP-12K)

PAP-12K is explicitly designed to encapsulate the unique challenges of 360ยฐ Equirectangular Projection (ERP) imagery. Unlike synthetic or web-crawled datasets, all 1,003 ultra-high resolution (11904ร—5952) panoramic images in PAP-12K were natively captured in real-world environments using professional 360ยฐ cameras. This ensures authentic geometric distortions, lighting conditions, and natural object scales, bridging the gap between static dataset evaluation and practical robotic applications.

Key challenges captured include:

  • Geometric Distortion: Objects suffer from severe stretching near the poles.
  • Extreme Scale Variations: Unconstrained environments lead to minute, sub-scale interactive targets.
  • Boundary Discontinuity: Continuous objects are split at image edges.
challenge

(Dataset download links and formatting instructions will be provided here soon.)


๐Ÿ› ๏ธ Method Overview

Our proposed PAP framework operates in three primary stages to tackle 360-degree scenes:

  1. Recursive Visual Routing: Uses numerical grid prompting to guide Vision-Language Models (VLMs) to dynamically "zoom in" and coarsely locate target tools.
  2. Adaptive Gaze: Projects the spherical region onto a tailored perspective plane to act as a domain adapter, eliminating geometric distortions and boundary discontinuities.
  3. Cascaded Affordance Grounding: Deploys robust 2D vision models (Open-Vocabulary Detector + SAM) within the rectified patch to extract precise, instance-level masks.
pipeline

๐Ÿ“ง Contact

If you have any questions or suggestions, please feel free to contact us at zzhang300@connect.hkust-gz.edu.cn, cliao127@connect.hkust-gz.edu.cn.

About

Panoramic Affordance Prediction (PAP)

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors