-
Notifications
You must be signed in to change notification settings - Fork 9
Description
Hi,
First of all, thank you very much for your excellent work and for open-sourcing the code!
According to the paper and the codebase, during the joint training stage you use ReconthenUndIterableDataset to load the SFT training data for spatial understanding datasets such as SPAR-7M, Omnispatial, Mindcube, and OST-Bench. In addition, SftJSONLIterableDataset is used for general VQA datasets like LLaVA-One-Vision.
I have a couple of questions regarding the data pipeline:
-
It seems that
ReconthenUndIterableDatasetrequires additional 3d annotations such as depth maps and camera poses. However, as far as I know, the training sets of Omnispatial and OST-Bench do not provide such annotations. In this case, should these two datasets be loaded through SftJSONLIterableDataset and trained in a purely 2D manner instead? -
Would it be possible to share a complete example of the SFT training data configuration (e.g., dataset config, mixing strategy, or YAML example) for the joint training stage?
Thank you very much for your time and help!