Hi,
Thanks for your great work.
I have several questions about the data and method:
- I am curious about the pipeline about generation of text list about HD map. Can you share more details about how to get text for multi-view and bev images? Are those information from a pretrained mulit-modal model or rules based on hd map?
- Are the visual encoder the same for multi-view images and bev cloud images? The encoder in the paper seems different, but in the inference code https://github.com/LLVM-AD/MAPLM/blob/main/baseline/evaluation/inference.py#L72C29-L72C44, the image processors are the same.
Hi,
Thanks for your great work.
I have several questions about the data and method: