This is an official implementation of the visual-language navigation task in InternVideo.
We currently provide evaluation of our pretrained model.
- Please follow https://github.com/jacobkrantz/VLN-CE to install Habitat Simulator and Habitat-lab. We use Python 3.6 in our experiments.
- Follow https://github.com/openai/CLIP to install CLIP.
- Follow https://github.com/jacobkrantz/VLN-CE to download Matterport3D environment to
data/scene_datasets. Data ahould have the formdata/scene_datasets/mp3d/{scene}/{scene}.glb. - Download preporcessed VLN-CE dataset from https://pjlab-gvm-data.oss-cn-shanghai.aliyuncs.com/internvideo/vln/dataset.zip to
data/datasets. Data should have the formdata/datasets/R2R_VLNCE_v1-2_preprocessed_BERTidx/{split}anddata/datasets/R2R_VLNCE_v1-2_preprocessed/{split}. - Download pretrained models from https://pjlab-gvm-data.oss-cn-shanghai.aliyuncs.com/internvideo/vln/pretrained.zip to
pretrained. It should have 6 folders:pretrained/pretrained_models,pretrained/VideoMAE,pretrained/wp_pred,pretrained/ddppo-models,pretrained/Prevalent,pretrained/wp_pred.
Simply run bash eval_**.sh to start evaluating the agent. Run bash train.bash to start training (6 gpus).