Towards Generalizable Partially Relevant Video Retrieval with Explicit and Implicit Knowledge Distillation
1. Create a conda environment and install the dependencies:
conda create -n prvr python=3.9
conda activate prvr
conda install pytorch==1.9.0 cudatoolkit=11.3 -c pytorch -c conda-forge
pip install -r requirements.txt
2. Download Datasets: All features of ActivityNet Captions and QVHighlights can also be downloaded later.
3. Set root and data_root in config files (e.g., ./Configs/act.py).
To train our method on ActivityNet Captions:
cd src
python main.py -d act --gpu 0,1
To train our method on QVHighlights:
cd src
python main.py -d qvhighlight --gpu 0,1
We provide trained checkpoints. You can download them from Baiduyun disk later.
For this repository, the expected model generalization performance in unseen data is: (Model generalization is evaluated by training on a source dataset and directly testing on an unseen target dataset (Source → Target))
| Dataset | R@1 | R@5 | R@10 | R@100 | SumR |
|---|---|---|---|---|---|
| QVHighlights → ActivityNet | 7.5 | 20.8 | 29.5 | 66.1 | 123.9 |
| ActivityNet → QVHighlights | 13.7 | 32.8 | 42.7 | 79.6 | 168.7 |
For this repository, the expected original dataset performance in seen data is:
CNN-based:
| Dataset | R@1 | R@5 | R@10 | R@100 | SumR |
|---|---|---|---|---|---|
| ActivityNet Captions | 8.9 | 27.8 | 40.5 | 79.6 | 156.8 |
| QVHighlights | 10.3 | 28.2 | 40.5 | 81.3 | 160.3 |
CLIP-based:
| Dataset | R@1 | R@5 | R@10 | R@100 | SumR |
|---|---|---|---|---|---|
| ActivityNet Captions | 15.0 | 36.4 | 49.7 | 84.0 | 185.1 |
| QVHighlights | 22.3 | 47.8 | 59.2 | 91.6 | 220.9 |
