UHCL: Unified Hierarchical Contrastive Learning for Video Caption

🎉 This repository is the official implementation of the paper "Unified Hierarchical Contrastive Learning for Video Caption". The paper has now been accepted by the JCR Q1 and SCI Q1 journal Information Fusion.

✨ The proposed method enhances the quality and distinctiveness of video caption generation through a unified hierarchical contrastive learning framework, without introducing additional inference overhead.

🔧 Setup

Execute below scripts in the main folder, to avoid a download conflict when doing distributed pretraining.

mkdir modules/bert-base-uncased
cd modules/bert-base-uncased/
wget https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-uncased-vocab.txt
mv bert-base-uncased-vocab.txt vocab.txt
wget https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-uncased.tar.gz
tar -xvf bert-base-uncased.tar.gz
rm bert-base-uncased.tar.gz
cd ../../

Prepare the conda environment:

conda create -n clip4caption python=3.6.9 tqdm boto3 requests pandas
conda activate clip4caption
pip install torch==1.10.2 torchvision --extra-index-url https://download.pytorch.org/whl/cu113
pip install git+https://github.com/Maluuba/nlg-eval.git@master
pip install pycocoevalcap
pip install pickle5
pip install opencv-python==4.5.5.62

Download the pretrained weight of UniVL:

mkdir -p ./weight
wget -P ./weight https://github.com/microsoft/UniVL/releases/download/v0/univl.pretrained.bin

Download the pretrained weights of our UHCL and extracted data features following CLIP4Caption reproduction. Place the files into the corresponding folders, and then you can start training and inference:

Link: https://pan.baidu.com/s/13MliRc4gIlOSpyPlZcOo4w?pwd=UHCL
Code: UHCL

or

https://www.alipan.com/s/9XXq8RwiZ9S
code: 98ub

🚀 Training & Evaluation

You may need to modify the corresponding scripts as per your needs.

cd scripts
# MSVD Dataset
bash train_msvd.sh
bash eval_msvd.sh

# MSRVTT Dataset
bash train_msrvtt.sh
bash eval_msrvtt.sh

⭐ Star History

Thanks for visiting ✨ UHCL!

_{UHCL is for educational, research, and technical exchange purposes only}

🤝 References

This repository is implemented based on the CLIP4Caption reproduction, Chat-UniVi, UniVL and CLIP4Clip

Original CLIP4Caption Notes

# Reproducing CLIP4Caption

Note: The implementation is not considering the TSN sampling as in the CLIP4Caption paper. However, even without the TSN sampling, i.e., only using the original sampling method in CLIP4Clip, it is found that similar (even slightly better) performance results can be achieved as in the CLIP4Caption paper. While reproducing the results, it was observed that using the TSN sampling could not achieve the similar performance results as in the paper.

Paper: Mingkang Tang, Zhanyu Wang, Zhenhua LIU, Fengyun Rao, Dian Li, and Xiu Li. 2021. CLIP4Caption: CLIP for Video Caption. In Proceedings of the 29th ACM International Conference on Multimedia (MM '21). Association for Computing Machinery, New York, NY, USA, 4858–4862. > https://dl.acm.org/doi/10.1145/3474085.3479207

Setup

Execute below scripts in the main folder, to avoid a download conflict when doing distributed pretraining.

mkdir modules/bert-base-uncased
cd modules/bert-base-uncased/
wget https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-uncased-vocab.txt
mv bert-base-uncased-vocab.txt vocab.txt
wget https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-uncased.tar.gz
tar -xvf bert-base-uncased.tar.gz
rm bert-base-uncased.tar.gz
cd ../../

Prepare the conda environment:

conda create -n clip4caption python=3.6.9 tqdm boto3 requests pandas
conda activate clip4caption
pip install torch==1.10.2 torchvision --extra-index-url https://download.pytorch.org/whl/cu113
pip install git+https://github.com/Maluuba/nlg-eval.git@master
pip install pycocoevalcap
pip install pickle5
pip install opencv-python==4.5.5.62

Download the pretrained weight of UniVL:

mkdir -p ./weight
wget -P ./weight https://github.com/microsoft/UniVL/releases/download/v0/univl.pretrained.bin

Extract the Video Features

Follow the instructions written here

Training & Evaluation

The shell scripts to train and to evaluate the model is provided here. You may need to modify the scripts as per your needs.

References

This repository is implemented based on UniVL and CLIP4Clip

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
.idea		.idea
ckpts		ckpts
dataloaders		dataloaders
dataset		dataset
extracted_feats		extracted_feats
feature_extractor		feature_extractor
modules		modules
scripts		scripts
weight		weight
README.md		README.md
Unified hierarchical contrastive learning for video captioning - 1-s2.0-S1566253525009182-main.pdf		Unified hierarchical contrastive learning for video captioning - 1-s2.0-S1566253525009182-main.pdf
train.py		train.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

UHCL: Unified Hierarchical Contrastive Learning for Video Caption

🔧 Setup

🚀 Training & Evaluation

⭐ Star History

🤝 References

Original CLIP4Caption Notes

Setup

Extract the Video Features

Training & Evaluation

References

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

UHCL: Unified Hierarchical Contrastive Learning for Video Caption

🔧 Setup

🚀 Training & Evaluation

⭐ Star History

🤝 References

Original CLIP4Caption Notes

Setup

Extract the Video Features

Training & Evaluation

References

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages