From Play to Replay: Composed Video Retrieval for Temporally Fine-Grained Videos

Animesh Gupta¹ | Jay Parmar¹ | Ishan Rajendrakumar Dave² | Mubarak Shah¹

¹University of Central Florida ²Adobe

Accepted in NeurIPS Datasets and Benchmarks track 2025

If you like our project, please give us a star ⭐ on GitHub for the latest update.

Composed Video Retrieval (CoVR) retrieves a target video given a query video and a modification text describing the intended change. Existing CoVR benchmarks emphasize appearance shifts or coarse event changes and therefore do not test the ability to capture subtle, fast-paced temporal differences. We introduce TF-CoVR, the first large-scale benchmark dedicated to temporally fine-grained CoVR. TF-CoVR focuses on gymnastics and diving and provides 180K triplets drawn from FineGym and FineDiving. Previous CoVR benchmarks focusing on temporal aspect, link each query to a single target segment taken from the same video, limiting practical usefulness. In TF-CoVR, we instead construct each <query, modification> pair by prompting an LLM with the label differences between clips drawn from different videos; every pair is thus associated with multiple valid target videos (3.9 on average), reflecting real-world tasks such as sports-highlight generation. To model these temporal dynamics we propose TF-CoVR-Base, a concise two-stage training framework: (i) pre-train a video encoder on fine-grained action classification to obtain temporally discriminative embeddings; (ii) align the composed query with candidate videos using contrastive learning. We conduct the first comprehensive study of image, video, and general multimodal embedding (GME) models on temporally fine-grained composed retrieval in both zero-shot and fine-tuning regimes. On TF-CoVR, TF-CoVR-Base improves zero-shot mAP@50 from 5.92 (LanguageBind) to 7.51, and after fine-tuning raises the state-of-the-art from 19.83 to 27.22

Environment Setup

cd TF-CoVR/
conda create -n tfcovr python=3.10 -y
conda activate tfcovr
pip install -r requirements.txt
pip install git+https://github.com/openai/CLIP.git

Pretrained weights

Please download our stage 1 pretrained weights from google drive here.
Please download our stage 2 pretrained weights from google drive here.

Dataset

Please follow the instructions from DATASET.md to access the dataset.

AIM Embeddings

Please follow the DATASET.md to get access to original videos and converting them to mp4 format.
Update the videos path and path to save embeddings in aim_emb.py

Please run the following command to generate the embeddings:

cd AIM_Embeddings
python aim_emb.py model.ckpt.path="stage-1-checkpoint-path"

Training

For reproducing results on TF-CoVR using TF-CoVR-Base

Run following command:
python train.py data=finegd-covr-aim trainer=gpu model=aim model/ckpt=aim test=finegd-test-aim

Testing

python test.py data=finegd-covr-aim trainer=gpu model=aim_clip model/ckpt=aim test=finegd-test-aim-clip machine.num_workers=8 trainer.max_epochs=100 model.ckpt.path=/checkpoint/path/

Citation

If you use this dataset and/or this code in your work, please cite our paper:

@misc{gupta2025playreplaycomposedvideo,
      title={From Play to Replay: Composed Video Retrieval for Temporally Fine-Grained Videos}, 
      author={Animesh Gupta and Jay Parmar and Ishan Rajendrakumar Dave and Mubarak Shah},
      year={2025},
      eprint={2506.05274},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2506.05274}, 
}

🙏 Acknowledgements

This repository has borrowed code from CoVR. We thank the authors for releasing their code.

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
AIM_Embeddings		AIM_Embeddings
assets		assets
configs		configs
data		data
src		src
tools/embs		tools/embs
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt
test.py		test.py
train.py		train.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

From Play to Replay: Composed Video Retrieval for Temporally Fine-Grained Videos

If you like our project, please give us a star ⭐ on GitHub for the latest update.

Environment Setup

Pretrained weights

Dataset

AIM Embeddings

Training

For reproducing results on TF-CoVR using TF-CoVR-Base

Testing

Citation

🙏 Acknowledgements

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

From Play to Replay: Composed Video Retrieval for Temporally Fine-Grained Videos

If you like our project, please give us a star ⭐ on GitHub for the latest update.

Environment Setup

Pretrained weights

Dataset

AIM Embeddings

Training

For reproducing results on TF-CoVR using TF-CoVR-Base

Testing

Citation

🙏 Acknowledgements

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages