📖 EnvisionBox Module — Full Documentation
Authors
Davide Ahmar — ahmar.davide@gmail.com
Wim Pouw — wim.pouw@donders.ru.nl
Babajide Owoyele — babajide.owoyele@hpi.de
This repository provides a user-friendly Python application built on Meta AI’s SAM2 model for object tracking and overlap (“looking at”) detection in videos.
The tool was developed as part of the EnvisionBOXBABY project, with a focus on analyzing infant–adult interactions using videos recorded from an infant’s head-mounted camera. However, it can be used for any scenario where you want to annotate objects and detect when one target object overlaps with others.
- 🖼️ Interactive annotation: select a reference frame, click to add positive/negative points, and name each object.
- 🎯 Target detection: any object named with
"target"(case-insensitive) is treated as the gaze/marker object. - 🔍 Event detection: logs “looking at” events whenever the target overlaps another object:
- By pixel overlap above a threshold
- Or by centroid inclusion
- 📂 Outputs:
- Annotated video with masks and status overlays
- Frame-by-frame CSV with bounding boxes, centroids, overlap info
- Time-aligned ELAN (.eaf) file for qualitative coding
Click the green Code button (top right) → Download ZIP → extract it to a folder (e.g., C:\EnvisionObjectAnnotator).
Or use git:
git clone https://github.com/DavAhm/EnvisionObjectAnnotator.git
cd EnvisionObjectAnnotatorFollow the installation guide for SAM2: SAM 2 Installation Instructions →
Follow the installation guide for Tools and Packages: Tools and Packages Installation Instructions →
- Load your video → supports
.mp4,.mov,.avi, etc. - Pick a reference frame → usually frame
0. - Annotate objects:
- Left-click = positive point
- Right-click = negative point
- Press C to name the object (must contain
"target"for gaze markers) - Press T to test masks
- Press Enter when done
- Set detection threshold → default is 10% overlap.
- Process video → masks are propagated, overlaps are detected, and outputs are generated.
- Annotated video: shows objects with color-coded masks and on-screen event labels
- CSV file: frame-by-frame details with bounding boxes, centroids, areas, and overlaps
- ELAN file: time-aligned tiers with “Looking at: [object]” events for qualitative coding
An example of the raw (left) and annotated (right) video output
Ahmar, D., Owoyele, B., & Pouw, W. (2025). EnvisionBoxAnnotator: An Automatic Object-to-Object Overlap Detector with SAM2 (Version 1.0.0). Zenodo. https://doi.org/10.5281/zenodo.18840160
