A powerful computer vision tool that uses CLIP (Contrastive Language-Image Pre-Training) to locate and identify objects or scenes in images based on natural language queries.
- Smart Query Expansion: Automatically generates variations of your search query for better detection
- Adaptive Window Sizing: Dynamically adjusts detection windows based on image dimensions
- Confidence-Based Detection: Provides confidence scores and quality assessments for each detection
- Professional Visualization: Creates high-quality result images with bounding boxes and crops
- Multiple Detection Support: Finds up to 3 best matches per query
- Metadata Export: Saves detailed information about each detection
- Python 3.7 or higher
- CUDA-compatible GPU (optional but recommended for faster processing)
- At least 4GB RAM
- 2GB free disk space for model downloads
First, create a new folder on your computer for this project:
Windows:
mkdir scene-localization
cd scene-localizationmacOS/Linux:
mkdir scene-localization
cd scene-localizationgit clone https://github.com/Rishabh1925/scene_localization_system.git
cd scene_localization_systemmacOS/Linux:
python3 -m venv scene_env
source scene_env/bin/activateWindows:
python -m venv scene_env
scene_env\Scripts\activateCreate a requirements.txt file with the following content:
torch
torchvision
transformers
opencv-python
numpy
Pillow
matplotlib
scipy
flask
flask-corsThen install:
pip install -r requirements.txtpython -c "import torch, transformers, cv2, PIL, flask; print('All dependencies installed successfully!')"- Start the web application:
python app.py
- Open your web browser and navigate to
http://127.0.0.1:5000(or the address displayed in your terminal) - Upload your image files (JPG, JPEG, PNG, BMP, TIFF, WebP, GIF) through the web interface
- Enter your search queries and click "Analyze Image"
- Please be patient! The analysis takes time to process (usually 1 - 5 minutes depending on your hardware and image complexity). The system is running complex AI computations in the background.
Think About This Project:
- How do you think this system combines computer vision and natural language processing?
- What makes CLIP special compared to traditional image recognition systems?
- Why might the sliding window approach be effective for object localization?
Fun CLIP & Vision Transformer Trivia:
- CLIP stands for: Contrastive Language-Image Pre-Training - it learned from 400 million image-text pairs!
- ViT Revolution: Vision Transformers (ViT) proved that the transformer architecture (originally for text) could beat CNNs at image tasks
- Zero-shot Magic: CLIP can recognize objects it was never explicitly trained to identify - it just needs a text description
- Multimodal Learning: CLIP understands both images AND text in the same mathematical space - that's why you can search images with natural language
The system generates:
improved_result.jpg: Main visualization with bounding boxes, confidence scores, and quality ratingsimproved_detections/folder: Individual cropped images and metadata files for each detection
Example metadata file:
Query: person talking
Matched Query: two people conversing
Confidence Score: 0.6470
Bounding Box: (234, 156, 456, 389)
Window Size: (200, 200)
Crop Size: 222x233 pixels- Specific objects: "red car", "brown dog", "person wearing hat"
- Actions: "person walking", "dog running"
- Scenes: "street vendor", "outdoor café"
- Relationships: "two people conversing", "person with bicycle"
- Be specific: Use descriptive terms like "red sports car" instead of just "car"
- Try alternatives: Use synonyms if initial queries don't work
- Use high-quality images: Clear, well-lit images with visible objects work best
- Optimize image size: Resize large images (>2000px) for better performance
- Use GPU acceleration: CUDA-compatible GPU recommended for faster processing
- Base Model: OpenAI CLIP-ViT-B/32
- Input Resolution: Images processed at 224x224 pixels
- Algorithm: Sliding window with cosine similarity between CLIP embeddings, non-maximum suppression for overlapping detections
scene-localization-system/
│
├── app.py # Main Flask web application
├── index.html # HTML templates
├── README.md # Documentation
├── requirements.txt # Python dependencies
├── scene_localizer.py
│
│
├── scene_env/ # Virtual environment (After you create it)
│
├── static/
│ └── images/
│ ├── test1.jpg
│ ├── test2.png
│ └── test3.png
│
└── improved_detections/ # Output folder
├── detection_1_score_0.647_confidence_high.jpg
├── detection_1_score_0.304_confidence_low.jpg
└── ...
This project uses components under the following licenses:
- CLIP Model: MIT License
- PyTorch: BSD License
- Transformers: Apache 2.0 License
If you encounter issues, check:
- All dependencies are properly installed
- Image files are in supported formats
- Objects you're searching for are clearly visible in the image