This project generates captions for images using a deep learning model. It includes a Jupyter notebook for model training, a FastAPI backend to serve the model, and a Streamlit web application for an interactive user interface.
- Image Feature Extraction: Uses a pre-trained VGG16 model to extract features from images.
- Caption Generation: Employs an LSTM-based model to generate descriptive captions for images.
- Training Pipeline: A Jupyter notebook (
notebooks/main.ipynb) details the process of downloading the Flickr8k dataset, preprocessing data, training the captioning model, and evaluating it. - API Endpoint: A FastAPI application (
api.py) provides an endpoint (/model) to get captions for uploaded images. - Interactive Web App: A Streamlit application (
app.py) allows users to upload an image and view the generated caption.
The image captioning model consists of an encoder-decoder architecture:
-
Encoder (Feature Extractor):
- A pre-trained VGG16 model, with its final classification layer removed, is used as the image feature extractor.
- It processes an input image (resized to 224x224 pixels) and outputs a 4096-dimensional feature vector representing the image's salient content.
- The Keras model for the feature extractor is typically saved as
models/feature_extractor.keras.
-
Decoder (Captioning Model):
- An LSTM (Long Short-Term Memory) network serves as the decoder.
- It takes the 4096-dimensional image feature vector (from the encoder) and the sequence of previously generated words (as token IDs) as input.
- The LSTM then predicts the next word in the caption sequence.
- This process is repeated, with the newly predicted word becoming part of the input for the next time step, until an 'endseq' token is generated or the maximum caption length is reached.
- The Keras model for the caption generator is typically saved as
models/best_model.keras.
-
Text Preprocessing:
- Captions are cleaned (converted to lowercase, punctuation and short words removed).
- Special tokens, 'startseq' and 'endseq', are added to mark the beginning and end of each caption, respectively.
- A Keras Tokenizer (
models/tokenizer (1).pkl) is used to convert words into integer sequences and vice-versa. The vocabulary is built from the training captions.
-
Embeddings:
- Word embeddings are utilized to represent each word in the vocabulary as a dense vector. These embeddings are learned during the training process and are part of the decoder model. They transform the tokenized words into a format suitable for the LSTM layer.
-
Training:
- The model is trained to minimize the 'categorical_crossentropy' loss function, using the 'adam' optimizer.
- During training, the decoder learns to predict the next word in a caption given the image features and the preceding words of the ground-truth caption.
- Python
- TensorFlow / Keras
- FastAPI
- Streamlit
- Numpy
- Pillow (PIL)
- Scikit-learn
- NLTK (for BLEU score calculation)
- Kaggle API (for dataset download)
- Jupyter Notebook
├── Datasets/
│ ├── captions.txt
│ └── Images/
├── models/
│ ├── best_model.keras # Main captioning model (Decoder)
│ ├── best_model.h5 # Alternative format for the main model
│ ├── feature_extractor.keras # VGG16 based feature extractor (Encoder)
│ ├── feature_extractor.h5 # Alternative format for the feature extractor
│ ├── tokenizer (1).pkl # Tokenizer for caption preprocessing
│ ├── features (1).pkl # Pre-computed image features (optional, from notebook)
│ ├── model.png # Diagram of the model architecture (if available)
│ └── ... (other model related files)
├── notebooks/
│ └── main.ipynb # Jupyter notebook for training and experimentation
├── api.py # FastAPI backend for serving the model
├── app.py # Streamlit web application for UI
├── requirements.txt # Python dependencies
└── README.md # This file
-
Clone the repository (if applicable):
# git clone <repository-url> # cd <repository-directory>
-
Create a virtual environment (recommended):
python -m venv imgcap source imgcap/bin/activate # On Windows use `imgcap\Scripts\activate`
-
Install dependencies:
pip install -r requirements.txt
-
Download the dataset:
- The Jupyter notebook (
notebooks/main.ipynb) contains commands to download the Flickr8k dataset using the Kaggle API. You will need to have yourkaggle.jsonAPI key set up. - Alternatively, manually download the Flickr8k dataset and place the
Imagesfolder andcaptions.txtfile into aDatasetsdirectory at the root of the project.
- The Jupyter notebook (
-
Train the Model (or use pre-trained models):
- Run through the
notebooks/main.ipynbto train the model. This will generatebest_model.keras,feature_extractor.keras, andtokenizer.pklin themodelsdirectory (or ensure your pre-trained models are in themodelsdirectory as specified inapi.pyandapp.py). The notebook saves models to/kaggle/working/by default, so you might need to adjust paths or move the saved models to themodels/directory in your project. The currentapi.pyandapp.pylook for models in a localmodels/directory.
- Run through the
- Navigate to the
notebooks/directory. - Open and run
main.ipynbusing Jupyter Notebook or JupyterLab.This notebook covers:jupyter notebook notebooks/main.ipynb # or jupyter lab notebooks/main.ipynb- Downloading and unzipping the Flickr8k dataset.
- Extracting image features using VGG16.
- Preprocessing captions.
- Tokenizing text data.
- Building and training the image captioning model (LSTM-based).
- Evaluating the model using BLEU scores.
- Generating captions for sample images.
- To run the API that serves the image captioning model:
uvicorn api:api --reload
- The API will be available at
http://127.0.0.1:8000. - You can access the interactive API documentation (Swagger UI) at
http://127.0.0.1:8000/docs. - The main endpoint for captioning is
POST /model, which expects an image file.
- To run the interactive web application:
streamlit run app.py
- The application will typically open in your web browser at
http://localhost:8501. - You can upload an image, and the app will display the image and the generated caption.
This README provides a comprehensive overview of the Image Captioning project. APP : https://image-captioning-17.streamlit.app/