Image Captioning

This project generates captions for images using a deep learning model. It includes a Jupyter notebook for model training, a FastAPI backend to serve the model, and a Streamlit web application for an interactive user interface.

Features

Image Feature Extraction: Uses a pre-trained VGG16 model to extract features from images.
Caption Generation: Employs an LSTM-based model to generate descriptive captions for images.
Training Pipeline: A Jupyter notebook (notebooks/main.ipynb) details the process of downloading the Flickr8k dataset, preprocessing data, training the captioning model, and evaluating it.
API Endpoint: A FastAPI application (api.py) provides an endpoint (/model) to get captions for uploaded images.
Interactive Web App: A Streamlit application (app.py) allows users to upload an image and view the generated caption.

The image captioning model consists of an encoder-decoder architecture:

Encoder (Feature Extractor):
- A pre-trained VGG16 model, with its final classification layer removed, is used as the image feature extractor.
- It processes an input image (resized to 224x224 pixels) and outputs a 4096-dimensional feature vector representing the image's salient content.
- The Keras model for the feature extractor is typically saved as models/feature_extractor.keras.
Decoder (Captioning Model):
- An LSTM (Long Short-Term Memory) network serves as the decoder.
- It takes the 4096-dimensional image feature vector (from the encoder) and the sequence of previously generated words (as token IDs) as input.
- The LSTM then predicts the next word in the caption sequence.
- This process is repeated, with the newly predicted word becoming part of the input for the next time step, until an 'endseq' token is generated or the maximum caption length is reached.
- The Keras model for the caption generator is typically saved as models/best_model.keras.
Text Preprocessing:
- Captions are cleaned (converted to lowercase, punctuation and short words removed).
- Special tokens, 'startseq' and 'endseq', are added to mark the beginning and end of each caption, respectively.
- A Keras Tokenizer (models/tokenizer (1).pkl) is used to convert words into integer sequences and vice-versa. The vocabulary is built from the training captions.
Embeddings:
- Word embeddings are utilized to represent each word in the vocabulary as a dense vector. These embeddings are learned during the training process and are part of the decoder model. They transform the tokenized words into a format suitable for the LSTM layer.
Training:
- The model is trained to minimize the 'categorical_crossentropy' loss function, using the 'adam' optimizer.
- During training, the decoder learns to predict the next word in a caption given the image features and the preceding words of the ground-truth caption.

Technologies Used

Python
TensorFlow / Keras
FastAPI
Streamlit
Numpy
Pillow (PIL)
Scikit-learn
NLTK (for BLEU score calculation)
Kaggle API (for dataset download)
Jupyter Notebook

Project Structure

├── Datasets/
│   ├── captions.txt
│   └── Images/
├── models/
│   ├── best_model.keras  # Main captioning model (Decoder)
│   ├── best_model.h5     # Alternative format for the main model
│   ├── feature_extractor.keras # VGG16 based feature extractor (Encoder)
│   ├── feature_extractor.h5    # Alternative format for the feature extractor
│   ├── tokenizer (1).pkl # Tokenizer for caption preprocessing
│   ├── features (1).pkl  # Pre-computed image features (optional, from notebook)
│   ├── model.png         # Diagram of the model architecture (if available)
│   └── ... (other model related files)
├── notebooks/
│   └── main.ipynb        # Jupyter notebook for training and experimentation
├── api.py                # FastAPI backend for serving the model
├── app.py                # Streamlit web application for UI
├── requirements.txt      # Python dependencies
└── README.md             # This file

Setup and Installation

Clone the repository (if applicable):

# git clone <repository-url>
# cd <repository-directory>

Create a virtual environment (recommended):

python -m venv imgcap
source imgcap/bin/activate  # On Windows use `imgcap\Scripts\activate`

Install dependencies:
```
pip install -r requirements.txt
```
Download the dataset:
- The Jupyter notebook (notebooks/main.ipynb) contains commands to download the Flickr8k dataset using the Kaggle API. You will need to have your kaggle.json API key set up.
- Alternatively, manually download the Flickr8k dataset and place the Images folder and captions.txt file into a Datasets directory at the root of the project.
Train the Model (or use pre-trained models):
- Run through the notebooks/main.ipynb to train the model. This will generate best_model.keras, feature_extractor.keras, and tokenizer.pkl in the models directory (or ensure your pre-trained models are in the models directory as specified in api.py and app.py). The notebook saves models to /kaggle/working/ by default, so you might need to adjust paths or move the saved models to the models/ directory in your project. The current api.py and app.py look for models in a local models/ directory.

Usage

1. Jupyter Notebook (Training and Exploration)

Navigate to the notebooks/ directory.
Open and run main.ipynb using Jupyter Notebook or JupyterLab.
```
jupyter notebook notebooks/main.ipynb
# or
jupyter lab notebooks/main.ipynb
```
This notebook covers:
- Downloading and unzipping the Flickr8k dataset.
- Extracting image features using VGG16.
- Preprocessing captions.
- Tokenizing text data.
- Building and training the image captioning model (LSTM-based).
- Evaluating the model using BLEU scores.
- Generating captions for sample images.

2. FastAPI Backend

To run the API that serves the image captioning model:
```
uvicorn api:api --reload
```
The API will be available at http://127.0.0.1:8000.
You can access the interactive API documentation (Swagger UI) at http://127.0.0.1:8000/docs.
The main endpoint for captioning is POST /model, which expects an image file.

3. Streamlit Web Application

To run the interactive web application:
```
streamlit run app.py
```
The application will typically open in your web browser at http://localhost:8501.
You can upload an image, and the app will display the image and the generated caption.

Model Details

This README provides a comprehensive overview of the Image Captioning project. APP : https://image-captioning-17.streamlit.app/

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Image Captioning

Features

Technologies Used

Project Structure

Setup and Installation

Usage

1. Jupyter Notebook (Training and Exploration)

2. FastAPI Backend

3. Streamlit Web Application

Model Details

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
models		models
notebooks		notebooks
.gitattributes		.gitattributes
.gitignore		.gitignore
README.md		README.md
api.py		api.py
app.py		app.py
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

Image Captioning

Features

Technologies Used

Project Structure

Setup and Installation

Usage

1. Jupyter Notebook (Training and Exploration)

2. FastAPI Backend

3. Streamlit Web Application

Model Details

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages