Dreambooth- Personalized Diffusion Model

This repo aims at using Dreambooth to teach a Diffusion model to learn my pictures and generate images of me from text prompts. I fine-tune stable-diffusion-xl model from Huggingface (over 10GB in size) on a single Turing T4 GPU (16GB) on Google Colab using LoRA and Accelerate from Huggingface. The repo also looks at merging different LoRA adapters in order to merge styles.

Motivation

Can Low-Rank Adapters work well for training Dreambooth?
Dreambooth works well on pictures of objects. Can it learn to represent human faces well?
How many images do I need to teach the model about myself?
What is Prior Preservation?
How will a model recognize me from the text prompt?
Can we merge different Adapters to learn different styles?
What are some difficulties when it comes to training on human faces, and how can we offset them?
On what text prompts does the model do well, and when does it mess up?

Project Structure `↩`

The data directory contains contains prior.zip, which contains 197 images of human faces (excluding my own). These images are used to train the model with prior preservation. I also uploaded 7 high-resolution images of me during fine-tuning (not uploaded here).
The personalized-diffusion.ipynb contains a notebook to train with and without Prior Preservation.
The train_dreambooth_lora_sdxl.py notebook contains the model and the script to train it.
The dreambooth-inference.ipynb contains a comprehensive and structured inference of the models trained with and without Prior Preservation. This contains all the images generated from text prompts post-training.

Dataset `↩`

The data contains 7 high-resolution images of me. For Dreambooth, it is important that these images cover different angles and clearly display the face. According to the experiments, 5-6 images are enough to train stable-diffusion-xl (SDXL) with LoRA. For prior preservation, we also use 197 images of other human faces to increase diversity and reduce language drift. These images are generated by the same Diffusion model itself.

Prior-Preservation `↩`

Fine-tuning layers that are conditioned on the text embeddings gives rise to the problem of language drift, where a model that is pre-trained on a large text corpus and later fine-tuned for a specific task progressively loses syntactic and semantic knowledge of the language. This phenomenon also affects diffusion models, where the model slowly forgets how to generate subjects of the same class as the target subject.

Another problem is the possibility of reduced output diversity. Text-to-image diffusion models naturally possess high amounts of output diversity. When fine-tuning a small set of images, we would like to be able to generate the subject in novel viewpoints, poses, and articulations. Yet, there is a risk of reducing the amount of variability in the output poses and views of the subject. To mitigate the two aforementioned issues, the paper proposes an autogenous class-specific prior preservation loss that encourages diversity and counters language drift. The method is to supervise the model with its own generated samples, in order for it to retain the prior once the few-shot fine-tuning begins. This allows it to generate diverse images of the class prior, as well as retain knowledge about the class prior that it can use in conjunction with knowledge about the subject instance.

Training `↩`

To accommodate such a large model on a 16GB Turing T4 GPU, I make use of gradient accumulation, gradient checkpointing, and 8-bit fused Adam (instead of the regular Adam). Training on 7 images for 1000 steps is conducted with and without the prior preservation loss to verify that the prior preservation actually helps.

Training Prompt

In order to teach the model a mapping between text and a subject, Dreambooth proposes using a rare token from the model's vocabulary and combining it with the subject prior. For instance, to train on my face, I use the prompt

A photo of Satvik person

Here, Satvik is the rare vocabulary token, and person is the class prior to the subject.

Results `↩`

It turns out that LoRA + Dreambooth with 1000 steps works decently well on human faces as well. Prior-Preservation definitely improves the model (as seen from the images below). For me, the PNDM Scheduler works well with just 50 timesteps and DDIM with 80 timesteps.

Recontextualization

prompt = "A picture of Satvik person in a wedding wearing traditional Indian clothes."

Without Prior Reservation

With Prior Reservation

Art Renditions

prompt = "A painting of Satvik person in the style of Starry Night by Van Gogh."

Without Prior Reservation

With Prior Reservation

Property Modification

prompt = "A picture of Satvik person with blonde hair"

Without Prior Reservation

With Prior Reservation

Novel-View Synthesis

prompt = "A back view photo of Satvik person"

Without Prior Reservation

With Prior Reservation

Acccesorization

prompt = "A picture of Satvik person with face mask."

Without Prior Reservation

With Prior Reservation

Merging Adapters `↩`

I experiment with generating my images in pixel-art style using two merged adapters. Particularly, I experiment with generating my pictures merged with the Pixel Art style.

prompt = "pixel, a photo of Satvik person wearing sunglasses."

Without Prior Reservation

With Prior Reservation

Limitations `↩`

Generating faces is tough; sometimes, eyes and teeth are not rendered properly or could be mismatched.

Compute Limits

The GPU did not allow me to fine-tune the text encoder (2 text encoders in the case of SDXL). Fine-tuning text encoders certainly improves image generation quality.

References `↩`

[1] Huggingface Blog

[2] "DreamBooth: Fine Tuning Text-to-Image Diffusion Models for Subject-Driven Generation"

[3] Dreambooth Diffusers Training Script

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
dataset		dataset
images		images
README.md		README.md
inference_dreambooth.ipynb		inference_dreambooth.ipynb
personalized_diffusion.ipynb		personalized_diffusion.ipynb

Folders and files

Latest commit

History

Repository files navigation

Dreambooth- Personalized Diffusion Model

Motivation

Jump To

Project Structure ↩

Dataset ↩

Prior-Preservation ↩

Training ↩

Training Prompt

Results ↩

Recontextualization

Without Prior Reservation

With Prior Reservation

Art Renditions

Without Prior Reservation

With Prior Reservation

Property Modification

Without Prior Reservation

With Prior Reservation

Novel-View Synthesis

Without Prior Reservation

With Prior Reservation

Acccesorization

Without Prior Reservation

With Prior Reservation

Merging Adapters ↩

Without Prior Reservation

With Prior Reservation

Limitations ↩

Compute Limits

References ↩

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Project Structure `↩`

Dataset `↩`

Prior-Preservation `↩`

Training `↩`

Results `↩`

Merging Adapters `↩`

Limitations `↩`

References `↩`

Packages