I recently completed a deep dive into Stable Diffusion by implementing it from scratch using PyTorch. This experience gave me a comprehensive understanding of how modern text-to-image AI systems work at a fundamental level.
Stable Diffusion is a latent diffusion model that generates images from text prompts. I learned that it consists of three main components working together:
- VAE (Variational Autoencoder): Compresses images into a latent space and reconstructs them
- CLIP Text Encoder: Converts text prompts into embeddings
- U-Net: The diffusion model that denoises latents iteratively
I implemented the CLIP (Contrastive Language-Image Pre-training) encoder to transform text prompts into meaningful embeddings:
- Token & Position Embeddings: The model uses a vocabulary of 49,408 tokens and supports sequences up to 77 tokens
- Transformer Architecture: 12 layers of self-attention with 12 heads each, operating on 768-dimensional embeddings
- QuickGELU Activation: I learned about this special activation function
x * sigmoid(1.702 * x)used instead of standard GELU - Pre-LayerNorm: The model uses pre-normalization (LayerNorm before attention) rather than post-normalization
The CLIP encoder outputs shape (batch, 77, 768) - a sequence of 77 token embeddings, each 768-dimensional.
The VAE was fascinating to implement because it works in latent space rather than pixel space:
Encoder:
- Takes RGB images
(B, 3, 512, 512) - Compresses them 8x in each spatial dimension
- Outputs latents
(B, 4, 64, 64)- reducing data by 48x! - This compression makes diffusion computationally feasible
Decoder:
- Takes the denoised latents
(B, 4, 64, 64) - Reconstructs back to full resolution
(B, 3, 512, 512) - Uses transposed convolutions and upsampling
I learned that working in latent space is the key innovation that makes Stable Diffusion efficient compared to pixel-space diffusion models.
The U-Net is the heart of the system. I implemented it with these key insights:
Architecture Pattern:
- Encoder path: Progressively downsamples from 320 → 640 → 1280 channels while reducing spatial dimensions
- Bottleneck: Processes features at the lowest resolution (H/64, W/64)
- Decoder path: Progressively upsamples back with skip connections from encoder
Two Types of Blocks:
-
Residual Blocks (for temporal conditioning):
- Use GroupNorm (groups of 32)
- Integrate time embeddings: I learned how timestep information gets injected via linear projection and addition
- Apply SiLU activation
-
Attention Blocks (for spatial and text conditioning):
- Self-Attention: Relates different parts of the image to each other
- Cross-Attention: This is where the magic happens - text embeddings guide image generation!
- GeGLU FFN: Learned about Gated Linear Units for the feed-forward network
Skip Connections: The decoder concatenates encoder features, doubling channel counts (e.g., 1280 → 2560), which helps preserve spatial details.
I implemented sinusoidal time embeddings following the "Attention is All You Need" positional encoding:
freqs = 10000^(-i/160) for i in [0, 160)
embedding = [cos(t * freqs), sin(t * freqs)]
This creates a 320-dimensional representation of the timestep, which gets expanded to 1280 dimensions via MLPs.
I learned the denoising diffusion probabilistic model (DDPM) sampling process:
Forward Diffusion (Training):
- Gradually adds Gaussian noise to images over T timesteps
- Follows a variance schedule
Reverse Diffusion (Inference):
- Starts from pure noise
(B, 4, 64, 64) - Iteratively predicts and removes noise for 50 steps
- Each step:
latent_t-1 = (latent_t - predicted_noise) / scaling_factor + noise
Strength Parameter: For image-to-image, I learned that "strength" controls how much to modify the input - higher strength means starting from a noisier latent.
One of the most important concepts I learned was CFG for controllable generation:
How it works:
- Run the model twice in parallel:
- Once with the text prompt (conditional)
- Once without text/empty prompt (unconditional)
- Interpolate:
output = cfg_scale * (cond - uncond) + uncond - Higher cfg_scale (e.g., 7.5) makes the model follow the prompt more closely
This is why the U-Net processes batches of 2 when CFG is enabled - it's computing both versions simultaneously!
I implemented the full generation pipeline:
Text-to-Image:
- Tokenize prompt → CLIP encoding
(1, 77, 768) - Initialize random latent
(1, 4, 64, 64) - For each timestep (50 iterations):
- Get time embedding
- U-Net predicts noise
- Apply CFG if enabled
- Denoise one step
- VAE decode latents to image
- Rescale from
[-1, 1]to[0, 255]
Image-to-Image:
- Encode input image to latent space
- Add noise based on strength
- Start denoising from this noisy latent (same as steps 3-5 above)
GroupNorm: I learned why GroupNorm (32 groups) is preferred over BatchNorm for diffusion models - it's more stable with small batch sizes.
Attention Mechanics:
- Self-attention:
Attention(Q, K, V)where Q=K=V from the same sequence - Cross-attention: Q from image features, K and V from text embeddings
- Masking: CLIP uses causal masking for autoregressive text processing
Shape Transformations: I became comfortable with complex shape manipulations:
(B, C, H, W) → (B, H*W, C)for attention- Broadcasting time embeddings with
unsqueeze(-1).unsqueeze(-1) - Chunking batches for CFG
Weight Loading: I learned about mapping pretrained Stable Diffusion weights to my custom architecture - this taught me about model parameter structure and naming conventions.
-
Latent Diffusion is Efficient: Operating in compressed latent space (64×64×4) instead of pixel space (512×512×3) makes generation ~48x more efficient
-
Cross-Attention Connects Text and Image: This is the mechanism that allows text prompts to guide image generation - Q from images attends to K,V from text
-
Iterative Refinement: 50 denoising steps gradually transform noise into a coherent image - each step makes small improvements
-
Conditioning is Everything: Time embeddings tell the model "how noisy is this?", text embeddings tell it "what to generate", and CFG tells it "how much to follow the prompt"
-
Architecture Symmetry: The U-Net's encoder-decoder symmetry with skip connections preserves information while allowing the model to process at multiple scales
- Implementing complex PyTorch architectures from scratch
- Understanding attention mechanisms deeply (self, cross, causal)
- Working with pretrained model weights
- Designing efficient inference pipelines
- Managing GPU memory (offloading models to CPU when idle)
- Using noise schedules and sampling algorithms
Now that I understand Stable Diffusion's internals, I want to explore:
- DDIM and other faster samplers
- ControlNet for structural conditioning
- LoRA for efficient fine-tuning
- Stable Diffusion XL architecture improvements
- Latent consistency models
This project transformed my understanding of generative AI from a black box to a system I can build, modify, and reason about deeply.