- Understanding Image Representation
- CNN Architecture Overview
- Convolution Layer - The Feature Extractor
- Activation Function - ReLU
- Pooling Layer - The Downsampler
- Flattening Layer - Bridge to Classification
- Fully Connected Layer - The Decision Maker
- Regularization Techniques
- Output Layer - Softmax
- Mathematical Formulas - Quick Reference
- Solved Examples
Analogy: Think of a digital image like a mosaic artwork. Each tiny colored tile is a pixel, and together they create the complete picture.
Key Concepts:
- A digital image = Matrix of tiny units called pixels
- Each pixel stores intensity or color information
- Computers don't see objects — they see numbers!
Grayscale vs. RGB Images:
| Image Type | Pixel Storage | Example Value |
|---|---|---|
| Grayscale | 1 number per pixel (brightness) | [128] |
| RGB Color | 3 numbers per pixel (R, G, B) | [128, 64, 32] |
Example Calculation:
- A tree image: 148 × 148 pixels = 21,904 individual values
- Grayscale: 21,904 values
- RGB Color: 21,904 × 3 = 65,712 values
Exam Tip: For RGB image of size M × N:
- Total values = M × N × 3
Analogy: CNN is like an assembly line in a factory:
- Raw material (Input image) enters
- Quality inspection stations (Convolution + ReLU) extract features
- Compression units (Pooling) reduce size
- Assembly workers (Fully connected layers) make final decisions
- Quality control (Softmax) provides confidence scores
Input Image → [Convolution → Batch Norm → ReLU → Pooling] × n
→ Flatten → Fully Connected → Dropout → Softmax → Output
Key Properties:
- Input layer: Original image (pixel values)
- Hidden layers: Convolution, ReLU, Pooling (partially connected)
- Output layer: Fully connected + Softmax (classification)
Analogy: Imagine using a magnifying glass to scan a document. You move it systematically across the page, examining small sections at a time. That's how convolution filters work!
Definition: A mathematical operation combining two functions to produce a third function.
In CNN:
Feature Map = Convolution(Input Image, Kernel/Filter)
Key Points:
- Filters are learnable matrices with weights and bias
- Weights are randomly initialized, updated during training
- Multiple filters learn different features (edges, textures, shapes)
- Same filter is shared across the entire image (parameter sharing)
Analogy: Each filter is like a detective with a specific specialty:
- Filter 1: Edge detective (finds boundaries)
- Filter 2: Texture detective (finds patterns)
- Filter 3: Shape detective (finds curves)
Traditional Neural Network: Every neuron connected to ALL inputs (fully connected)
- Problem: Too many parameters for images!
CNN Approach: Each neuron connected to small local region only
- Advantage: Fewer parameters, learns spatial hierarchies
Concept: The same kernel weights are used across all spatial locations.
Benefit:
- Drastically reduces parameters
- If image has 1000×1000 pixels and filter is 3×3:
- Without sharing: 1,000,000 × 9 = 9 million parameters
- With sharing: Only 9 parameters!
Sliding Window Protocol:
- Kernel starts at top-left corner
- Moves left to right, computing dot product
- Reaches last column, resets to first column
- Moves one row down
- Repeats until entire image processed
Mathematical Operation (Element-wise multiplication + sum):
Example:
Input Patch: Kernel: Calculation:
[1 2 0] [1 2 2] (1×1)+(2×2)+(0×2)+
[2 1 1] × [0 0 0] = (2×0)+(1×0)+(1×0)+
[0 5 0] [-1 -2 -1] (0×-1)+(5×-2)+(0×-1)
Result = 1 + 4 + 0 + 0 + 0 + 0 + 0 - 10 + 0 = -5
When: Applied between convolution and activation (ReLU)
Purpose:
- Normalizes inputs of each layer
- Reduces internal covariate shift (changes in activation distributions)
- Acts as regularization
Benefits:
- Enables higher learning rates → faster training
- Stabilizes learning
- Improves overall performance
Exam Tip: Batch Normalization is MORE effective in convolutional layers than Dropout!
Analogy: Like adding a picture frame around your photo to preserve its size.
Purpose:
- Preserve spatial dimensions
- Treat edge pixels similar to center pixels
- Control output size
Visual:
Original 5×5: With Padding (P=1):
[a b c d e] [0 0 0 0 0 0 0]
[f g h i j] [0 a b c d e 0]
[k l m n o] → [0 f g h i j 0]
[p q r s t] [0 k l m n o 0]
[u v w x y] [0 p q r s t 0]
[0 u v w x y 0]
[0 0 0 0 0 0 0]
Definition: Number of pixels the filter shifts at each step
Effect:
- Stride = 1: Filter moves 1 pixel at a time → Larger output
- Stride = 2: Filter moves 2 pixels at a time → Smaller output
Analogy: ReLU is like a security gate that only lets positive values through and blocks negative ones.
Mathematical Formula: $$ y = \max(0, x) = \begin{cases} x & \text{if } x > 0 \ 0 & \text{if } x \leq 0 \end{cases} $$
Input Matrix:
M = [-3 19 5]
[ 7 -6 12]
[ 4 -8 17]
After ReLU:
ReLU(M) = [0 19 5]
[7 0 12]
[4 0 17]
Purpose:
- Introduces non-linearity (enables learning complex patterns)
- Suppresses negative activations
- Improves learning efficiency
Analogy: Like creating a thumbnail image — you keep the important parts but reduce the size.
- Selects maximum element from each region
- Keeps the strongest features
- Better results in practice
- Calculates average of each region
- Smooths the feature map
2×2 Max Pooling:
Input (4×4): Output (2×2):
[1 3 2 4] [3 4]
[5 6 7 8] → [9 11]
[9 2 1 3]
[4 5 11 7]
Top-left: max(1,3,5,6) = 6 → Wait, let me recalculate:
[1 3 | 2 4] [6 8]
[5 6 | 7 8] → [9 11]
-----------
[9 2 | 1 3]
[4 5 | 11 7]
Benefits:
- Reduces spatial dimensions (height × width)
- Decreases parameters → Less computation
- Controls overfitting
- Translation invariance (small shifts don't affect output)
- Retains important features
What if NO pooling?
- Feature maps retain same resolution
- Increased computational complexity
- Higher risk of overfitting
Analogy: Like converting a 2D chessboard into a single line of pieces.
Input: 2D Feature Map (Matrix) Output: 1D Vector
Example:
Feature Map (2×3): Flattened Vector:
[2 5 1] [2, 5, 1, 4, 0, 3]
[4 0 3] →
Purpose: Prepare data for fully connected layers (which require 1D input)
Analogy: Like a committee where every member (neuron) considers ALL information before voting.
- Every neuron connected to all neurons in previous layer
- Similar to traditional neural networks
- Usually forms the final layers of CNN
- Contains large number of parameters
- Takes feature vector from flattening layer
- Learns complex decision boundaries
- Classifies into different categories
Problem: Prone to overfitting (too many parameters) Solution: Use Dropout regularization
Analogy: Like training a sports team where random players sit out during practice. This forces all players to be versatile, not relying on specific teammates.
How it Works:
- Randomly disable neurons during training
- Typical dropout rate: 0.5 (50% neurons dropped)
- At test time: All neurons active
Benefits:
- Reduces node-to-node dependencies
- Forces network to learn robust features
- Better generalization to new data
- Improves training speed
Exam Tip:
- Dropout is more effective in fully connected layers
- Batch Normalization is more effective in convolutional layers
Analogy: Like a weather forecaster giving probabilities: 70% rain, 20% cloudy, 10% sunny. All probabilities sum to 100%.
Purpose: Convert raw scores (logits) into probability distribution
Properties:
- All outputs between 0 and 1
- Sum of all outputs = 1.0
- Used for multi-class classification
Example:
Logits (Raw Outputs): Softmax Probabilities:
[1.3] [0.02] → 2% Class 1
[3.1] → [0.90] → 90% Class 2 ✓ (Predicted)
[2.2] [0.05] → 5% Class 3
[0.7] [0.01] → 1% Class 4
[1.9] [0.02] → 2% Class 5
-----
Sum = 1.00
Exam Tip:
- Binary classification: Use Sigmoid/Logistic function
- Multi-class classification: Use Softmax
Most Important Formula for Exams: $$ \text{Output Size} = \left\lfloor \frac{N - F + 2P}{S} \right\rfloor + 1 $$
Where:
- N = Input dimension (height or width)
- F = Kernel/Filter size
- P = Padding
- S = Stride
- ⌊ ⌋ = Floor function (round down)
For Convolutional Layers: $$ \text{Parameters} = (\text{Kernel Height} \times \text{Kernel Width} \times \text{Input Channels} + 1) \times \text{Output Channels} $$
The +1 accounts for the bias term.
| Operation | Changes Dimensions? | Adds Parameters? |
|---|---|---|
| Convolution | Yes (depends on P, S) | Yes (weights + bias) |
| Batch Normalization | No | Yes (scale + shift) |
| ReLU | No | No |
| Pooling | Yes (reduces size) | No |
| Flatten | Yes (2D → 1D) | No |
| Dropout | No | No |
| Softmax | No | No |
Given:
- Input: 10 × 10 × 10 (Width × Height × Channels)
- Operations:
- 3×3 Conv (40 channels), stride=1, padding=1
- ReLU
- 3×3 Max Pooling, stride=1, padding=1
- 3×3 Conv (20 channels), stride=1, padding=1
- ReLU
- 2×2 Max Pooling, stride=2, padding=1
Solution:
Step 1: 3×3 Convolution (40 channels) $$ \text{Width} = \left\lfloor \frac{10 - 3 + 2(1)}{1} \right\rfloor + 1 = \left\lfloor \frac{9}{1} \right\rfloor + 1 = 10 $$ $$ \text{Height} = \left\lfloor \frac{10 - 3 + 2(1)}{1} \right\rfloor + 1 = 10 $$ Output: 10 × 10 × 40 ✓
Step 2: ReLU
- No dimension change Output: 10 × 10 × 40 ✓
Step 3: 3×3 Max Pooling $$ \text{Width} = \left\lfloor \frac{10 - 3 + 2(1)}{1} \right\rfloor + 1 = 10 $$ $$ \text{Height} = \left\lfloor \frac{10 - 3 + 2(1)}{1} \right\rfloor + 1 = 10 $$ Output: 10 × 10 × 40 ✓
Step 4: 3×3 Convolution (20 channels) $$ \text{Width} = \left\lfloor \frac{10 - 3 + 2(1)}{1} \right\rfloor + 1 = 10 $$ $$ \text{Height} = \left\lfloor \frac{10 - 3 + 2(1)}{1} \right\rfloor + 1 = 10 $$ Output: 10 × 10 × 20 ✓
Step 5: ReLU
- No dimension change Output: 10 × 10 × 20 ✓
Step 6: 2×2 Max Pooling (stride=2) $$ \text{Width} = \left\lfloor \frac{10 - 2 + 2(1)}{2} \right\rfloor + 1 = \left\lfloor \frac{10}{2} \right\rfloor + 1 = 6 $$ $$ \text{Height} = \left\lfloor \frac{10 - 2 + 2(1)}{2} \right\rfloor + 1 = 6 $$ Output: 6 × 6 × 20 ✓
Final Answer: 6 × 6 × 20
Same network as Example 1. Calculate total parameters.
Solution:
First Convolution (3×3, 40 channels):
- Kernel: 3 × 3
- Input Channels: 10
- Output Channels: 40
Second Convolution (3×3, 20 channels):
- Kernel: 3 × 3
- Input Channels: 40 (from previous layer)
- Output Channels: 20
ReLU and Pooling: 0 parameters (no learnable weights)
Total Parameters: $$ 3,640 + 7,220 = \boxed{10,860} $$
Given:
Input (5×5): Kernel (3×3):
[1 2 0 3 2] [1 2 2]
[2 1 1 1 1] [0 0 0]
[0 5 0 0 0] [-1 -2 -1]
[3 7 0 6 0]
[1 1 3 2 0]
Stride = 1, Padding = 0
Calculate output dimensions: $$ \text{Output} = \left\lfloor \frac{5 - 3 + 0}{1} \right\rfloor + 1 = 3 $$ Output will be 3 × 3
Calculate O₁,₁ (top-left output):
Patch: Kernel:
[1 2 0] [1 2 2]
[2 1 1] × [0 0 0]
[0 5 0] [-1 -2 -1]
Calculation:
(1×1) + (2×2) + (0×2) +
(2×0) + (1×0) + (1×0) +
(0×-1) + (5×-2) + (0×-1)
= 1 + 4 + 0 + 0 + 0 + 0 + 0 - 10 + 0 = -5
Complete Feature Map:
[-5 3 10]
[-11 -8 -7]
[4 -4 -7]
Step-by-Step Process:
-
Input: Image (pixel matrix) enters the network
-
Convolution: Filters slide over image, extract features
- Creates feature maps
- Parameter sharing reduces complexity
-
Batch Normalization: Normalize activations
- Stabilizes learning
- Reduces covariate shift
-
ReLU: Non-linear activation
- Keeps positive values
- Sets negatives to zero
-
Pooling: Downsample feature maps
- Reduces dimensions
- Retains important features
- Provides translation invariance
-
Repeat 2-5: Multiple times for deeper features
- Early layers: Low-level features (edges, textures)
- Deep layers: High-level features (shapes, objects)
-
Flatten: Convert 2D feature maps to 1D vector
-
Fully Connected: Learn decision boundaries
- Apply dropout to prevent overfitting
-
Softmax: Convert to probabilities
- Final classification output
| Aspect | Traditional NN | CNN |
|---|---|---|
| Connectivity | Fully connected | Locally connected |
| Parameters | Very high | Reduced (sharing) |
| Input Type | 1D vectors | 2D/3D images |
| Spatial Info | Lost | Preserved |
| Best For | Tabular data | Images, spatial data |
| Aspect | Batch Normalization | Dropout |
|---|---|---|
| Best in | Convolutional layers | Fully connected layers |
| Purpose | Reduce covariate shift | Prevent overfitting |
| During Test | Active (with learned stats) | Inactive |
| Effect | Normalizes activations | Randomly drops neurons |
- Always use floor function in dimension calculations
- Remember the +1 in bias for parameter counting
- Check if padding is specified (default is usually 0)
- ReLU and Pooling don't add parameters
- Channels dimension doesn't change in pooling
- Don't forget the floor operation ⌊ ⌋
- Don't confuse stride with kernel size
- Don't count activation functions as having parameters
- Don't mix up input channels vs output channels
- Don't forget to add bias (+1) in parameter formula
Input: 32 × 32 × 3 Conv: 5×5 kernel, 64 filters, stride=2, padding=2 What is output dimension?
Answer: $$ \left\lfloor \frac{32 - 5 + 2(2)}{2} \right\rfloor + 1 = \left\lfloor \frac{33}{2} \right\rfloor + 1 = 16 + 1 = 17 $$ Output: 17 × 17 × 64
How many parameters in above convolution? $$ (5 \times 5 \times 3 + 1) \times 64 = 76 \times 64 = 4,864 $$
If RGB image is 256 × 256, how many pixel values total? $$ 256 \times 256 \times 3 = 196,608 $$
-
Output Dimension:
$\left\lfloor \frac{N - F + 2P}{S} \right\rfloor + 1$ -
Parameters Count:
$(F_h \times F_w \times C_{in} + 1) \times C_{out}$ -
ReLU:
$\text{max}(0, x)$
- Convolution (extract features)
- Batch Normalization (stabilize)
- ReLU (activate)
- Pooling (downsample)
- Can calculate output dimensions with any P, S, F, N?
- Can count parameters for convolutional layers?
- Understand difference between local vs full connectivity?
- Know when to use Batch Norm vs Dropout?
- Can perform manual convolution calculation?
- Understand purpose of each layer type?
- Know which operations add parameters?
- Can explain parameter sharing benefit?
- Understand flattening process?
- Know Softmax vs Sigmoid usage?
Remember: CNNs are just systematic pattern extractors. Each layer has a specific job, and together they transform raw pixels into meaningful predictions!