📉 Linear Regression from Scratch using Gradient Descent

A vectorized, modular implementation of Linear Regression using Gradient Descent, benchmarked against Scikit-Learn in high-dimensional biological scenarios.

Project Overview

While sklearn.linear_model.LinearRegression is the industry standard, its reliance on the Singular Value Decomposition (SVD) analytical solution creates significant memory bottlenecks when handling high-dimensional datasets.

To address this, this project explores the use of Gradient Descent as an iterative alternative. Instead of attempting to solve the entire system at once through computationally expensive matrix decompositions, Gradient Descent approaches the optimal solution in steps. This study aims to demonstrate how this iterative process can be more resource-efficient in high-dimensional contexts, providing a practical way to manage memory usage without sacrificing the model's ability to learn from the data.

Key Features

Vectorized Implementation: Fully optimized matrix operations using NumPy (no slow for loops).
Automatic Feature Engineering: Built-in Z-Score Normalization and One-Hot Encoding handling.
Dual-Scenario Validation: Benchmarked using generated Synthetic Data across two distinct domains:
- Agro-Ecological Case: Verified convergence and prediction accuracy against Scikit-Learn.
- Single-Cell Genomics Case: Stress-tested memory efficiency on high-dimensional data ($N=5,000, P=20,000$).
Scikit-Learn Compatible: Follows the standard .fit() and .predict() API design.

Theoretical Foundation & Algorithm

Scikit-Learn Approach

While the textbook definition of Linear Regression uses the Normal Equation $\theta = (X^T X)^{-1} X^T y$, production-grade libraries like scikit-learn avoid this method due to numerical instability when features are correlated.

Instead, sklearn.linear_model.LinearRegression relies on Singular Value Decomposition (SVD) of the data matrix $X$ (specifically using LAPACK's gelsd driver). It decomposes the matrix such that: $$X = U \Sigma V^T$$ The solver then computes the pseudo-inverse to find the weights. While numerically stable, this approach requires decomposing the entire matrix in memory, leading to a computational cost of roughly $O(N \cdot P^2)$ or $O(P^3)$. In our Genomics scenario ($P=20,000$), this matrix decomposition becomes the primary memory bottleneck.

Iterative Approach

To bypass the memory cost of matrix decomposition, I have implemented Batch Gradient Descent. Instead of solving the system in a single step, we optimize the weights iteratively by following the slope of the error surface.

Algorithm Steps:

Prediction: $\hat{y} = Xw + b$
Cost Function (MSE): $J(w,b) = \frac{1}{2n} \sum_{i=1}^{n} (y^{(i)} - \hat{y}^{(i)})^2$
Gradient Calculation: $\nabla J = \frac{1}{n} X^T (\hat{y} - y)$
Weight Update: $w = w - \alpha \cdot \nabla J$

Complexity & Notation Analysis

The key advantage of our approach is visible when analyzing the algorithmic complexity per step.

Symbol	Definition	Context in this Project
$N$	Number of Samples	5,000 (Cells)
$P$	Number of Features	20,000 (Genes)
$k$	Number of Iterations	e.g., 1,000 steps
$\alpha$	Learning Rate	Step size (Hyperparameter)

Comparative Complexity:

Analytical (SVD/OLS): $O(N \cdot P^2)$ — Quadratic/Cubic relative to Features.
Gradient Descent: $O(k \cdot N \cdot P)$ — Linear relative to Features.

By keeping the complexity linear with respect to $P$, Gradient Descent allows us to handle High-Dimensional Data ($P \gg N$) with significantly lower peak memory usage, as demonstrated in our benchmarks.

Methodology & Experimental Design

To ensure a fair evaluation, I developed two testing environments using Synthetic Data. This allowed us to control the "Ground Truth" and measure how well the model recovers known biological or ecological signals.

Scenario 1: Agro-Ecological Accuracy

Context: Predicting crop yield based on variables like Rainfall, Soil pH, and Fertilizer type.
Challenge: Handling categorical data and features with vastly different scales.
Validation: Comparison of RMSE and weight recovery.

Scenario 2: High-Dimensional Genomics (Stress Test)

Context: Single-Cell RNA Sequencing (scRNA-seq) where expression is measured across thousands of genes.
Dataset: $N=5,000$ cells and $P=20,000$ genes.
Challenge: Peak memory management and training efficiency in an underdetermined system ($P \gg N$).

Implementation Highlights

Beyond the core algorithm, this project tackles two critical challenges often overlooked in academic implementations but vital for production environments.

Feature Scaling

Gradient Descent is highly sensitive to the scale of input features. If one variable ranges from 0-1 (e.g., Soil pH) and another from 0-1000 (e.g., Rainfall), the cost function contours become elongated ellipses. The optimizer "zigzags" across the valley, taking much longer to reach the global minimum. So, I implemented automatic Z-Score Normalization inside the fit method:

$$x' = \frac{x - \mu}{\sigma}$$ This transforms the error surface into a symmetric bowl (spherical contours), allowing the gradient to point directly towards the minimum for faster and stable convergence.

Schema Consistency

A common failure mode in deployment occurs when the test set (or live data) lacks specific categorical levels present during training (e.g., the training set has "Fertilizer A, B, C", but a specific test batch only contains "A and B"). Instead of simple One-Hot Encoding which would result in mismatched matrix dimensions, I implemented a robust schema enforcement using pandas.reindex inside the predict method. This guarantees that the model always receives the exact feature structure it expects, filling missing dummy columns with zeros automatically to prevent dimensionality errors.

Experimental Results & Benchmarking

Scenario 1: Convergence and Precision

The model demonstrated a perfect convergence path. As shown in the dashboard, our custom model recovered the "Ground Truth" signal with high correlation, matching the precision of the analytical solution.

Left: Cost function decreasing smoothly. Right: Predicted vs. Actual values showing high accuracy.

Scenario 2: Computational Efficiency

In the high-dimensional Genomics test, the iterative nature of Gradient Descent proved to be more resource-efficient.

Speed: Even in pure Python, our vectorized implementation achieved a ~1.13x speedup over Scikit-Learn in this specific high-dimensional layout ($P \gg N$).
Memory: Our model achieved ~11.4% less peak RAM usage by avoiding expensive matrix decompositions.

Project Structure

LINEAR-REGRESSION-FROM-SCRATCH/
│
├── src/
│   ├── __init__.py
│   └── linear_regression_gd.py                      # Core Class
│
├── notebooks/
│   ├── agro_case_study.ipynb                      # Scenario 1: Accuracy Check
│   └── single_cell_genomics_stress_test.ipynb     # Scenario 2: Scalability
│
├── img/                                             # Benchmark Visualizations
├── requirements.txt                                 # Dependencies
└── README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

📉 Linear Regression from Scratch using Gradient Descent

Project Overview

Key Features

Theoretical Foundation & Algorithm

Scikit-Learn Approach

Iterative Approach

Complexity & Notation Analysis

Methodology & Experimental Design

Scenario 1: Agro-Ecological Accuracy

Scenario 2: High-Dimensional Genomics (Stress Test)

Implementation Highlights

Feature Scaling

Schema Consistency

Experimental Results & Benchmarking

Scenario 1: Convergence and Precision

Scenario 2: Computational Efficiency

Project Structure

Contact

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
img		img
notebooks		notebooks
src		src
.gitattributes		.gitattributes
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

📉 Linear Regression from Scratch using Gradient Descent

Project Overview

Key Features

Theoretical Foundation & Algorithm

Scikit-Learn Approach

Iterative Approach

Complexity & Notation Analysis

Methodology & Experimental Design

Scenario 1: Agro-Ecological Accuracy

Scenario 2: High-Dimensional Genomics (Stress Test)

Implementation Highlights

Feature Scaling

Schema Consistency

Experimental Results & Benchmarking

Scenario 1: Convergence and Precision

Scenario 2: Computational Efficiency

Project Structure

Contact

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages