A vectorized, modular implementation of Linear Regression using Gradient Descent, benchmarked against Scikit-Learn in high-dimensional biological scenarios.
While sklearn.linear_model.LinearRegression is the industry standard, its reliance on the Singular Value Decomposition (SVD) analytical solution creates significant memory bottlenecks when handling high-dimensional datasets.
To address this, this project explores the use of Gradient Descent as an iterative alternative. Instead of attempting to solve the entire system at once through computationally expensive matrix decompositions, Gradient Descent approaches the optimal solution in steps. This study aims to demonstrate how this iterative process can be more resource-efficient in high-dimensional contexts, providing a practical way to manage memory usage without sacrificing the model's ability to learn from the data.
-
Vectorized Implementation: Fully optimized matrix operations using NumPy (no slow
forloops). - Automatic Feature Engineering: Built-in Z-Score Normalization and One-Hot Encoding handling.
-
Dual-Scenario Validation: Benchmarked using generated Synthetic Data across two distinct domains:
- Agro-Ecological Case: Verified convergence and prediction accuracy against Scikit-Learn.
-
Single-Cell Genomics Case: Stress-tested memory efficiency on high-dimensional data (
$N=5,000, P=20,000$ ).
-
Scikit-Learn Compatible: Follows the standard
.fit()and.predict()API design.
While the textbook definition of Linear Regression uses the Normal Equation scikit-learn avoid this method due to numerical instability when features are correlated.
Instead, sklearn.linear_model.LinearRegression relies on Singular Value Decomposition (SVD) of the data matrix gelsd driver). It decomposes the matrix such that:
To bypass the memory cost of matrix decomposition, I have implemented Batch Gradient Descent. Instead of solving the system in a single step, we optimize the weights iteratively by following the slope of the error surface.
Algorithm Steps:
-
Prediction:
$\hat{y} = Xw + b$ -
Cost Function (MSE):
$J(w,b) = \frac{1}{2n} \sum_{i=1}^{n} (y^{(i)} - \hat{y}^{(i)})^2$ -
Gradient Calculation:
$\nabla J = \frac{1}{n} X^T (\hat{y} - y)$ -
Weight Update:
$w = w - \alpha \cdot \nabla J$
The key advantage of our approach is visible when analyzing the algorithmic complexity per step.
| Symbol | Definition | Context in this Project |
|---|---|---|
| Number of Samples | 5,000 (Cells) | |
| Number of Features | 20,000 (Genes) | |
| Number of Iterations | e.g., 1,000 steps | |
| Learning Rate | Step size (Hyperparameter) |
Comparative Complexity:
-
Analytical (SVD/OLS):
$O(N \cdot P^2)$ — Quadratic/Cubic relative to Features. -
Gradient Descent:
$O(k \cdot N \cdot P)$ — Linear relative to Features.
By keeping the complexity linear with respect to
To ensure a fair evaluation, I developed two testing environments using Synthetic Data. This allowed us to control the "Ground Truth" and measure how well the model recovers known biological or ecological signals.
- Context: Predicting crop yield based on variables like Rainfall, Soil pH, and Fertilizer type.
- Challenge: Handling categorical data and features with vastly different scales.
- Validation: Comparison of RMSE and weight recovery.
- Context: Single-Cell RNA Sequencing (scRNA-seq) where expression is measured across thousands of genes.
-
Dataset:
$N=5,000$ cells and$P=20,000$ genes. -
Challenge: Peak memory management and training efficiency in an underdetermined system (
$P \gg N$ ).
Beyond the core algorithm, this project tackles two critical challenges often overlooked in academic implementations but vital for production environments.
Gradient Descent is highly sensitive to the scale of input features. If one variable ranges from 0-1 (e.g., Soil pH) and another from 0-1000 (e.g., Rainfall), the cost function contours become elongated ellipses. The optimizer "zigzags" across the valley, taking much longer to reach the global minimum. So, I implemented automatic Z-Score Normalization inside the fit method:
A common failure mode in deployment occurs when the test set (or live data) lacks specific categorical levels present during training (e.g., the training set has "Fertilizer A, B, C", but a specific test batch only contains "A and B"). Instead of simple One-Hot Encoding which would result in mismatched matrix dimensions, I implemented a robust schema enforcement using pandas.reindex inside the predict method. This guarantees that the model always receives the exact feature structure it expects, filling missing dummy columns with zeros automatically to prevent dimensionality errors.
The model demonstrated a perfect convergence path. As shown in the dashboard, our custom model recovered the "Ground Truth" signal with high correlation, matching the precision of the analytical solution.
Left: Cost function decreasing smoothly. Right: Predicted vs. Actual values showing high accuracy.
In the high-dimensional Genomics test, the iterative nature of Gradient Descent proved to be more resource-efficient.
-
Speed: Even in pure Python, our vectorized implementation achieved a ~1.13x speedup over Scikit-Learn in this specific high-dimensional layout (
$P \gg N$ ). - Memory: Our model achieved ~11.4% less peak RAM usage by avoiding expensive matrix decompositions.
LINEAR-REGRESSION-FROM-SCRATCH/
│
├── src/
│ ├── __init__.py
│ └── linear_regression_gd.py # Core Class
│
├── notebooks/
│ ├── agro_case_study.ipynb # Scenario 1: Accuracy Check
│ └── single_cell_genomics_stress_test.ipynb # Scenario 2: Scalability
│
├── img/ # Benchmark Visualizations
├── requirements.txt # Dependencies
└── README.md
