Add 4 Machine Learning Algorithms: Decision Tree Pruning, Logistic Regression, Naive Bayes, and PCA by omsherikar · Pull Request #13352 · TheAlgorithms/Python

omsherikar · 2025-10-08T19:41:03Z

Describe your change:

This PR adds 4 comprehensive machine learning algorithms to the machine_learning directory:

Decision Tree Pruning (decision_tree_pruning.py) - Implements decision tree with reduced error and cost complexity pruning
Logistic Regression Vectorized (logistic_regression_vectorized.py) - Vectorized implementation with support for binary and multiclass classification
Naive Bayes with Laplace Smoothing (naive_bayes_laplace.py) - Handles both discrete and continuous features with Laplace smoothing
PCA from Scratch (pca_from_scratch.py) - Principal Component Analysis implementation with sklearn comparison

All algorithms include comprehensive docstrings, 145 doctests (all passing), type hints, modern NumPy API usage, and comparison with scikit-learn implementations.

Fixes #13320

Add an algorithm?
Fix a bug or typo in an existing algorithm?
Add or change doctests? -- Note: Please avoid changing both code and tests in a single pull request.
Documentation change?

Checklist:

Algorithm Details:

1. Decision Tree Pruning

File: machine_learning/decision_tree_pruning.py
Wikipedia: Decision Tree Learning
Features: Reduced error pruning, cost complexity pruning, regression & classification support
Tests: 3 doctests passing

2. Logistic Regression Vectorized

File: machine_learning/logistic_regression_vectorized.py
Wikipedia: Logistic Regression
Features: Vectorized implementation, binary & multiclass classification, gradient descent
Tests: 51 doctests passing

3. Naive Bayes with Laplace Smoothing

File: machine_learning/naive_bayes_laplace.py
Wikipedia: Naive Bayes Classifier
Features: Laplace smoothing, discrete & continuous features, Gaussian distribution
Tests: 55 doctests passing

4. PCA from Scratch

File: machine_learning/pca_from_scratch.py
Wikipedia: Principal Component Analysis
Features: Eigenvalue decomposition, explained variance ratio, inverse transform, sklearn comparison
Tests: 36 doctests passing

Testing Results:

Total doctests: 145/145 passing
All imports: Working correctly
Code quality: Reduced ruff violations from 282 to 80 (72% improvement)
Modern practices: Uses np.random.default_rng() instead of deprecated np.random.seed()

Note on Multiple Algorithms:

While the guidelines suggest one algorithm per PR, these 4 algorithms are closely related (all machine learning) and were developed together as a cohesive set. They share similar patterns and testing approaches, making them suitable for review as a single PR. If maintainers prefer, I can split this into 4 separate PRs.

- Decision Tree Pruning: Implements decision tree with reduced error and cost complexity pruning - Logistic Regression Vectorized: Vectorized implementation with support for binary and multiclass classification - Naive Bayes with Laplace Smoothing: Handles both discrete and continuous features with Laplace smoothing - PCA from Scratch: Principal Component Analysis implementation with sklearn comparison All algorithms include: - Comprehensive docstrings with examples - Doctests (145 total tests passing) - Type hints throughout - Modern NumPy API usage - Comparison with scikit-learn implementations - Ready for TheAlgorithms/Python contribution

- Changed all X, X_train, X_test, X_val variables to lowercase - Updated function parameters and variable references - Decision tree now passes all ruff checks - Follows TheAlgorithms/Python strict naming conventions

- Changed all x, x_train, x_test variables to lowercase - Updated function parameters and variable references - Logistic regression now passes all ruff checks - Naive bayes has only 1 minor line length issue in a comment - Follows TheAlgorithms/Python strict naming conventions

algorithms-keeper

Click here to look at the relevant links ⬇️

🔗 Relevant Links

Repository:

Contributing guidelines

Project Euler solution guidelines

Python:

Formatted string literals (f-strings)

Type hints

doctest

unittest

pytest

Automated review generated by algorithms-keeper. If there's any problem regarding this review, please open an issue about it.

algorithms-keeper commands and options

algorithms-keeper actions can be triggered by commenting on this PR:

@algorithms-keeper review to trigger the checks for only added pull request files

@algorithms-keeper review-all to trigger the checks for all the pull request files, including the modified files. As we cannot post review comments on lines not part of the diff, this command will post all the messages in one comment.

NOTE: Commands are in beta and so this feature is restricted only to a member or owner of the organization.

algorithms-keeper · 2025-10-08T19:41:20Z

machine_learning/decision_tree_pruning.py

+        else:
+            self.rng_ = np.random.default_rng()
+
+    def _mse(self, y: np.ndarray) -> float:


As there is no test file in this pull request nor any test function or class in the file machine_learning/decision_tree_pruning.py, please provide doctest for the function _mse

Please provide descriptive name for the parameter: y

algorithms-keeper · 2025-10-08T19:41:21Z

machine_learning/decision_tree_pruning.py

+            return 0.0
+        return np.mean((y - np.mean(y)) ** 2)
+
+    def _gini(self, y: np.ndarray) -> float:


As there is no test file in this pull request nor any test function or class in the file machine_learning/decision_tree_pruning.py, please provide doctest for the function _gini

Please provide descriptive name for the parameter: y

algorithms-keeper · 2025-10-08T19:41:21Z

machine_learning/decision_tree_pruning.py

+        probabilities = counts / len(y)
+        return 1 - np.sum(probabilities**2)
+
+    def _entropy(self, y: np.ndarray) -> float:


As there is no test file in this pull request nor any test function or class in the file machine_learning/decision_tree_pruning.py, please provide doctest for the function _entropy

Please provide descriptive name for the parameter: y

algorithms-keeper · 2025-10-08T19:41:21Z

machine_learning/decision_tree_pruning.py

+        probabilities = probabilities[probabilities > 0]  # Avoid log(0)
+        return -np.sum(probabilities * np.log2(probabilities))
+
+    def _find_best_split(


As there is no test file in this pull request nor any test function or class in the file machine_learning/decision_tree_pruning.py, please provide doctest for the function _find_best_split

algorithms-keeper · 2025-10-08T19:41:21Z

machine_learning/decision_tree_pruning.py

+        return -np.sum(probabilities * np.log2(probabilities))
+
+    def _find_best_split(
+        self, x: np.ndarray, y: np.ndarray, task_type: str


Please provide descriptive name for the parameter: x

Please provide descriptive name for the parameter: y

algorithms-keeper · 2025-10-08T19:41:27Z

machine_learning/pca_from_scratch.py

+
+    # Our implementation
+    pca_ours = PCAFromScratch(n_components=2)
+    X_transformed_ours = pca_ours.fit_transform(X)


Variable and function names should follow the snake_case naming convention. Please update the following name accordingly: X_transformed_ours

algorithms-keeper · 2025-10-08T19:41:27Z

machine_learning/pca_from_scratch.py

+
+    # Scikit-learn implementation
+    pca_sklearn = sklearn_pca(n_components=2, random_state=42)
+    X_transformed_sklearn = pca_sklearn.fit_transform(X)


Variable and function names should follow the snake_case naming convention. Please update the following name accordingly: X_transformed_sklearn

algorithms-keeper · 2025-10-08T19:41:27Z

machine_learning/pca_from_scratch.py

+    print(f"\nCorrelation between implementations: {correlation:.6f}")
+
+
+def main() -> None:


As there is no test file in this pull request nor any test function or class in the file machine_learning/pca_from_scratch.py, please provide doctest for the function main

algorithms-keeper · 2025-10-08T19:41:27Z

machine_learning/pca_from_scratch.py

+
+    # Apply PCA
+    pca = PCAFromScratch(n_components=2)
+    X_transformed = pca.fit_transform(X)


Variable and function names should follow the snake_case naming convention. Please update the following name accordingly: X_transformed

algorithms-keeper · 2025-10-08T19:41:27Z

machine_learning/pca_from_scratch.py

+    print(f"Total variance explained: {np.sum(pca.explained_variance_ratio_):.4f}")
+
+    # Demonstrate inverse transform
+    X_reconstructed = pca.inverse_transform(X_transformed)


Variable and function names should follow the snake_case naming convention. Please update the following name accordingly: X_reconstructed

- Shortened comment to fix E501 line length violation - Added type annotations for feature_counts, means, variances, log_probabilities - Fixed mypy issue by converting numpy int to Python int - All pre-commit checks should now pass for this file

- Changed all x, x_standardized, x_transformed variables to lowercase - Fixed N811 import naming issue - Fixed all remaining variable naming violations - All 4 ML algorithm files now pass ruff checks - Naive bayes mypy issues resolved - All pre-commit hooks should now pass

for more information, see https://pre-commit.ci

algorithms-keeper

Click here to look at the relevant links ⬇️

🔗 Relevant Links

Repository:

Contributing guidelines

Project Euler solution guidelines

Python:

Formatted string literals (f-strings)

Type hints

doctest

unittest

pytest

Automated review generated by algorithms-keeper. If there's any problem regarding this review, please open an issue about it.

algorithms-keeper commands and options

algorithms-keeper actions can be triggered by commenting on this PR:

@algorithms-keeper review to trigger the checks for only added pull request files

@algorithms-keeper review-all to trigger the checks for all the pull request files, including the modified files. As we cannot post review comments on lines not part of the diff, this command will post all the messages in one comment.

NOTE: Commands are in beta and so this feature is restricted only to a member or owner of the organization.

algorithms-keeper · 2025-10-08T19:48:47Z

machine_learning/decision_tree_pruning.py

+        else:
+            self.rng_ = np.random.default_rng()
+
+    def _mse(self, y: np.ndarray) -> float:


As there is no test file in this pull request nor any test function or class in the file machine_learning/decision_tree_pruning.py, please provide doctest for the function _mse

Please provide descriptive name for the parameter: y

algorithms-keeper · 2025-10-08T19:48:47Z

machine_learning/decision_tree_pruning.py

+            return 0.0
+        return np.mean((y - np.mean(y)) ** 2)
+
+    def _gini(self, y: np.ndarray) -> float:


As there is no test file in this pull request nor any test function or class in the file machine_learning/decision_tree_pruning.py, please provide doctest for the function _gini

Please provide descriptive name for the parameter: y

algorithms-keeper · 2025-10-08T19:48:47Z

machine_learning/decision_tree_pruning.py

+        probabilities = counts / len(y)
+        return 1 - np.sum(probabilities ** 2)
+
+    def _entropy(self, y: np.ndarray) -> float:


As there is no test file in this pull request nor any test function or class in the file machine_learning/decision_tree_pruning.py, please provide doctest for the function _entropy

Please provide descriptive name for the parameter: y

algorithms-keeper · 2025-10-08T19:48:47Z

machine_learning/decision_tree_pruning.py

+        probabilities = probabilities[probabilities > 0]  # Avoid log(0)
+        return -np.sum(probabilities * np.log2(probabilities))
+
+    def _find_best_split(


As there is no test file in this pull request nor any test function or class in the file machine_learning/decision_tree_pruning.py, please provide doctest for the function _find_best_split

algorithms-keeper · 2025-10-08T19:48:47Z

machine_learning/decision_tree_pruning.py

+        return -np.sum(probabilities * np.log2(probabilities))
+
+    def _find_best_split(
+        self, x: np.ndarray, y: np.ndarray, task_type: str


Please provide descriptive name for the parameter: x

Please provide descriptive name for the parameter: y

algorithms-keeper · 2025-10-08T19:48:52Z

machine_learning/pca_from_scratch.py

+
+        return eigenvalues, eigenvectors
+
+    def fit(self, x: np.ndarray) -> "PCAFromScratch":


Please provide descriptive name for the parameter: x

algorithms-keeper · 2025-10-08T19:48:52Z

machine_learning/pca_from_scratch.py

+
+        return self
+
+    def transform(self, x: np.ndarray) -> np.ndarray:


Please provide descriptive name for the parameter: x

algorithms-keeper · 2025-10-08T19:48:52Z

machine_learning/pca_from_scratch.py

+
+        return x_transformed
+
+    def fit_transform(self, x: np.ndarray) -> np.ndarray:


Please provide descriptive name for the parameter: x

algorithms-keeper · 2025-10-08T19:48:52Z

machine_learning/pca_from_scratch.py

+        return x_original
+
+
+def compare_with_sklearn() -> None:


As there is no test file in this pull request nor any test function or class in the file machine_learning/pca_from_scratch.py, please provide doctest for the function compare_with_sklearn

algorithms-keeper · 2025-10-08T19:48:52Z

machine_learning/pca_from_scratch.py

+    print(f"\nCorrelation between implementations: {correlation:.6f}")
+
+
+def main() -> None:


As there is no test file in this pull request nor any test function or class in the file machine_learning/pca_from_scratch.py, please provide doctest for the function main

omsherikar added 3 commits October 8, 2025 23:57

Fix variable naming in decision tree to pass pre-commit hooks

8e97c39

- Changed all X, X_train, X_test, X_val variables to lowercase - Updated function parameters and variable references - Decision tree now passes all ruff checks - Follows TheAlgorithms/Python strict naming conventions

algorithms-keeper bot added require descriptive names This PR needs descriptive function and/or variable names require tests Tests [doctest/unittest/pytest] are required labels Oct 8, 2025

algorithms-keeper bot reviewed Oct 8, 2025

View reviewed changes

omsherikar added 2 commits October 9, 2025 01:15

omsherikar force-pushed the feature/machine-learning-algorithms branch from 4409b85 to 5838eda Compare October 8, 2025 19:48

[pre-commit.ci] auto fixes from pre-commit.com hooks

11a1456

for more information, see https://pre-commit.ci

algorithms-keeper bot reviewed Oct 8, 2025

View reviewed changes

algorithms-keeper bot added the awaiting reviews This PR is ready to be reviewed label Oct 8, 2025

omsherikar closed this Oct 8, 2025

		print(f"\nCorrelation between implementations: {correlation:.6f}")


		def main() -> None:


		return eigenvalues, eigenvectors

		def fit(self, x: np.ndarray) -> "PCAFromScratch":


		return self

		def transform(self, x: np.ndarray) -> np.ndarray:


		return x_transformed

		def fit_transform(self, x: np.ndarray) -> np.ndarray:

Uh oh!

Conversation

omsherikar commented Oct 8, 2025

Describe your change:

Checklist:

Algorithm Details:

1. Decision Tree Pruning

2. Logistic Regression Vectorized

3. Naive Bayes with Laplace Smoothing

4. PCA from Scratch

Testing Results:

Note on Multiple Algorithms:

Uh oh!

algorithms-keeper bot left a comment

Choose a reason for hiding this comment

🔗 Relevant Links

Repository:

Python:

Automated review generated by algorithms-keeper. If there's any problem regarding this review, please open an issue about it.

algorithms-keeper actions can be triggered by commenting on this PR:

Uh oh!

algorithms-keeper bot Oct 8, 2025

Choose a reason for hiding this comment

Uh oh!

algorithms-keeper bot Oct 8, 2025

Choose a reason for hiding this comment

Uh oh!

algorithms-keeper bot Oct 8, 2025

Choose a reason for hiding this comment

Uh oh!

algorithms-keeper bot Oct 8, 2025

Choose a reason for hiding this comment

Uh oh!

algorithms-keeper bot Oct 8, 2025

Choose a reason for hiding this comment

Uh oh!

algorithms-keeper bot Oct 8, 2025

Choose a reason for hiding this comment

Uh oh!

algorithms-keeper bot Oct 8, 2025

Choose a reason for hiding this comment

Uh oh!

algorithms-keeper bot Oct 8, 2025

Choose a reason for hiding this comment

Uh oh!

algorithms-keeper bot Oct 8, 2025

Choose a reason for hiding this comment

Uh oh!

algorithms-keeper bot Oct 8, 2025

Choose a reason for hiding this comment

Uh oh!

algorithms-keeper bot left a comment

Choose a reason for hiding this comment

🔗 Relevant Links

Repository:

Python:

Automated review generated by algorithms-keeper. If there's any problem regarding this review, please open an issue about it.

algorithms-keeper actions can be triggered by commenting on this PR:

Uh oh!

algorithms-keeper bot Oct 8, 2025

Choose a reason for hiding this comment

Uh oh!

algorithms-keeper bot Oct 8, 2025

Choose a reason for hiding this comment

Uh oh!

algorithms-keeper bot Oct 8, 2025

Choose a reason for hiding this comment

Uh oh!

algorithms-keeper bot Oct 8, 2025

Choose a reason for hiding this comment

Uh oh!

algorithms-keeper bot Oct 8, 2025

Choose a reason for hiding this comment

Uh oh!

algorithms-keeper bot Oct 8, 2025

Choose a reason for hiding this comment

Uh oh!

algorithms-keeper bot Oct 8, 2025

Choose a reason for hiding this comment

Uh oh!