Welcome to the Data Science Projects repository. This repository contains various data data science projects implemented using various technologies and models.
- Description: In this analysis, aim is to build a linear regression model to predict the fuel efficiency (measured in miles per gallon, mpg) of different vehicles based on various features such as the number of cylinders, displacement, horsepower, weight, acceleration, model year, and origin.
- Learning Type: Supervised Learning
- Technologies Used: python, pandas, seaborn, matplotlib, scikit-learn, statsmodels, scipi
- Algorithms/Models Used: Linear Regression, Ridge Regression
- Description: This project builds classification models to predict whether a retail bank customer is likely to churn. It explores demographic, behavioural, and financial features using logistic regression and random forest models. Feature engineering, threshold tuning, and model evaluation are used to identify high-risk customers and support retention strategies.
- Learning Type: Supervised Learning
- Technologies Used: python, pandas, numpy, seaborn, matplotlib, scikit-learn
- Algorithms/Models Used: Logistic Regression, Random Forest, GridSearchCV, ROC Curve Analysis, Precision-Recall Threshold Tuning
- Description: This project applies clustering techniques to bank customer data to uncover distinct behavioural segments and assess their relationship with churn. Using features such as credit score, account balance, number of products, and activity status, the analysis identifies four meaningful customer profiles (e.g., “Wealthy Light Users” and “Rapid Multi-Product Adopters”). The churn rate is then evaluated across these segments to support targeted retention and marketing strategies.
- Learning Type: Unsupervised Learning (with supervised follow-up analysis on churn labels)
- Technologies Used: python, pandas, numpy, seaborn, matplotlib, scikit-learn
- Algorithms/Models Used: K-Means Clustering, StandardScaler, Silhouette Analysis, Data Visualization (heatmaps, elbow plots)
- Description: This project focuses on developing a machine learning model to predict whether a breast tumor is benign or malignant using the k-Nearest Neighbors (k-NN) algorithm, leveraging a series of preprocessing techniques, including feature scaling, dimensionality reduction through PCA, and addressing class imbalance with SMOTE.
- Learning Type: Supervised Learning
- Technologies Used: python, pandas, numpy, seaborn, matplotlib, scikit-learn, imbalanced-learn
- Algorithms/Models Used: k-Nearest Neighbors (k-NN) Classification, Principal Component Analysis (PCA), Synthetic Minority Over-sampling Technique (SMOTE)
- Description: This project aims to perform clustering on employee data to identify distinct groups within the workforce. The analysis includes exploring the relationships between these clusters and employee attrition rates, providing insights into factors contributing to employee turnover. Key features analyzed include gender, job satisfaction, commute distance, performance, and department affiliation. The project also employs Principal Component Analysis (PCA) for dimensionality reduction and visualization of the clusters.
- Learning Type: Unsupervised Learning
- Technologies Used: python, pandas, numpy, seaborn, matplotlib, scikit-learn
- Algorithms/Models Used: K-Means Clustering, Principal Component Analysis (PCA)
- Description: This project explores the factors influencing car purchase decisions. By leveraging data science techniques, particularly decision tree models, the project aims to predict whether a client will purchase a car based on demographic and financial information such as age, gender, and annual salary.
- Learning Type: Supervised Learning
- Technologies Used: python, pandas, numpy, seaborn, matplotlib, scikit-learn
- Algorithms/Models Used: Decision Tree, GridSearchCV
- Description: This project focuses on identifying countries with unique or anomalous socio-economic conditions using various unsupervised anomaly detection algorithms. Through the use of K-Means clustering, DBSCAN, Isolation Forest, and One-Class SVM, the project seeks to uncover global outliers that may indicate countries facing extreme challenges or irregular patterns.
- Learning Type: Unsupervised Learning
- Technologies Used: python, pandas, numpy, seaborn, matplotlib, scikit-learn
- Algorithms/Models Used: K-Means, DBSCAN, Isolation Forest, One-Class SVM
Each project folder contains specific instructions on how to run the scripts and notebooks. Refer to the README file within each project folder for detailed usage guidelines.
This repository is licensed under the MIT License. See the LICENSE file for more details.