This repo is to demonstrate the application of supervised and unsupervised ML for income prediction and customer segmentation analysis using the Current Population Surveys (1994 and 1995) by the U.S. Census Bureau. Two income levels are given as the dependent variable: <$50,000 and >$50,000. The income threshold represents individuals around the 75th percentile of the total U.S. population (high-income group) during that time. There are 40 independent variables in total. A detailed list is provided in datacleaning.csv with metadata and data inspection information.
The workflow includes three models: LASSO logistic regression, XGBoost, and K-means clustering. LASSO is used for variable filtering based on the association strength between the variables. XGBoost is used for predicting the two income levels, and K-means clustering is used for customer segmentation.
- Clone the GitRepo and navigate to the directory
git clone https://github.com/codefortheplanet/Supervised-and-Unsupervised-machine-learning-for-Current-Population-Surveys.git
cd Supervised-and-Unsupervised-machine-learning-for-Current-Population-Surveys
- Create a new conda environment and install required dependencies. Then, activate the environment
conda env create -f environment.yml
conda activate cps
- Run the initiate Python script
python run.py
The script will print major evaluation metrics and save result related coefficents and graphs