Skip to content

codefortheplanet/Supervised-and-Unsupervised-machine-learning-for-Current-Population-Surveys

Repository files navigation

Supervised and Unsupervised machine learning for Current Population Surveys

This repo is to demonstrate the application of supervised and unsupervised ML for income prediction and customer segmentation analysis using the Current Population Surveys (1994 and 1995) by the U.S. Census Bureau. Two income levels are given as the dependent variable: <$50,000 and >$50,000. The income threshold represents individuals around the 75th percentile of the total U.S. population (high-income group) during that time. There are 40 independent variables in total. A detailed list is provided in datacleaning.csv with metadata and data inspection information.

The workflow includes three models: LASSO logistic regression, XGBoost, and K-means clustering. LASSO is used for variable filtering based on the association strength between the variables. XGBoost is used for predicting the two income levels, and K-means clustering is used for customer segmentation.

To run the analysis:

  1. Clone the GitRepo and navigate to the directory
git clone https://github.com/codefortheplanet/Supervised-and-Unsupervised-machine-learning-for-Current-Population-Surveys.git

cd Supervised-and-Unsupervised-machine-learning-for-Current-Population-Surveys

  1. Create a new conda environment and install required dependencies. Then, activate the environment
conda env create -f environment.yml

conda activate cps

  1. Run the initiate Python script
python run.py

The script will print major evaluation metrics and save result related coefficents and graphs

About

Prediction of income-level and customer segmentation using CPS demographic and employment data from 1994 and 1995

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages