End-to-end machine learning notebooks for unsupervised clustering and supervised classification. The project includes data loading, preprocessing, PCA + KMeans clustering, Decision Tree and Random Forest classifiers, evaluation metrics/plots, hyperparameter tuning, and persisted artifacts.
- Data loading from CSV datasets (
data_clustering.csv,data_clustering_inverse.csv). - Preprocessing:
LabelEncoderfor categorical features andStandardScalerfor numeric scaling. - Clustering: PCA dimensionality reduction + KMeans; includes silhouette score and cluster analysis.
- Classification:
DecisionTreeClassifierandRandomForestClassifierwith train/test split. - Evaluation: Accuracy, Precision, Recall, F1-score, and confusion matrix visualizations (seaborn/matplotlib).
- Hyperparameter tuning:
RandomizedSearchCVfor Random Forest; best estimator saved. - Artifacts (saved via joblib using
.h5filenames):decision_tree_model.h5explore_random_forest_classification.h5tuning_classification.h5model_clustering.h5PCA_model_clustering.h5
- Python 3 (tested with 3.9–3.11)
- Jupyter Notebook / JupyterLab
- NumPy, pandas
- scikit-learn
- seaborn, matplotlib
- joblib
- Create and activate a virtual environment:
python -m venv .venv
source .venv/bin/activate # Windows: .venv\Scripts\activate- Install dependencies:
pip install --upgrade pip
pip install jupyter numpy pandas scikit-learn seaborn matplotlibOption A - VS Code
- Open this folder in VS Code.
- Open
[Klasifikasi]_Submission_Akhir_BMLP_Nelson_Ahli.ipynbor[Clustering]_Submission_Akhir_BMLP_Nelson_Ahli.ipynb. - Select a Python 3 kernel, then “Run All”.
Option B - Jupyter (CLI)
jupyter lab
# or
jupyter notebookThen open:
- Classification:
[Klasifikasi]_Submission_Akhir_BMLP_Nelson_Ahli.ipynb - Clustering:
[Clustering]_Submission_Akhir_BMLP_Nelson_Ahli.ipynb
Notes for running:
- Keep the working directory at the repository root so relative data paths resolve.
- The notebooks save models automatically to the repo root (files listed above).
.
├─ [Klasifikasi]_Submission_Akhir_BMLP_Nelson_Ahli.ipynb
├─ [Clustering]_Submission_Akhir_BMLP_Nelson_Ahli.ipynb
├─ data_clustering.csv
├─ data_clustering_inverse.csv
├─ decision_tree_model.h5
├─ explore_random_forest_classification.h5
├─ tuning_classification.h5
├─ model_clustering.h5
├─ PCA_model_clustering.h5
└─ README.md
- The model artifacts use the
.h5extension for consistency with prior naming conventions, but are actually saved usingjoblib.dump(...)(not HDF5 format). This is a legacy naming choice from the notebooks. - Exact metrics and plots are produced inside the notebooks.