This project focuses on performing binary classification using a given dataset. The process involves loading the data, performing initial exploration and preprocessing, visualizing key features, and finally training and evaluating a Decision Tree Classifier.
The dataset Lab_Exam_binary_classification_dataset.csv contains three columns:
Feature1: A numerical feature.Feature2: Another numerical feature.Target: The binary target variable ('Yes' or 'No').
The dataset was loaded into a Pandas DataFrame. The first 5 rows, basic information (df.info()), and descriptive statistics (df.describe()) were displayed to understand the data's structure, types, and summary statistics.
- Missing Values: Missing values in the 'Target' column were identified and handled by dropping the corresponding rows, ensuring data integrity for the target variable.
- Target Encoding: The 'Target' column was encoded from categorical ('Yes', 'No') to numerical (1, 0) for model compatibility, as most machine learning algorithms require numerical input.
- Outlier Handling: An outlier in 'Feature1' with an exceptionally large value (identified as
10000.0at index132) was detected and removed from the dataset. This step is crucial to prevent the model from being unduly influenced by extreme values, which could skew results and impair performance.
A scatter plot was generated to visualize the relationship between 'Feature1' and 'Feature2', with points colored according to the 'Target' variable. This visualization helped in understanding the separability of the classes and identifying any obvious patterns or challenges for classification.
The cleaned and prepared dataset was split into training and testing sets using train_test_split with a 70% to 30% ratio, respectively, and random_state=42 for reproducibility. This ensures that the model is trained on one part of the data and evaluated on unseen data to assess its generalization capabilities.
- Model Choice: A Decision Tree Classifier (
DecisionTreeClassifier) was chosen for this binary classification task. Decision Trees are intuitive, easy to interpret, and capable of capturing non-linear relationships within the data. Their interpretability makes it easier to understand how decisions are made based on feature values. They are also relatively robust to outliers after the initial cleaning steps, and can handle both numerical and categorical data effectively without extensive feature scaling. - Training: The
DecisionTreeClassifierwas initialized withrandom_state=42for reproducibility and trained using theX_train(features) andy_train(target) data.
- Prediction: The trained model was used to make predictions (
dt_y_pred) on theX_testdataset. - Performance Metrics: The model's performance was rigorously evaluated using:
- Accuracy Score: The overall proportion of correctly classified instances. The Decision Tree model achieved an accuracy of 0.9433 on the test set.
- Classification Report: This report provides detailed metrics for each class, including precision, recall, and f1-score, as well as support. The report indicated:
- Class 0: Precision of 1.00, Recall of 0.93, F1-score of 0.96 (for 240 samples).
- Class 1: Precision of 0.79, Recall of 0.98, F1-score of 0.87 (for 60 samples). This shows strong performance for class 0, and good recall for class 1, suggesting the model is effective at identifying instances of the positive class.
A custom plot_decision_boundary function was utilized to visually represent the decision regions created by the trained Decision Tree model on the training data. This plot provides a clear graphical insight into how the model segments the feature space to classify instances, making the model's logic more transparent.
To reproduce this analysis, execute the cells in the provided Jupyter notebook sequentially. Ensure all required Python libraries (pandas, scikit-learn, matplotlib, seaborn, numpy) are installed in your environment.