ML_2_TreeRegressionClassification/Team14_Exercise9.Rmd at main · itlawson/ML_2_TreeRegressionClassification · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
---
title: "TP 1_Classification and Regression Trees"
author: "Group 14"
date: "2/24/2021"
output: pdf_document
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
# Exercise 9: Using Tree Classification to Predict Orange Juice Sales

Exercise 9 uses tree classification to predict whether a customer will by Citrus Hill (CH) or Minute Maid (MM) orange juice. The exercise involves making the model, running cross validation, and then using a pruned model.


## Load the Data

The OJ data contains 1070 purchases where the customer either purchased Citrus Hill or Minute Maid Orange Juice. There are 17 characteristics of the customer and product recorded in the data. The data set is already cleaned as shown in the 'str' function, so no data cleaning is required.

```{r}
library(ISLR)
data(OJ)
str(OJ)
```

## Split the Data

Instead of using the book's method, the data is split using the caret package. The training data contains 80% of the data while the remaining 20% is saved for testing.

```{r}
library(caret)
set.seed(420)
d_divide<-createDataPartition(OJ$Purchase,p=.8,list = FALSE)
train<-OJ[d_divide,]
test<-OJ[-d_divide,]
```


## Fit the Model

To fit the tree classification model, use the tree library. As seen in the model summary, there are 10 terminal nodes and a training error rate of 15.87%.


```{r}
library(tree)
model<-tree(Purchase~.,data=train)
summary(model)
```

## Display Detailed Results

Printing the model shows the break down of each node in the tree. In order it displays the split criterion, the number of observations in that branch, the deviance, the overall prediction for the branch, and the fraction of observations that take on the prediction. It also denotes if a node is a terminal node or not.

```{r}
model
```

## Plot the Tree

Plotting the tree shows a visualization of the different branches in the tree. It does not go as in-depth as printing the model, but it does show the split criteria.

```{r}
plot(model)
text(model,pretty = 0)
```

## Use the Test Data and Create a Confusion Matrix

Using the test data to make a prediction yields a test error rate of 16.9%. This is slightly higher than the training error rate, which is generally expected. There were 36 misclassifications and 177 valid classifications.

```{r}
prediction<-predict(model,test,type = 'class')
table(prediction, test$Purchase)
cat("Test Error Rate:",toString((26+10)/length(test$Purchase)))
```

## Use Cross Validation to Find Optimal Tree Size

To cross validate the tree model, the cv.tree function is used. Setting FUN to prune.misclass ensures that the training error rate is used as the criteria for model improvement. Displaying the results of the model shows the deviation at each number of terminal nodes.

```{r}
cvModel<-cv.tree(model, FUN = prune.misclass)
cvModel
```


## Plot the CV Results

Plotting the cross validation results shows that using 4 terminal nodes yields the smallest deviation/error rate for the model. This differs from our original model that used 10 terminal nodes. Not only is the error rate lower, less nodes and branches means there is a lower chance of overfitting.

```{r}
plot(cvModel$size,cvModel$dev, type = 'b')
```


## Make a Model Using the Best Pruned Tree

Using the number of terminal nodes found above, a pruned model is created using the prune.misclass function. The plot shows the branches for a 4 terminal node model.

```{r}
pruneModel<-prune.misclass(model,best=4)
plot(pruneModel)
text(pruneModel,pretty=0)
```

## Find the New Test Error Rate

Using the pruned model yields different results from the original model. The training error rate was slightly higher at 18.09% as opposed to the original 15.87%. The test error rate fell by 2.82% from the original test error rate of 16.9% to the pruned test error rate of 14.08%.

```{r}
summary(pruneModel)

prunePred<-predict(pruneModel,test,type = 'class')
table(prunePred, test$Purchase)

cat("Test Error Rate:",toString((13+17)/length(test$Purchase)))
```