PredictionWithRandomForest/WineQualityWithRandomForest.Rmd at master · padmesky/PredictionWithRandomForest · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
---
title: "Predicting Wine Quality with Random Forest Algorithm"
subtitle: "Prediction with Random Forest Algorithm"
date: "`r format(Sys.time(), '%d %B %Y')`"
author: "Havva Nur Elveren"
output:
  html_document:
      theme: journal
      toc: yes
      toc_depth: 4
      #toc_float: true
  word_document:
      toc: yes
      toc_depth: 4
      #toc_float: true
  pdf_document:
      toc: yes
      theme: journal
      toc_depth: 4
      #toc_float: true
---
---
# Objective: Predicting Wine Quality
Can we predict wine quality based on its features such as acidity, alcohol, sugar or sulfate level? In this project, we'll predict Wine Quality with looking at the value of different features of a wine. We'll use a data set that has been collected from red wine variants of the Portuguese "Vinho Verde" wine. If quality is greater than 6.5 it is considered as good wine, otherwise it is considered as bad wine.

# Data Description:
* 1.6K Row with 12 Column. You can download the data from the link https://www.kaggle.com/uciml/red-wine-quality-cortez-et-al-2009
```{r}
library(kableExtra)

dt <- data.frame(Name = c("fixed.acidity", "volatile.acidity", "citric.acid", "residual.sugar", "chlorides", "free.sulfur.dioxide", "total.sulfur.dioxide",
"density", "pH", "sulphates", "alcohol", "quality"),
Description = c("most acids involved with wine or fixed or nonvolatile (do not evaporate readily)", "the amount of acetic acid in wine, which at too high of levels can lead to an unpleasant, vinegar taste", "found in small quantities, citric acid can add 'freshness' and flavor to wines", "the amount of sugar remaining after fermentation stops, it's rare to find wines with less than 1 gram/liter and wines with greater than", "the amount of salt in the wine", "the free form of SO2 exists in equilibrium between molecular SO2 (as a dissolved gas) and bisulfite ion; it prevents", "amount of free and bound forms of S02; in low concentrations, SO2 is mostly undetectable in wine, but at free SO2", "the density of water is close to that of water depending on the percent alcohol and sugar content", "describes how acidic or basic a wine is on a scale from 0 (very acidic) to 14 (very basic); most wines are between 3-4 on the", "a wine additive which can contribute to sulfur dioxide gas (S02) levels, wich acts as an antimicrobial and", "the percent alcohol content of the wine","Score between 0 and 10, if quality > 6.5 it's Good, otherwise it is Bad "))

dt %>%
  kbl() %>%
  kable_styling()
```

## Step 1: Load the Libraries
```{r}
library(caret)
library(glmnet)
library(xgboost)
library(randomForest)
library(ISLR)
library(readxl)
library(pROC)
library(lattice)
library(e1071)
library(kknn)
library(ggplot2)
library(multiROC)
library(MLeval)
library(AppliedPredictiveModeling)
library(corrplot)
library(Hmisc)
library(dplyr)
library(quantmod)

library(nnet)
library(NeuralNetTools)

library(mlbench)
library(caretEnsemble)
library(ranger)
library(C50)
```

## Step 2: Load the Data Set
```{r}
winedata <- read.csv("winequality-red.csv")
winedata <- data.frame(winedata, stringsAsFactors = FALSE)
str(winedata)

# If quality score is greater than 6.5 set quality as Good, otherwise set as Bad
winedata$quality[winedata$quality>6.5] <- 'Good'
winedata$quality[winedata$quality<= 6.5] <- 'Bad'
winedata$quality <- as.factor(winedata$quality)
table(winedata$quality)
```
## Step 3: Prepare the Data
```{r}
## Create random training and test data set with splitting the data with the proportion of %80 to %20.
trainIndex <- createDataPartition(winedata$quality, p = 0.8, list=FALSE)
trainData <- winedata[trainIndex,]
testData <- winedata[-trainIndex,]
# Train labels
TrainLabels<-winedata[trainIndex,]$quality
# Test labels
TestLabels<-winedata[-trainIndex,]$quality
prop.table(table(trainData$quality))
prop.table((table(testData$quality)))
```
## Step 4: Make Prediction with Random Forest
```{r}
control <- trainControl(method = "repeatedcv", number = 10, repeats = 1)
metric <- "Accuracy"

# Number of features for splitting each tree node will be defined by getting the root square of the total number of features
mtry <- sqrt(ncol(trainData))

# Setting the mtry value
tunegrid <- expand.grid(mtry=mtry)

# Train the data with random forest, with setting the ntree to 50. ntree defines the number of trees to grow. Since our data set is not large 50 would be sufficient.
rfModel <- train(quality~., data=trainData, method="rf",ntree=50, metric=metric, tuneGrid=tunegrid, trControl=control)
print(rfModel)

#Predict
predictions<-predict(rfModel, testData, type = "prob")
predictions<-predict(rfModel, testData)
confusionMatrix(as.factor(predictions),as.factor(TestLabels))
```
## Step 5: Improving Model Performance With Boosting
```{r}
#trail defines the number of iterations
c5BoostModel <- C5.0(x= trainData, y= trainData$quality, trail=5)
summary(c5BoostModel)
plot(c5BoostModel)

c5prediction <- predict(c5BoostModel, testData)
confusionMatrix(as.factor(c5prediction),as.factor(TestLabels))
```
## Step 6: Using Gradient Boosting
```{r}
gbmModel <- train(quality~., data=trainData, method="gbm", metric=metric, trControl=control)
summary(gbmModel)
plot(gbmModel)

gbmPrediction <- predict(gbmModel, testData)
confusionMatrix(as.factor(gbmPrediction),as.factor(TestLabels))
```
# Conclusion
In this project we predicted wine quality with random forest algorithm and then tried to improve the model performance with using C5 and gradient boosting  methods.
Random Forest implementation predicts the wine quality with %93 accuracy. Since the data is imbalanced and there are more bad quality wines in the data set, the prediction for bad wines is better comparing to good wines.
Using C5 method with boosting results with a drastic improvement in the model performance and predicts the wine quality with %100 accuracy even though the data is imbalanced. However, gradient boosting method doesn't provide any improvement on the model comparing to results of C5 and random forest.