CSSR/Assignment1.Rmd at master · BenjaminGaiser/CSSR · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
---
title: "Assignment 1"
author: "Ben Gaiser & Jeremy Russell"
date: "10/7/2016"
output: html_document
bibliography: RpackageCitations.bib
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```

This HTML is created using @R-knitr and @R-rmarkdown.

### Data Gathering process

Before going into our analysis, we make sure that all the required packages are installed. The following loop is a simple if-else loop that does check whether the packages are installed and if not, then installs them.

```{r message=FALSE}
packages <- c('repmis', 'knitr', 'rmarkdown', 'Hmisc', 'ggplot2')
for (p in packages) {
  if (p %in% installed.packages()) require(p, character.only=TRUE)
  else {
    install.packages(p)
    require(p, character.only=TRUE)
  }
}
```

Then we are using @R-repmis for loading our R packages, while simultaneously preparing a file from which we can cite our packages that we use.

```{r message=FALSE}
repmis::LoadandCite(packages, file = 'RpackageCitations.bib')
```

Setting our working directory.

```{r message=FALSE}
wrkdir <- c('C:/Users/Benji/Desktop/Statistics/Git/Repositories/CSSR',
            '~/Hertie School/Fall 2016/CollaborativeSocialScienceDataAnalysis/CSSR')
repmis::set_valid_wd(wrkdir)
```


```{r include=FALSE}
rm(packages, p, wrkdir)
```

The alcohol consumption data set is from fivethirtyeight (https://github.com/fivethirtyeight/data). Swiss is a core R data frame. We are using the specific SHA1 hash for the fivethirtyeight data set for better reproducibility.

```{r message=FALSE}
URL <- paste0('https://raw.githubusercontent.com/fivethirtyeight/data/master/alcohol-consumption/drinks.csv')

AlcoholConsumption <- repmis::source_data(URL, sha1 = 'af869c65cccfc989a3b8c0a21752e4c0ba25bd3d')

swiss <- as.data.frame(swiss)
```


```{r include=FALSE}
#source('https://raw.githubusercontent.com/BenjaminGaiser/CSSR/master/DataGathering.R')
# We have not included this in the HTML because we added the Data Gathering Process at the beginning. This does exactly the same and is used in our R-file.
```


### Data Frame 1: Alcohol Consumption
##### Units: Average serving sizes per person
##### Source: 538 from World Health Organisation, Global Information System on Alcohol and Health (GISAH), 2010
##### Variables of interest:
#####  Independent Variable X: beer servings
#####  Dependent Variable Y: total litres of pure alcohol


#### What the data look like - Initial Descriptive Statistics

We use R's base command summary as well as the more advanced command describe from @R-Hmisc for initial descriptive statistics to get a feel for the data.

```{r message=FALSE}
summary(AlcoholConsumption)
describe(AlcoholConsumption)
```

Similarly, we could look at a specific variable and look up for example who drinks the most in total litres of pure alcohol. The answer is row 16, i.e. Belarus.

```{r message=FALSE}
which.max(AlcoholConsumption$total_litres_of_pure_alcohol)
head(AlcoholConsumption[16,])
```

The least is drunk by row 1, i.e. Afghanistan.

```{r message=FALSE}
which.min(AlcoholConsumption$total_litres_of_pure_alcohol)
head(AlcoholConsumption[1,])
```

### Hypothesis: Beer is the main driver of total litres of pure alcohol consumed per country

To see if our hypothesis is accurate, we can look at the correlations between the three variables of servings (beer, wine and spirits) and the total litres of pure alochol that is drunk in each country. We then find that indeed beer is the strongest correlated (0.84), while wine and spirits are similarly correlated (0.67 vs. 0.65, respectively.

```{r message=FALSE}
cor(AlcoholConsumption$beer_servings, AlcoholConsumption$total_litres_of_pure_alcohol)    # 0.84
cor(AlcoholConsumption$wine_servings, AlcoholConsumption$total_litres_of_pure_alcohol)    # 0.67
cor(AlcoholConsumption$spirit_servings, AlcoholConsumption$total_litres_of_pure_alcohol)  # 0.65
```

#### We can plot these findings in a scatterplot in order to get a visualised understanding of the correlation.

Step 1: Creating a function for plotting a ggplot using @R-ggplot2.

```{r message=FALSE}
ggplotRegAlcCons <- function(fit){
  ggplot(AlcoholConsumption, aes(beer_servings, total_litres_of_pure_alcohol)) +
    geom_point(colour = 'blue') +
    stat_smooth(method = 'lm', col = 'red', size=0.75) +
    labs(title = paste('Adj R2 = ',signif(summary(fit)$adj.r.squared, 3),
                       'Intercept =',signif(fit$coef[[1]],3 ),
                       ' Slope =',signif(fit$coef[[2]], 1),
                       ' P =',signif(summary(fit)$coef[2,4], 2)))
}
```

Step 2: Running the linear regression for the line of best fit.

```{r message=FALSE}
FitOfData <- lm(total_litres_of_pure_alcohol ~ beer_servings, data=AlcoholConsumption)
```

Step 3: Plotting the graph.

```{r message=FALSE}
ggplotRegAlcCons(FitOfData)
```

We can clearly see that this is a linear relationship with a high significance (p-value < 0.01) and a high Adj.R²-value of around 70 %.

#### Analyzing five countries of interest to see how they differ in their values of beer servings and total litres of pure alcohol.

Finding Germany, USA, South Africa, China and Australia.

```{r message=FALSE}
which(grepl('Germany', AlcoholConsumption$country)) # row 66
which(grepl('USA', AlcoholConsumption$country)) # row 185
which(grepl('South Africa', AlcoholConsumption$country)) # row 160
which(grepl('China', AlcoholConsumption$country)) # row 37
which(grepl('Australia', AlcoholConsumption$country)) # row 9
```

Subsetting the Data 'AlcoholConsumption' for ease of the next commands.

```{r message=FALSE}
SubsetOfFiveCountries <- AlcoholConsumption[c(9, 37, 66, 160, 185),]
```

Plotting our findings in a Scatterplot, we see that Germany drinks the most out of the five countries on both beer and total litres of pure alcohol.

```{r message=FALSE}
ggplot(SubsetOfFiveCountries,
       aes(beer_servings, total_litres_of_pure_alcohol)) +
  geom_point(aes(colour = factor(country))) +
  scale_colour_discrete(name='Countries')
```


### Data Frame 2: Swiss Data Set
#### Source: Swiss Fertility and Socioeconomic Indicators (1888), R Data Set
#### Variables of interest:
#### Independent Variable X: Catholic - % as opposed to Protestant
#### Dependent Variable Y: Fertility - lg, common standardized fertility measure


#### What the data look like - Initial Descriptive Statistics
```{r message=FALSE}
summary(swiss)
```

### Hypothesis: Catholics have a higher fertility rate than Protestants.

First, we have a closer look at the initial descriptive statistics of our variables of interest.

```{r message=FALSE}
describe(swiss$Fertility)
describe(swiss$Catholic)
var(swiss$Fertility)
var(swiss$Catholic)
sd(swiss$Fertility)
sd(swiss$Catholic)
```

The variable 'Catholic' shows high variance and standard deviation for a continuous variable of between 0 and 100. Let's plot the relationship in order to see how the relationship looks.

```{r message=FALSE}
ggplot(swiss, aes(Catholic, Fertility)) + geom_point()
```

Adding a line of best fit, we can see that this relationship is significant, explaining 20 % of the variance (R-value). Yet, the striking thing is the clear divide between mostly Protestant and mostly Catholic cantons, except of four "outliers in the middle"

```{r message=FALSE}
ggplotRegSwiss <- function(fit){
  ggplot(swiss, aes(Catholic, Fertility)) +
    geom_point(colour = 'blue') +
    stat_smooth(method = 'lm', col = 'red', size=0.75) +
    labs(title = paste('Adj R2 = ',signif(summary(fit)$adj.r.squared, 3),
                       'Intercept =',signif(fit$coef[[1]],3 ),
                       ' Slope =',signif(fit$coef[[2]], 1),
                       ' P =',signif(summary(fit)$coef[2,4], 2)))
}
FitOfDataSwiss <- lm(Fertility ~ Catholic, data = swiss)
ggplotRegSwiss(FitOfDataSwiss)
```

#### Which cantons are neither mostly Protestant nor mostly Catholic?

Step 1: Creating a factor variable with four different groups.

```{r message=FALSE}
swiss$CatholicCat <- cut(swiss$Catholic, seq(0, 100, 25))
```

Step 2: Changing the factor variable into a character variable in order to rename the rownames.

```{r message=FALSE}
swiss$CatholicCat <- as.character(swiss$CatholicCat)
```

Step 3: Renaming the rownames.

```{r message=FALSE}
swiss$CatholicCat[swiss$CatholicCat=='(0,25]'] <- 'Protestant'
swiss$CatholicCat[swiss$CatholicCat=='(25,50]'] <- 'Protestant to Catholic'
swiss$CatholicCat[swiss$CatholicCat=='(50,75]'] <- 'Catholic to Protestant'
swiss$CatholicCat[swiss$CatholicCat=='(75,100]'] <- 'Catholic'
```

Step 4: Finding the cantons which are 'Protestant to Catholic' and 'Catholic to Protestant'.

```{r message=FALSE}
which(grepl('Protestant to Catholic', swiss$CatholicCat)) # 4 and 45
which(grepl('Catholic to Protestant', swiss$CatholicCat)) # 46 and 47
```

Step 5: Searching the names.

```{r message=FALSE}
swiss[c(4,45:47),]
```

##### Moutier, V. De Geneve, Rive Droite and Rive Gauce are the only cantons where there is at least a third of the population that does not belong to the majority religion.

# References

---