training/12-supervised-learning.Rmd at master · NetDem-USC/training · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
---
title: "Supervised machine learning"
author: Pablo Barbera
date: March 6, 2017
output: html_document
---

### Supervised machine learning

We'll be working again with tweets about the 2014 EP elections in the UK. Now we will be using the `polite` variable, which indicates whether each tweet was hand-coded as being __polite__ (a tweet that adheres to politeness standards, i.e. it is written in a well-mannered and
non-offensive way) or __impolite__ (an ill-mannered, disrespectful tweet that may contain offensive language).

The source of the dataset is an article co-authored with Yannis Theocharis, Zoltan Fazekas, and Sebastian Popa, published in the Journal of Communication. The link is [here](http://onlinelibrary.wiley.com/doi/10.1111/jcom.12259/abstract). Our goal was to understand to what extent candidates are not engaging voters on Twitter because they're exposed to mostly impolite messages.

Let's start by reading the dataset and creating a dummy variable indicating whether each tweet is impolite.

```{r}
library(quanteda)
tweets <- read.csv("data/EP-elections-tweets.csv", stringsAsFactors=F)
tweets$impolite <- ifelse(tweets$polite=="polite", 0, 1)
```

We'll do some cleaning as well -- substituting handles with @. Why? We want to provent overfitting.
```{r}
tweets$text <- gsub('@[0-9_A-Za-z]+', '@', tweets$text)
```

Create the dfm and trim it so that only tokens that appear in 2 or more tweets are included.
```{r}
twcorpus <- corpus(tweets$text)
twdfm <- dfm(twcorpus, removePunct=TRUE, remove=c(
  stopwords("english"), "t.co", "https", "rt", "amp", "http", "t.c", "can"))
twdfm <- dfm_trim(twdfm, min_docfreq = 2)
```

And split the dataset into training and test set. We'll go with 80% training and 20% set. Note the use of a random seed to make sure our results are replicable.
```{r}
set.seed(123)
training <- sample(1:nrow(tweets), floor(.80 * nrow(tweets)))
test <- (1:nrow(tweets))[1:nrow(tweets) %in% training == FALSE]
```

Our first step is to train the classifier using cross-validation. There are many packages in R to run machine learning models. For regularized regression, glmnet is in my opinion the best. It's much faster than caret or mlr (in my experience at least), and it has cross-validation already built-in, so we don't need to code it from scratch.

```{r}
library(glmnet)
require(doMC)
registerDoMC(cores=3)
ridge <- cv.glmnet(twdfm[training,], tweets$impolite[training],
	family="binomial", alpha=0, nfolds=5, parallel=TRUE,
	type.measure="deviance")
plot(ridge)
```

We can now compute the performance metrics on the test set.
```{r}
## function to compute accuracy
accuracy <- function(ypred, y){
	tab <- table(ypred, y)
	return(sum(diag(tab))/sum(tab))
}
# function to compute precision
precision <- function(ypred, y){
	tab <- table(ypred, y)
	return((tab[2,2])/(tab[2,1]+tab[2,2]))
}
# function to compute recall
recall <- function(ypred, y){
	tab <- table(ypred, y)
	return(tab[2,2]/(tab[1,2]+tab[2,2]))
}
# computing predicted values
preds <- predict(ridge, twdfm[test,], type="response") > mean(tweets$impolite[test])
# confusion matrix
table(preds, tweets$impolite[test])
# performance metrics
accuracy(preds, tweets$impolite[test])
precision(preds, tweets$impolite[test])
recall(preds, tweets$impolite[test])
```

Something that is often very useful is to look at the actual estimated coefficients and see which of these have the highest or lowest values:

```{r}
# from the different values of lambda, let's pick the best one
best.lambda <- which(ridge$lambda==ridge$lambda.min)
beta <- ridge$glmnet.fit$beta[,best.lambda]
head(beta)

## identifying predictive features
df <- data.frame(coef = as.numeric(beta),
				word = names(beta), stringsAsFactors=F)

df <- df[order(df$coef),]
head(df[,c("coef", "word")], n=30)
paste(df$word[1:30], collapse=", ")
df <- df[order(df$coef, decreasing=TRUE),]
head(df[,c("coef", "word")], n=30)
paste(df$word[1:30], collapse=", ")
```

### A special case: wordscores

Let's check an example of wordscores. Here we have tweets from a random sample of 100 Members of the U.S. Congress, as well as their ideal points based on roll-call votes. Can we replicate the ideal points only using the text of their tweets?

```{r}
cong <- read.csv("data/congress-tweets.csv", stringsAsFactors=F)
# creating the corpus and dfm objects
ccorpus <- corpus(cong$text)
docnames(ccorpus) <- cong$screen_name
cdfm <- dfm(ccorpus, removePunct=TRUE, remove=c(stopwords("english"), "t.co", "https", "rt", "amp", "http", "t.c", "can"))
cdfm <- dfm_trim(cdfm, min_docfreq = 2)
# running wordscores
ws <- textmodel(cdfm, cong$idealPoint, model="wordscores", smooth=.5)
ws
# let's look at the most discriminant words
sw <- sort(ws@Sw)
head(sw, n=20)
tail(sw, n=20)
```

Now let's split the data into training and test set and see what we can learn...

```{r}
set.seed(123)
test <- sample(1:nrow(cong), floor(.20 * nrow(cong)))
# extracting ideal points and replacing them with missing values
refpoints <- cong$idealPoint
refpoints[test] <- NA
# running wordscores
ws <- textmodel(cdfm, refpoints, model="wordscores", smooth=.5)
# predicted values (this will take a while...)
preds <- predict(ws, rescaling="lbg")
scores <- preds@textscores
# and let's compare
plot(scores$textscore_lbg[test], cong$idealPoint[test])
cor(scores$textscore_lbg[test], cong$idealPoint[test])
```


### Another example: predicting authorship of tweets

Now let's work with a dataset that contains all the tweets sent by Donald Trump, Ted Cruz, Hillary Clinton, and Bernie Sanders during the 2016 primary election campaign. Let's pick Donald Trump and try to build a classifier to predict whether a tweet was published by him (from an Android device) or his campaign team (from an iPhone).

```{r}
tweets <- read.csv('data/candidate-tweets.csv', stringsAsFactors=F)
tweets <- tweets[tweets$screen_name=="realDonaldTrump",]
# subsetting tweets in 2016
tweets$datetime <- as.POSIXct(tweets$datetime)
tweets <- tweets[tweets$datetime > as.POSIXct("2016-01-01 00:00:00"),]
# variable measuring if tweet is coming from an Android device
tweets$android <- ifelse(grepl("Android", tweets$source), 1, 0)
prop.table(table(tweets$android))
# removing URLs and handles
tweets$text <- gsub('https?://t.co/[A-Za-z0-9]+', '', tweets$text)
tweets$text <- gsub('@[0-9_A-Za-z]+', '@', tweets$text)
```

Create a training and test set, with 80% and 20%, respectively.

```{r}
set.seed(123)
training <- sample(1:nrow(tweets), floor(.80 * nrow(tweets)))
test <- (1:nrow(tweets))[1:nrow(tweets) %in% training == FALSE]
```

Construct the DFM. You may want to experiment with different preprocessing techniques until you achieve better performance.

```{r}
library(quanteda)
twcorpus <- corpus(tweets$text)
twdfm <- dfm(twcorpus, removePunct=TRUE, removeNumbers=TRUE, remove=c(
  stopwords("english"), "t.co", "https", "rt", "amp", "http", "t.c", "can"))
#twdfm <- dfm_trim(twdfm, min_docfreq = 2)
textplot_wordcloud(twdfm, rot.per=0, scale=c(3.5, .75), max.words=100)
```

Now run the classifier. Then, compute the accuracy.

```{r}
library(glmnet)
require(doMC)
registerDoMC(cores=3)
ridge <- cv.glmnet(twdfm[training,], tweets$android[training],
	family="binomial", alpha=0, nfolds=5, parallel=TRUE,
	type.measure="deviance")
plot(ridge)
## function to compute accuracy
accuracy <- function(ypred, y){
	tab <- table(ypred, y)
	return(sum(diag(tab))/sum(tab))
}
# function to compute precision
precision <- function(ypred, y){
	tab <- table(ypred, y)
	return((tab[2,2])/(tab[2,1]+tab[2,2]))
}
# function to compute recall
recall <- function(ypred, y){
	tab <- table(ypred, y)
	return(tab[2,2]/(tab[1,2]+tab[2,2]))
}
# computing predicted values
preds <- predict(ridge, twdfm[test,], type="response") > mean(tweets$android[test])
# confusion matrix
table(preds, tweets$android[test])
# performance metrics
accuracy(preds, tweets$android[test])
precision(preds, tweets$android[test])
recall(preds, tweets$android[test])
```

Identify the features that better predict that tweets were sent by this candidate. What do you learn?
```{r}
# from the different values of lambda, let's pick the best one
best.lambda <- which(ridge$lambda==ridge$lambda.min)
beta <- ridge$glmnet.fit$beta[,best.lambda]
head(beta)

## identifying predictive features
df <- data.frame(coef = as.numeric(beta),
				word = names(beta), stringsAsFactors=F)

df <- df[order(df$coef),]
head(df[,c("coef", "word")], n=30)
paste(df$word[1:30], collapse=", ")
df <- df[order(df$coef, decreasing=TRUE),]
head(df[,c("coef", "word")], n=30)
paste(df$word[1:30], collapse=", ")
```

Trying a different classifier: Gradient Boosting of Decision Trees

```{r}
library(xgboost)
# converting matrix object
X <- as(twdfm, "dgCMatrix")
# running xgboost model
tryEta <- c(1,2)
tryDepths <- c(1,2,4)
bestEta=NA
bestDepth=NA
bestAcc=0

for(eta in tryEta){
  for(dp in tryDepths){
    bst <- xgb.cv(data = X[training,],
		    label =  tweets$android[training],
		    max.depth = dp,
	      eta = eta,
	      nthread = 4,
	      nround = 500,
	      nfold=5,
	      print.every.n = 100L,
	      objective = "binary:logistic")
    # cross-validated accuracy
    acc <- 1-tail(bst$test.error.mean,1)
		cat("Results for eta=",eta," and depth=", dp, " : ",
				acc," accuracy.\n",sep="")
		if(acc>bestAcc){
				bestEta=eta
				bestAcc=acc
				bestDepth=dp
		}
	}
}

cat("Best model has eta=",bestEta," and depth=", bestDepth, " : ",
	bestAcc," accuracy.\n",sep="")

# running best model
rf <- xgboost(data = X[training,],
    label = tweets$android[training],
		max.depth = bestDepth,
    eta = bestEta,
    nthread = 4,
    nround = 1000,
    print.every.n = 100L,
    objective = "binary:logistic")

# out-of-sample accuracy
preds <- predict(rf, X[test,])
cat("\nAccuracy on test set=", round(accuracy(preds>.50, tweets$android[test]),3))
cat("\nPrecision on test set=", round(precision(preds>.50, tweets$android[test]),3))
cat("\nRecall on test set=", round(recall(preds>.50, tweets$android[test]),3))

# feature importance
labels <- dimnames(X)[[2]]
importance <- xgb.importance(labels, model = rf, data=X, label=tweets$android)
importance <- importance[order(importance$Gain, decreasing=TRUE),]
head(importance, n=20)

# adding sign
sums <- list()
for (v in 0:1){
	sums[[v+1]] <- colSums(X[tweets[,"android"]==v,])
}
sums <- do.call(cbind, sums)
sign <- apply(sums, 1, which.max)

df <- data.frame(
	Feature = labels,
	sign = sign-1,
	stringsAsFactors=F)
importance <- merge(importance, df, by="Feature")

## best predictors
for (v in 0:1){
	cat("\n\n")
	cat("value==", v)
	importance <- importance[order(importance$Gain, decreasing=TRUE),]
	print(head(importance[importance$sign==v,], n=50))
	cat("\n")
	cat(paste(unique(head(importance$Feature[importance$sign==v], n=50)), collapse=", "))
}

```