Machine-Learning-2-Model-Summary/ML_2_Summary.Rmd at main · itlawson/Machine-Learning-2-Model-Summary · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
---
title: "ML_2_Summary"
author: "Study Crew"
date: "3/9/2021"
output:
    html_document:
    toc: true
    toc_depth: 1
    toc_float: true
    df_print: paged
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```

# Machine Learning 2 Summary Document {.tabset}

This document is a comprehensive summary for Machine Learning 2. It is intended to help us study for the exams and for future reference in our professional lives (if any companies actually use R).

## Basic Statistics {.tabset}
Authors: Carrington Metts & Thomas Trankle

### Bias/Variance Tradeoff
**Variance** is defined as the amount by which a prediction would vary if it were estimated with a different training set. A model with high variance is *overfit*- small
changes in the training set may result in large changes to the model.

**Bias** is defined as the amount of error that is due to reducing a complex real-world dataset to a simple model. A model with high bias is *underfit*- changes in the
training set are unlikely to significantly change the model, but the model still does not make good predictions for the data.

If your training error is low and test error is high, the model has *high variance*. If both the training and testing errors are high, the model has
*high bias*. If both the training and testing errors are low, the model has *low variance and low bias*.

### Law of Parsimony
* Occam's razor
* Simpler Model is always better than complex model
* Only keep features that truly add substance to the model
* Avoid **overfitting** the model
  + Unimportant/irrelevant features overcomplicate model
  + Highly correlated features overcomplicate model (multicollinearity)
  + Interaction terms, polynomial terms, interaction with polynomials, etc.  models get complicated very quickly!

### Stats Formula's
Training set MSE = an underestimate of test MSE

MSE = RSS /n

Least squares method minimizes training not testing MSE

We can estimate test error indirectly by adjusting training error:

* $C_{P} = \frac{1}{n}(RSS + 2d\hat{\sigma}^{2})$
  + Smaller $C_{p}$ value = smaller test error
  + Select model with smaller $C_{P}$
  + second term is the penalty factor for having more features
* $\textbf{AIC} = \frac{1}{n\hat{\sigma}^2}(RSS + 2d\hat{\sigma}^{2})$
* $\textbf{BIC} = \frac{1}{n\hat{\sigma}^2}(RSS + log(n)\hat{\sigma}^{2})$
* $\textbf{Adjusted R}^2 = 1 - \frac{RSS/(n-d-1)}{TSS/(n-1)}$
  + Smaller adjusted $R^{2}$ = larger test error
  + Regular R2 is just RSS/TSS

where

$\hat{\sigma}^2$ = residual variance estimated using full model

d = # of features used in the subset

### Shrinkage/Regularization
We generally have a problem when $n$ < $p$

This creates low bias and high variance issue

The solution to this is Regularization (shrinkage; desensitization)

* Instead of fitting a model with all predictors (Xs)
* Shrink (regularize; constrain) β coefficients towards zero
* **NB**: This reduces **variance** but increases **bias**

Generally, we do this with either feature selection, ridge or lasso.

With regularization, we are adding a penalty to our objective function:

* Linear = minimize RSS
* Ridge = minimize RSS + $\lambda \times \text{slope}^2$ [penalty]
* Lasso = minimize RSS + $\lambda \times |\text{slope}|$ [penalty]

Overall, when we increase $\lambda$, we decrease the size of the slope/coefficients of
certain beta's

When $\lambda$ $\uparrow$

* Beta slope $\downarrow$
* the line gets flatter which means the label is less sensitive to the features
* weight decay in gradient descent is larger

l2norm is a value between 0 and 1 (ratio od ridge vs LS estimates)

$$\frac{||\hat{\beta}^{R}_{\lambda}||_{2}}{||\hat{\beta}||_{2}} =
\frac{\text{Betas estimated using ridge}}{\text{Betas estimated using least squares}}$$

Ridge vs. Lasso

* Use Ridge when most features in the model are useful (according to literature, theory, expertise, or common sense)

* Use Lasso when we suspect that the model includes a lot of useless features
  + Lasso helps us determine which features are useless (with zero slope)
  + Lasso selects variables (similar to subset selection)
  + Lasso model is sparse = easier to interpret than Ridge model

Additional Considerations:

* Coefficient estimates are sensitive to scaling
  + Standardize predictors before using regularization
* Estimates a line when n < p
* Works best when LS regression produces high variance
* Computationally feasible alternatives to subset selection

--------------------------------------------------------------------------------

## Forward and Backward Selection {.tabset}
Author: Matt Sadosuk

### Summary/description:

For forward stepwise selection, the model begins with no variables, then starts
adding the most significant variables until a pre-specified stopping rule is reached.

Backward stepwise selection begins with a model that contains all variables, then starts removing the least significant variables until a
pre-specified stopping rule is reached or until no variable is left in the model


### When to employ this method/model:
Forward and backward selection will be used when you're trying to identify which
combination of features that will produce the best model.

This R script will cover how to use Forward and Backward selection to identify which model is the best.
As well it will feature a validation process to identify how close the results were before performing the test
on the data set.

### Model Code and Inputs:
Using the library mass, use the Boston dataset which is about the housing suburbs of the Boston area. The Boston dataset has 13 features.
```{r}
library(MASS)
data(Boston)
length(Boston)
```

The next step involves subsetting the data through the function regsubset from the library (leaps).
First we need to perform the test without forward and backward selection to see how the model looks.
Then to note for the code we will use medv (median value of owner-occupied homes in $1000s) as our Y-var.
The argument nvmax specifies the maximum number of subsets that will be considered. We can use method= to
specify forward or backward selection; otherwise, the default is full subset selection.

```{r}
library(leaps)

regfit.full=regsubsets(medv~.,data=Boston ,nvmax=13)
reg.summary =summary(regfit.full)

regfit.fwd=regsubsets(medv~.,data=Boston , nvmax=13, method ="forward")
summary(regfit.fwd)

regfit.bwd=regsubsets(medv~.,data=Boston , nvmax=13, method ="backward")
summary(regfit.bwd)

```
The summary() function returns * if a given variable is included in the model. It is sorted by model size; the first
row is the best 1-feature model, the second row is the best 2-feature model, etc.

The summary object stores the R squared values for each model:
``` {r}
reg.summary$rsq
```

To see the coefficients of a model, use the coef() function. Let's try it with the 7-variable model:


```{r}
coef(regfit.full ,7)
coef(regfit.fwd ,7)
coef(regfit.bwd ,7)
```

### Model Tests:
In order for these approaches to yield accurate estimates of the test
error, we must use only the training observations to perform all aspects of
model-fitting—including variable selection. Therefore, the determination of
which model of a given size is best must be made using only the training
observations.

Now we split the training and test data using a sample vector of True and False items.
A True value means the item is in the training set; otherwise it will be False.
In the test line we use the !train to switch the true and false values for the
test dataset.

```{r}
set.seed(1)
train=sample(c(TRUE ,FALSE), nrow(Boston),rep=TRUE)
test=(!train)
```

Following the splitting of data into the test and training set we now want to
select the the best subset. We do this by now using Boston[train,] which
is new sample vector of true and false values. We are still using a nvmax of 13 (which means
we will examine a maximum of 13 subsets).

```{r}
regfit.best=regsubsets(medv~.,data=Boston[train ,],nvmax=13)
```

Next we will get the validation error for the best model size by creating a model
matrix for the test data. The model.matrix() function builds an X matrix from the data

```{r}
test.mat=model.matrix(medv~.,data=Boston[test ,])
```

Next we create a for loop that takes each size i and will be extract the coefficients
from the variable regfit.best, then multiply the coef into the appropriate columns
for the test model matrix to make predictions and compute the test MSE

First we create a empty vector from 1 to 13. Inside the for loop, we  calculate coefficient of each i,
then make predictions from those coefficients.
Finally, we calculate the test MSE and store it in the empty vector.
```{r}
val.errors = rep(0,13)
for(i in 1:13){
  coefi=coef(regfit.best ,id=i)
  pred=test.mat[,names(coefi)]%*%coefi
  val.errors[i]=mean((Boston$medv[test]-pred)^2)
}
val.errors
```

We now use the which min to see what is the best amount of variables to use
```{r}
which.min(val.errors)
```
Here, we see that the 11-feature model works best.

We can also perform cross-validation to directly determine the test error rate.
Unfortunately, there is no predict() function for regsubset. Instead, we'll write our own:

```{r}
predict.regsubsets =function (object , newdata ,id ,...){
 form=as.formula (object$call [[2]])
 mat=model.matrix(form ,newdata )
 coefi=coef(object ,id=id)
 xvars=names(coefi)
 mat[,xvars]%*%coefi
}
```

Now we want to choose among models with different sizes using a 10-fold cross validation method.
We want to perform the best subset selection within each of the 10 training sets.
 Finally we will create a matrix to store the store the results.

```{r}
k=10
set.seed(1)
folds=sample(1:k,nrow(Boston),replace=TRUE)
cv.errors= matrix(NA,k,13, dimnames = list(NULL, paste(1:13)))
```

Now we can perform the cross validation through nested for loops. The first loop
(with j) iterates over each of the 10 folds as a test set. After this, we use another loop to make model
predictions (using our new predict.regsubsets() function) and then calculate the test error rates for the subsetted data.
We then add the CV error to the matrix. The result is a 10x13 matrix, where the (i, j)th element
corresponds to the MSE for the ith fold for the model with *j* predictors.

```{r}
k <- 10
for(j in 1:k){
 best.fit <- regsubsets(medv~.,data=Boston[folds!=j,],nvmax=13)
 for(i in 1:13){
  pred=predict.regsubsets(best.fit ,Boston[folds ==j,],id=i)
  cv.errors[j,i]= mean((Boston$medv[ folds==j]-pred)^2)
  }
}
```

Now we determine the average error for each model. We use apply() to average over each column.
```{r}
mean.cv.errors=apply(cv.errors ,2, mean)
mean.cv.errors

par(mfrow=c(1,1))
plot(mean.cv.errors ,type="b")
```

After doing the CV on the Boston data set we can conclude that the 11 variable model
is the best fit.

To get the coefficients for the final model:
```{r}
bestModel <- regsubsets(medv~., data=Boston, nvmax=13)
coef(bestModel, 11)
```

--------------------------------------------------------------------------------

## Ridge Regression {.tabset}
Author: Sarah Brown

### Summary/description:
With this method, you shrink the Beta coefficients of your features towards 0.
It reduces variance but increases bias in the training data.
It doesn't let you make a coefficient 0 (it shrinks asymptotically) and so no features are fully eliminated.
So, we use this when most features are useful and none seem irrelevant (if they did, you'd look at using lasso perhaps).
As λ goes up, beta slope goes down and the line gets flatter which means the model is less sensitive to the features.
This means that the weight decay in gradient descent is larger.

### When to employ this method/model:
When the the training model is very accurate, but the test model is not, signaling that there is a large least
squares error (high variance) in the test model. This means that the line is overfit to the training data. You want to
minimize the the cost function which is: RSS + λ * (slope)^2 [penalty] where λ is a tuning parameter that indicates the severity of the penalty.
When λ is very high, all coefficients are driven to 0. When λ is equal to 0, there is no effect, so the function is fit with least squares.

Ridge regression often considers the L2 norm, which is the sum of the squares of all coefficients. In particular, it is common to make
plots where one axis is equal to the L2 norm with ridge regression divided by the L2 norm with least squares. When this ratio is close to 0,
the coefficients with ridge regression take very small values (i.e. λ is very large). When this ratio is 1, the results of ridge regression and least squares
are the same (i.e. λ is small).

According to Dr. Chung's PowerPoint, you can use ridge regression with:

Multiple regression
    $λ(slope12 + slope22 + \dots)$ (except for intercept)

Categorical X (binary)
    $Slope = \textit{difference between the two categories}$

Logistic Regression
    Categorical Y
    Optimizes sum of likelihoods (rather than RSS)

### Model Code and Inputs:

```{r}
library(ISLR)
library(glmnet)
data(Hitters)
Hitters <- na.omit(Hitters)
x=model.matrix(Salary~.,Hitters)[,-1]
y=Hitters$Salary

#First, divide the data into training and testing (in this case, its 75% training)
# (6) Create the train (75%) and test (25%) data sets
train = sample(1:nrow(x), nrow(x)*.75)
test=(-train)
y.test=y[test]

#Create a grid of lambda valaues. Then use the glmnet() function to create a model to predict the #training y's using the training x's. Use alpha = 0 for ridge regression.
grid=10^seq(10,-2,length=100)
mod.ridge <- glmnet(x[train,], y[train], alpha=0, lambda =grid)
```

### Model Tests:

``` {r}
#Evaluate training model performance (how well it predicted the y-values) by using cross-validation, which is the cv.glmnet() function. (In this example code, we are creating a 12-fold cross-validation model, but by default it is a 10-fold cv).
cv.out.ridge <- cv.glmnet(x[train,], y[train], alpha=0, lambda = grid, nfolds = 12)
plot(cv.out.ridge)
```

### Model Improvements:

``` {r}
#Make predictions using the best model by using the best lambda. Create a vector of test set #predictions.
bestlam_r <- cv.out.ridge$lambda.min
ridge.pred <- predict(mod.ridge, s=bestlam_r, newx=x[test,])

#Compute and display the test error rate.
MSE_best_ridge<-mean((ridge.pred - y.test)^2)
```

### Comprehensive Example:

```{r}
#Part 1: Exploring ridge regression without using a test and train set
#Using the hitters data set, we set up the "x" and "y" matrix values so that we can use the glmnet function.
x=model.matrix(Salary~.,Hitters)[,-1]
y=Hitters$Salary

#Make a grid of lambda values, and see how many coefficients you have and how many potential lambda #values there are.
grid=10^seq(10,-2,length=100)
ridge.mod=glmnet(x,y,alpha = 0,lambda = grid)
COEF<-coef(ridge.mod)
dim(COEF)
```


You can look at various lambda values and how they perform. Here is an example of looking at the #50th lambda value from the grid.

```{r}
ridge.mod$lambda[50]

coef(ridge.mod)[,50]

sqrt(sum(coef(ridge.mod)[-1,50]^2))# Find the L2 value
```

```{r}

#Part 2: Predicting with a test and train set also using the hitters data
#Create test and train
set.seed(1)
train<-sample(1:nrow(x),nrow(x)/2)
test=(-train)
y.test=y[test]

#To choose the best lambda value, we use cv.glmnet. We can find the best lambda and associated MSE.
set.seed(1)
cv.out=cv.glmnet(x[train,],y[train],alpha=0)
plot(cv.out)
bestlam<-cv.out$lambda.min
bestlam #326.0828

ridge.pred<-predict(ridge.mod,s=bestlam,newx = x[test,])
mean((ridge.pred-y.test)^2) # 139856.6


#We can also explore what the coefficients look like using the whole data set and our best lamda #value
out<-glmnet(x,y,alpha = 0)
predict(out,type = "coefficients",s=bestlam)[1:20,]

# no coefficients are zero because in ridge regression they aren't allowed to be
```
--------------------------------------------------------------------------------


## Lasso Regression {.tabset}
Author: Jill Van den Dungen

### Summary/description:
Lasso is an alternative to the ridge regression. The cost function for lasso regression is  RSS + λ * abs(slope). Here, the penalty
term is determined by the absolute value of the coefficients instead of the square. This has the effect of forcing some of the coefficients
estimates to be exactly equal to zero when the tuning parameter lambda is sufficiently large. Therefore, some variables
are eliminated. Models generated from lasso are sparse = they are easier to interpret than the models by ridge regression.

### When to employ this method/model:
Use lasso when we suspect that the model includes a lot of useless features.


### Model Code and Inputs:

```{r, eval =F}
#Create the train (75%) and test (25%) data sets
train<-sample(1:nrow(x), nrow(x)*0.75)
test<- (-train)
y_test=y[test]

#Create a lamba grid vector of 120 elements ranging from 10^10 to 10^-3
grid=10^seq(10,-3,length=120)

#Using the glmnet() function, create a lasso model named mod.lasso that predicts the training y's using the training x's and the grid of lambda values created above.

mod.lasso<-glmnet(x[train,],y[train],alpha = 1,lambda = grid)

```

### Model Tests:

```{r,eval =F}
#Evaluate training model performance using cross-validation
#Using the cv.glmnet() function and the same parameters used above in the creation of mod.lasso (i.e. including the lambda grid vector),
#     create a 12-fold cross-validation model named cv.out.lasso

cv.out.lasso<-cv.glmnet(x[train,],y[train],alpha = 1,lambda = grid, nfolds=12)
plot(mod.lasso)
```

### Model Improvements:

```{r,eval =F}
#Display the best cross-validated lambda value (the one that produces the lowest deviance - do not use the 1-standard error rule here).
bestlam<-cv.out.lasso$lambda.min

#Using the best lambda and the model named mod.lasso, create a vector of test set predictions.
lasso.pred<- predict(mod.lasso, s=bestlam, newx= x[test,])

#Compute and display the test error rate.
mean((lasso.pred-y_test)^2)
```

### Comprehensive Example:

```{r,eval =F}
library(ISLR)
data(Hitters)
Hitters <- na.omit(Hitters)
#Using the hitters data set, we set up the "x" and "y" matrix values so that we can use the glmnet function.
x=model.matrix(Salary~., data=Hitters)[,-1]
y=Hitters$Salary

#Make a grid of lambda values
grid=10^seq(10,-2,length=100)

#Create test and train
set.seed(1)
train<-sample(1:nrow(x),nrow(x)/2)
test=(-train)
y.test=y[test]

#Create lasso model using train set
lasso.mod<-glmnet(x[train,],y[train], alpha=1, lambda= grid)
plot(lasso.mod)

#Create cross validation model
set.seed(1)
cv.out<-cv.glmnet(x[train, ], y[train], alpha=1)
plot(cv.out)

#Create best lambda and make predictions
bestlam<- cv.out$lambda.train
bestlam

lasso.pred<-predict(lasso.mod, s=bestlam, newx= x[test,])
mean((lasso.pred-y.test)^2)

#explore whole data set to see what best lambda is
out=glmnet(x,y,alpha= 1, lambda= grid)
lasso.coef<- predict(out, type= 'coefficients', s=bestlam)[1:20,]
lasso.coef

```
--------------------------------------------------------------------------------


## Principal Component Regression {.tabset}
Author: Alex King

### Summary/description:

Principal Components Analysis (PCA): Popular Algorithm for deriving a low-dimensional set of features from a large set of features from a large set of variables. This is done by transforming X features through linear combinations

There are two methods:

Principal Components Regression/Analysis
Partial Least Squares

Both involve supervised regression of Y on components. However, PCR is the unsupervised identification of components in the X feature space, while PLS is supervised.

PLS and PCR have very similar performances.


### When to employ this method/model:

Uses:
1) When there are more variables than observation (wide data)
Shopping Behavior (Y) and Search Keywords
2) When the explanatory variables are highly correlated
3) Use PLS when you want to reduce bias

Potential Drawbacks: There is no guarantee that the selected principal components are associated with the outcome. If there is no correlation with the variables, you will run into trouble. PLS can also increase variance


### Model Code and Inputs:
Library: PLS

PCR: pcr(Y~X, data=DF,scale=TRUE,validation= “cv”)
PLS: plsr(Y~X,data=df,scale=TRUE,validation= “cv”)


### Model Tests:

 For the model tests, we will use the Validation Plot Function. This function allows you to graph show the Mean Square Error of Prediction versus the # of Components. You will choose the number of components that has the lowest MSEP.

validationplot(pcr.fit, val.type="MSEP")


### Model Improvements:
Adjust nComp which is the number of principal components to build the models for. Adding Scale = True

### Comprehensive Example:

```{r}
#### PCR

### Step 1: Load Require Packages
library(tidyverse)
library(caret)
library(pls)

### Step 2: Load Data Set the Data
library(MASS)
data("Boston")
str(Boston)

### Step 3: Split Dataset
set.seed(123)
training.indices <- Boston$medv %>% createDataPartition(p=0.8,list=FALSE)
train <- Boston[training.indices,]
test <- Boston[-training.indices,]
y.test <- test$medv

### Step 4: Compute PCR

# Build Model
model <- pcr(medv~.,data=Boston,subset=training.indices,scale=TRUE,validation="CV")
summary(model)

validationplot(model,val.type = "MSEP")


### Calculate Lowest MSE
pcr.pred = predict(model,test,ncomp=13)
mean((pcr.pred-y.test)^2)
# MSE = 21.05844

### Test on whole Dataset
model.whole <- pcr(medv~.,data=Boston,scale=TRUE,validation="CV")
pcr.pred = predict(model.whole,test,ncomp=13)
mean((pcr.pred-y.test)^2)
# MSE = 19.934

##### PLS
### Steps 1-3 Same as PCR
pls.fit <- plsr(medv~.,data=Boston,subset=training.indices, scale=TRUE,validation="CV")
summary(pls.fit)

validationplot(pls.fit,val.type="MSEP")

pls.pred <- predict(pls.fit,test,ncomp=8)
mean((pls.pred-y.test)^2)
# MSE = 20.95983

### Whole Dataset
pls.whole <- plsr(medv~.,data=Boston, scale=TRUE,validation="CV")
pls.pred <- predict(pls.whole,test,ncomp=8)
mean((pls.pred-y.test)^2)
# MSE = 19.90098
```
--------------------------------------------------------------------------------


## Polynomial Regression {.tabset}
Author: Carrington Metts

### Summary/description:
Polynomial regression fits a polynomial curve to a set of data. It is analagous to linear regression, but allows greater flexibility
to fit to data that does not have a linear relationship.

### When to employ this method/model:
Polynomial regression should be used when there is a continuous response variable and one or more continuous features,
and there is a nonlinear relationship between the X and Y variables.

Polynomial regression may also be used to estimate a binary response variable, given a set of continuous features.

### Model Code and Inputs:
A polynomial fit can be produced with either orthogonal or non-orthogonal terms. Having orthogonal polynomials means each
term in the fit is a linear combination of each polynomial term (age^2, age^3, age^4, etc.) Non-orthogonal polynomials
means the coefficients are reported directly. Both versions of orthogonality will result in the same fitted values.
When creating models, **it is best to always use orthogonal terms**, since non-orthogonal terms are correlated with
each other and therefore result in a multicollinearity problem.

As with linear regression, we use lm() or glm() to create the fit. Within lm() or glm(), the poly() function automatically creates orthogonal
terms. (To create non-orthogonal terms, you can use the cbind() function, or set raw=True inside poly()). To create a model, we will use the Wage data (from ISLR) to predict
wage as a function of age. We can then use coef() to get the coefficients of the model.

```{r poly_model, warning=FALSE}
library(ISLR)
#fit a degree-4 polynomial in 2 ways
orthogonal_fit <- lm(wage~poly(age,4), data=Wage)
summary(orthogonal_fit)

```

We can also create a polynomial logistic regression model to predict whether a person will earn more than $250,000. To do so,
we use I() to coerce the continuous wage variable to binary categorical. We then run a glm fit on the model.

```{r poly_binary_model}
logistic_fit <- glm(I(wage>250)~poly(age, 4), data=Wage, family=binomial)
```


### Model Tests:
It is important to choose the lowest-degree model that adequately models the data. To do so, we use the ANOVA (analysis of variance)
test. ANOVA computes an F-test against two hypotheses: the null hypothesis, which states that a simple model is sufficient
to explain the data, against a complex hypothesis, which states that a higher-order model is required.

To determine the best model for the Wage data, we will create 5 different polynomial models and run ANOVA.
```{r poly_tests}
library(ISLR)
fit.1 <- lm(wage~age, data=Wage)
fit.2 <- lm(wage~poly(age,2), data=Wage)
fit.3 <- lm(wage~poly(age,3), data=Wage)
fit.4 <- lm(wage~poly(age,4), data=Wage)
fit.5 <- lm(wage~poly(age,5), data=Wage)

anova(fit.1, fit.2, fit.3, fit.4, fit.5)

```

Here, we see that the p-values for fits 2 and 3 are less than 0.05, which means the models are significant. However, the p-value for
fit 4 is 0.051046. This means the fourth model is not significantly better than the third model. Therefore, we should use the third-order model.

The effectiveness of a polynomial regression model may be determined by R^2, MSE, or any other regression error metric. For polynomial logistic regression,
we typically determine the output classes based on the predicted probabilities and compute the accuracy rate:

```{r eval=FALSE}
predictedClasses = ifelse ( outputProb > .5, "Yes" , "No" )
table(predictedClasses, actualClasses)

```

### Model Predictions
#### Polynomial Regression
Now that we have chosen model 3, let's use it to predict wage for some new list of ages. We'll first create a new set of ages to feed to the model.
Then, we'll use the predict() function to calculate the predictions.
```{r poly_predict}
agelims  = range(Wage$age) #pull out minimum and maximum ages
age.grid = seq(from=agelims[1], to=agelims[2]) #create a list of ages from the smallest to largest age
preds    = predict(fit.3, newdata=list(age=age.grid), se=TRUE) #make predictions on the new list of ages
```

Now, we want to determine the 95% confidence interval for our predictions, which corresponds to 2 standard deviations in each direction.
```{r poly_conf}
se.bands = cbind(preds$fit+2*preds$se.fit, preds$fit-2*preds$se.fit)
```

We can now plot the data, prediction, and confidence interval:
```{r poly_plot}
par(mfrow=c(1,2), mar=c(4.5,4.5,1,1), oma=c(0,0,4,0))
plot(Wage$age, Wage$wage, xlim=agelims, cex =.5, col="darkgrey ") #data
title("Degree-3 Polynomial",outer=T)
lines(age.grid, preds$fit, lwd=2, col="blue") #predictions
matlines(age.grid, se.bands, lwd=1, col=" blue", lty=3) #confidence interval
```

#### Polynomial Logistic Regression
The glm() function automatically returns predictions in logit form, so we need to transform them to the actual values (ie probabilities that
range from 0 to 1, as we expect).
Other than that, the predictions and confidence intervals are computed and plotted in the same way.

```{r poly_logistic_predict}
logist_preds = predict(logistic_fit,newdata=list(age=age.grid),se=T) #make predictions

#transform predictions from logit form
pfit = exp(logist_preds$fit)/(1+exp(logist_preds$fit))

#compute confidence intervals and transform from logit form
se.bands.logit = cbind(logist_preds$fit +2* logist_preds$se.fit , logist_preds$fit -2*logist_preds$se.fit)
se.bands = exp(se.bands.logit)/(1+exp(se.bands.logit))

#plot data, fit, confidence band
plot(Wage$age ,I(Wage$wage >250),xlim=agelims ,type="n",ylim=c(0,.2)) #data

#add a jitter to the points so they don't overlap each other (just for aesthetics)
points(jitter(Wage$age), I((Wage$wage >250)/5),cex=.5,pch ="|", col="darkgrey")

#prediction
lines(age.grid ,pfit ,lwd=2, col ="blue")

#confidence interval
matlines (age.grid ,se.bands ,lwd=1, col="blue",lty=3)
```

### Comprehensive Example:

#### Polynomial Regression
``` {r poly_full}
#run anova to pick best model (degree 1-5)
fit.1 <- lm(wage~age, data=Wage)
fit.2 <- lm(wage~poly(age,2), data=Wage)
fit.3 <- lm(wage~poly(age,3), data=Wage)
fit.4 <- lm(wage~poly(age,4), data=Wage)
fit.5 <- lm(wage~poly(age,5), data=Wage)

anova(fit.1, fit.2, fit.3, fit.4, fit.5)

#fit.3 is the best, according to anova
#create vector of new ages to make predictions on
agelims  = range(Wage$age)
age.grid = seq(from=agelims[1], to=agelims[2])

#make predictions
preds    = predict(fit.3, newdata=list(age=age.grid), se=TRUE)

#compute confidence interval
se.bands = cbind(preds$fit+2*preds$se.fit, preds$fit-2*preds$se.fit)

#plot data, fit, 95% confidence interval
par(mfrow=c(1,1), mar=c(4.5,4.5,1,1), oma=c(0,0,4,0))
plot(Wage$age, Wage$wage, xlim=agelims, cex =.5, col="darkgrey ") #data
title("Degree-3 Polynomial",outer=T)
lines(age.grid, preds$fit, lwd=2, col="blue") #fit
matlines(age.grid, se.bands, lwd=1, col=" blue", lty=3) #confidence interval
````

#### Polynomial Logistic Regression
```{r poly_logistic_full}
#create the model
logistic_fit <- glm(I(wage>250)~poly(age, 4), data=Wage, family=binomial)

#make predictions for new data (returns probabilities in logit form)
logist_preds = predict(logistic_fit,newdata=list(age=age.grid),se=T)

#transform probabilities from logit form to normal (ie probabilities that go from 0 to 1, as we expect)
pfit = exp(logist_preds$fit)/(1+exp(logist_preds$fit))

#find the 2-standard-deviation confidence intervals
se.bands.logit = cbind(logist_preds$fit +2* logist_preds$se.fit , logist_preds$fit -2*logist_preds$se.fit)

#transform confidence intervals from logit to normal form
se.bands = exp(se.bands.logit)/(1+exp(se.bands.logit))

#plot the data points, apply a jitter so they aren't on top of each other
plot(Wage$age ,I(Wage$wage >250),xlim=agelims ,type="n",ylim=c(0,.2))
points(jitter(Wage$age), I((Wage$wage >250)/5),cex=.5,pch ="|", col="darkgrey")

#plot the predicted values (age.grid and the Y predictions)
lines(age.grid ,pfit ,lwd=2, col ="blue")

#plot the confidence intervals
matlines (age.grid ,se.bands ,lwd=1, col="blue",lty=3)

```

--------------------------------------------------------------------------------

## Step Functions{.tabset}
Author: Alex Russett

### Summary and Description

In general **step functions** allow you to fit different areas of your data to different constant functions that
will allow them to fit the data better overall. This is stated in the book as avoiding imposing a
**global structure** on data, and to instead break the data into **bins**, allowing us to fit a different constant to each bin.

To do this, we create cut points C(1), C(2)... ,C(k) in the range of X, then construct K+! new variables.

You can see example pictures on page 269 in chapter 7.2 of ISLR

### When do we use step functions?

Step functions should ONLY be used when there are very natural and easy to identify break points
in the X range of the data. They are commonly used with categorical features.

Step Functions are commonly used among Neural Networks as well.


### Model Code and Inputs:

#### Basics - Sample code not designed to be run

```{r, eval = F}
# Fitted Step function
lm(Y ~ cut(X,4))
# Fitted Logistic Regression
glm(Y ~ cut(X,4))
```
The above code chunk is choosing the data, and then the number of splits to employ! This is the entire applicaiton of the
step function.

Here it is in context!

### Full Comprehensive Example

This example uses an LM function, but I will highlight when it fits a step function for analysis!
```{r}
# Load library and attach data set
rm(list=ls())
library(ISLR)
attach(Wage)
```

```{r}
#using the lm() function, in order to predict
#wage using a fourth-degree polynomial in age: poly(age,4). The poly() command allows us to avoid having to write out a long formula with powers of age.
fit=lm(wage~poly(age ,4) ,data=Wage)
coef(summary(fit))
```

```{r}
#We can use poly() to obtain age, age^2, age^3, and age^4, if we prefer. We do this by adding the raw=TRUE argument to the poly() function
fit2 = lm(wage~poly(age,4,raw=T),data=Wage)
coef(summary(fit2))
```
```{r}
# There are a few other ways to fit the model, showing the flexibility of the formula language in R:
fit2a = lm(wage~age+I(age^2)+I(age^3)+I(age^4),data=Wage)
coef(fit2a)

fit2b = lm(wage~cbind(age,age^2,age^3,age^4),data=Wage)
coef(fit2b)
```
```{r}
#We now create a grid of values for age at which we want predictions, and then call the generic predict() function, specifying that we want standard errors as well
agelims =range(age)
age.grid=seq(from=agelims [1],to=agelims [2])
preds=predict (fit ,newdata =list(age=age.grid),se=TRUE)
se.bands=cbind(preds$fit +2* preds$se.fit,preds$fit -2* preds$se.fit)
```

```{r}
#Finally, we plot the data and add the fit from the degree-4 polynomial:
plot(age ,wage ,xlim=agelims ,cex =.5,col=" darkgrey ")
lines(age.grid ,preds$fit,lwd=2,col="blue")
matlines(age.grid ,se.bands ,lwd=1, col=" blue",lty=3)
```

```{r}
#To determine the simplest model which is sufficient to explain the relationship between wage and age we fit 5 models of increasing polynomial degrees:
fit.1 =lm(wage~age,data=Wage)
fit.2 =lm(wage~poly(age,2),data=Wage)
fit.3 =lm(wage~poly(age,3),data=Wage)
fit.4 =lm(wage~poly(age,4),data=Wage)
fit.5 =lm(wage~poly(age,5),data=Wage)
anova(fit.1, fit.2, fit.3, fit.4, fit.5)
```
We use the anova() function, which performs an analysis of variance (ANOVA, using an F-test) in order to test the null analysis of hypothesis that a model M1 is sufficient to explain the data against the variance alternative hypothesis that a more complex model M2 is required.

Result: Either a cubic or quadratic model (fit.3, fit.4) would be best for the data


```{r}
#Now we want to try to predict if an individual is making more than $250,000 per year:
fit = glm(I(wage>250)~poly(age,4), data=Wage, family=binomial)

#The expression wage > 250 evaluates to a logical variable of TRUE's and FALSE's, which are turned into 1's and 0's

preds = predict(fit,newdata=list(age=age.grid),se=T)


#The default prediction type for a glm() model is type="link", which is what we use here. This means we get predictions
#of the logit, so some transformations need to be done to obtain the correct confidence interval
pfit = exp(preds$fit)/(1+exp(preds$fit))
se.bands.logit=cbind(preds$fit + 2*preds$se.fit, preds$fit - 2*preds$se.fit)
se.bands = exp(se.bands.logit)/(1+exp(se.bands.logit))


```


```{r}
#To complete the right-hand side plot of figure 7.1:
plot(age,I(wage>250), xlim=agelims, type="n", ylim=c(0,.2))
points(jitter(age), I((wage>250)/5), cex=.5, pch="|", col="darkgrey")
lines(age.grid, pfit, lwd=2, col="blue")
matlines(age.grid, se.bands, lwd=1, col="blue", lty=3)

```

### Step Function Implementation

```{r}
table(cut(age,4))
fit=lm(wage~cut(age,4),data=Wage)
coef(summary(fit))
```

### Results and Final Thoughts

The function automatically chose breakpoints of 33.5, 49, and 64.5 years, but we could have chosen our own
bins with the "breaks" option when splitting the data into bins.

The cut() function returns an ordered categorical variable, which the lm() function then uses to create dummy variables
to regress upon.

Age less than 33.5 is left out, and the result of that intercept is the average salary for those under 33.5 years old.

--------------------------------------------------------------------------------


## Regression Splines {.tabset}
Author: Andrew Tremblay & Thomas Trankle

### Summary/description:

**Regression Splines** are more flexible than polynomials and step functions.

This is because the are actually an extension of the two.

They involve dividing the range of *X* into *K* distinct regions. Within each region,
a polynomial function is fit to the data. However, these polynomials are constrained so that
they **join smoothly at the region boundaries, or knots**. Provided that the interval is
divided into enough regions, this can produce an extremely flexible fit.

The points where the coefficients change are called *knots*.

In general, if we place *K* different knots throughout the range of *X*, then we
will end up fitting *K + 1* different polynomials.
```{r, echo = F, cache=T}
knitr::include_graphics("regKnots.png")
```

To make it a continuous function we add constraints to the model.
Each constraint that we impose on the piecewise polynomials effectively frees up one
degree of freedom.

Remember, there are degrees of freedom for the entire model(Total DF) and the
specified degrees of freedom for the polynomial (df).
$$df = degree + knots$$
```{r, eval = F}
# Using the bs() function deals with the formula above
bs(x , df = NULL, knots = NULL, degree = 3 #default
   intercept = False)
```