# SMHS LinearModeling MachineLearning

## SMHS Linear Modeling - Machine Learning Algorithms

Scientific inference based on fixed and random effect models, assumptions, and mixed effects logistic regression.

Questions:

• How can we tie human intuition and computer-generated results to obtain reliable, effective, and efficient decision-support system (that facilitates, forecasting)?
• Niels Born – “It is difficult to make predictions, especially about the future”
• Can we unsupervisely classify the data?

Prediction

For most of the machine learning algorithms (including first-order linear regression), we:

• ﬁrst generate the model using training data, and then
• predict values for test/new data.

Predictions are made using the R predict function. (type ?predict.name), where name is the function-name corresponding to the algorithm. The ﬁrst argument of predict often represents the variable storing the model, and the second argument is a matrix or data frame of test data that the model needs to be applied to. Calling predict can be done in 2 ways: type predict or type of predict.name.

Example:

```#mydata <- read.table('https://umich.instructure.com/files/330381/download?download_frd=1&verifier=HpfmjfMFaMsk7rIpfPx0tmz960oTW7JA8ZonGvVC',as.is=T, header=T)  # 01a_data.txt
```
```# (1) First, there are different approaches to split the data (partition the data) into
# training and testing sets.
## TRAINING: 75% of the sample size
sample_size <- floor(0.75 * nrow(mydata))
## set the seed to make your partition reproductible
set.seed(1234)
train_ind <- sample(seq_len(nrow(mydata)), size = sample_size)
train <- mydata[train_ind, ]
```
```# TESTING DATA
test <- mydata[-train_ind, ]
```
```lin.mod <- lm(Weight ~ Height*Team, data=train)
predicted.values <-  predict(lin.mod, newdata=test
```

Data Modeling/Training

Logistic Regression:

```glm_model <-glm(ifelse(Weight > 200,1,0) ~ Height*Team, family=binomial(link="logit"), data=train)
```

K-Means Clustering

```train.1 <- cbind(train\$\\$\$Height, train\$\\$\$Weight, train\$\\$\$Age)
test.1 <- cbind(test\$\\$\$Height, test\$\\$\$Weight, test\$\\$\$Age)
Weight.1 <- ifelse(train\$\\$\$Weight > 200,1,0)

kmeans_model <- kmeans(<u><b>train.1</b></u>, 3)
plot(train.1, col = kmeans_model\$\\$\$cluster)
points(kmeans_model\$\\$\$centers, col = 1:2, pch = 8, cex = 2)

<b>K-Nearest Neighbor Classiﬁcation</b>
# install.packages("class")
library("class")
knn_model  <-  knn(train=train.1,  test=test.1,  cl=as.factor(Weight.1),  k=5)
plot(knn_model)
summary(knn_model)

<b>Naïve Bayes Classifier</b>
install.packages("e1071")
library("e1071")
nbc_model <-  naiveBayes(Weight ~ Height*Age,  data=train.1)

<b>Decision Trees (CART)</b>
#install.packages("e1071")
library("rpart")
cart_model <- rpart(Weight ~ Height+Age, data= as.data.frame(train.1), method="class")
plot(cart_model)
text(cart_model)

# X be the matrix of features, and labels be a vector of 0-1 class labels.
boost_model <- ada(x= cbind(train\$\\$\$Height, train\$\\$\$Weight, train\$\\$\$Age), y= Weight.1)
plot(boost_model)
boost_model
```

Support Vector Machines (SVM)

```#install.packages("e1071")
library("rpart")
svm_model <- svm(x= cbind(train\$\\$\$Height, train\$\\$\$Weight, train\$\\$\$Age), y=as.factor(Weight.1),
summary(svm_model)

=='"`UNIQ--h-1--QINU`"'Appendix==

==='"`UNIQ--h-2--QINU`"'Example 1: Simulation (subject, day, treatment, observation)===

<i>Obs ~ Treatment + Day + Subject(Treatment)+ Day*Subject(Treatment)+ ε.</i>

This model is accounts for:<br>
Response = Obs

Fixed effects:<br>
Treatment (fixed)<br>
Day (fixed)<br>
Treatment*Day interaction

Random Effects:<br>
Subject nested within Treatment (random)<br>
Day crossed with "Subject within Treatment" (random)

mydata <- data.frame(
Subject  = c(13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 29, 30, 31, 32, 33,
34, 35, 36, 37, 38, 39, 40, 62, 63, 64, 65, 13, 14, 15, 16, 17, 18,
19, 20, 21, 22, 23, 24, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39,
40, 62, 63, 64, 65, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24,
29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 62, 63, 64, 65),
Day       = c(rep(c("Day1", "Day3", "Day6"), each=28)),
Treatment = c(rep(c("B", "A", "C", "B", "C", "A", "A", "B", "A", "C", "B", "C",
"A", "A", "B", "A", "C", "B", "C", "A", "A"), each = 4)),
Obs       = c(6.472687, 7.017110, 6.200715, 6.613928, 6.829968, 7.387583, 7.367293,
8.018853, 7.527408, 6.746739, 7.296910, 6.983360, 6.816621, 6.571689,
5.911261, 6.954988, 7.624122, 7.669865, 7.676225, 7.263593, 7.704737,
7.328716, 7.295610, 5.964180, 6.880814, 6.926342, 6.926342, 7.562293,
6.677607, 7.023526, 6.441864, 7.020875, 7.478931, 7.495336, 7.427709,
7.633020, 7.382091, 7.359731, 7.285889, 7.496863, 6.632403, 6.171196,
6.306012, 7.253833, 7.594852, 6.915225, 7.220147, 7.298227, 7.573612,
7.366550, 7.560513, 7.289078, 7.287802, 7.155336, 7.394452, 7.465383,
6.976048, 7.222966, 6.584153, 7.013223, 7.569905, 7.459185, 7.504068,
7.801867, 7.598728, 7.475841, 7.511873, 7.518384, 6.618589, 5.854754,
6.125749, 6.962720, 7.540600, 7.379861, 7.344189, 7.362815, 7.805802,
7.764172, 7.789844, 7.616437, NA, NA, NA, NA))

install.packages("lme4")
library("lme4", lib.loc="~/R/win-library/3.1")
m1 <- lmer(Obs ~ Treatment * Day + (1 | Subject), mydata)
m1

Linear mixed model fit by REML ['lmerMod']

Formula: Obs ~ Treatment * Day + (1 | Subject)

Data: mydata

REML criterion at convergence: 56.8669

<center> Random Effects
{| class="wikitable" style="text-align:center; width:35%" border="1"
|-
|Groups||Name||Std. Dev.
|-
|Subject||(Intercept)||0.2163
|-
|Residual|| ||0.2602

|}
Number of obs: 80, groups:  Subject, 28 </center>

<center> Fixed Effects
{| class="wikitable" style="text-align:center; width:35%" border="1"
|-
|(Intercept)||TreatmentB||TreatmentC
|-
|7.1827||-0.6129||0.1658
|-
|DayDay3||DayDay6 ||TreatmentB: DayDay3
|-
|0.2446||0.4507||-0.1235
|-
|TreatmentC: DayDay3||TreatmentB: DayDay6||TreatmentC: DayDay6
|-
|-0.2740||-0.3508||-0.3327

|}
</center>

<center>
{| class="wikitable" style="text-align:center; width:35%" border="1"
|-
|Index||Subject||Day||Treatment||Obs
|-
|1||13||Day1||B||6.472687
|-
|2||14||Day1||B||7.01711
|-
|3||15||Day1||B||6.200715
|-
|4||16||Day1||B||6.613928
|-
|5||17||Day1||A||6.829968
|-
|6||18||Day1||A||7.387583
|-
|7||19||Day1||A||7.367293
|-
|8||20||Day1||A||8.018853
|-
|9||21||Day1||C||7.527408
|-
|10||22||Day1||C||6.746739
|-
|11||23||Day1||C||7.29691
|-
|12||24||Day1||C||6.98336
|-
|13||29||Day1||B||6.816621
|-
|14||30||Day1||B||6.571689
|-
|15||31||Day1||B||5.911261
|-
|16||32||Day1||B||6.954988
|-
|17||33||Day1||C||7.624122
|-
|18||34||Day1||C||7.669865
|-
|19||35||Day1||C||7.676225
|-
|20||36||Day1||C||7.263593
|-
|21||37||Day1||A||7.704737
|-
|22||38||Day1||A||7.328716
|-
|23||39||Day1||A||7.29561
|-
|24||40||Day1||A||5.96418
|-
|25||62||Day1||A||6.880814
|-
|26||63||Day1||A||6.926342
|-
|27||64||Day1||A||6.926342
|-
|28||65||Day1||A||7.562293
|-
|29||13||Day3||B||6.677607
|-
|30||14||Day3||B||7.023526
|-
|31||15||Day3||B||6.441864
|-
|32||16||Day3||B||7.020875
|-
|33||17||Day3||A||7.478931
|-
|34||18||Day3||A||7.495336
|-
|35||19||Day3||A||7.427709
|-
|36||20||Day3||A||7.63302
|-
|37||21||Day3||C||7.382091
|-
|38||22||Day3||C||7.359731
|-
|39||23||Day3||C||7.285889
|-
|40||24||Day3||C||7.496863
|-
|41||29||Day3||B||6.632403
|-
|42||30||Day3||B||6.171196
|-
|43||31||Day3||B||6.306012
|-
|44||32||Day3||B||7.253833
|-
|45||33||Day3||C||7.594852
|-
|46||34||Day3||C||6.915225
|-
|47||35||Day3||C||7.220147
|-
|48||36||Day3||C||7.298227
|-
|49||37||Day3||A||7.573612
|-
|50||38||Day3||A||7.36655
|-
|51||39||Day3||A||7.560513
|-
|52||40||Day3||A||7.289078
|-
|53||62||Day3||A||7.287802
|-
|54||63||Day3||A||7.155336
|-
|55||64||Day3||A||7.394452
|-
|56||65||Day3||A||7.465383
|-
|57||13||Day6||B||6.976048
|-
|58||14||Day6||B||7.222966
|-
|59||15||Day6||B||6.584153
|-
|60||16||Day6||B||7.013223
|-
|61||17||Day6||A||7.569905
|-
|62||18||Day6||A||7.459185
|-
|63||19||Day6||A||7.504068
|-
|64||20||Day6||A||7.801867
|-
|65||21||Day6||C||7.598728
|-
|66||22||Day6||C||7.475841
|-
|67||23||Day6||C||7.511873
|-
|68||24||Day6||C||7.518384
|-
|69||29||Day6||B||6.618589
|-
|70||30||Day6||B||5.854754
|-
|71||31||Day6||B||6.125749
|-
|72||32||Day6||B||6.96272
|-
|73||33||Day6||C||7.5406
|-
|74||34||Day6||C||7.379861
|-
|75||35||Day6||C||7.344189
|-
|76||36||Day6||C||7.362815
|-
|77||37||Day6||A||7.805802
|-
|78||38||Day6||A||7.764172
|-
|79||39||Day6||A||7.789844
|-
|80||40||Day6||A||7.616437
|-
|81||62||Day6||A||NA
|-
|82||63||Day6||A||NA
|-
|83||64||Day6||A||NA
|-
|84||65||Day6||A||NA

|}
</center>

==='"`UNIQ--h-3--QINU`"'Example 2: Genotype-phenotype===

Save this data file as a tab-separated TXT file:

<center>
{| class="wikitable" style="text-align:center; width:35%" border="1"
|-
|Genotype||Race||Subject||Weight
|-
|A||1||1||8
|-
|A||1||2||9
|-
|A||1||3||11
|-
|A||1||4||12
|-
|A||1||5||10
|-
|A||2||1||17
|-
|A||2||2||17
|-
|A||2||3||16
|-
|A||2||4||15
|-
|A||2||5||19
|-
|A||2||6||18
|-
|A||2||7||18
|-
|A||2||8||18
|-
|A||2||9||24
|-
|A||3||1||12
|-
|A||3||2||12
|-
|A||3||3||16
|-
|A||3||4||15
|-
|A||3||5||15
|-
|A||3||6||14
|-
|A||4||1||17
|-
|A||4||2||20
|-
|A||4||3||20
|-
|A||4||4||19
|-
|A||4||5||19
|-
|A||4||6||18
|-
|A||4||7||20
|-
|A||4||8||19
|-
|A||4||9||19
|-
|B||5||1||9
|-
|B||5||2||12
|-
|B||5||3||13
|-
|B||5||4||16
|-
|B||5||5||14
|-
|B||5||6||14
|-
|B||6||1||10
|-
|B||6||2||10
|-
|B||6||3||9
|-
|B||6||4||8
|-
|B||6||5||13
|-
|B||6||6||9
|-
|B||6||7||11
|-
|B||7||1||12
|-
|B||7||2||16
|-
|B||7||3||17
|-
|B||7||4||15
|-
|B||7||5||15
|-
|B||7||6||15
|-
|B||8||1||9
|-
|B||8||2||6
|-
|B||8||3||8
|-
|B||8||4||8
|-
|B||8||5||13
|-
|B||8||6||9
|-
|B||8||7||9
|-
|B||8||8||10

|}
</center>

names(data)
attach(data)
table(Genotype, Race)
table(Race, Subject)

# for demonstration, construct a balanced data set
#  5 subjects for each race
data_balance <- data[data\$\\$\$Subject <=5,]
```
```# create factors
data_balance\$\\$\$g <- as.factor(data_balance\$\\$\$Genotype)
data_balance\$\\$\$t <- as.factor(data_balance\$\\$\$Race)
data_balance\$\\$\$s <- as.factor(data_balance\$\\$\$Subject)
```
```# fit the ANOVA
anova.model <- lm(Weight ~ g+t, data= data_balance)
# get the ANOVA table
anova(anova.model)
```
```# note that all F tests use MSE, subjects within race as denominator
# will need to hand-calculate test for genotypes
```
```# Random effects modeling estimation using REML
library(lme4)
```
```lme.model <- lmer(Weight~g+(1|t),data= data_balance)
summary(lme.model)
anova(lme.model)
```
```# various extractor functions:
fixef(lme.model)    	# estimates of fixed effects
vcov(lme.model)     	# VC matrix for the fixed effects
VarCorr(lme.model)  	# estimated variance(-covariance) for random effects
ranef(lme.model)    	# predictions of random effects
coef(lme.model)     	# fixed effects + pred's of random effects
fitted(lme.model)   	# conditional means for each obs (X bhat + Z uhat)
resid(lme.model)    	# conditional residuals (Y - fitted)
```
```# REML is the default method for estimating variance components.
# If want to use ML, can specify that
lmer(Weight~g+(1|t),REML=F, data= data_balance)
```
```# Now, to get p-values for the effects, or construct a ci
#  inference on Subject weight focus is the difference between the two genotypes
#  which is Cb for C = [0, 1] using the default R parameterization
# following assumes lme4 library loaded, data frame is d
#  uses full data set (unbalanced)
# also assumes g and t are factors identifying genotypes and trays
```
```# fit the model
data\$\\$\$g <- as.factor(data\$\\$\$Genotype)
data\$\\$\$t <- as.factor(data\$\\$\$Race)
data\$\\$\$s <- as.factor(data\$\\$\$Subject)
```
```# See R DATA TYPES: http://www.statmethods.net/input/datatypes.html
```
```model.lmer <- lmer(Weight ~ g + (1|t), data= data)
# get a confidence interval for g
# slower, obvious programing
nsim <- 10
gdiff <- rep(NA, nsim)
for (i in 1:nsim) {
data\$\\$\$y <- simulate(model.lmer)    # param bootstrap data set
model.lmer.1 <- lmer(unlist(data\$\\$\$y) ~ g + (1|t), data= data)
# we need to turn the list of simulated values (y) into an atomic vector with unlist()
# both data frames and models objects (e.g.,  produced by lm()) are lists
gdiff[i] <- fixef(model.lmer.1)[2]   # keep only the est diff
}
quantile(gdiff, c(0.025, 0.975))
# print 95% CI of the coefficient/effect-size of genotype(g)
```
```# print model summary
summary(model.lmer)
```
```# two ways to speed up CI construction and increase simulations to 1K
# use apply to avoid the for loop
# use refit() to avoid the setup time before fitting the LME
```
```yall <- simulate(model.lmer, nsim=1000)
gdiff <- apply(yall, 2, function(y) {fixef(refit(model.lmer, y))[2]})
quantile(gdiff, c(0.025, 0.975))
```
```# To get both the estimate and the SE
#  which are in the fixed effect table returned by summary()
#  use the extractor functions fixef() and vcov()
```
```yall <- simulate(model.lmer, nsim=1000)
gdt <- apply(yall, 2, function(y) {
model.lmer.2 <- refit(model.lmer, y);
c(fixef(model.lmer.2)[2], sqrt(vcov(model.lmer.2)[2,2]) )
})
gdt <- t(gdt)    # because result of apply is 2 rows, nsim cols
```
```# Using gdt, we can obtain a parametric bootstrap-t interval
```
```# Hypothesis testing: here we need to simulate under H0: g = 0
# To find the t statistic and the F statistic
#  t statistic is in the @coefs table
#  F statistic is a value in the anova() output
#    if more than one test, need to subscript (unless you want all F's)
#    need to reference as a list to avoid R issues
```
```# Complete model with genotypes
model.lmer <- lmer(Weight ~ g + (1|t), data= data)
```
```model.lmer.0 <- lmer(Weight ~ (1|t), data=data)
yall <- simulate(model.lmer.0, nsim=1000)
#  NB: simulate data under model.lmer.0
gdt0 <- apply(yall, 2, function(y) {
model.lmer.2 <- refit(model.lmer, y); # but analyze under alt. model (model.lmer)
c(summary(model.lmer.2), anova(model.lmer.2)\$\\$\$'F value' )
})
gdt0 <- t(gdt0)

# obs t = 1.438179, p-value using t statistic:
coef(summary(model.lmer))
mean(abs(gdt0[,1]\$\\$\$sim_1\$\\$\$residuals) >= 1.438179)
# p-value = 0.125

# obs F = 2.0684
anova(model.lmer)
mean(gdt0[,2]\$\\$\$sim_2\$\\$\$residuals >= 2.0684)
# p-value = 0.01785714
```

Baseball data

```data <- read.table('E:\\Ivo.dir\\Research\\UMichigan\\Education_Teaching_Curricula\\2015_2016\\HS_853_Fall_2015\\Modules_docx\\01a_data.txt',as.is=T, header=T)
boxplot(Weight ~ Position, data = data, xlab = "Position", ylab = "Weight",
main = "MLB Weight Distribution by Position")
boxplot(Weight ~ Team, data = data, xlab = "Team", ylab = "Weight",
main = "MLB Team Weight Distributions")
```

### References

Bates, D.M., Maechler, M., & Bolker, B. (2012). lme4: Linear mixed-effects models using S4 classes. R package version 0.999999-0.

Baayen, R.H. (2008). Analyzing Linguistic Data: A Practical Introduction to Statistics Using R. Cambridge: Cambridge University Press.

Baayen, R.H., Davidson, D.J., Bates, D.M. (2008). Mixed-effects modeling with crossed random effects for subjects and items. Journal of Memory and Language, 59, 390-412.

Barr, D.J., Levy, R., Scheepers, C., & Tilly, H. J. (2013). Random effects structure for confirmatory hypothesis testing: Keep it maximal. Journal of Memory and Language, 68, 255–278.

Bolker, B. M., Brooks, M. E., Clark, C. J., Geange, S. W., Poulsen, J. R., Stevens, M. H. H., & White, J. S. S. (2009). Generalized linear mixed models: a practical guide for ecology and evolution. Trends in Ecology & Evolution, 24(3), 127-135.

Wike, E.L., & Church, J.D. (1976). Comments on Clark’s “The language-as-fixed-effect fallacy”. Journal of Verbal Learning & Verbal Behavior, 15, 249-255.

Winter, B. (2013). Linear models and linear mixed effects models in R with linguistic applications. arXiv:1308.5499. [1]

....