Difference between revisions of "SMHS LinearModeling MachineLearning"

From SOCR
Jump to: navigation, search
(References)
 
(15 intermediate revisions by 2 users not shown)
Line 5: Line 5:
 
<b>Questions:</b>
 
<b>Questions:</b>
 
*How can we tie human intuition and computer-generated results to obtain reliable, effective, and efficient decision-support system (that facilitates, forecasting)?
 
*How can we tie human intuition and computer-generated results to obtain reliable, effective, and efficient decision-support system (that facilitates, forecasting)?
*Niels Born – “It is difficult to make predictions, especially about the future”
+
*Niels Born – “It is difficult to make predictions, especially about the future”
 
*Can we unsupervisely classify the data?
 
*Can we unsupervisely classify the data?
 
 
<b>Prediction</b>
 
<b>Prediction</b>
  
 
For most of the machine learning algorithms (including first-order linear regression), we:
 
For most of the machine learning algorithms (including first-order linear regression), we:
*first generate the model using training data, and then  
+
*first generate the model using training data, and then
*predict values for test/new data.  
+
*predict values for test/new data.
  
Predictions are made using the R <b>predict</b> function. (type <b>?predict.name</b>), where <b>name</b> is the function-name corresponding to the algorithm. The first argument of predict often represents the variable storing the model and the second argument is a matrix or data frame of test data that the model needs to be applied to. Calling predict can be done in 2 ways: type <b>predict</b> or type of <b>predict.name.</b>
+
Predictions are made using the R <b>predict</b> function. (type <b>?predict.name</b>), where <b>name</b> is the function-name corresponding to the algorithm. The first argument of predict often represents the variable storing the model, and the second argument is a matrix or data frame of test data that the model needs to be applied to. Calling predict can be done in 2 ways: type <b>predict</b> or type of <b>predict.name.</b>
  
 
<b>Example:</b>
 
<b>Example:</b>
Line 34: Line 33:
  
 
  lin.mod <- lm(Weight ~ Height*Team, <b><u>data=train</u></b>)
 
  lin.mod <- lm(Weight ~ Height*Team, <b><u>data=train</u></b>)
  predicted.values <-  predict(lin.mod, <b><u>newdata=test</b></u>)
+
  predicted.values <-  predict(lin.mod, <b><u>newdata=test</u></b>
 +
 
 +
<b>Data Modeling/Training</b>
 +
 
 +
<b>Logistic Regression:</b>
 +
glm_model <-glm(ifelse(Weight > 200,1,0) ~ Height*Team, family=binomial(link="logit"), <u><b>data=train</b></u>)
 +
 
 +
<b>K-Means Clustering</b>
 +
train.1 <- cbind(train$\$$Height, train$\$$Weight, train$\$$Age)
 +
test.1 <- cbind(test$\$$Height, test$\$$Weight, test$\$$Age)
 +
Weight.1 <- ifelse(train$\$$Weight > 200,1,0)
 +
 
 +
head(train.1)
 +
kmeans_model <- kmeans(<u><b>train.1</b></u>, 3)
 +
plot(train.1, col = kmeans_model$\$$cluster)
 +
points(kmeans_model$\$$centers, col = 1:2, pch = 8, cex = 2)
 +
 
 +
<b>K-Nearest Neighbor Classification</b>
 +
# install.packages("class")
 +
library("class")
 +
knn_model  <-  knn(train=train.1,  test=test.1,  cl=as.factor(Weight.1),  k=5)
 +
plot(knn_model)
 +
summary(knn_model)
 +
 
 +
<b>Naïve Bayes Classifier</b>
 +
install.packages("e1071")
 +
library("e1071")
 +
nbc_model <-  naiveBayes(Weight ~ Height*Age,  data=train.1)
 +
 
 +
<b>Decision Trees (CART)</b>
 +
#install.packages("e1071")
 +
library("rpart")
 +
cart_model <- rpart(Weight ~ Height+Age, data= as.data.frame(train.1), method="class")
 +
plot(cart_model)
 +
text(cart_model)
 +
 
 +
<b>AdaBoost</b>
 +
install.packages("ada")
 +
# X be the matrix of features, and labels be a vector of 0-1 class labels.
 +
library("ada")
 +
boost_model <- ada(x= cbind(train$\$$Height, train$\$$Weight, train$\$$Age), y= Weight.1)
 +
plot(boost_model)
 +
boost_model
 +
 
 +
<b>Support Vector Machines (SVM)</b>
 +
#install.packages("e1071")
 +
library("rpart")
 +
svm_model <- svm(x= cbind(train$\$$Height, train$\$$Weight, train$\$$Age), y=as.factor(Weight.1),
 +
kernel ="radial")
 +
summary(svm_model)
 +
 
 +
==Appendix==
 +
 
 +
===Example 1: Simulation (subject, day, treatment, observation)===
 +
 
 +
<i>Obs ~ Treatment + Day + Subject(Treatment)+ Day*Subject(Treatment)+ ε.</i>
 +
 
 +
This model is accounts for:<BR>
 +
Response = Obs
 +
 
 +
Fixed effects:<BR>
 +
Treatment (fixed)<BR>
 +
Day (fixed)<BR>
 +
Treatment*Day interaction
 +
 
 +
Random Effects:<BR>
 +
Subject nested within Treatment (random)<BR>
 +
Day crossed with "Subject within Treatment" (random)
 +
 
 +
mydata <- data.frame(
 +
Subject  = c(13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 29, 30, 31, 32, 33,
 +
34, 35, 36, 37, 38, 39, 40, 62, 63, 64, 65, 13, 14, 15, 16, 17, 18,
 +
19, 20, 21, 22, 23, 24, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39,
 +
40, 62, 63, 64, 65, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24,
 +
29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 62, 63, 64, 65),
 +
Day      = c(rep(c("Day1", "Day3", "Day6"), each=28)),
 +
Treatment = c(rep(c("B", "A", "C", "B", "C", "A", "A", "B", "A", "C", "B", "C",
 +
"A", "A", "B", "A", "C", "B", "C", "A", "A"), each = 4)),
 +
Obs      = c(6.472687, 7.017110, 6.200715, 6.613928, 6.829968, 7.387583, 7.367293,
 +
8.018853, 7.527408, 6.746739, 7.296910, 6.983360, 6.816621, 6.571689,
 +
5.911261, 6.954988, 7.624122, 7.669865, 7.676225, 7.263593, 7.704737,
 +
7.328716, 7.295610, 5.964180, 6.880814, 6.926342, 6.926342, 7.562293,
 +
6.677607, 7.023526, 6.441864, 7.020875, 7.478931, 7.495336, 7.427709,
 +
7.633020, 7.382091, 7.359731, 7.285889, 7.496863, 6.632403, 6.171196,
 +
6.306012, 7.253833, 7.594852, 6.915225, 7.220147, 7.298227, 7.573612,
 +
7.366550, 7.560513, 7.289078, 7.287802, 7.155336, 7.394452, 7.465383,
 +
6.976048, 7.222966, 6.584153, 7.013223, 7.569905, 7.459185, 7.504068,
 +
7.801867, 7.598728, 7.475841, 7.511873, 7.518384, 6.618589, 5.854754,
 +
6.125749, 6.962720, 7.540600, 7.379861, 7.344189, 7.362815, 7.805802,
 +
7.764172, 7.789844, 7.616437, NA, NA, NA, NA))
 +
 
 +
install.packages("lme4")
 +
library("lme4", lib.loc="~/R/win-library/3.1")
 +
m1 <- lmer(Obs ~ Treatment * Day + (1 | Subject), mydata)
 +
m1
 +
 
 +
Linear mixed model fit by REML ['lmerMod']
 +
 
 +
Formula: Obs ~ Treatment * Day + (1 | Subject)
 +
 
 +
Data: mydata
 +
 
 +
REML criterion at convergence: 56.8669
 +
 
 +
<center> Random Effects
 +
{| class="wikitable" style="text-align:center; width:35%" border="1"
 +
|-
 +
|Groups||Name||Std. Dev.
 +
|-
 +
|Subject||(Intercept)||0.2163
 +
|-
 +
|Residual|| ||0.2602
 +
 
 +
|}
 +
Number of obs: 80, groups:  Subject, 28 </center>
 +
 
 +
 
 +
 
 +
 
 +
 
 +
<center> Fixed Effects
 +
{| class="wikitable" style="text-align:center; width:35%" border="1"
 +
|-
 +
|(Intercept)||TreatmentB||TreatmentC
 +
|-
 +
|7.1827||-0.6129||0.1658
 +
|-
 +
|DayDay3||DayDay6 ||TreatmentB: DayDay3
 +
|-
 +
|0.2446||0.4507||-0.1235
 +
|-
 +
|TreatmentC: DayDay3||TreatmentB: DayDay6||TreatmentC: DayDay6
 +
|-
 +
|-0.2740||-0.3508||-0.3327
 +
 
 +
|}
 +
</center>
 +
 
 +
 
 +
 
 +
 
 +
 
 +
<center>
 +
{| class="wikitable" style="text-align:center; width:35%" border="1"
 +
|-
 +
|Index||Subject||Day||Treatment||Obs
 +
|-
 +
|1||13||Day1||B||6.472687
 +
|-
 +
|2||14||Day1||B||7.01711
 +
|-
 +
|3||15||Day1||B||6.200715
 +
|-
 +
|4||16||Day1||B||6.613928
 +
|-
 +
|5||17||Day1||A||6.829968
 +
|-
 +
|6||18||Day1||A||7.387583
 +
|-
 +
|7||19||Day1||A||7.367293
 +
|-
 +
|8||20||Day1||A||8.018853
 +
|-
 +
|9||21||Day1||C||7.527408
 +
|-
 +
|10||22||Day1||C||6.746739
 +
|-
 +
|11||23||Day1||C||7.29691
 +
|-
 +
|12||24||Day1||C||6.98336
 +
|-
 +
|13||29||Day1||B||6.816621
 +
|-
 +
|14||30||Day1||B||6.571689
 +
|-
 +
|15||31||Day1||B||5.911261
 +
|-
 +
|16||32||Day1||B||6.954988
 +
|-
 +
|17||33||Day1||C||7.624122
 +
|-
 +
|18||34||Day1||C||7.669865
 +
|-
 +
|19||35||Day1||C||7.676225
 +
|-
 +
|20||36||Day1||C||7.263593
 +
|-
 +
|21||37||Day1||A||7.704737
 +
|-
 +
|22||38||Day1||A||7.328716
 +
|-
 +
|23||39||Day1||A||7.29561
 +
|-
 +
|24||40||Day1||A||5.96418
 +
|-
 +
|25||62||Day1||A||6.880814
 +
|-
 +
|26||63||Day1||A||6.926342
 +
|-
 +
|27||64||Day1||A||6.926342
 +
|-
 +
|28||65||Day1||A||7.562293
 +
|-
 +
|29||13||Day3||B||6.677607
 +
|-
 +
|30||14||Day3||B||7.023526
 +
|-
 +
|31||15||Day3||B||6.441864
 +
|-
 +
|32||16||Day3||B||7.020875
 +
|-
 +
|33||17||Day3||A||7.478931
 +
|-
 +
|34||18||Day3||A||7.495336
 +
|-
 +
|35||19||Day3||A||7.427709
 +
|-
 +
|36||20||Day3||A||7.63302
 +
|-
 +
|37||21||Day3||C||7.382091
 +
|-
 +
|38||22||Day3||C||7.359731
 +
|-
 +
|39||23||Day3||C||7.285889
 +
|-
 +
|40||24||Day3||C||7.496863
 +
|-
 +
|41||29||Day3||B||6.632403
 +
|-
 +
|42||30||Day3||B||6.171196
 +
|-
 +
|43||31||Day3||B||6.306012
 +
|-
 +
|44||32||Day3||B||7.253833
 +
|-
 +
|45||33||Day3||C||7.594852
 +
|-
 +
|46||34||Day3||C||6.915225
 +
|-
 +
|47||35||Day3||C||7.220147
 +
|-
 +
|48||36||Day3||C||7.298227
 +
|-
 +
|49||37||Day3||A||7.573612
 +
|-
 +
|50||38||Day3||A||7.36655
 +
|-
 +
|51||39||Day3||A||7.560513
 +
|-
 +
|52||40||Day3||A||7.289078
 +
|-
 +
|53||62||Day3||A||7.287802
 +
|-
 +
|54||63||Day3||A||7.155336
 +
|-
 +
|55||64||Day3||A||7.394452
 +
|-
 +
|56||65||Day3||A||7.465383
 +
|-
 +
|57||13||Day6||B||6.976048
 +
|-
 +
|58||14||Day6||B||7.222966
 +
|-
 +
|59||15||Day6||B||6.584153
 +
|-
 +
|60||16||Day6||B||7.013223
 +
|-
 +
|61||17||Day6||A||7.569905
 +
|-
 +
|62||18||Day6||A||7.459185
 +
|-
 +
|63||19||Day6||A||7.504068
 +
|-
 +
|64||20||Day6||A||7.801867
 +
|-
 +
|65||21||Day6||C||7.598728
 +
|-
 +
|66||22||Day6||C||7.475841
 +
|-
 +
|67||23||Day6||C||7.511873
 +
|-
 +
|68||24||Day6||C||7.518384
 +
|-
 +
|69||29||Day6||B||6.618589
 +
|-
 +
|70||30||Day6||B||5.854754
 +
|-
 +
|71||31||Day6||B||6.125749
 +
|-
 +
|72||32||Day6||B||6.96272
 +
|-
 +
|73||33||Day6||C||7.5406
 +
|-
 +
|74||34||Day6||C||7.379861
 +
|-
 +
|75||35||Day6||C||7.344189
 +
|-
 +
|76||36||Day6||C||7.362815
 +
|-
 +
|77||37||Day6||A||7.805802
 +
|-
 +
|78||38||Day6||A||7.764172
 +
|-
 +
|79||39||Day6||A||7.789844
 +
|-
 +
|80||40||Day6||A||7.616437
 +
|-
 +
|81||62||Day6||A||NA
 +
|-
 +
|82||63||Day6||A||NA
 +
|-
 +
|83||64||Day6||A||NA
 +
|-
 +
|84||65||Day6||A||NA
 +
 
 +
|}
 +
</center>
 +
 
 +
===Example 2: Genotype-phenotype===
 +
 
 +
Save this data file as a tab-separated TXT file:
 +
 
 +
<center>
 +
{| class="wikitable" style="text-align:center; width:35%" border="1"
 +
|-
 +
|Genotype||Race||Subject||Weight
 +
|-
 +
|A||1||1||8
 +
|-
 +
|A||1||2||9
 +
|-
 +
|A||1||3||11
 +
|-
 +
|A||1||4||12
 +
|-
 +
|A||1||5||10
 +
|-
 +
|A||2||1||17
 +
|-
 +
|A||2||2||17
 +
|-
 +
|A||2||3||16
 +
|-
 +
|A||2||4||15
 +
|-
 +
|A||2||5||19
 +
|-
 +
|A||2||6||18
 +
|-
 +
|A||2||7||18
 +
|-
 +
|A||2||8||18
 +
|-
 +
|A||2||9||24
 +
|-
 +
|A||3||1||12
 +
|-
 +
|A||3||2||12
 +
|-
 +
|A||3||3||16
 +
|-
 +
|A||3||4||15
 +
|-
 +
|A||3||5||15
 +
|-
 +
|A||3||6||14
 +
|-
 +
|A||4||1||17
 +
|-
 +
|A||4||2||20
 +
|-
 +
|A||4||3||20
 +
|-
 +
|A||4||4||19
 +
|-
 +
|A||4||5||19
 +
|-
 +
|A||4||6||18
 +
|-
 +
|A||4||7||20
 +
|-
 +
|A||4||8||19
 +
|-
 +
|A||4||9||19
 +
|-
 +
|B||5||1||9
 +
|-
 +
|B||5||2||12
 +
|-
 +
|B||5||3||13
 +
|-
 +
|B||5||4||16
 +
|-
 +
|B||5||5||14
 +
|-
 +
|B||5||6||14
 +
|-
 +
|B||6||1||10
 +
|-
 +
|B||6||2||10
 +
|-
 +
|B||6||3||9
 +
|-
 +
|B||6||4||8
 +
|-
 +
|B||6||5||13
 +
|-
 +
|B||6||6||9
 +
|-
 +
|B||6||7||11
 +
|-
 +
|B||7||1||12
 +
|-
 +
|B||7||2||16
 +
|-
 +
|B||7||3||17
 +
|-
 +
|B||7||4||15
 +
|-
 +
|B||7||5||15
 +
|-
 +
|B||7||6||15
 +
|-
 +
|B||8||1||9
 +
|-
 +
|B||8||2||6
 +
|-
 +
|B||8||3||8
 +
|-
 +
|B||8||4||8
 +
|-
 +
|B||8||5||13
 +
|-
 +
|B||8||6||9
 +
|-
 +
|B||8||7||9
 +
|-
 +
|B||8||8||10
 +
 
 +
|}
 +
</center>
 +
 
 +
# data <- read.table('C:\\Users\\Dinov\\Desktop\\data.txt',as.is=T, header=T)
 +
data <- read.table('data.txt',as.is=T, header=T)
 +
names(data)
 +
attach(data)
 +
table(Genotype, Race)
 +
table(Race, Subject)
 +
 
 +
# for demonstration, construct a balanced data set
 +
#  5 subjects for each race
 +
data_balance <- data[data$\$$Subject <=5,]
 +
 
 +
# create factors
 +
data_balance$\$$g <- as.factor(data_balance$\$$Genotype)
 +
data_balance$\$$t <- as.factor(data_balance$\$$Race)
 +
data_balance$\$$s <- as.factor(data_balance$\$$Subject)
 +
 
 +
# fit the ANOVA
 +
anova.model <- lm(Weight ~ g+t, data= data_balance)
 +
# get the ANOVA table
 +
anova(anova.model)
 +
 
 +
# note that all F tests use MSE, subjects within race as denominator
 +
# will need to hand-calculate test for genotypes
 +
 
 +
# Random effects modeling estimation using REML
 +
library(lme4)
 +
 
 +
lme.model <- lmer(Weight~g+(1|t),data= data_balance)
 +
summary(lme.model)
 +
anova(lme.model)
 +
 
 +
# various extractor functions:
 +
fixef(lme.model)    # estimates of fixed effects
 +
vcov(lme.model)    # VC matrix for the fixed effects
 +
VarCorr(lme.model)  # estimated variance(-covariance) for random effects
 +
ranef(lme.model)    # predictions of random effects
 +
coef(lme.model)    # fixed effects + pred's of random effects
 +
fitted(lme.model)  # conditional means for each obs (X bhat + Z uhat)
 +
resid(lme.model)    # conditional residuals (Y - fitted)
 +
 
 +
# REML is the default method for estimating variance components.
 +
# If want to use ML, can specify that
 +
lmer(Weight~g+(1|t),REML=F, data= data_balance)
 +
 
 +
# Now, to get p-values for the effects, or construct a ci
 +
#  inference on Subject weight focus is the difference between the two genotypes
 +
#  which is Cb for C = [0, 1] using the default R parameterization
 +
# following assumes lme4 library loaded, data frame is d
 +
#  uses full data set (unbalanced)
 +
# also assumes g and t are factors identifying genotypes and trays
 +
 
 +
# fit the model
 +
data$\$$g <- as.factor(data$\$$Genotype)
 +
data$\$$t <- as.factor(data$\$$Race)
 +
data$\$$s <- as.factor(data$\$$Subject)
 +
 
 +
# See R DATA TYPES: http://www.statmethods.net/input/datatypes.html
 +
 
 +
model.lmer <- lmer(Weight ~ g + (1|t), data= data)
 +
# get a confidence interval for g
 +
# slower, obvious programing
 +
nsim <- 10
 +
gdiff <- rep(NA, nsim)
 +
for (i in 1:nsim) {
 +
data$\$$y <- simulate(model.lmer)    # param bootstrap data set
 +
model.lmer.1 <- lmer(unlist(data$\$$y) ~ g + (1|t), data= data)
 +
  # we need to turn the list of simulated values (y) into an atomic vector with unlist()
 +
  # both data frames and models objects (e.g.,  produced by lm()) are lists
 +
  #  http://adv-r.had.co.nz/Data-structures.html
 +
gdiff[i] <- fixef(model.lmer.1)[2]  # keep only the est diff
 +
}
 +
quantile(gdiff, c(0.025, 0.975))
 +
  # print 95% CI of the coefficient/effect-size of genotype(g)
 +
 
 +
# print model summary
 +
summary(model.lmer)
 +
 
 +
# two ways to speed up CI construction and increase simulations to 1K
 +
# use apply to avoid the for loop
 +
# use refit() to avoid the setup time before fitting the LME
 +
 
 +
yall <- simulate(model.lmer, nsim=1000)
 +
gdiff <- apply(yall, 2, function(y) {fixef(refit(model.lmer, y))[2]})
 +
quantile(gdiff, c(0.025, 0.975))
 +
 
 +
# To get both the estimate and the SE
 +
#  which are in the fixed effect table returned by summary()
 +
#  use the extractor functions fixef() and vcov()
 +
 
 +
yall <- simulate(model.lmer, nsim=1000)
 +
gdt <- apply(yall, 2, function(y) {
 +
model.lmer.2 <- refit(model.lmer, y);
 +
c(fixef(model.lmer.2)[2], sqrt(vcov(model.lmer.2)[2,2]) )
 +
})
 +
gdt <- t(gdt)    # because result of apply is 2 rows, nsim cols
 +
 
 +
# Using gdt, we can obtain a parametric bootstrap-t interval
 +
 
 +
# Hypothesis testing: here we need to simulate under <b><u>H0: g = 0</u></b>
 +
# To find the t statistic and the F statistic
 +
#  t statistic is in the @coefs table
 +
#  F statistic is a value in the anova() output
 +
#    if more than one test, need to subscript (unless you want all F's)
 +
#    need to reference as a list to avoid R issues
  
 +
# Complete model with genotypes
 +
model.lmer <- lmer(Weight ~ g + (1|t), data= data)
  
 +
model.lmer.0 <- lmer(Weight ~ (1|t), data=data)
 +
yall <- simulate(model.lmer.0, nsim=1000)
 +
#  NB: simulate data under model.lmer.0
 +
gdt0 <- apply(yall, 2, function(y) {
 +
model.lmer.2 <- refit(model.lmer, y); # but analyze under alt. model (model.lmer)
 +
c(summary(model.lmer.2), anova(model.lmer.2)$\$$'F value' )
 +
})
 +
gdt0 <- t(gdt0)
  
 +
# obs t = 1.438179, p-value using t statistic:
 +
coef(summary(model.lmer))
 +
mean(abs(gdt0[,1]$\$$sim_1$\$$residuals) >= 1.438179)
 +
# p-value = 0.125
  
 +
# obs F = 2.0684
 +
anova(model.lmer)
 +
  mean(gdt0[,2]$\$$sim_2$\$$residuals >= 2.0684)
 +
# p-value = 0.01785714
  
 +
<b>Baseball data</b>
  
 +
http://wiki.socr.umich.edu/index.php/SOCR_Data_MLB_HeightsWeights
  
 +
data <- read.table('E:\\Ivo.dir\\Research\\UMichigan\\Education_Teaching_Curricula\\2015_2016\\HS_853_Fall_2015\\Modules_docx\\01a_data.txt',as.is=T, header=T)
 +
boxplot(Weight ~ Position, data = data, xlab = "Position", ylab = "Weight",
 +
main = "MLB Weight Distribution by Position")
 +
boxplot(Weight ~ Team, data = data, xlab = "Team", ylab = "Weight",
 +
main = "MLB Team Weight Distributions")
  
 +
===References===
  
 +
Bates, D.M., Maechler, M., & Bolker, B.  (2012). lme4:  Linear mixed-effects models using S4 classes. R package version 0.999999-0.
  
 +
Baayen, R.H. (2008). Analyzing Linguistic Data: A Practical Introduction to Statistics Using R. Cambridge: Cambridge University Press.
  
 +
Baayen, R.H., Davidson, D.J., Bates, D.M. (2008). Mixed-effects modeling with crossed random effects for subjects and items. Journal of Memory and Language, 59, 390-412.
  
 +
Barr, D.J., Levy, R., Scheepers, C., & Tilly, H. J. (2013). Random effects structure for confirmatory hypothesis testing: Keep it maximal. Journal of Memory and Language, 68, 255–278.
  
 +
Bolker, B. M., Brooks, M. E., Clark, C. J., Geange, S. W., Poulsen, J. R., Stevens, M. H. H., & White, J. S. S. (2009). Generalized linear mixed models: a practical guide for ecology and evolution. Trends in Ecology & Evolution, 24(3), 127-135.
  
....
+
Wike, E.L., & Church, J.D. (1976). Comments on Clark’s “The language-as-fixed-effect fallacy”. Journal of Verbal Learning & Verbal Behavior, 15, 249-255.
  
 +
Winter, B. (2013). Linear models and linear mixed effects models in R with linguistic applications. arXiv:1308.5499. [http://arxiv.org/pdf/1308.5499.pdf]
  
  

Latest revision as of 08:58, 23 May 2016

SMHS Linear Modeling - Machine Learning Algorithms

Scientific inference based on fixed and random effect models, assumptions, and mixed effects logistic regression.

Questions:

  • How can we tie human intuition and computer-generated results to obtain reliable, effective, and efficient decision-support system (that facilitates, forecasting)?
  • Niels Born – “It is difficult to make predictions, especially about the future”
  • Can we unsupervisely classify the data?

Prediction

For most of the machine learning algorithms (including first-order linear regression), we:

  • first generate the model using training data, and then
  • predict values for test/new data.

Predictions are made using the R predict function. (type ?predict.name), where name is the function-name corresponding to the algorithm. The first argument of predict often represents the variable storing the model, and the second argument is a matrix or data frame of test data that the model needs to be applied to. Calling predict can be done in 2 ways: type predict or type of predict.name.

Example:

#mydata <- read.table('https://umich.instructure.com/files/330381/download?download_frd=1&verifier=HpfmjfMFaMsk7rIpfPx0tmz960oTW7JA8ZonGvVC',as.is=T, header=T)  # 01a_data.txt
# mydata <- read.table('data.txt',as.is=T, header=T)
# (1) First, there are different approaches to split the data (partition the data) into 
# training and testing sets.
## TRAINING: 75% of the sample size
sample_size <- floor(0.75 * nrow(mydata))
## set the seed to make your partition reproductible
set.seed(1234)
train_ind <- sample(seq_len(nrow(mydata)), size = sample_size)
train <- mydata[train_ind, ]
# TESTING DATA
test <- mydata[-train_ind, ]
lin.mod <- lm(Weight ~ Height*Team, data=train)
predicted.values <-  predict(lin.mod, newdata=test

Data Modeling/Training

Logistic Regression:

glm_model <-glm(ifelse(Weight > 200,1,0) ~ Height*Team, family=binomial(link="logit"), data=train)

K-Means Clustering

train.1 <- cbind(train$\$$Height, train$\$$Weight, train$\$$Age)
 test.1 <- cbind(test$\$$Height, test$\$$Weight, test$\$$Age)
Weight.1 <- ifelse(train$\$$Weight > 200,1,0)

 head(train.1)
 kmeans_model <- kmeans(<u><b>train.1</b></u>, 3)
 plot(train.1, col = kmeans_model$\$$cluster)
points(kmeans_model$\$$centers, col = 1:2, pch = 8, cex = 2)

<b>K-Nearest Neighbor Classification</b>
 # install.packages("class")
 library("class")
 knn_model  <-  knn(train=train.1,  test=test.1,  cl=as.factor(Weight.1),  k=5)
 plot(knn_model)
 summary(knn_model)

<b>Naïve Bayes Classifier</b>
 install.packages("e1071")
 library("e1071")
 nbc_model <-  naiveBayes(Weight ~ Height*Age,  data=train.1)

<b>Decision Trees (CART)</b>
 #install.packages("e1071")
 library("rpart")
 cart_model <- rpart(Weight ~ Height+Age, data= as.data.frame(train.1), method="class")
 plot(cart_model)
 text(cart_model) 

<b>AdaBoost</b>
 install.packages("ada")
 # X be the matrix of features, and labels be a vector of 0-1 class labels.
 library("ada")
 boost_model <- ada(x= cbind(train$\$$Height, train$\$$Weight, train$\$$Age), y= Weight.1)
plot(boost_model)
boost_model

Support Vector Machines (SVM)

#install.packages("e1071")
library("rpart")
svm_model <- svm(x= cbind(train$\$$Height, train$\$$Weight, train$\$$Age), y=as.factor(Weight.1), 
 kernel ="radial")
 summary(svm_model)

=='"`UNIQ--h-1--QINU`"'Appendix==

==='"`UNIQ--h-2--QINU`"'Example 1: Simulation (subject, day, treatment, observation)===

<i>Obs ~ Treatment + Day + Subject(Treatment)+ Day*Subject(Treatment)+ ε.</i>

This model is accounts for:<br>
Response = Obs

Fixed effects:<br>
Treatment (fixed)<br>
Day (fixed)<br>
Treatment*Day interaction

Random Effects:<br>
Subject nested within Treatment (random)<br>
Day crossed with "Subject within Treatment" (random)

 mydata <- data.frame(
 Subject  = c(13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 29, 30, 31, 32, 33, 
 34, 35, 36, 37, 38, 39, 40, 62, 63, 64, 65, 13, 14, 15, 16, 17, 18, 
 19, 20, 21, 22, 23, 24, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 
 40, 62, 63, 64, 65, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 
 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 62, 63, 64, 65), 
 Day       = c(rep(c("Day1", "Day3", "Day6"), each=28)), 
 Treatment = c(rep(c("B", "A", "C", "B", "C", "A", "A", "B", "A", "C", "B", "C", 
 "A", "A", "B", "A", "C", "B", "C", "A", "A"), each = 4)), 
 Obs       = c(6.472687, 7.017110, 6.200715, 6.613928, 6.829968, 7.387583, 7.367293, 
 8.018853, 7.527408, 6.746739, 7.296910, 6.983360, 6.816621, 6.571689, 
 5.911261, 6.954988, 7.624122, 7.669865, 7.676225, 7.263593, 7.704737, 
 7.328716, 7.295610, 5.964180, 6.880814, 6.926342, 6.926342, 7.562293, 
 6.677607, 7.023526, 6.441864, 7.020875, 7.478931, 7.495336, 7.427709, 
 7.633020, 7.382091, 7.359731, 7.285889, 7.496863, 6.632403, 6.171196, 
 6.306012, 7.253833, 7.594852, 6.915225, 7.220147, 7.298227, 7.573612, 
 7.366550, 7.560513, 7.289078, 7.287802, 7.155336, 7.394452, 7.465383, 
 6.976048, 7.222966, 6.584153, 7.013223, 7.569905, 7.459185, 7.504068, 
 7.801867, 7.598728, 7.475841, 7.511873, 7.518384, 6.618589, 5.854754, 
 6.125749, 6.962720, 7.540600, 7.379861, 7.344189, 7.362815, 7.805802, 
 7.764172, 7.789844, 7.616437, NA, NA, NA, NA))

 install.packages("lme4")
 library("lme4", lib.loc="~/R/win-library/3.1")
 m1 <- lmer(Obs ~ Treatment * Day + (1 | Subject), mydata)
 m1

Linear mixed model fit by REML ['lmerMod']

Formula: Obs ~ Treatment * Day + (1 | Subject)

Data: mydata

REML criterion at convergence: 56.8669

<center> Random Effects
{| class="wikitable" style="text-align:center; width:35%" border="1"
|-
|Groups||Name||Std. Dev.
|-
|Subject||(Intercept)||0.2163
|-
|Residual|| ||0.2602

|}
Number of obs: 80, groups:  Subject, 28 </center>





<center> Fixed Effects
{| class="wikitable" style="text-align:center; width:35%" border="1"
|-
|(Intercept)||TreatmentB||TreatmentC
|-
|7.1827||-0.6129||0.1658
|-
|DayDay3||DayDay6 ||TreatmentB: DayDay3
|-
|0.2446||0.4507||-0.1235
|-
|TreatmentC: DayDay3||TreatmentB: DayDay6||TreatmentC: DayDay6
|-
|-0.2740||-0.3508||-0.3327

|}
</center>





<center>
{| class="wikitable" style="text-align:center; width:35%" border="1"
|-
|Index||Subject||Day||Treatment||Obs
|-
|1||13||Day1||B||6.472687
|-
|2||14||Day1||B||7.01711
|-
|3||15||Day1||B||6.200715
|-
|4||16||Day1||B||6.613928
|-
|5||17||Day1||A||6.829968
|-
|6||18||Day1||A||7.387583
|-
|7||19||Day1||A||7.367293
|-
|8||20||Day1||A||8.018853
|-
|9||21||Day1||C||7.527408
|-
|10||22||Day1||C||6.746739
|-
|11||23||Day1||C||7.29691
|-
|12||24||Day1||C||6.98336
|-
|13||29||Day1||B||6.816621
|-
|14||30||Day1||B||6.571689
|-
|15||31||Day1||B||5.911261
|-
|16||32||Day1||B||6.954988
|-
|17||33||Day1||C||7.624122
|-
|18||34||Day1||C||7.669865
|-
|19||35||Day1||C||7.676225
|-
|20||36||Day1||C||7.263593
|-
|21||37||Day1||A||7.704737
|-
|22||38||Day1||A||7.328716
|-
|23||39||Day1||A||7.29561
|-
|24||40||Day1||A||5.96418
|-
|25||62||Day1||A||6.880814
|-
|26||63||Day1||A||6.926342
|-
|27||64||Day1||A||6.926342
|-
|28||65||Day1||A||7.562293
|-
|29||13||Day3||B||6.677607
|-
|30||14||Day3||B||7.023526
|-
|31||15||Day3||B||6.441864
|-
|32||16||Day3||B||7.020875
|-
|33||17||Day3||A||7.478931
|-
|34||18||Day3||A||7.495336
|-
|35||19||Day3||A||7.427709
|-
|36||20||Day3||A||7.63302
|-
|37||21||Day3||C||7.382091
|-
|38||22||Day3||C||7.359731
|-
|39||23||Day3||C||7.285889
|-
|40||24||Day3||C||7.496863
|-
|41||29||Day3||B||6.632403
|-
|42||30||Day3||B||6.171196
|-
|43||31||Day3||B||6.306012
|-
|44||32||Day3||B||7.253833
|-
|45||33||Day3||C||7.594852
|-
|46||34||Day3||C||6.915225
|-
|47||35||Day3||C||7.220147
|-
|48||36||Day3||C||7.298227
|-
|49||37||Day3||A||7.573612
|-
|50||38||Day3||A||7.36655
|-
|51||39||Day3||A||7.560513
|-
|52||40||Day3||A||7.289078
|-
|53||62||Day3||A||7.287802
|-
|54||63||Day3||A||7.155336
|-
|55||64||Day3||A||7.394452
|-
|56||65||Day3||A||7.465383
|-
|57||13||Day6||B||6.976048
|-
|58||14||Day6||B||7.222966
|-
|59||15||Day6||B||6.584153
|-
|60||16||Day6||B||7.013223
|-
|61||17||Day6||A||7.569905
|-
|62||18||Day6||A||7.459185
|-
|63||19||Day6||A||7.504068
|-
|64||20||Day6||A||7.801867
|-
|65||21||Day6||C||7.598728
|-
|66||22||Day6||C||7.475841
|-
|67||23||Day6||C||7.511873
|-
|68||24||Day6||C||7.518384
|-
|69||29||Day6||B||6.618589
|-
|70||30||Day6||B||5.854754
|-
|71||31||Day6||B||6.125749
|-
|72||32||Day6||B||6.96272
|-
|73||33||Day6||C||7.5406
|-
|74||34||Day6||C||7.379861
|-
|75||35||Day6||C||7.344189
|-
|76||36||Day6||C||7.362815
|-
|77||37||Day6||A||7.805802
|-
|78||38||Day6||A||7.764172
|-
|79||39||Day6||A||7.789844
|-
|80||40||Day6||A||7.616437
|-
|81||62||Day6||A||NA
|-
|82||63||Day6||A||NA
|-
|83||64||Day6||A||NA
|-
|84||65||Day6||A||NA

|}
</center>

==='"`UNIQ--h-3--QINU`"'Example 2: Genotype-phenotype===

Save this data file as a tab-separated TXT file:

<center>
{| class="wikitable" style="text-align:center; width:35%" border="1"
|-
|Genotype||Race||Subject||Weight
|-
|A||1||1||8
|-
|A||1||2||9
|-
|A||1||3||11
|-
|A||1||4||12
|-
|A||1||5||10
|-
|A||2||1||17
|-
|A||2||2||17
|-
|A||2||3||16
|-
|A||2||4||15
|-
|A||2||5||19
|-
|A||2||6||18
|-
|A||2||7||18
|-
|A||2||8||18
|-
|A||2||9||24
|-
|A||3||1||12
|-
|A||3||2||12
|-
|A||3||3||16
|-
|A||3||4||15
|-
|A||3||5||15
|-
|A||3||6||14
|-
|A||4||1||17
|-
|A||4||2||20
|-
|A||4||3||20
|-
|A||4||4||19
|-
|A||4||5||19
|-
|A||4||6||18
|-
|A||4||7||20
|-
|A||4||8||19
|-
|A||4||9||19
|-
|B||5||1||9
|-
|B||5||2||12
|-
|B||5||3||13
|-
|B||5||4||16
|-
|B||5||5||14
|-
|B||5||6||14
|-
|B||6||1||10
|-
|B||6||2||10
|-
|B||6||3||9
|-
|B||6||4||8
|-
|B||6||5||13
|-
|B||6||6||9
|-
|B||6||7||11
|-
|B||7||1||12
|-
|B||7||2||16
|-
|B||7||3||17
|-
|B||7||4||15
|-
|B||7||5||15
|-
|B||7||6||15
|-
|B||8||1||9
|-
|B||8||2||6
|-
|B||8||3||8
|-
|B||8||4||8
|-
|B||8||5||13
|-
|B||8||6||9
|-
|B||8||7||9
|-
|B||8||8||10

|}
</center>

 # data <- read.table('C:\\Users\\Dinov\\Desktop\\data.txt',as.is=T, header=T)
 data <- read.table('data.txt',as.is=T, header=T)
 names(data)
 attach(data)
 table(Genotype, Race)
 table(Race, Subject)

 # for demonstration, construct a balanced data set 
 #  5 subjects for each race 
 data_balance <- data[data$\$$Subject <=5,]
# create factors 
data_balance$\$$g <- as.factor(data_balance$\$$Genotype)
data_balance$\$$t <- as.factor(data_balance$\$$Race)
data_balance$\$$s <- as.factor(data_balance$\$$Subject)
# fit the ANOVA 
anova.model <- lm(Weight ~ g+t, data= data_balance)
# get the ANOVA table
anova(anova.model) 
# note that all F tests use MSE, subjects within race as denominator
# will need to hand-calculate test for genotypes
# Random effects modeling estimation using REML
library(lme4)
lme.model <- lmer(Weight~g+(1|t),data= data_balance)
summary(lme.model)
anova(lme.model)
# various extractor functions:
fixef(lme.model)    	# estimates of fixed effects
vcov(lme.model)     	# VC matrix for the fixed effects
VarCorr(lme.model)  	# estimated variance(-covariance) for random effects
ranef(lme.model)    	# predictions of random effects
coef(lme.model)     	# fixed effects + pred's of random effects
fitted(lme.model)   	# conditional means for each obs (X bhat + Z uhat)
resid(lme.model)    	# conditional residuals (Y - fitted)
# REML is the default method for estimating variance components.
# If want to use ML, can specify that
lmer(Weight~g+(1|t),REML=F, data= data_balance)
# Now, to get p-values for the effects, or construct a ci
#  inference on Subject weight focus is the difference between the two genotypes
#  which is Cb for C = [0, 1] using the default R parameterization
# following assumes lme4 library loaded, data frame is d
#  uses full data set (unbalanced)
# also assumes g and t are factors identifying genotypes and trays
# fit the model
data$\$$g <- as.factor(data$\$$Genotype)
data$\$$t <- as.factor(data$\$$Race)
data$\$$s <- as.factor(data$\$$Subject)
# See R DATA TYPES: http://www.statmethods.net/input/datatypes.html
model.lmer <- lmer(Weight ~ g + (1|t), data= data)
# get a confidence interval for g
# slower, obvious programing
nsim <- 10
gdiff <- rep(NA, nsim)
for (i in 1:nsim) {
data$\$$y <- simulate(model.lmer)    # param bootstrap data set
 model.lmer.1 <- lmer(unlist(data$\$$y) ~ g + (1|t), data= data)
  # we need to turn the list of simulated values (y) into an atomic vector with unlist()
  # both data frames and models objects (e.g.,  produced by lm()) are lists
  #  http://adv-r.had.co.nz/Data-structures.html 
gdiff[i] <- fixef(model.lmer.1)[2]   # keep only the est diff
}
quantile(gdiff, c(0.025, 0.975)) 
  # print 95% CI of the coefficient/effect-size of genotype(g)
# print model summary
summary(model.lmer)
# two ways to speed up CI construction and increase simulations to 1K
# use apply to avoid the for loop
# use refit() to avoid the setup time before fitting the LME
yall <- simulate(model.lmer, nsim=1000)
gdiff <- apply(yall, 2, function(y) {fixef(refit(model.lmer, y))[2]})
quantile(gdiff, c(0.025, 0.975))
# To get both the estimate and the SE
#  which are in the fixed effect table returned by summary()
#  use the extractor functions fixef() and vcov()
yall <- simulate(model.lmer, nsim=1000)
gdt <- apply(yall, 2, function(y) { 
model.lmer.2 <- refit(model.lmer, y); 
c(fixef(model.lmer.2)[2], sqrt(vcov(model.lmer.2)[2,2]) ) 
})
gdt <- t(gdt)    # because result of apply is 2 rows, nsim cols
# Using gdt, we can obtain a parametric bootstrap-t interval
# Hypothesis testing: here we need to simulate under H0: g = 0
# To find the t statistic and the F statistic
#  t statistic is in the @coefs table
#  F statistic is a value in the anova() output
#    if more than one test, need to subscript (unless you want all F's)
#    need to reference as a list to avoid R issues
# Complete model with genotypes
model.lmer <- lmer(Weight ~ g + (1|t), data= data)
model.lmer.0 <- lmer(Weight ~ (1|t), data=data)
yall <- simulate(model.lmer.0, nsim=1000)
#  NB: simulate data under model.lmer.0
gdt0 <- apply(yall, 2, function(y) {
model.lmer.2 <- refit(model.lmer, y); # but analyze under alt. model (model.lmer)
c(summary(model.lmer.2), anova(model.lmer.2)$\$$'F value' )
 })
 gdt0 <- t(gdt0)

 # obs t = 1.438179, p-value using t statistic:
 coef(summary(model.lmer))
 mean(abs(gdt0[,1]$\$$sim_1$\$$residuals) >= 1.438179)
 # p-value = 0.125

 # obs F = 2.0684
 anova(model.lmer)
   mean(gdt0[,2]$\$$sim_2$\$$residuals >= 2.0684)
# p-value = 0.01785714

Baseball data

http://wiki.socr.umich.edu/index.php/SOCR_Data_MLB_HeightsWeights

data <- read.table('E:\\Ivo.dir\\Research\\UMichigan\\Education_Teaching_Curricula\\2015_2016\\HS_853_Fall_2015\\Modules_docx\\01a_data.txt',as.is=T, header=T)
boxplot(Weight ~ Position, data = data, xlab = "Position", ylab = "Weight",
main = "MLB Weight Distribution by Position")
boxplot(Weight ~ Team, data = data, xlab = "Team", ylab = "Weight",
main = "MLB Team Weight Distributions")

References

Bates, D.M., Maechler, M., & Bolker, B. (2012). lme4: Linear mixed-effects models using S4 classes. R package version 0.999999-0.

Baayen, R.H. (2008). Analyzing Linguistic Data: A Practical Introduction to Statistics Using R. Cambridge: Cambridge University Press.

Baayen, R.H., Davidson, D.J., Bates, D.M. (2008). Mixed-effects modeling with crossed random effects for subjects and items. Journal of Memory and Language, 59, 390-412.

Barr, D.J., Levy, R., Scheepers, C., & Tilly, H. J. (2013). Random effects structure for confirmatory hypothesis testing: Keep it maximal. Journal of Memory and Language, 68, 255–278.

Bolker, B. M., Brooks, M. E., Clark, C. J., Geange, S. W., Poulsen, J. R., Stevens, M. H. H., & White, J. S. S. (2009). Generalized linear mixed models: a practical guide for ecology and evolution. Trends in Ecology & Evolution, 24(3), 127-135.

Wike, E.L., & Church, J.D. (1976). Comments on Clark’s “The language-as-fixed-effect fallacy”. Journal of Verbal Learning & Verbal Behavior, 15, 249-255.

Winter, B. (2013). Linear models and linear mixed effects models in R with linguistic applications. arXiv:1308.5499. [1]





Translate this page:

(default)
Uk flag.gif

Deutsch
De flag.gif

Español
Es flag.gif

Français
Fr flag.gif

Italiano
It flag.gif

Português
Pt flag.gif

日本語
Jp flag.gif

България
Bg flag.gif

الامارات العربية المتحدة
Ae flag.gif

Suomi
Fi flag.gif

इस भाषा में
In flag.gif

Norge
No flag.png

한국어
Kr flag.gif

中文
Cn flag.gif

繁体中文
Cn flag.gif

Русский
Ru flag.gif

Nederlands
Nl flag.gif

Ελληνικά
Gr flag.gif

Hrvatska
Hr flag.gif

Česká republika
Cz flag.gif

Danmark
Dk flag.gif

Polska
Pl flag.png

România
Ro flag.png

Sverige
Se flag.gif