Difference between revisions of "SMHS LinearModeling MachineLearning"
(→SMHS Linear Modeling - Machine Learning Algorithms) |
|||
| Line 35: | Line 35: | ||
predicted.values <- predict(lin.mod, <b><u>newdata=test</u></b> | predicted.values <- predict(lin.mod, <b><u>newdata=test</u></b> | ||
| − | Data Modeling/Training | + | <b>Data Modeling/Training</b> |
<b>Logistic Regression:</b> | <b>Logistic Regression:</b> | ||
Revision as of 09:55, 23 May 2016
SMHS Linear Modeling - Machine Learning Algorithms
Scientific inference based on fixed and random effect models, assumptions, and mixed effects logistic regression.
Questions:
- How can we tie human intuition and computer-generated results to obtain reliable, effective, and efficient decision-support system (that facilitates, forecasting)?
- Niels Born – “It is difficult to make predictions, especially about the future”
- Can we unsupervisely classify the data?
Prediction
For most of the machine learning algorithms (including first-order linear regression), we:
- first generate the model using training data, and then
- predict values for test/new data.
Predictions are made using the R predict function. (type ?predict.name), where name is the function-name corresponding to the algorithm. The first argument of predict often represents the variable storing the model, and the second argument is a matrix or data frame of test data that the model needs to be applied to. Calling predict can be done in 2 ways: type predict or type of predict.name.
Example:
#mydata <- read.table('https://umich.instructure.com/files/330381/download?download_frd=1&verifier=HpfmjfMFaMsk7rIpfPx0tmz960oTW7JA8ZonGvVC',as.is=T, header=T) # 01a_data.txt
# mydata <- read.table('data.txt',as.is=T, header=T)
# (1) First, there are different approaches to split the data (partition the data) into # training and testing sets. ## TRAINING: 75% of the sample size sample_size <- floor(0.75 * nrow(mydata)) ## set the seed to make your partition reproductible set.seed(1234) train_ind <- sample(seq_len(nrow(mydata)), size = sample_size) train <- mydata[train_ind, ]
# TESTING DATA test <- mydata[-train_ind, ]
lin.mod <- lm(Weight ~ Height*Team, data=train) predicted.values <- predict(lin.mod, newdata=test
Data Modeling/Training
Logistic Regression:
glm_model <-glm(ifelse(Weight > 200,1,0) ~ Height*Team, family=binomial(link="logit"), data=train)
K-Means Clustering
train.1 <- cbind(train$\$$Height, train$\$$Weight, train$\$$Age)
test.1 <- cbind(test$\$$Height, test$\$$Weight, test$\$$Age)
Weight.1 <- ifelse(train$\$$Weight > 200,1,0)
head(train.1)
kmeans_model <- kmeans(<u><b>train.1</b></u>, 3)
plot(train.1, col = kmeans_model$\$$cluster)
points(kmeans_model$\$$centers, col = 1:2, pch = 8, cex = 2)
<b>K-Nearest Neighbor Classification</b>
# install.packages("class")
library("class")
knn_model <- knn(train=train.1, test=test.1, cl=as.factor(Weight.1), k=5)
plot(knn_model)
summary(knn_model)
<b>Naïve Bayes Classifier</b>
install.packages("e1071")
library("e1071")
nbc_model <- naiveBayes(Weight ~ Height*Age, data=train.1)
<b>Decision Trees (CART)</b>
#install.packages("e1071")
library("rpart")
cart_model <- rpart(Weight ~ Height+Age, data= as.data.frame(train.1), method="class")
plot(cart_model)
text(cart_model)
<b>AdaBoost</b>
install.packages("ada")
# X be the matrix of features, and labels be a vector of 0-1 class labels.
library("ada")
boost_model <- ada(x= cbind(train$\$$Height, train$\$$Weight, train$\$$Age), y= Weight.1)
plot(boost_model)
boost_model
Support Vector Machines (SVM)
#install.packages("e1071")
library("rpart")
svm_model <- svm(x= cbind(train$\$$Height, train$\$$Weight, train$\$$Age), y=as.factor(Weight.1),
kernel ="radial")
summary(svm_model)
=='"`UNIQ--h-1--QINU`"'Appendix==
==='"`UNIQ--h-2--QINU`"'Example 1: Simulation (subject, day, treatment, observation)===
<i>Obs ~ Treatment + Day + Subject(Treatment)+ Day*Subject(Treatment)+ ε.</i>
This model is accounts for:<br>
Response = Obs
Fixed effects:<br>
Treatment (fixed)<br>
Day (fixed)<br>
Treatment*Day interaction
Random Effects:<br>
Subject nested within Treatment (random)<br>
Day crossed with "Subject within Treatment" (random)
mydata <- data.frame(
Subject = c(13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 29, 30, 31, 32, 33,
34, 35, 36, 37, 38, 39, 40, 62, 63, 64, 65, 13, 14, 15, 16, 17, 18,
19, 20, 21, 22, 23, 24, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39,
40, 62, 63, 64, 65, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24,
29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 62, 63, 64, 65),
Day = c(rep(c("Day1", "Day3", "Day6"), each=28)),
Treatment = c(rep(c("B", "A", "C", "B", "C", "A", "A", "B", "A", "C", "B", "C",
"A", "A", "B", "A", "C", "B", "C", "A", "A"), each = 4)),
Obs = c(6.472687, 7.017110, 6.200715, 6.613928, 6.829968, 7.387583, 7.367293,
8.018853, 7.527408, 6.746739, 7.296910, 6.983360, 6.816621, 6.571689,
5.911261, 6.954988, 7.624122, 7.669865, 7.676225, 7.263593, 7.704737,
7.328716, 7.295610, 5.964180, 6.880814, 6.926342, 6.926342, 7.562293,
6.677607, 7.023526, 6.441864, 7.020875, 7.478931, 7.495336, 7.427709,
7.633020, 7.382091, 7.359731, 7.285889, 7.496863, 6.632403, 6.171196,
6.306012, 7.253833, 7.594852, 6.915225, 7.220147, 7.298227, 7.573612,
7.366550, 7.560513, 7.289078, 7.287802, 7.155336, 7.394452, 7.465383,
6.976048, 7.222966, 6.584153, 7.013223, 7.569905, 7.459185, 7.504068,
7.801867, 7.598728, 7.475841, 7.511873, 7.518384, 6.618589, 5.854754,
6.125749, 6.962720, 7.540600, 7.379861, 7.344189, 7.362815, 7.805802,
7.764172, 7.789844, 7.616437, NA, NA, NA, NA))
install.packages("lme4")
library("lme4", lib.loc="~/R/win-library/3.1")
m1 <- lmer(Obs ~ Treatment * Day + (1 | Subject), mydata)
m1
Linear mixed model fit by REML ['lmerMod']
Formula: Obs ~ Treatment * Day + (1 | Subject)
Data: mydata
REML criterion at convergence: 56.8669
<center> Random Effects
{| class="wikitable" style="text-align:center; width:35%" border="1"
|-
|Groups||Name||Std. Dev.
|-
|Subject||(Intercept)||0.2163
|-
|Residual|| ||0.2602
|}
Number of obs: 80, groups: Subject, 28 </center>
<center> Fixed Effects
{| class="wikitable" style="text-align:center; width:35%" border="1"
|-
|(Intercept)||TreatmentB||TreatmentC
|-
|7.1827||-0.6129||0.1658
|-
|DayDay3||DayDay6 ||TreatmentB: DayDay3
|-
|0.2446||0.4507||-0.1235
|-
|TreatmentC: DayDay3||TreatmentB: DayDay6||TreatmentC: DayDay6
|-
|-0.2740||-0.3508||-0.3327
|}
</center>
<center>
{| class="wikitable" style="text-align:center; width:35%" border="1"
|-
|Index||Subject||Day||Treatment||Obs
|-
|1||13||Day1||B||6.472687
|-
|2||14||Day1||B||7.01711
|-
|3||15||Day1||B||6.200715
|-
|4||16||Day1||B||6.613928
|-
|5||17||Day1||A||6.829968
|-
|6||18||Day1||A||7.387583
|-
|7||19||Day1||A||7.367293
|-
|8||20||Day1||A||8.018853
|-
|9||21||Day1||C||7.527408
|-
|10||22||Day1||C||6.746739
|-
|11||23||Day1||C||7.29691
|-
|12||24||Day1||C||6.98336
|-
|13||29||Day1||B||6.816621
|-
|14||30||Day1||B||6.571689
|-
|15||31||Day1||B||5.911261
|-
|16||32||Day1||B||6.954988
|-
|17||33||Day1||C||7.624122
|-
|18||34||Day1||C||7.669865
|-
|19||35||Day1||C||7.676225
|-
|20||36||Day1||C||7.263593
|-
|21||37||Day1||A||7.704737
|-
|22||38||Day1||A||7.328716
|-
|23||39||Day1||A||7.29561
|-
|24||40||Day1||A||5.96418
|-
|25||62||Day1||A||6.880814
|-
|26||63||Day1||A||6.926342
|-
|27||64||Day1||A||6.926342
|-
|28||65||Day1||A||7.562293
|-
|29||13||Day3||B||6.677607
|-
|30||14||Day3||B||7.023526
|-
|31||15||Day3||B||6.441864
|-
|32||16||Day3||B||7.020875
|-
|33||17||Day3||A||7.478931
|-
|34||18||Day3||A||7.495336
|-
|35||19||Day3||A||7.427709
|-
|36||20||Day3||A||7.63302
|-
|37||21||Day3||C||7.382091
|-
|38||22||Day3||C||7.359731
|-
|39||23||Day3||C||7.285889
|-
|40||24||Day3||C||7.496863
|-
|41||29||Day3||B||6.632403
|-
|42||30||Day3||B||6.171196
|-
|43||31||Day3||B||6.306012
|-
|44||32||Day3||B||7.253833
|-
|45||33||Day3||C||7.594852
|-
|46||34||Day3||C||6.915225
|-
|47||35||Day3||C||7.220147
|-
|48||36||Day3||C||7.298227
|-
|49||37||Day3||A||7.573612
|-
|50||38||Day3||A||7.36655
|-
|51||39||Day3||A||7.560513
|-
|52||40||Day3||A||7.289078
|-
|53||62||Day3||A||7.287802
|-
|54||63||Day3||A||7.155336
|-
|55||64||Day3||A||7.394452
|-
|56||65||Day3||A||7.465383
|-
|57||13||Day6||B||6.976048
|-
|58||14||Day6||B||7.222966
|-
|59||15||Day6||B||6.584153
|-
|60||16||Day6||B||7.013223
|-
|61||17||Day6||A||7.569905
|-
|62||18||Day6||A||7.459185
|-
|63||19||Day6||A||7.504068
|-
|64||20||Day6||A||7.801867
|-
|65||21||Day6||C||7.598728
|-
|66||22||Day6||C||7.475841
|-
|67||23||Day6||C||7.511873
|-
|68||24||Day6||C||7.518384
|-
|69||29||Day6||B||6.618589
|-
|70||30||Day6||B||5.854754
|-
|71||31||Day6||B||6.125749
|-
|72||32||Day6||B||6.96272
|-
|73||33||Day6||C||7.5406
|-
|74||34||Day6||C||7.379861
|-
|75||35||Day6||C||7.344189
|-
|76||36||Day6||C||7.362815
|-
|77||37||Day6||A||7.805802
|-
|78||38||Day6||A||7.764172
|-
|79||39||Day6||A||7.789844
|-
|80||40||Day6||A||7.616437
|-
|81||62||Day6||A||NA
|-
|82||63||Day6||A||NA
|-
|83||64||Day6||A||NA
|-
|84||65||Day6||A||NA
|}
</center>
==='"`UNIQ--h-3--QINU`"'Example 2: Genotype-phenotype===
Save this data file as a tab-separated TXT file:
<center>
{| class="wikitable" style="text-align:center; width:35%" border="1"
|-
|Genotype||Race||Subject||Weight
|-
|A||1||1||8
|-
|A||1||2||9
|-
|A||1||3||11
|-
|A||1||4||12
|-
|A||1||5||10
|-
|A||2||1||17
|-
|A||2||2||17
|-
|A||2||3||16
|-
|A||2||4||15
|-
|A||2||5||19
|-
|A||2||6||18
|-
|A||2||7||18
|-
|A||2||8||18
|-
|A||2||9||24
|-
|A||3||1||12
|-
|A||3||2||12
|-
|A||3||3||16
|-
|A||3||4||15
|-
|A||3||5||15
|-
|A||3||6||14
|-
|A||4||1||17
|-
|A||4||2||20
|-
|A||4||3||20
|-
|A||4||4||19
|-
|A||4||5||19
|-
|A||4||6||18
|-
|A||4||7||20
|-
|A||4||8||19
|-
|A||4||9||19
|-
|B||5||1||9
|-
|B||5||2||12
|-
|B||5||3||13
|-
|B||5||4||16
|-
|B||5||5||14
|-
|B||5||6||14
|-
|B||6||1||10
|-
|B||6||2||10
|-
|B||6||3||9
|-
|B||6||4||8
|-
|B||6||5||13
|-
|B||6||6||9
|-
|B||6||7||11
|-
|B||7||1||12
|-
|B||7||2||16
|-
|B||7||3||17
|-
|B||7||4||15
|-
|B||7||5||15
|-
|B||7||6||15
|-
|B||8||1||9
|-
|B||8||2||6
|-
|B||8||3||8
|-
|B||8||4||8
|-
|B||8||5||13
|-
|B||8||6||9
|-
|B||8||7||9
|-
|B||8||8||10
|}
</center>
# data <- read.table('C:\\Users\\Dinov\\Desktop\\data.txt',as.is=T, header=T)
data <- read.table('data.txt',as.is=T, header=T)
names(data)
attach(data)
table(Genotype, Race)
table(Race, Subject)
# for demonstration, construct a balanced data set
# 5 subjects for each race
data_balance <- data[data$\$$Subject <=5,]
# create factors data_balance$\$$g <- as.factor(data_balance$\$$Genotype) data_balance$\$$t <- as.factor(data_balance$\$$Race) data_balance$\$$s <- as.factor(data_balance$\$$Subject)
# fit the ANOVA anova.model <- lm(Weight ~ g+t, data= data_balance) # get the ANOVA table anova(anova.model)
# note that all F tests use MSE, subjects within race as denominator # will need to hand-calculate test for genotypes
# Random effects modeling estimation using REML library(lme4)
lme.model <- lmer(Weight~g+(1|t),data= data_balance) summary(lme.model) anova(lme.model)
# various extractor functions: fixef(lme.model) # estimates of fixed effects vcov(lme.model) # VC matrix for the fixed effects VarCorr(lme.model) # estimated variance(-covariance) for random effects ranef(lme.model) # predictions of random effects coef(lme.model) # fixed effects + pred's of random effects fitted(lme.model) # conditional means for each obs (X bhat + Z uhat) resid(lme.model) # conditional residuals (Y - fitted)
# REML is the default method for estimating variance components. # If want to use ML, can specify that lmer(Weight~g+(1|t),REML=F, data= data_balance)
# Now, to get p-values for the effects, or construct a ci # inference on Subject weight focus is the difference between the two genotypes # which is Cb for C = [0, 1] using the default R parameterization # following assumes lme4 library loaded, data frame is d # uses full data set (unbalanced) # also assumes g and t are factors identifying genotypes and trays
# fit the model data$\$$g <- as.factor(data$\$$Genotype) data$\$$t <- as.factor(data$\$$Race) data$\$$s <- as.factor(data$\$$Subject)
# See R DATA TYPES: http://www.statmethods.net/input/datatypes.html
model.lmer <- lmer(Weight ~ g + (1|t), data= data)
# get a confidence interval for g
# slower, obvious programing
nsim <- 10
gdiff <- rep(NA, nsim)
for (i in 1:nsim) {
data$\$$y <- simulate(model.lmer) # param bootstrap data set
model.lmer.1 <- lmer(unlist(data$\$$y) ~ g + (1|t), data= data)
# we need to turn the list of simulated values (y) into an atomic vector with unlist()
# both data frames and models objects (e.g., produced by lm()) are lists
# http://adv-r.had.co.nz/Data-structures.html
gdiff[i] <- fixef(model.lmer.1)[2] # keep only the est diff
}
quantile(gdiff, c(0.025, 0.975))
# print 95% CI of the coefficient/effect-size of genotype(g)
# print model summary summary(model.lmer)
# two ways to speed up CI construction and increase simulations to 1K # use apply to avoid the for loop # use refit() to avoid the setup time before fitting the LME
yall <- simulate(model.lmer, nsim=1000)
gdiff <- apply(yall, 2, function(y) {fixef(refit(model.lmer, y))[2]})
quantile(gdiff, c(0.025, 0.975))
# To get both the estimate and the SE # which are in the fixed effect table returned by summary() # use the extractor functions fixef() and vcov()
yall <- simulate(model.lmer, nsim=1000)
gdt <- apply(yall, 2, function(y) {
model.lmer.2 <- refit(model.lmer, y);
c(fixef(model.lmer.2)[2], sqrt(vcov(model.lmer.2)[2,2]) )
})
gdt <- t(gdt) # because result of apply is 2 rows, nsim cols
# Using gdt, we can obtain a parametric bootstrap-t interval
# Hypothesis testing: here we need to simulate under H0: g = 0 # To find the t statistic and the F statistic # t statistic is in the @coefs table # F statistic is a value in the anova() output # if more than one test, need to subscript (unless you want all F's) # need to reference as a list to avoid R issues
# Complete model with genotypes model.lmer <- lmer(Weight ~ g + (1|t), data= data)
model.lmer.0 <- lmer(Weight ~ (1|t), data=data)
yall <- simulate(model.lmer.0, nsim=1000)
# NB: simulate data under model.lmer.0
gdt0 <- apply(yall, 2, function(y) {
model.lmer.2 <- refit(model.lmer, y); # but analyze under alt. model (model.lmer)
c(summary(model.lmer.2), anova(model.lmer.2)$\$$'F value' )
})
gdt0 <- t(gdt0)
# obs t = 1.438179, p-value using t statistic:
coef(summary(model.lmer))
mean(abs(gdt0[,1]$\$$sim_1$\$$residuals) >= 1.438179)
# p-value = 0.125
# obs F = 2.0684
anova(model.lmer)
mean(gdt0[,2]$\$$sim_2$\$$residuals >= 2.0684)
# p-value = 0.01785714
Baseball data
http://wiki.socr.umich.edu/index.php/SOCR_Data_MLB_HeightsWeights
data <- read.table('E:\\Ivo.dir\\Research\\UMichigan\\Education_Teaching_Curricula\\2015_2016\\HS_853_Fall_2015\\Modules_docx\\01a_data.txt',as.is=T, header=T)
boxplot(Weight ~ Position, data = data, xlab = "Position", ylab = "Weight",
main = "MLB Weight Distribution by Position")
boxplot(Weight ~ Team, data = data, xlab = "Team", ylab = "Weight",
main = "MLB Team Weight Distributions")
References
Bates, D.M., Maechler, M., & Bolker, B. (2012). lme4: Linear mixed-effects models using S4 classes. R package version 0.999999-0.
Baayen, R.H. (2008). Analyzing Linguistic Data: A Practical Introduction to Statistics Using R. Cambridge: Cambridge University Press.
Baayen, R.H., Davidson, D.J., Bates, D.M. (2008). Mixed-effects modeling with crossed random effects for subjects and items. Journal of Memory and Language, 59, 390-412.
Barr, D.J., Levy, R., Scheepers, C., & Tilly, H. J. (2013). Random effects structure for confirmatory hypothesis testing: Keep it maximal. Journal of Memory and Language, 68, 255–278.
Bolker, B. M., Brooks, M. E., Clark, C. J., Geange, S. W., Poulsen, J. R., Stevens, M. H. H., & White, J. S. S. (2009). Generalized linear mixed models: a practical guide for ecology and evolution. Trends in Ecology & Evolution, 24(3), 127-135.
Wike, E.L., & Church, J.D. (1976). Comments on Clark’s “The language-as-fixed-effect fallacy”. Journal of Verbal Learning & Verbal Behavior, 15, 249-255.
Winter, B. (2013). Linear models and linear mixed effects models in R with linguistic applications. arXiv:1308.5499. [1]
....
- SOCR Home page: http://www.socr.umich.edu
Translate this page: