Difference between revisions of "SMHS LinearModeling MachineLearning"
(→Example 1: Simulation (subject, day, treatment, observation)) |
(→References) |
||
(3 intermediate revisions by 2 users not shown) | |||
Line 4: | Line 4: | ||
<b>Questions:</b> | <b>Questions:</b> | ||
− | + | *How can we tie human intuition and computer-generated results to obtain reliable, effective, and efficient decision-support system (that facilitates, forecasting)? | |
− | + | *Niels Born – “It is difficult to make predictions, especially about the future” | |
− | + | *Can we unsupervisely classify the data? | |
− | |||
<b>Prediction</b> | <b>Prediction</b> | ||
For most of the machine learning algorithms (including first-order linear regression), we: | For most of the machine learning algorithms (including first-order linear regression), we: | ||
− | + | *first generate the model using training data, and then | |
− | + | *predict values for test/new data. | |
Predictions are made using the R <b>predict</b> function. (type <b>?predict.name</b>), where <b>name</b> is the function-name corresponding to the algorithm. The first argument of predict often represents the variable storing the model, and the second argument is a matrix or data frame of test data that the model needs to be applied to. Calling predict can be done in 2 ways: type <b>predict</b> or type of <b>predict.name.</b> | Predictions are made using the R <b>predict</b> function. (type <b>?predict.name</b>), where <b>name</b> is the function-name corresponding to the algorithm. The first argument of predict often represents the variable storing the model, and the second argument is a matrix or data frame of test data that the model needs to be applied to. Calling predict can be done in 2 ways: type <b>predict</b> or type of <b>predict.name.</b> | ||
Line 36: | Line 35: | ||
predicted.values <- predict(lin.mod, <b><u>newdata=test</u></b> | predicted.values <- predict(lin.mod, <b><u>newdata=test</u></b> | ||
− | Data Modeling/Training | + | <b>Data Modeling/Training</b> |
<b>Logistic Regression:</b> | <b>Logistic Regression:</b> | ||
Line 91: | Line 90: | ||
<i>Obs ~ Treatment + Day + Subject(Treatment)+ Day*Subject(Treatment)+ ε.</i> | <i>Obs ~ Treatment + Day + Subject(Treatment)+ Day*Subject(Treatment)+ ε.</i> | ||
− | This model is accounts for: | + | This model is accounts for:<BR> |
− | |||
Response = Obs | Response = Obs | ||
− | Fixed effects: | + | Fixed effects:<BR> |
− | + | Treatment (fixed)<BR> | |
− | Treatment (fixed) | + | Day (fixed)<BR> |
− | |||
− | Day (fixed) | ||
− | |||
Treatment*Day interaction | Treatment*Day interaction | ||
− | Random Effects: | + | Random Effects:<BR> |
− | + | Subject nested within Treatment (random)<BR> | |
− | Subject nested within Treatment (random) | ||
− | |||
Day crossed with "Subject within Treatment" (random) | Day crossed with "Subject within Treatment" (random) | ||
Line 634: | Line 627: | ||
Winter, B. (2013). Linear models and linear mixed effects models in R with linguistic applications. arXiv:1308.5499. [http://arxiv.org/pdf/1308.5499.pdf] | Winter, B. (2013). Linear models and linear mixed effects models in R with linguistic applications. arXiv:1308.5499. [http://arxiv.org/pdf/1308.5499.pdf] | ||
− | |||
− | |||
− | |||
Latest revision as of 08:58, 23 May 2016
SMHS Linear Modeling - Machine Learning Algorithms
Scientific inference based on fixed and random effect models, assumptions, and mixed effects logistic regression.
Questions:
- How can we tie human intuition and computer-generated results to obtain reliable, effective, and efficient decision-support system (that facilitates, forecasting)?
- Niels Born – “It is difficult to make predictions, especially about the future”
- Can we unsupervisely classify the data?
Prediction
For most of the machine learning algorithms (including first-order linear regression), we:
- first generate the model using training data, and then
- predict values for test/new data.
Predictions are made using the R predict function. (type ?predict.name), where name is the function-name corresponding to the algorithm. The first argument of predict often represents the variable storing the model, and the second argument is a matrix or data frame of test data that the model needs to be applied to. Calling predict can be done in 2 ways: type predict or type of predict.name.
Example:
#mydata <- read.table('https://umich.instructure.com/files/330381/download?download_frd=1&verifier=HpfmjfMFaMsk7rIpfPx0tmz960oTW7JA8ZonGvVC',as.is=T, header=T) # 01a_data.txt # mydata <- read.table('data.txt',as.is=T, header=T)
# (1) First, there are different approaches to split the data (partition the data) into # training and testing sets. ## TRAINING: 75% of the sample size sample_size <- floor(0.75 * nrow(mydata)) ## set the seed to make your partition reproductible set.seed(1234) train_ind <- sample(seq_len(nrow(mydata)), size = sample_size) train <- mydata[train_ind, ]
# TESTING DATA test <- mydata[-train_ind, ]
lin.mod <- lm(Weight ~ Height*Team, data=train) predicted.values <- predict(lin.mod, newdata=test
Data Modeling/Training
Logistic Regression:
glm_model <-glm(ifelse(Weight > 200,1,0) ~ Height*Team, family=binomial(link="logit"), data=train)
K-Means Clustering
train.1 <- cbind(train$\$$Height, train$\$$Weight, train$\$$Age) test.1 <- cbind(test$\$$Height, test$\$$Weight, test$\$$Age) Weight.1 <- ifelse(train$\$$Weight > 200,1,0) head(train.1) kmeans_model <- kmeans(<u><b>train.1</b></u>, 3) plot(train.1, col = kmeans_model$\$$cluster) points(kmeans_model$\$$centers, col = 1:2, pch = 8, cex = 2) <b>K-Nearest Neighbor Classification</b> # install.packages("class") library("class") knn_model <- knn(train=train.1, test=test.1, cl=as.factor(Weight.1), k=5) plot(knn_model) summary(knn_model) <b>Naïve Bayes Classifier</b> install.packages("e1071") library("e1071") nbc_model <- naiveBayes(Weight ~ Height*Age, data=train.1) <b>Decision Trees (CART)</b> #install.packages("e1071") library("rpart") cart_model <- rpart(Weight ~ Height+Age, data= as.data.frame(train.1), method="class") plot(cart_model) text(cart_model) <b>AdaBoost</b> install.packages("ada") # X be the matrix of features, and labels be a vector of 0-1 class labels. library("ada") boost_model <- ada(x= cbind(train$\$$Height, train$\$$Weight, train$\$$Age), y= Weight.1) plot(boost_model) boost_model
Support Vector Machines (SVM)
#install.packages("e1071") library("rpart") svm_model <- svm(x= cbind(train$\$$Height, train$\$$Weight, train$\$$Age), y=as.factor(Weight.1), kernel ="radial") summary(svm_model) =='"`UNIQ--h-1--QINU`"'Appendix== ==='"`UNIQ--h-2--QINU`"'Example 1: Simulation (subject, day, treatment, observation)=== <i>Obs ~ Treatment + Day + Subject(Treatment)+ Day*Subject(Treatment)+ ε.</i> This model is accounts for:<br> Response = Obs Fixed effects:<br> Treatment (fixed)<br> Day (fixed)<br> Treatment*Day interaction Random Effects:<br> Subject nested within Treatment (random)<br> Day crossed with "Subject within Treatment" (random) mydata <- data.frame( Subject = c(13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 62, 63, 64, 65, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 62, 63, 64, 65, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 62, 63, 64, 65), Day = c(rep(c("Day1", "Day3", "Day6"), each=28)), Treatment = c(rep(c("B", "A", "C", "B", "C", "A", "A", "B", "A", "C", "B", "C", "A", "A", "B", "A", "C", "B", "C", "A", "A"), each = 4)), Obs = c(6.472687, 7.017110, 6.200715, 6.613928, 6.829968, 7.387583, 7.367293, 8.018853, 7.527408, 6.746739, 7.296910, 6.983360, 6.816621, 6.571689, 5.911261, 6.954988, 7.624122, 7.669865, 7.676225, 7.263593, 7.704737, 7.328716, 7.295610, 5.964180, 6.880814, 6.926342, 6.926342, 7.562293, 6.677607, 7.023526, 6.441864, 7.020875, 7.478931, 7.495336, 7.427709, 7.633020, 7.382091, 7.359731, 7.285889, 7.496863, 6.632403, 6.171196, 6.306012, 7.253833, 7.594852, 6.915225, 7.220147, 7.298227, 7.573612, 7.366550, 7.560513, 7.289078, 7.287802, 7.155336, 7.394452, 7.465383, 6.976048, 7.222966, 6.584153, 7.013223, 7.569905, 7.459185, 7.504068, 7.801867, 7.598728, 7.475841, 7.511873, 7.518384, 6.618589, 5.854754, 6.125749, 6.962720, 7.540600, 7.379861, 7.344189, 7.362815, 7.805802, 7.764172, 7.789844, 7.616437, NA, NA, NA, NA)) install.packages("lme4") library("lme4", lib.loc="~/R/win-library/3.1") m1 <- lmer(Obs ~ Treatment * Day + (1 | Subject), mydata) m1 Linear mixed model fit by REML ['lmerMod'] Formula: Obs ~ Treatment * Day + (1 | Subject) Data: mydata REML criterion at convergence: 56.8669 <center> Random Effects {| class="wikitable" style="text-align:center; width:35%" border="1" |- |Groups||Name||Std. Dev. |- |Subject||(Intercept)||0.2163 |- |Residual|| ||0.2602 |} Number of obs: 80, groups: Subject, 28 </center> <center> Fixed Effects {| class="wikitable" style="text-align:center; width:35%" border="1" |- |(Intercept)||TreatmentB||TreatmentC |- |7.1827||-0.6129||0.1658 |- |DayDay3||DayDay6 ||TreatmentB: DayDay3 |- |0.2446||0.4507||-0.1235 |- |TreatmentC: DayDay3||TreatmentB: DayDay6||TreatmentC: DayDay6 |- |-0.2740||-0.3508||-0.3327 |} </center> <center> {| class="wikitable" style="text-align:center; width:35%" border="1" |- |Index||Subject||Day||Treatment||Obs |- |1||13||Day1||B||6.472687 |- |2||14||Day1||B||7.01711 |- |3||15||Day1||B||6.200715 |- |4||16||Day1||B||6.613928 |- |5||17||Day1||A||6.829968 |- |6||18||Day1||A||7.387583 |- |7||19||Day1||A||7.367293 |- |8||20||Day1||A||8.018853 |- |9||21||Day1||C||7.527408 |- |10||22||Day1||C||6.746739 |- |11||23||Day1||C||7.29691 |- |12||24||Day1||C||6.98336 |- |13||29||Day1||B||6.816621 |- |14||30||Day1||B||6.571689 |- |15||31||Day1||B||5.911261 |- |16||32||Day1||B||6.954988 |- |17||33||Day1||C||7.624122 |- |18||34||Day1||C||7.669865 |- |19||35||Day1||C||7.676225 |- |20||36||Day1||C||7.263593 |- |21||37||Day1||A||7.704737 |- |22||38||Day1||A||7.328716 |- |23||39||Day1||A||7.29561 |- |24||40||Day1||A||5.96418 |- |25||62||Day1||A||6.880814 |- |26||63||Day1||A||6.926342 |- |27||64||Day1||A||6.926342 |- |28||65||Day1||A||7.562293 |- |29||13||Day3||B||6.677607 |- |30||14||Day3||B||7.023526 |- |31||15||Day3||B||6.441864 |- |32||16||Day3||B||7.020875 |- |33||17||Day3||A||7.478931 |- |34||18||Day3||A||7.495336 |- |35||19||Day3||A||7.427709 |- |36||20||Day3||A||7.63302 |- |37||21||Day3||C||7.382091 |- |38||22||Day3||C||7.359731 |- |39||23||Day3||C||7.285889 |- |40||24||Day3||C||7.496863 |- |41||29||Day3||B||6.632403 |- |42||30||Day3||B||6.171196 |- |43||31||Day3||B||6.306012 |- |44||32||Day3||B||7.253833 |- |45||33||Day3||C||7.594852 |- |46||34||Day3||C||6.915225 |- |47||35||Day3||C||7.220147 |- |48||36||Day3||C||7.298227 |- |49||37||Day3||A||7.573612 |- |50||38||Day3||A||7.36655 |- |51||39||Day3||A||7.560513 |- |52||40||Day3||A||7.289078 |- |53||62||Day3||A||7.287802 |- |54||63||Day3||A||7.155336 |- |55||64||Day3||A||7.394452 |- |56||65||Day3||A||7.465383 |- |57||13||Day6||B||6.976048 |- |58||14||Day6||B||7.222966 |- |59||15||Day6||B||6.584153 |- |60||16||Day6||B||7.013223 |- |61||17||Day6||A||7.569905 |- |62||18||Day6||A||7.459185 |- |63||19||Day6||A||7.504068 |- |64||20||Day6||A||7.801867 |- |65||21||Day6||C||7.598728 |- |66||22||Day6||C||7.475841 |- |67||23||Day6||C||7.511873 |- |68||24||Day6||C||7.518384 |- |69||29||Day6||B||6.618589 |- |70||30||Day6||B||5.854754 |- |71||31||Day6||B||6.125749 |- |72||32||Day6||B||6.96272 |- |73||33||Day6||C||7.5406 |- |74||34||Day6||C||7.379861 |- |75||35||Day6||C||7.344189 |- |76||36||Day6||C||7.362815 |- |77||37||Day6||A||7.805802 |- |78||38||Day6||A||7.764172 |- |79||39||Day6||A||7.789844 |- |80||40||Day6||A||7.616437 |- |81||62||Day6||A||NA |- |82||63||Day6||A||NA |- |83||64||Day6||A||NA |- |84||65||Day6||A||NA |} </center> ==='"`UNIQ--h-3--QINU`"'Example 2: Genotype-phenotype=== Save this data file as a tab-separated TXT file: <center> {| class="wikitable" style="text-align:center; width:35%" border="1" |- |Genotype||Race||Subject||Weight |- |A||1||1||8 |- |A||1||2||9 |- |A||1||3||11 |- |A||1||4||12 |- |A||1||5||10 |- |A||2||1||17 |- |A||2||2||17 |- |A||2||3||16 |- |A||2||4||15 |- |A||2||5||19 |- |A||2||6||18 |- |A||2||7||18 |- |A||2||8||18 |- |A||2||9||24 |- |A||3||1||12 |- |A||3||2||12 |- |A||3||3||16 |- |A||3||4||15 |- |A||3||5||15 |- |A||3||6||14 |- |A||4||1||17 |- |A||4||2||20 |- |A||4||3||20 |- |A||4||4||19 |- |A||4||5||19 |- |A||4||6||18 |- |A||4||7||20 |- |A||4||8||19 |- |A||4||9||19 |- |B||5||1||9 |- |B||5||2||12 |- |B||5||3||13 |- |B||5||4||16 |- |B||5||5||14 |- |B||5||6||14 |- |B||6||1||10 |- |B||6||2||10 |- |B||6||3||9 |- |B||6||4||8 |- |B||6||5||13 |- |B||6||6||9 |- |B||6||7||11 |- |B||7||1||12 |- |B||7||2||16 |- |B||7||3||17 |- |B||7||4||15 |- |B||7||5||15 |- |B||7||6||15 |- |B||8||1||9 |- |B||8||2||6 |- |B||8||3||8 |- |B||8||4||8 |- |B||8||5||13 |- |B||8||6||9 |- |B||8||7||9 |- |B||8||8||10 |} </center> # data <- read.table('C:\\Users\\Dinov\\Desktop\\data.txt',as.is=T, header=T) data <- read.table('data.txt',as.is=T, header=T) names(data) attach(data) table(Genotype, Race) table(Race, Subject) # for demonstration, construct a balanced data set # 5 subjects for each race data_balance <- data[data$\$$Subject <=5,]
# create factors data_balance$\$$g <- as.factor(data_balance$\$$Genotype) data_balance$\$$t <- as.factor(data_balance$\$$Race) data_balance$\$$s <- as.factor(data_balance$\$$Subject)
# fit the ANOVA anova.model <- lm(Weight ~ g+t, data= data_balance) # get the ANOVA table anova(anova.model)
# note that all F tests use MSE, subjects within race as denominator # will need to hand-calculate test for genotypes
# Random effects modeling estimation using REML library(lme4)
lme.model <- lmer(Weight~g+(1|t),data= data_balance) summary(lme.model) anova(lme.model)
# various extractor functions: fixef(lme.model) # estimates of fixed effects vcov(lme.model) # VC matrix for the fixed effects VarCorr(lme.model) # estimated variance(-covariance) for random effects ranef(lme.model) # predictions of random effects coef(lme.model) # fixed effects + pred's of random effects fitted(lme.model) # conditional means for each obs (X bhat + Z uhat) resid(lme.model) # conditional residuals (Y - fitted)
# REML is the default method for estimating variance components. # If want to use ML, can specify that lmer(Weight~g+(1|t),REML=F, data= data_balance)
# Now, to get p-values for the effects, or construct a ci # inference on Subject weight focus is the difference between the two genotypes # which is Cb for C = [0, 1] using the default R parameterization # following assumes lme4 library loaded, data frame is d # uses full data set (unbalanced) # also assumes g and t are factors identifying genotypes and trays
# fit the model data$\$$g <- as.factor(data$\$$Genotype) data$\$$t <- as.factor(data$\$$Race) data$\$$s <- as.factor(data$\$$Subject)
# See R DATA TYPES: http://www.statmethods.net/input/datatypes.html
model.lmer <- lmer(Weight ~ g + (1|t), data= data) # get a confidence interval for g # slower, obvious programing nsim <- 10 gdiff <- rep(NA, nsim) for (i in 1:nsim) { data$\$$y <- simulate(model.lmer) # param bootstrap data set model.lmer.1 <- lmer(unlist(data$\$$y) ~ g + (1|t), data= data) # we need to turn the list of simulated values (y) into an atomic vector with unlist() # both data frames and models objects (e.g., produced by lm()) are lists # http://adv-r.had.co.nz/Data-structures.html gdiff[i] <- fixef(model.lmer.1)[2] # keep only the est diff } quantile(gdiff, c(0.025, 0.975)) # print 95% CI of the coefficient/effect-size of genotype(g)
# print model summary summary(model.lmer)
# two ways to speed up CI construction and increase simulations to 1K # use apply to avoid the for loop # use refit() to avoid the setup time before fitting the LME
yall <- simulate(model.lmer, nsim=1000) gdiff <- apply(yall, 2, function(y) {fixef(refit(model.lmer, y))[2]}) quantile(gdiff, c(0.025, 0.975))
# To get both the estimate and the SE # which are in the fixed effect table returned by summary() # use the extractor functions fixef() and vcov()
yall <- simulate(model.lmer, nsim=1000) gdt <- apply(yall, 2, function(y) { model.lmer.2 <- refit(model.lmer, y); c(fixef(model.lmer.2)[2], sqrt(vcov(model.lmer.2)[2,2]) ) }) gdt <- t(gdt) # because result of apply is 2 rows, nsim cols
# Using gdt, we can obtain a parametric bootstrap-t interval
# Hypothesis testing: here we need to simulate under H0: g = 0 # To find the t statistic and the F statistic # t statistic is in the @coefs table # F statistic is a value in the anova() output # if more than one test, need to subscript (unless you want all F's) # need to reference as a list to avoid R issues
# Complete model with genotypes model.lmer <- lmer(Weight ~ g + (1|t), data= data)
model.lmer.0 <- lmer(Weight ~ (1|t), data=data) yall <- simulate(model.lmer.0, nsim=1000) # NB: simulate data under model.lmer.0 gdt0 <- apply(yall, 2, function(y) { model.lmer.2 <- refit(model.lmer, y); # but analyze under alt. model (model.lmer) c(summary(model.lmer.2), anova(model.lmer.2)$\$$'F value' ) }) gdt0 <- t(gdt0) # obs t = 1.438179, p-value using t statistic: coef(summary(model.lmer)) mean(abs(gdt0[,1]$\$$sim_1$\$$residuals) >= 1.438179) # p-value = 0.125 # obs F = 2.0684 anova(model.lmer) mean(gdt0[,2]$\$$sim_2$\$$residuals >= 2.0684) # p-value = 0.01785714
Baseball data
http://wiki.socr.umich.edu/index.php/SOCR_Data_MLB_HeightsWeights
data <- read.table('E:\\Ivo.dir\\Research\\UMichigan\\Education_Teaching_Curricula\\2015_2016\\HS_853_Fall_2015\\Modules_docx\\01a_data.txt',as.is=T, header=T) boxplot(Weight ~ Position, data = data, xlab = "Position", ylab = "Weight", main = "MLB Weight Distribution by Position") boxplot(Weight ~ Team, data = data, xlab = "Team", ylab = "Weight", main = "MLB Team Weight Distributions")
References
Bates, D.M., Maechler, M., & Bolker, B. (2012). lme4: Linear mixed-effects models using S4 classes. R package version 0.999999-0.
Baayen, R.H. (2008). Analyzing Linguistic Data: A Practical Introduction to Statistics Using R. Cambridge: Cambridge University Press.
Baayen, R.H., Davidson, D.J., Bates, D.M. (2008). Mixed-effects modeling with crossed random effects for subjects and items. Journal of Memory and Language, 59, 390-412.
Barr, D.J., Levy, R., Scheepers, C., & Tilly, H. J. (2013). Random effects structure for confirmatory hypothesis testing: Keep it maximal. Journal of Memory and Language, 68, 255–278.
Bolker, B. M., Brooks, M. E., Clark, C. J., Geange, S. W., Poulsen, J. R., Stevens, M. H. H., & White, J. S. S. (2009). Generalized linear mixed models: a practical guide for ecology and evolution. Trends in Ecology & Evolution, 24(3), 127-135.
Wike, E.L., & Church, J.D. (1976). Comments on Clark’s “The language-as-fixed-effect fallacy”. Journal of Verbal Learning & Verbal Behavior, 15, 249-255.
Winter, B. (2013). Linear models and linear mixed effects models in R with linguistic applications. arXiv:1308.5499. [1]
- SOCR Home page: http://www.socr.umich.edu
Translate this page: