Difference between revisions of "SMHS LinearModeling MachineLearning"
(→References) |
|||
(16 intermediate revisions by 2 users not shown) | |||
Line 3: | Line 3: | ||
Scientific inference based on fixed and random effect models, assumptions, and mixed effects logistic regression. | Scientific inference based on fixed and random effect models, assumptions, and mixed effects logistic regression. | ||
+ | <b>Questions:</b> | ||
+ | *How can we tie human intuition and computer-generated results to obtain reliable, effective, and efficient decision-support system (that facilitates, forecasting)? | ||
+ | *Niels Born – “It is difficult to make predictions, especially about the future” | ||
+ | *Can we unsupervisely classify the data? | ||
+ | <b>Prediction</b> | ||
− | + | For most of the machine learning algorithms (including first-order linear regression), we: | |
+ | *first generate the model using training data, and then | ||
+ | *predict values for test/new data. | ||
+ | Predictions are made using the R <b>predict</b> function. (type <b>?predict.name</b>), where <b>name</b> is the function-name corresponding to the algorithm. The first argument of predict often represents the variable storing the model, and the second argument is a matrix or data frame of test data that the model needs to be applied to. Calling predict can be done in 2 ways: type <b>predict</b> or type of <b>predict.name.</b> | ||
+ | |||
+ | <b>Example:</b> | ||
+ | |||
+ | #mydata <- read.table('https://umich.instructure.com/files/330381/download?download_frd=1&verifier=HpfmjfMFaMsk7rIpfPx0tmz960oTW7JA8ZonGvVC',as.is=T, header=T) # 01a_data.txt | ||
+ | # mydata <- read.table('data.txt',as.is=T, header=T) | ||
+ | |||
+ | # (1) First, there are different approaches to split the data (partition the data) into | ||
+ | # training and testing sets. | ||
+ | ## TRAINING: 75% of the sample size | ||
+ | sample_size <- floor(0.75 * nrow(mydata)) | ||
+ | ## set the seed to make your partition reproductible | ||
+ | set.seed(1234) | ||
+ | train_ind <- sample(seq_len(nrow(mydata)), size = sample_size) | ||
+ | train <- mydata[train_ind, ] | ||
+ | |||
+ | # TESTING DATA | ||
+ | test <- mydata[-train_ind, ] | ||
+ | |||
+ | lin.mod <- lm(Weight ~ Height*Team, <b><u>data=train</u></b>) | ||
+ | predicted.values <- predict(lin.mod, <b><u>newdata=test</u></b> | ||
+ | |||
+ | <b>Data Modeling/Training</b> | ||
+ | |||
+ | <b>Logistic Regression:</b> | ||
+ | glm_model <-glm(ifelse(Weight > 200,1,0) ~ Height*Team, family=binomial(link="logit"), <u><b>data=train</b></u>) | ||
+ | |||
+ | <b>K-Means Clustering</b> | ||
+ | train.1 <- cbind(train$\$$Height, train$\$$Weight, train$\$$Age) | ||
+ | test.1 <- cbind(test$\$$Height, test$\$$Weight, test$\$$Age) | ||
+ | Weight.1 <- ifelse(train$\$$Weight > 200,1,0) | ||
+ | |||
+ | head(train.1) | ||
+ | kmeans_model <- kmeans(<u><b>train.1</b></u>, 3) | ||
+ | plot(train.1, col = kmeans_model$\$$cluster) | ||
+ | points(kmeans_model$\$$centers, col = 1:2, pch = 8, cex = 2) | ||
+ | |||
+ | <b>K-Nearest Neighbor Classification</b> | ||
+ | # install.packages("class") | ||
+ | library("class") | ||
+ | knn_model <- knn(train=train.1, test=test.1, cl=as.factor(Weight.1), k=5) | ||
+ | plot(knn_model) | ||
+ | summary(knn_model) | ||
+ | |||
+ | <b>Naïve Bayes Classifier</b> | ||
+ | install.packages("e1071") | ||
+ | library("e1071") | ||
+ | nbc_model <- naiveBayes(Weight ~ Height*Age, data=train.1) | ||
+ | |||
+ | <b>Decision Trees (CART)</b> | ||
+ | #install.packages("e1071") | ||
+ | library("rpart") | ||
+ | cart_model <- rpart(Weight ~ Height+Age, data= as.data.frame(train.1), method="class") | ||
+ | plot(cart_model) | ||
+ | text(cart_model) | ||
+ | |||
+ | <b>AdaBoost</b> | ||
+ | install.packages("ada") | ||
+ | # X be the matrix of features, and labels be a vector of 0-1 class labels. | ||
+ | library("ada") | ||
+ | boost_model <- ada(x= cbind(train$\$$Height, train$\$$Weight, train$\$$Age), y= Weight.1) | ||
+ | plot(boost_model) | ||
+ | boost_model | ||
+ | |||
+ | <b>Support Vector Machines (SVM)</b> | ||
+ | #install.packages("e1071") | ||
+ | library("rpart") | ||
+ | svm_model <- svm(x= cbind(train$\$$Height, train$\$$Weight, train$\$$Age), y=as.factor(Weight.1), | ||
+ | kernel ="radial") | ||
+ | summary(svm_model) | ||
+ | |||
+ | ==Appendix== | ||
+ | |||
+ | ===Example 1: Simulation (subject, day, treatment, observation)=== | ||
+ | |||
+ | <i>Obs ~ Treatment + Day + Subject(Treatment)+ Day*Subject(Treatment)+ ε.</i> | ||
+ | |||
+ | This model is accounts for:<BR> | ||
+ | Response = Obs | ||
+ | |||
+ | Fixed effects:<BR> | ||
+ | Treatment (fixed)<BR> | ||
+ | Day (fixed)<BR> | ||
+ | Treatment*Day interaction | ||
+ | |||
+ | Random Effects:<BR> | ||
+ | Subject nested within Treatment (random)<BR> | ||
+ | Day crossed with "Subject within Treatment" (random) | ||
+ | |||
+ | mydata <- data.frame( | ||
+ | Subject = c(13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 29, 30, 31, 32, 33, | ||
+ | 34, 35, 36, 37, 38, 39, 40, 62, 63, 64, 65, 13, 14, 15, 16, 17, 18, | ||
+ | 19, 20, 21, 22, 23, 24, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, | ||
+ | 40, 62, 63, 64, 65, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, | ||
+ | 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 62, 63, 64, 65), | ||
+ | Day = c(rep(c("Day1", "Day3", "Day6"), each=28)), | ||
+ | Treatment = c(rep(c("B", "A", "C", "B", "C", "A", "A", "B", "A", "C", "B", "C", | ||
+ | "A", "A", "B", "A", "C", "B", "C", "A", "A"), each = 4)), | ||
+ | Obs = c(6.472687, 7.017110, 6.200715, 6.613928, 6.829968, 7.387583, 7.367293, | ||
+ | 8.018853, 7.527408, 6.746739, 7.296910, 6.983360, 6.816621, 6.571689, | ||
+ | 5.911261, 6.954988, 7.624122, 7.669865, 7.676225, 7.263593, 7.704737, | ||
+ | 7.328716, 7.295610, 5.964180, 6.880814, 6.926342, 6.926342, 7.562293, | ||
+ | 6.677607, 7.023526, 6.441864, 7.020875, 7.478931, 7.495336, 7.427709, | ||
+ | 7.633020, 7.382091, 7.359731, 7.285889, 7.496863, 6.632403, 6.171196, | ||
+ | 6.306012, 7.253833, 7.594852, 6.915225, 7.220147, 7.298227, 7.573612, | ||
+ | 7.366550, 7.560513, 7.289078, 7.287802, 7.155336, 7.394452, 7.465383, | ||
+ | 6.976048, 7.222966, 6.584153, 7.013223, 7.569905, 7.459185, 7.504068, | ||
+ | 7.801867, 7.598728, 7.475841, 7.511873, 7.518384, 6.618589, 5.854754, | ||
+ | 6.125749, 6.962720, 7.540600, 7.379861, 7.344189, 7.362815, 7.805802, | ||
+ | 7.764172, 7.789844, 7.616437, NA, NA, NA, NA)) | ||
+ | |||
+ | install.packages("lme4") | ||
+ | library("lme4", lib.loc="~/R/win-library/3.1") | ||
+ | m1 <- lmer(Obs ~ Treatment * Day + (1 | Subject), mydata) | ||
+ | m1 | ||
+ | |||
+ | Linear mixed model fit by REML ['lmerMod'] | ||
+ | |||
+ | Formula: Obs ~ Treatment * Day + (1 | Subject) | ||
+ | |||
+ | Data: mydata | ||
+ | |||
+ | REML criterion at convergence: 56.8669 | ||
+ | |||
+ | <center> Random Effects | ||
+ | {| class="wikitable" style="text-align:center; width:35%" border="1" | ||
+ | |- | ||
+ | |Groups||Name||Std. Dev. | ||
+ | |- | ||
+ | |Subject||(Intercept)||0.2163 | ||
+ | |- | ||
+ | |Residual|| ||0.2602 | ||
+ | |||
+ | |} | ||
+ | Number of obs: 80, groups: Subject, 28 </center> | ||
+ | |||
+ | |||
+ | |||
+ | |||
+ | |||
+ | <center> Fixed Effects | ||
+ | {| class="wikitable" style="text-align:center; width:35%" border="1" | ||
+ | |- | ||
+ | |(Intercept)||TreatmentB||TreatmentC | ||
+ | |- | ||
+ | |7.1827||-0.6129||0.1658 | ||
+ | |- | ||
+ | |DayDay3||DayDay6 ||TreatmentB: DayDay3 | ||
+ | |- | ||
+ | |0.2446||0.4507||-0.1235 | ||
+ | |- | ||
+ | |TreatmentC: DayDay3||TreatmentB: DayDay6||TreatmentC: DayDay6 | ||
+ | |- | ||
+ | |-0.2740||-0.3508||-0.3327 | ||
+ | |||
+ | |} | ||
+ | </center> | ||
+ | |||
+ | |||
+ | |||
+ | |||
+ | |||
+ | <center> | ||
+ | {| class="wikitable" style="text-align:center; width:35%" border="1" | ||
+ | |- | ||
+ | |Index||Subject||Day||Treatment||Obs | ||
+ | |- | ||
+ | |1||13||Day1||B||6.472687 | ||
+ | |- | ||
+ | |2||14||Day1||B||7.01711 | ||
+ | |- | ||
+ | |3||15||Day1||B||6.200715 | ||
+ | |- | ||
+ | |4||16||Day1||B||6.613928 | ||
+ | |- | ||
+ | |5||17||Day1||A||6.829968 | ||
+ | |- | ||
+ | |6||18||Day1||A||7.387583 | ||
+ | |- | ||
+ | |7||19||Day1||A||7.367293 | ||
+ | |- | ||
+ | |8||20||Day1||A||8.018853 | ||
+ | |- | ||
+ | |9||21||Day1||C||7.527408 | ||
+ | |- | ||
+ | |10||22||Day1||C||6.746739 | ||
+ | |- | ||
+ | |11||23||Day1||C||7.29691 | ||
+ | |- | ||
+ | |12||24||Day1||C||6.98336 | ||
+ | |- | ||
+ | |13||29||Day1||B||6.816621 | ||
+ | |- | ||
+ | |14||30||Day1||B||6.571689 | ||
+ | |- | ||
+ | |15||31||Day1||B||5.911261 | ||
+ | |- | ||
+ | |16||32||Day1||B||6.954988 | ||
+ | |- | ||
+ | |17||33||Day1||C||7.624122 | ||
+ | |- | ||
+ | |18||34||Day1||C||7.669865 | ||
+ | |- | ||
+ | |19||35||Day1||C||7.676225 | ||
+ | |- | ||
+ | |20||36||Day1||C||7.263593 | ||
+ | |- | ||
+ | |21||37||Day1||A||7.704737 | ||
+ | |- | ||
+ | |22||38||Day1||A||7.328716 | ||
+ | |- | ||
+ | |23||39||Day1||A||7.29561 | ||
+ | |- | ||
+ | |24||40||Day1||A||5.96418 | ||
+ | |- | ||
+ | |25||62||Day1||A||6.880814 | ||
+ | |- | ||
+ | |26||63||Day1||A||6.926342 | ||
+ | |- | ||
+ | |27||64||Day1||A||6.926342 | ||
+ | |- | ||
+ | |28||65||Day1||A||7.562293 | ||
+ | |- | ||
+ | |29||13||Day3||B||6.677607 | ||
+ | |- | ||
+ | |30||14||Day3||B||7.023526 | ||
+ | |- | ||
+ | |31||15||Day3||B||6.441864 | ||
+ | |- | ||
+ | |32||16||Day3||B||7.020875 | ||
+ | |- | ||
+ | |33||17||Day3||A||7.478931 | ||
+ | |- | ||
+ | |34||18||Day3||A||7.495336 | ||
+ | |- | ||
+ | |35||19||Day3||A||7.427709 | ||
+ | |- | ||
+ | |36||20||Day3||A||7.63302 | ||
+ | |- | ||
+ | |37||21||Day3||C||7.382091 | ||
+ | |- | ||
+ | |38||22||Day3||C||7.359731 | ||
+ | |- | ||
+ | |39||23||Day3||C||7.285889 | ||
+ | |- | ||
+ | |40||24||Day3||C||7.496863 | ||
+ | |- | ||
+ | |41||29||Day3||B||6.632403 | ||
+ | |- | ||
+ | |42||30||Day3||B||6.171196 | ||
+ | |- | ||
+ | |43||31||Day3||B||6.306012 | ||
+ | |- | ||
+ | |44||32||Day3||B||7.253833 | ||
+ | |- | ||
+ | |45||33||Day3||C||7.594852 | ||
+ | |- | ||
+ | |46||34||Day3||C||6.915225 | ||
+ | |- | ||
+ | |47||35||Day3||C||7.220147 | ||
+ | |- | ||
+ | |48||36||Day3||C||7.298227 | ||
+ | |- | ||
+ | |49||37||Day3||A||7.573612 | ||
+ | |- | ||
+ | |50||38||Day3||A||7.36655 | ||
+ | |- | ||
+ | |51||39||Day3||A||7.560513 | ||
+ | |- | ||
+ | |52||40||Day3||A||7.289078 | ||
+ | |- | ||
+ | |53||62||Day3||A||7.287802 | ||
+ | |- | ||
+ | |54||63||Day3||A||7.155336 | ||
+ | |- | ||
+ | |55||64||Day3||A||7.394452 | ||
+ | |- | ||
+ | |56||65||Day3||A||7.465383 | ||
+ | |- | ||
+ | |57||13||Day6||B||6.976048 | ||
+ | |- | ||
+ | |58||14||Day6||B||7.222966 | ||
+ | |- | ||
+ | |59||15||Day6||B||6.584153 | ||
+ | |- | ||
+ | |60||16||Day6||B||7.013223 | ||
+ | |- | ||
+ | |61||17||Day6||A||7.569905 | ||
+ | |- | ||
+ | |62||18||Day6||A||7.459185 | ||
+ | |- | ||
+ | |63||19||Day6||A||7.504068 | ||
+ | |- | ||
+ | |64||20||Day6||A||7.801867 | ||
+ | |- | ||
+ | |65||21||Day6||C||7.598728 | ||
+ | |- | ||
+ | |66||22||Day6||C||7.475841 | ||
+ | |- | ||
+ | |67||23||Day6||C||7.511873 | ||
+ | |- | ||
+ | |68||24||Day6||C||7.518384 | ||
+ | |- | ||
+ | |69||29||Day6||B||6.618589 | ||
+ | |- | ||
+ | |70||30||Day6||B||5.854754 | ||
+ | |- | ||
+ | |71||31||Day6||B||6.125749 | ||
+ | |- | ||
+ | |72||32||Day6||B||6.96272 | ||
+ | |- | ||
+ | |73||33||Day6||C||7.5406 | ||
+ | |- | ||
+ | |74||34||Day6||C||7.379861 | ||
+ | |- | ||
+ | |75||35||Day6||C||7.344189 | ||
+ | |- | ||
+ | |76||36||Day6||C||7.362815 | ||
+ | |- | ||
+ | |77||37||Day6||A||7.805802 | ||
+ | |- | ||
+ | |78||38||Day6||A||7.764172 | ||
+ | |- | ||
+ | |79||39||Day6||A||7.789844 | ||
+ | |- | ||
+ | |80||40||Day6||A||7.616437 | ||
+ | |- | ||
+ | |81||62||Day6||A||NA | ||
+ | |- | ||
+ | |82||63||Day6||A||NA | ||
+ | |- | ||
+ | |83||64||Day6||A||NA | ||
+ | |- | ||
+ | |84||65||Day6||A||NA | ||
+ | |||
+ | |} | ||
+ | </center> | ||
+ | |||
+ | ===Example 2: Genotype-phenotype=== | ||
+ | |||
+ | Save this data file as a tab-separated TXT file: | ||
+ | |||
+ | <center> | ||
+ | {| class="wikitable" style="text-align:center; width:35%" border="1" | ||
+ | |- | ||
+ | |Genotype||Race||Subject||Weight | ||
+ | |- | ||
+ | |A||1||1||8 | ||
+ | |- | ||
+ | |A||1||2||9 | ||
+ | |- | ||
+ | |A||1||3||11 | ||
+ | |- | ||
+ | |A||1||4||12 | ||
+ | |- | ||
+ | |A||1||5||10 | ||
+ | |- | ||
+ | |A||2||1||17 | ||
+ | |- | ||
+ | |A||2||2||17 | ||
+ | |- | ||
+ | |A||2||3||16 | ||
+ | |- | ||
+ | |A||2||4||15 | ||
+ | |- | ||
+ | |A||2||5||19 | ||
+ | |- | ||
+ | |A||2||6||18 | ||
+ | |- | ||
+ | |A||2||7||18 | ||
+ | |- | ||
+ | |A||2||8||18 | ||
+ | |- | ||
+ | |A||2||9||24 | ||
+ | |- | ||
+ | |A||3||1||12 | ||
+ | |- | ||
+ | |A||3||2||12 | ||
+ | |- | ||
+ | |A||3||3||16 | ||
+ | |- | ||
+ | |A||3||4||15 | ||
+ | |- | ||
+ | |A||3||5||15 | ||
+ | |- | ||
+ | |A||3||6||14 | ||
+ | |- | ||
+ | |A||4||1||17 | ||
+ | |- | ||
+ | |A||4||2||20 | ||
+ | |- | ||
+ | |A||4||3||20 | ||
+ | |- | ||
+ | |A||4||4||19 | ||
+ | |- | ||
+ | |A||4||5||19 | ||
+ | |- | ||
+ | |A||4||6||18 | ||
+ | |- | ||
+ | |A||4||7||20 | ||
+ | |- | ||
+ | |A||4||8||19 | ||
+ | |- | ||
+ | |A||4||9||19 | ||
+ | |- | ||
+ | |B||5||1||9 | ||
+ | |- | ||
+ | |B||5||2||12 | ||
+ | |- | ||
+ | |B||5||3||13 | ||
+ | |- | ||
+ | |B||5||4||16 | ||
+ | |- | ||
+ | |B||5||5||14 | ||
+ | |- | ||
+ | |B||5||6||14 | ||
+ | |- | ||
+ | |B||6||1||10 | ||
+ | |- | ||
+ | |B||6||2||10 | ||
+ | |- | ||
+ | |B||6||3||9 | ||
+ | |- | ||
+ | |B||6||4||8 | ||
+ | |- | ||
+ | |B||6||5||13 | ||
+ | |- | ||
+ | |B||6||6||9 | ||
+ | |- | ||
+ | |B||6||7||11 | ||
+ | |- | ||
+ | |B||7||1||12 | ||
+ | |- | ||
+ | |B||7||2||16 | ||
+ | |- | ||
+ | |B||7||3||17 | ||
+ | |- | ||
+ | |B||7||4||15 | ||
+ | |- | ||
+ | |B||7||5||15 | ||
+ | |- | ||
+ | |B||7||6||15 | ||
+ | |- | ||
+ | |B||8||1||9 | ||
+ | |- | ||
+ | |B||8||2||6 | ||
+ | |- | ||
+ | |B||8||3||8 | ||
+ | |- | ||
+ | |B||8||4||8 | ||
+ | |- | ||
+ | |B||8||5||13 | ||
+ | |- | ||
+ | |B||8||6||9 | ||
+ | |- | ||
+ | |B||8||7||9 | ||
+ | |- | ||
+ | |B||8||8||10 | ||
+ | |||
+ | |} | ||
+ | </center> | ||
+ | |||
+ | # data <- read.table('C:\\Users\\Dinov\\Desktop\\data.txt',as.is=T, header=T) | ||
+ | data <- read.table('data.txt',as.is=T, header=T) | ||
+ | names(data) | ||
+ | attach(data) | ||
+ | table(Genotype, Race) | ||
+ | table(Race, Subject) | ||
+ | |||
+ | # for demonstration, construct a balanced data set | ||
+ | # 5 subjects for each race | ||
+ | data_balance <- data[data$\$$Subject <=5,] | ||
+ | |||
+ | # create factors | ||
+ | data_balance$\$$g <- as.factor(data_balance$\$$Genotype) | ||
+ | data_balance$\$$t <- as.factor(data_balance$\$$Race) | ||
+ | data_balance$\$$s <- as.factor(data_balance$\$$Subject) | ||
+ | |||
+ | # fit the ANOVA | ||
+ | anova.model <- lm(Weight ~ g+t, data= data_balance) | ||
+ | # get the ANOVA table | ||
+ | anova(anova.model) | ||
+ | |||
+ | # note that all F tests use MSE, subjects within race as denominator | ||
+ | # will need to hand-calculate test for genotypes | ||
+ | |||
+ | # Random effects modeling estimation using REML | ||
+ | library(lme4) | ||
+ | |||
+ | lme.model <- lmer(Weight~g+(1|t),data= data_balance) | ||
+ | summary(lme.model) | ||
+ | anova(lme.model) | ||
+ | |||
+ | # various extractor functions: | ||
+ | fixef(lme.model) # estimates of fixed effects | ||
+ | vcov(lme.model) # VC matrix for the fixed effects | ||
+ | VarCorr(lme.model) # estimated variance(-covariance) for random effects | ||
+ | ranef(lme.model) # predictions of random effects | ||
+ | coef(lme.model) # fixed effects + pred's of random effects | ||
+ | fitted(lme.model) # conditional means for each obs (X bhat + Z uhat) | ||
+ | resid(lme.model) # conditional residuals (Y - fitted) | ||
+ | |||
+ | # REML is the default method for estimating variance components. | ||
+ | # If want to use ML, can specify that | ||
+ | lmer(Weight~g+(1|t),REML=F, data= data_balance) | ||
+ | |||
+ | # Now, to get p-values for the effects, or construct a ci | ||
+ | # inference on Subject weight focus is the difference between the two genotypes | ||
+ | # which is Cb for C = [0, 1] using the default R parameterization | ||
+ | # following assumes lme4 library loaded, data frame is d | ||
+ | # uses full data set (unbalanced) | ||
+ | # also assumes g and t are factors identifying genotypes and trays | ||
+ | |||
+ | # fit the model | ||
+ | data$\$$g <- as.factor(data$\$$Genotype) | ||
+ | data$\$$t <- as.factor(data$\$$Race) | ||
+ | data$\$$s <- as.factor(data$\$$Subject) | ||
+ | |||
+ | # See R DATA TYPES: http://www.statmethods.net/input/datatypes.html | ||
+ | |||
+ | model.lmer <- lmer(Weight ~ g + (1|t), data= data) | ||
+ | # get a confidence interval for g | ||
+ | # slower, obvious programing | ||
+ | nsim <- 10 | ||
+ | gdiff <- rep(NA, nsim) | ||
+ | for (i in 1:nsim) { | ||
+ | data$\$$y <- simulate(model.lmer) # param bootstrap data set | ||
+ | model.lmer.1 <- lmer(unlist(data$\$$y) ~ g + (1|t), data= data) | ||
+ | # we need to turn the list of simulated values (y) into an atomic vector with unlist() | ||
+ | # both data frames and models objects (e.g., produced by lm()) are lists | ||
+ | # http://adv-r.had.co.nz/Data-structures.html | ||
+ | gdiff[i] <- fixef(model.lmer.1)[2] # keep only the est diff | ||
+ | } | ||
+ | quantile(gdiff, c(0.025, 0.975)) | ||
+ | # print 95% CI of the coefficient/effect-size of genotype(g) | ||
+ | |||
+ | # print model summary | ||
+ | summary(model.lmer) | ||
+ | |||
+ | # two ways to speed up CI construction and increase simulations to 1K | ||
+ | # use apply to avoid the for loop | ||
+ | # use refit() to avoid the setup time before fitting the LME | ||
+ | |||
+ | yall <- simulate(model.lmer, nsim=1000) | ||
+ | gdiff <- apply(yall, 2, function(y) {fixef(refit(model.lmer, y))[2]}) | ||
+ | quantile(gdiff, c(0.025, 0.975)) | ||
+ | |||
+ | # To get both the estimate and the SE | ||
+ | # which are in the fixed effect table returned by summary() | ||
+ | # use the extractor functions fixef() and vcov() | ||
+ | |||
+ | yall <- simulate(model.lmer, nsim=1000) | ||
+ | gdt <- apply(yall, 2, function(y) { | ||
+ | model.lmer.2 <- refit(model.lmer, y); | ||
+ | c(fixef(model.lmer.2)[2], sqrt(vcov(model.lmer.2)[2,2]) ) | ||
+ | }) | ||
+ | gdt <- t(gdt) # because result of apply is 2 rows, nsim cols | ||
+ | |||
+ | # Using gdt, we can obtain a parametric bootstrap-t interval | ||
+ | |||
+ | # Hypothesis testing: here we need to simulate under <b><u>H0: g = 0</u></b> | ||
+ | # To find the t statistic and the F statistic | ||
+ | # t statistic is in the @coefs table | ||
+ | # F statistic is a value in the anova() output | ||
+ | # if more than one test, need to subscript (unless you want all F's) | ||
+ | # need to reference as a list to avoid R issues | ||
+ | |||
+ | # Complete model with genotypes | ||
+ | model.lmer <- lmer(Weight ~ g + (1|t), data= data) | ||
+ | |||
+ | model.lmer.0 <- lmer(Weight ~ (1|t), data=data) | ||
+ | yall <- simulate(model.lmer.0, nsim=1000) | ||
+ | # NB: simulate data under model.lmer.0 | ||
+ | gdt0 <- apply(yall, 2, function(y) { | ||
+ | model.lmer.2 <- refit(model.lmer, y); # but analyze under alt. model (model.lmer) | ||
+ | c(summary(model.lmer.2), anova(model.lmer.2)$\$$'F value' ) | ||
+ | }) | ||
+ | gdt0 <- t(gdt0) | ||
+ | |||
+ | # obs t = 1.438179, p-value using t statistic: | ||
+ | coef(summary(model.lmer)) | ||
+ | mean(abs(gdt0[,1]$\$$sim_1$\$$residuals) >= 1.438179) | ||
+ | # p-value = 0.125 | ||
+ | |||
+ | # obs F = 2.0684 | ||
+ | anova(model.lmer) | ||
+ | mean(gdt0[,2]$\$$sim_2$\$$residuals >= 2.0684) | ||
+ | # p-value = 0.01785714 | ||
+ | |||
+ | <b>Baseball data</b> | ||
+ | |||
+ | http://wiki.socr.umich.edu/index.php/SOCR_Data_MLB_HeightsWeights | ||
+ | |||
+ | data <- read.table('E:\\Ivo.dir\\Research\\UMichigan\\Education_Teaching_Curricula\\2015_2016\\HS_853_Fall_2015\\Modules_docx\\01a_data.txt',as.is=T, header=T) | ||
+ | boxplot(Weight ~ Position, data = data, xlab = "Position", ylab = "Weight", | ||
+ | main = "MLB Weight Distribution by Position") | ||
+ | boxplot(Weight ~ Team, data = data, xlab = "Team", ylab = "Weight", | ||
+ | main = "MLB Team Weight Distributions") | ||
+ | |||
+ | ===References=== | ||
+ | |||
+ | Bates, D.M., Maechler, M., & Bolker, B. (2012). lme4: Linear mixed-effects models using S4 classes. R package version 0.999999-0. | ||
+ | |||
+ | Baayen, R.H. (2008). Analyzing Linguistic Data: A Practical Introduction to Statistics Using R. Cambridge: Cambridge University Press. | ||
+ | |||
+ | Baayen, R.H., Davidson, D.J., Bates, D.M. (2008). Mixed-effects modeling with crossed random effects for subjects and items. Journal of Memory and Language, 59, 390-412. | ||
+ | |||
+ | Barr, D.J., Levy, R., Scheepers, C., & Tilly, H. J. (2013). Random effects structure for confirmatory hypothesis testing: Keep it maximal. Journal of Memory and Language, 68, 255–278. | ||
+ | |||
+ | Bolker, B. M., Brooks, M. E., Clark, C. J., Geange, S. W., Poulsen, J. R., Stevens, M. H. H., & White, J. S. S. (2009). Generalized linear mixed models: a practical guide for ecology and evolution. Trends in Ecology & Evolution, 24(3), 127-135. | ||
+ | |||
+ | Wike, E.L., & Church, J.D. (1976). Comments on Clark’s “The language-as-fixed-effect fallacy”. Journal of Verbal Learning & Verbal Behavior, 15, 249-255. | ||
+ | |||
+ | Winter, B. (2013). Linear models and linear mixed effects models in R with linguistic applications. arXiv:1308.5499. [http://arxiv.org/pdf/1308.5499.pdf] | ||
Latest revision as of 08:58, 23 May 2016
SMHS Linear Modeling - Machine Learning Algorithms
Scientific inference based on fixed and random effect models, assumptions, and mixed effects logistic regression.
Questions:
- How can we tie human intuition and computer-generated results to obtain reliable, effective, and efficient decision-support system (that facilitates, forecasting)?
- Niels Born – “It is difficult to make predictions, especially about the future”
- Can we unsupervisely classify the data?
Prediction
For most of the machine learning algorithms (including first-order linear regression), we:
- first generate the model using training data, and then
- predict values for test/new data.
Predictions are made using the R predict function. (type ?predict.name), where name is the function-name corresponding to the algorithm. The first argument of predict often represents the variable storing the model, and the second argument is a matrix or data frame of test data that the model needs to be applied to. Calling predict can be done in 2 ways: type predict or type of predict.name.
Example:
#mydata <- read.table('https://umich.instructure.com/files/330381/download?download_frd=1&verifier=HpfmjfMFaMsk7rIpfPx0tmz960oTW7JA8ZonGvVC',as.is=T, header=T) # 01a_data.txt # mydata <- read.table('data.txt',as.is=T, header=T)
# (1) First, there are different approaches to split the data (partition the data) into # training and testing sets. ## TRAINING: 75% of the sample size sample_size <- floor(0.75 * nrow(mydata)) ## set the seed to make your partition reproductible set.seed(1234) train_ind <- sample(seq_len(nrow(mydata)), size = sample_size) train <- mydata[train_ind, ]
# TESTING DATA test <- mydata[-train_ind, ]
lin.mod <- lm(Weight ~ Height*Team, data=train) predicted.values <- predict(lin.mod, newdata=test
Data Modeling/Training
Logistic Regression:
glm_model <-glm(ifelse(Weight > 200,1,0) ~ Height*Team, family=binomial(link="logit"), data=train)
K-Means Clustering
train.1 <- cbind(train$\$$Height, train$\$$Weight, train$\$$Age) test.1 <- cbind(test$\$$Height, test$\$$Weight, test$\$$Age) Weight.1 <- ifelse(train$\$$Weight > 200,1,0) head(train.1) kmeans_model <- kmeans(<u><b>train.1</b></u>, 3) plot(train.1, col = kmeans_model$\$$cluster) points(kmeans_model$\$$centers, col = 1:2, pch = 8, cex = 2) <b>K-Nearest Neighbor Classification</b> # install.packages("class") library("class") knn_model <- knn(train=train.1, test=test.1, cl=as.factor(Weight.1), k=5) plot(knn_model) summary(knn_model) <b>Naïve Bayes Classifier</b> install.packages("e1071") library("e1071") nbc_model <- naiveBayes(Weight ~ Height*Age, data=train.1) <b>Decision Trees (CART)</b> #install.packages("e1071") library("rpart") cart_model <- rpart(Weight ~ Height+Age, data= as.data.frame(train.1), method="class") plot(cart_model) text(cart_model) <b>AdaBoost</b> install.packages("ada") # X be the matrix of features, and labels be a vector of 0-1 class labels. library("ada") boost_model <- ada(x= cbind(train$\$$Height, train$\$$Weight, train$\$$Age), y= Weight.1) plot(boost_model) boost_model
Support Vector Machines (SVM)
#install.packages("e1071") library("rpart") svm_model <- svm(x= cbind(train$\$$Height, train$\$$Weight, train$\$$Age), y=as.factor(Weight.1), kernel ="radial") summary(svm_model) =='"`UNIQ--h-1--QINU`"'Appendix== ==='"`UNIQ--h-2--QINU`"'Example 1: Simulation (subject, day, treatment, observation)=== <i>Obs ~ Treatment + Day + Subject(Treatment)+ Day*Subject(Treatment)+ ε.</i> This model is accounts for:<br> Response = Obs Fixed effects:<br> Treatment (fixed)<br> Day (fixed)<br> Treatment*Day interaction Random Effects:<br> Subject nested within Treatment (random)<br> Day crossed with "Subject within Treatment" (random) mydata <- data.frame( Subject = c(13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 62, 63, 64, 65, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 62, 63, 64, 65, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 62, 63, 64, 65), Day = c(rep(c("Day1", "Day3", "Day6"), each=28)), Treatment = c(rep(c("B", "A", "C", "B", "C", "A", "A", "B", "A", "C", "B", "C", "A", "A", "B", "A", "C", "B", "C", "A", "A"), each = 4)), Obs = c(6.472687, 7.017110, 6.200715, 6.613928, 6.829968, 7.387583, 7.367293, 8.018853, 7.527408, 6.746739, 7.296910, 6.983360, 6.816621, 6.571689, 5.911261, 6.954988, 7.624122, 7.669865, 7.676225, 7.263593, 7.704737, 7.328716, 7.295610, 5.964180, 6.880814, 6.926342, 6.926342, 7.562293, 6.677607, 7.023526, 6.441864, 7.020875, 7.478931, 7.495336, 7.427709, 7.633020, 7.382091, 7.359731, 7.285889, 7.496863, 6.632403, 6.171196, 6.306012, 7.253833, 7.594852, 6.915225, 7.220147, 7.298227, 7.573612, 7.366550, 7.560513, 7.289078, 7.287802, 7.155336, 7.394452, 7.465383, 6.976048, 7.222966, 6.584153, 7.013223, 7.569905, 7.459185, 7.504068, 7.801867, 7.598728, 7.475841, 7.511873, 7.518384, 6.618589, 5.854754, 6.125749, 6.962720, 7.540600, 7.379861, 7.344189, 7.362815, 7.805802, 7.764172, 7.789844, 7.616437, NA, NA, NA, NA)) install.packages("lme4") library("lme4", lib.loc="~/R/win-library/3.1") m1 <- lmer(Obs ~ Treatment * Day + (1 | Subject), mydata) m1 Linear mixed model fit by REML ['lmerMod'] Formula: Obs ~ Treatment * Day + (1 | Subject) Data: mydata REML criterion at convergence: 56.8669 <center> Random Effects {| class="wikitable" style="text-align:center; width:35%" border="1" |- |Groups||Name||Std. Dev. |- |Subject||(Intercept)||0.2163 |- |Residual|| ||0.2602 |} Number of obs: 80, groups: Subject, 28 </center> <center> Fixed Effects {| class="wikitable" style="text-align:center; width:35%" border="1" |- |(Intercept)||TreatmentB||TreatmentC |- |7.1827||-0.6129||0.1658 |- |DayDay3||DayDay6 ||TreatmentB: DayDay3 |- |0.2446||0.4507||-0.1235 |- |TreatmentC: DayDay3||TreatmentB: DayDay6||TreatmentC: DayDay6 |- |-0.2740||-0.3508||-0.3327 |} </center> <center> {| class="wikitable" style="text-align:center; width:35%" border="1" |- |Index||Subject||Day||Treatment||Obs |- |1||13||Day1||B||6.472687 |- |2||14||Day1||B||7.01711 |- |3||15||Day1||B||6.200715 |- |4||16||Day1||B||6.613928 |- |5||17||Day1||A||6.829968 |- |6||18||Day1||A||7.387583 |- |7||19||Day1||A||7.367293 |- |8||20||Day1||A||8.018853 |- |9||21||Day1||C||7.527408 |- |10||22||Day1||C||6.746739 |- |11||23||Day1||C||7.29691 |- |12||24||Day1||C||6.98336 |- |13||29||Day1||B||6.816621 |- |14||30||Day1||B||6.571689 |- |15||31||Day1||B||5.911261 |- |16||32||Day1||B||6.954988 |- |17||33||Day1||C||7.624122 |- |18||34||Day1||C||7.669865 |- |19||35||Day1||C||7.676225 |- |20||36||Day1||C||7.263593 |- |21||37||Day1||A||7.704737 |- |22||38||Day1||A||7.328716 |- |23||39||Day1||A||7.29561 |- |24||40||Day1||A||5.96418 |- |25||62||Day1||A||6.880814 |- |26||63||Day1||A||6.926342 |- |27||64||Day1||A||6.926342 |- |28||65||Day1||A||7.562293 |- |29||13||Day3||B||6.677607 |- |30||14||Day3||B||7.023526 |- |31||15||Day3||B||6.441864 |- |32||16||Day3||B||7.020875 |- |33||17||Day3||A||7.478931 |- |34||18||Day3||A||7.495336 |- |35||19||Day3||A||7.427709 |- |36||20||Day3||A||7.63302 |- |37||21||Day3||C||7.382091 |- |38||22||Day3||C||7.359731 |- |39||23||Day3||C||7.285889 |- |40||24||Day3||C||7.496863 |- |41||29||Day3||B||6.632403 |- |42||30||Day3||B||6.171196 |- |43||31||Day3||B||6.306012 |- |44||32||Day3||B||7.253833 |- |45||33||Day3||C||7.594852 |- |46||34||Day3||C||6.915225 |- |47||35||Day3||C||7.220147 |- |48||36||Day3||C||7.298227 |- |49||37||Day3||A||7.573612 |- |50||38||Day3||A||7.36655 |- |51||39||Day3||A||7.560513 |- |52||40||Day3||A||7.289078 |- |53||62||Day3||A||7.287802 |- |54||63||Day3||A||7.155336 |- |55||64||Day3||A||7.394452 |- |56||65||Day3||A||7.465383 |- |57||13||Day6||B||6.976048 |- |58||14||Day6||B||7.222966 |- |59||15||Day6||B||6.584153 |- |60||16||Day6||B||7.013223 |- |61||17||Day6||A||7.569905 |- |62||18||Day6||A||7.459185 |- |63||19||Day6||A||7.504068 |- |64||20||Day6||A||7.801867 |- |65||21||Day6||C||7.598728 |- |66||22||Day6||C||7.475841 |- |67||23||Day6||C||7.511873 |- |68||24||Day6||C||7.518384 |- |69||29||Day6||B||6.618589 |- |70||30||Day6||B||5.854754 |- |71||31||Day6||B||6.125749 |- |72||32||Day6||B||6.96272 |- |73||33||Day6||C||7.5406 |- |74||34||Day6||C||7.379861 |- |75||35||Day6||C||7.344189 |- |76||36||Day6||C||7.362815 |- |77||37||Day6||A||7.805802 |- |78||38||Day6||A||7.764172 |- |79||39||Day6||A||7.789844 |- |80||40||Day6||A||7.616437 |- |81||62||Day6||A||NA |- |82||63||Day6||A||NA |- |83||64||Day6||A||NA |- |84||65||Day6||A||NA |} </center> ==='"`UNIQ--h-3--QINU`"'Example 2: Genotype-phenotype=== Save this data file as a tab-separated TXT file: <center> {| class="wikitable" style="text-align:center; width:35%" border="1" |- |Genotype||Race||Subject||Weight |- |A||1||1||8 |- |A||1||2||9 |- |A||1||3||11 |- |A||1||4||12 |- |A||1||5||10 |- |A||2||1||17 |- |A||2||2||17 |- |A||2||3||16 |- |A||2||4||15 |- |A||2||5||19 |- |A||2||6||18 |- |A||2||7||18 |- |A||2||8||18 |- |A||2||9||24 |- |A||3||1||12 |- |A||3||2||12 |- |A||3||3||16 |- |A||3||4||15 |- |A||3||5||15 |- |A||3||6||14 |- |A||4||1||17 |- |A||4||2||20 |- |A||4||3||20 |- |A||4||4||19 |- |A||4||5||19 |- |A||4||6||18 |- |A||4||7||20 |- |A||4||8||19 |- |A||4||9||19 |- |B||5||1||9 |- |B||5||2||12 |- |B||5||3||13 |- |B||5||4||16 |- |B||5||5||14 |- |B||5||6||14 |- |B||6||1||10 |- |B||6||2||10 |- |B||6||3||9 |- |B||6||4||8 |- |B||6||5||13 |- |B||6||6||9 |- |B||6||7||11 |- |B||7||1||12 |- |B||7||2||16 |- |B||7||3||17 |- |B||7||4||15 |- |B||7||5||15 |- |B||7||6||15 |- |B||8||1||9 |- |B||8||2||6 |- |B||8||3||8 |- |B||8||4||8 |- |B||8||5||13 |- |B||8||6||9 |- |B||8||7||9 |- |B||8||8||10 |} </center> # data <- read.table('C:\\Users\\Dinov\\Desktop\\data.txt',as.is=T, header=T) data <- read.table('data.txt',as.is=T, header=T) names(data) attach(data) table(Genotype, Race) table(Race, Subject) # for demonstration, construct a balanced data set # 5 subjects for each race data_balance <- data[data$\$$Subject <=5,]
# create factors data_balance$\$$g <- as.factor(data_balance$\$$Genotype) data_balance$\$$t <- as.factor(data_balance$\$$Race) data_balance$\$$s <- as.factor(data_balance$\$$Subject)
# fit the ANOVA anova.model <- lm(Weight ~ g+t, data= data_balance) # get the ANOVA table anova(anova.model)
# note that all F tests use MSE, subjects within race as denominator # will need to hand-calculate test for genotypes
# Random effects modeling estimation using REML library(lme4)
lme.model <- lmer(Weight~g+(1|t),data= data_balance) summary(lme.model) anova(lme.model)
# various extractor functions: fixef(lme.model) # estimates of fixed effects vcov(lme.model) # VC matrix for the fixed effects VarCorr(lme.model) # estimated variance(-covariance) for random effects ranef(lme.model) # predictions of random effects coef(lme.model) # fixed effects + pred's of random effects fitted(lme.model) # conditional means for each obs (X bhat + Z uhat) resid(lme.model) # conditional residuals (Y - fitted)
# REML is the default method for estimating variance components. # If want to use ML, can specify that lmer(Weight~g+(1|t),REML=F, data= data_balance)
# Now, to get p-values for the effects, or construct a ci # inference on Subject weight focus is the difference between the two genotypes # which is Cb for C = [0, 1] using the default R parameterization # following assumes lme4 library loaded, data frame is d # uses full data set (unbalanced) # also assumes g and t are factors identifying genotypes and trays
# fit the model data$\$$g <- as.factor(data$\$$Genotype) data$\$$t <- as.factor(data$\$$Race) data$\$$s <- as.factor(data$\$$Subject)
# See R DATA TYPES: http://www.statmethods.net/input/datatypes.html
model.lmer <- lmer(Weight ~ g + (1|t), data= data) # get a confidence interval for g # slower, obvious programing nsim <- 10 gdiff <- rep(NA, nsim) for (i in 1:nsim) { data$\$$y <- simulate(model.lmer) # param bootstrap data set model.lmer.1 <- lmer(unlist(data$\$$y) ~ g + (1|t), data= data) # we need to turn the list of simulated values (y) into an atomic vector with unlist() # both data frames and models objects (e.g., produced by lm()) are lists # http://adv-r.had.co.nz/Data-structures.html gdiff[i] <- fixef(model.lmer.1)[2] # keep only the est diff } quantile(gdiff, c(0.025, 0.975)) # print 95% CI of the coefficient/effect-size of genotype(g)
# print model summary summary(model.lmer)
# two ways to speed up CI construction and increase simulations to 1K # use apply to avoid the for loop # use refit() to avoid the setup time before fitting the LME
yall <- simulate(model.lmer, nsim=1000) gdiff <- apply(yall, 2, function(y) {fixef(refit(model.lmer, y))[2]}) quantile(gdiff, c(0.025, 0.975))
# To get both the estimate and the SE # which are in the fixed effect table returned by summary() # use the extractor functions fixef() and vcov()
yall <- simulate(model.lmer, nsim=1000) gdt <- apply(yall, 2, function(y) { model.lmer.2 <- refit(model.lmer, y); c(fixef(model.lmer.2)[2], sqrt(vcov(model.lmer.2)[2,2]) ) }) gdt <- t(gdt) # because result of apply is 2 rows, nsim cols
# Using gdt, we can obtain a parametric bootstrap-t interval
# Hypothesis testing: here we need to simulate under H0: g = 0 # To find the t statistic and the F statistic # t statistic is in the @coefs table # F statistic is a value in the anova() output # if more than one test, need to subscript (unless you want all F's) # need to reference as a list to avoid R issues
# Complete model with genotypes model.lmer <- lmer(Weight ~ g + (1|t), data= data)
model.lmer.0 <- lmer(Weight ~ (1|t), data=data) yall <- simulate(model.lmer.0, nsim=1000) # NB: simulate data under model.lmer.0 gdt0 <- apply(yall, 2, function(y) { model.lmer.2 <- refit(model.lmer, y); # but analyze under alt. model (model.lmer) c(summary(model.lmer.2), anova(model.lmer.2)$\$$'F value' ) }) gdt0 <- t(gdt0) # obs t = 1.438179, p-value using t statistic: coef(summary(model.lmer)) mean(abs(gdt0[,1]$\$$sim_1$\$$residuals) >= 1.438179) # p-value = 0.125 # obs F = 2.0684 anova(model.lmer) mean(gdt0[,2]$\$$sim_2$\$$residuals >= 2.0684) # p-value = 0.01785714
Baseball data
http://wiki.socr.umich.edu/index.php/SOCR_Data_MLB_HeightsWeights
data <- read.table('E:\\Ivo.dir\\Research\\UMichigan\\Education_Teaching_Curricula\\2015_2016\\HS_853_Fall_2015\\Modules_docx\\01a_data.txt',as.is=T, header=T) boxplot(Weight ~ Position, data = data, xlab = "Position", ylab = "Weight", main = "MLB Weight Distribution by Position") boxplot(Weight ~ Team, data = data, xlab = "Team", ylab = "Weight", main = "MLB Team Weight Distributions")
References
Bates, D.M., Maechler, M., & Bolker, B. (2012). lme4: Linear mixed-effects models using S4 classes. R package version 0.999999-0.
Baayen, R.H. (2008). Analyzing Linguistic Data: A Practical Introduction to Statistics Using R. Cambridge: Cambridge University Press.
Baayen, R.H., Davidson, D.J., Bates, D.M. (2008). Mixed-effects modeling with crossed random effects for subjects and items. Journal of Memory and Language, 59, 390-412.
Barr, D.J., Levy, R., Scheepers, C., & Tilly, H. J. (2013). Random effects structure for confirmatory hypothesis testing: Keep it maximal. Journal of Memory and Language, 68, 255–278.
Bolker, B. M., Brooks, M. E., Clark, C. J., Geange, S. W., Poulsen, J. R., Stevens, M. H. H., & White, J. S. S. (2009). Generalized linear mixed models: a practical guide for ecology and evolution. Trends in Ecology & Evolution, 24(3), 127-135.
Wike, E.L., & Church, J.D. (1976). Comments on Clark’s “The language-as-fixed-effect fallacy”. Journal of Verbal Learning & Verbal Behavior, 15, 249-255.
Winter, B. (2013). Linear models and linear mixed effects models in R with linguistic applications. arXiv:1308.5499. [1]
- SOCR Home page: http://www.socr.umich.edu
Translate this page: