Difference between revisions of "SMHS LinearModeling LMM"
Line 73: | Line 73: | ||
In mixed modeling, we add one or more terms to the fixed effects to account for random effects. These random effects essentially generalize the model (make it more applicable in situations when the error term does have structure) – that is, the random effect models pull out structure from the error term “ε”. For instance, in the baseball weights example, we add a random effect for “player”, and this characterizes idiosyncratic variation that is due to individual body build differences. The intertwining of fixed and random effects is what makes these models mixed-effect models. | In mixed modeling, we add one or more terms to the fixed effects to account for random effects. These random effects essentially generalize the model (make it more applicable in situations when the error term does have structure) – that is, the random effect models pull out structure from the error term “ε”. For instance, in the baseball weights example, we add a random effect for “player”, and this characterizes idiosyncratic variation that is due to individual body build differences. The intertwining of fixed and random effects is what makes these models mixed-effect models. | ||
+ | Our updated formula looks like this: | ||
+ | <blockquote>M1: Weight ~ Height + position + ε1</blockquote> | ||
+ | <blockquote>M2: Weight ~ Height + position + (1|player) + ε2</blockquote> | ||
+ | “(1|player)” is the R notation for random player effects. It assumes an intercept that’s different for each player”, and “1” stands for the constant-term or intercept. This formula explicitly states that the model should account for (inter-player dependencies) multiple responses per player (say over time) – i.e., account for different baselines for each player. This effectively resolves the conflict with weight-dependences within players that stem from multiple weight observations. | ||
+ | The mixed effect model still contains a general error term “ε”, as even though it accounts for individual by-player variation, there’s still going to be “random” differences between different body-size measurements (e.g., weight) from the same player (noise). | ||
+ | The player position represents an additional source of non-independence that may need to be accounted for. Similarly to the case of by-player variation, we may expect by-position variation. For instance, there might be some special body-size demands for different positions that may lead to overall higher/lower weight. The weight measurements for different players may similarly be affected by this random factor due to game-position-specific idiosyncrasies. Thus, the different responses to one position may not always be independent. There may be some similarities in multiple weight measurements for the same play-position – even if these come from different players. Disregarding these interdependencies, we may violate the linear model independence assumptions. Below is an exemplary visual representation of the by-position variability in weight. | ||
+ | |||
+ | data <- read.table('https://umich.instructure.com/files/330381/download?download_frd=1&verifier=HpfmjfMFaMsk7rIpfPx0tmz960oTW7JA8ZonGvVC',as.is=T, header=T) | ||
+ | library("reshape2") | ||
+ | # melting by "Position". `melt is from the reshape2 package. | ||
+ | # do ?melt for help | ||
+ | data.m <- melt(data[,-c(1:2)], id.var = "Position") | ||
+ | # | ||
+ | require(ggplot2) | ||
+ | ggplot(data = data.m, aes(x=variable, y=value)) + geom_boxplot(aes(fill=Position)) | ||
+ | |||
+ | ggplot(data = data.m, aes(x=Position, y=value)) + | ||
+ | geom_boxplot() + facet_wrap(~variable,ncol = 4) | ||
+ | |||
+ | p <- ggplot(data = data.m, aes(x=variable, y=value)) | ||
+ | p <- p + geom_boxplot(aes(fill = Position)) | ||
+ | # to color the points replace group with color=Position | ||
+ | p <- p + geom_point(aes(y=value, group=Position), position = position_dodge(width=0.75)) | ||
+ | p <- p + facet_wrap( ~ variable, scales="free") | ||
+ | p <- p + xlab("x-axis") + ylab("y-axis") + ggtitle("Title") | ||
+ | p <- p + guides(fill=guide_legend(title="Legend_Title")) | ||
+ | p | ||
Revision as of 14:50, 2 February 2016
SMHS Linear Modeling - Linear mixed effects analyses
Scientific inference based on fixed and random effect models, assumptions, and mixed effects logistic regression.
Questions:
- What happens if data are not independent and identically distributed (IIDs)?
- How to model multiple observations for the case/subject (across time, conditions, etc.)?
Fixed and random effects Linear models express relationships between data elements (variables) in terms of a (linear) function. For example, we can model weight as a function of height.
Weight ~ Height + ε
Here, “height” a fixed effect, and ε was our “error term” representing the deviations between the model predictions and observations (of weight) due to “random” factors that we cannot control experimentally. This tem (ε) is the “probabilistic” or “stochastic” part of the model. Let’s try to unpack “ε” and add complexity to it. In mixed (fixed and random effect) models, everything in the “systematic” part of your model works just like with linear models. If we change the random aspect of our model this leaves the systematic part (height) unchanged.
Suppose we’re looking at the Baseball data and try to identify a relationship that looks like this:
Weight ~ position + ε
Position (player field position) is treated as a categorical factor with several levels (e.g., Catcher, First-Baseman, etc.) On top of that, we also have an additional fixed effect, Height, and so our bivariate linear model looks more like this:
Weight ~ Height + position + ε
This model expansion is nice, but it complicates a bit the data analytics and scientific inference. If the study design involved taking multiple measures per player, say across time/age, each player would yield multiple position, height and weight responses. According to the assumptions of the linear model, this would violate the independence assumption as multiple responses from the same subject cannot be regarded as independent from one another. Every player has a slightly different weight, and this is going to be an idiosyncratic factor that affects all responses from the same player, thus rendering these different responses inter-dependent (within player) instead of independent, as required by the model assumptions.
A way to resolve this model assumption violation is to add a random effect for players. This allows us to account for inter-independences by assuming a different “baseline” weight value for each player. For example, player 1 may have a mean weight 200 pounds across different times, and player 2 may have a mean weight of 350 pounds. Here’s a visual depiction of how this looks like:
Team A_Player1 | Team A_Playe2 | Team A_Player3 | Team A_Player4 | Team A_Player5 | Team B_Player1 | Team B_Player2 | Team B_Player 3 |
160 | 340 | 240 | 340 | 180 | 200 | 240 | 180 |
180 | 340 | 240 | 400 | 240 | 200 | 320 | 120 |
220 | 320 | 320 | 400 | 260 | 180 | 340 | 160 |
240 | 300 | 300 | 380 | 320 | 160 | 300 | 160 |
200 | 380 | 300 | 380 | 280 | 260 | 300 | 260 |
220 | 360 | 280 | 360 | 280 | 180 | 300 | 180 |
260 | 360 | 320 | 400 | 260 | 240 | 320 | 180 |
200 | 360 | 280 | 380 | 300 | 220 | 320 | 200 |
220 | 480 | 260 | 380 | 280 | 180 | 300 | 220 |
SOCR Charts generated these plots (http://www.socr.umich.edu/html/cha/). The same data may similarly be plotted using R:
data <- read.table('C:\\Users\\Dinov\\Desktop\\01a_data.txt',as.is=T, header=T) # data <- read.table('https://umich.instructure.com/files/330381/download?download_frd=1&verifier=HpfmjfMFaMsk7rIpfPx0tmz960oTW7JA8ZonGvVC',as.is=T, header=T) attach(data) boxplot(TeamA_Player1, TeamA_Player2, TeamA_Player3, TeamA_Player4, TeamA_Player5, TeamB_Player1, TeamB_Player2, TeamB_Player3, col=c("white","lightgray")) boxplot(data, las = 2) boxplot(data, las = 2, par(mar = c(8, 5, 4, 2)+ 0.1))
We can model these individual differences by assuming different random intercepts for each player. In other words, each player will be assigned a different intercept value, and the mixed model estimates these intercept values.
The (fixed effects) linear models include several fixed effects and a general error term “ε”. Linear modeling segregates the world into things that we understand as systematic, i.e., fixed, effects or the explanatory variables, and things that we cannot control for or poorly understand (error, ε). Statistical inference requires that the unsystematic part (ε) of the model, does not have any interesting structure or pattern – it should represent random white noise with common (iid) across-the-board characteristics.
In mixed modeling, we add one or more terms to the fixed effects to account for random effects. These random effects essentially generalize the model (make it more applicable in situations when the error term does have structure) – that is, the random effect models pull out structure from the error term “ε”. For instance, in the baseball weights example, we add a random effect for “player”, and this characterizes idiosyncratic variation that is due to individual body build differences. The intertwining of fixed and random effects is what makes these models mixed-effect models.
Our updated formula looks like this:
M1: Weight ~ Height + position + ε1
M2: Weight ~ Height + position + (1|player) + ε2
“(1|player)” is the R notation for random player effects. It assumes an intercept that’s different for each player”, and “1” stands for the constant-term or intercept. This formula explicitly states that the model should account for (inter-player dependencies) multiple responses per player (say over time) – i.e., account for different baselines for each player. This effectively resolves the conflict with weight-dependences within players that stem from multiple weight observations.
The mixed effect model still contains a general error term “ε”, as even though it accounts for individual by-player variation, there’s still going to be “random” differences between different body-size measurements (e.g., weight) from the same player (noise).
The player position represents an additional source of non-independence that may need to be accounted for. Similarly to the case of by-player variation, we may expect by-position variation. For instance, there might be some special body-size demands for different positions that may lead to overall higher/lower weight. The weight measurements for different players may similarly be affected by this random factor due to game-position-specific idiosyncrasies. Thus, the different responses to one position may not always be independent. There may be some similarities in multiple weight measurements for the same play-position – even if these come from different players. Disregarding these interdependencies, we may violate the linear model independence assumptions. Below is an exemplary visual representation of the by-position variability in weight.
data <- read.table('https://umich.instructure.com/files/330381/download?download_frd=1&verifier=HpfmjfMFaMsk7rIpfPx0tmz960oTW7JA8ZonGvVC',as.is=T, header=T) library("reshape2") # melting by "Position". `melt is from the reshape2 package. # do ?melt for help data.m <- melt(data[,-c(1:2)], id.var = "Position") # require(ggplot2) ggplot(data = data.m, aes(x=variable, y=value)) + geom_boxplot(aes(fill=Position))
ggplot(data = data.m, aes(x=Position, y=value)) + geom_boxplot() + facet_wrap(~variable,ncol = 4)
p <- ggplot(data = data.m, aes(x=variable, y=value)) p <- p + geom_boxplot(aes(fill = Position)) # to color the points replace group with color=Position p <- p + geom_point(aes(y=value, group=Position), position = position_dodge(width=0.75)) p <- p + facet_wrap( ~ variable, scales="free") p <- p + xlab("x-axis") + ylab("y-axis") + ggtitle("Title") p <- p + guides(fill=guide_legend(title="Legend_Title")) p
....
Next See
Machine Learning Algorithms section for data modeling, training , testing, forecasting, prediction, and simulation.
- SOCR Home page: http://www.socr.umich.edu
Translate this page: