Difference between revisions of "SMHS LinearModeling LMM"

Revision as of 11:49, 10 February 2016

SMHS Linear Modeling - Linear mixed effects analyses

Scientific inference based on fixed and random effect models, assumptions, and mixed effects logistic regression.

Questions:

What happens if data are not independent and identically distributed (IIDs)?
How to model multiple observations for the case/subject (across time, conditions, etc.)?

Fixed and random effects Linear models express relationships between data elements (variables) in terms of a (linear) function. For example, we can model weight as a function of height.

Weight ~ Height + ε

Here, “height” a fixed effect, and ε was our “error term” representing the deviations between the model predictions and observations (of weight) due to “random” factors that we cannot control experimentally. This tem (ε) is the “probabilistic” or “stochastic” part of the model. Let’s try to unpack “ε” and add complexity to it. In mixed (fixed and random effect) models, everything in the “systematic” part of your model works just like with linear models. If we change the random aspect of our model this leaves the systematic part (height) unchanged.

Suppose we’re looking at the Baseball data and try to identify a relationship that looks like this:

Weight ~ position + ε

Position (player field position) is treated as a categorical factor with several levels (e.g., Catcher, First-Baseman, etc.) On top of that, we also have an additional fixed effect, Height, and so our bivariate linear model looks more like this:

Weight ~ Height + position + ε

This model expansion is nice, but it complicates a bit the data analytics and scientific inference. If the study design involved taking multiple measures per player, say across time/age, each player would yield multiple position, height and weight responses. According to the assumptions of the linear model, this would violate the independence assumption as multiple responses from the same subject cannot be regarded as independent from one another. Every player has a slightly different weight, and this is going to be an idiosyncratic factor that affects all responses from the same player, thus rendering these different responses inter-dependent (within player) instead of independent, as required by the model assumptions.

A way to resolve this model assumption violation is to add a random effect for players. This allows us to account for inter-independences by assuming a different “baseline” weight value for each player. For example, player 1 may have a mean weight 200 pounds across different times, and player 2 may have a mean weight of 350 pounds. Here’s a visual depiction of how this looks like:

Team A_Player1	Team A_Playe2	Team A_Player3	Team A_Player4	Team A_Player5	Team B_Player1	Team B_Player2	Team B_Player 3
160	340	240	340	180	200	240	180
180	340	240	400	240	200	320	120
220	320	320	400	260	180	340	160
240	300	300	380	320	160	300	160
200	380	300	380	280	260	300	260
220	360	280	360	280	180	300	180
260	360	320	400	260	240	320	180
200	360	280	380	300	220	320	200
220	480	260	380	280	180	300	220

SOCR Charts generated these plots (http://www.socr.umich.edu/html/cha/). The same data may similarly be plotted using R:

data <- read.table('C:\\Users\\Dinov\\Desktop\\01a_data.txt',as.is=T, header=T)
# data <- read.table('https://umich.instructure.com/files/330381/download?download_frd=1&verifier=HpfmjfMFaMsk7rIpfPx0tmz960oTW7JA8ZonGvVC',as.is=T, header=T)	
attach(data)
boxplot(TeamA_Player1, TeamA_Player2, TeamA_Player3, TeamA_Player4, TeamA_Player5, TeamB_Player1, TeamB_Player2, TeamB_Player3, col=c("white","lightgray"))
boxplot(data, las = 2)
boxplot(data, las = 2, par(mar = c(8, 5, 4, 2)+ 0.1))

We can model these individual differences by assuming different random intercepts for each player. In other words, each player will be assigned a different intercept value, and the mixed model estimates these intercept values.

The (fixed effects) linear models include several fixed effects and a general error term “ε”. Linear modeling segregates the world into things that we understand as systematic, i.e., fixed, effects or the explanatory variables, and things that we cannot control for or poorly understand (error, ε). Statistical inference requires that the unsystematic part (ε) of the model, does not have any interesting structure or pattern – it should represent random white noise with common (iid) across-the-board characteristics.

In mixed modeling, we add one or more terms to the fixed effects to account for random effects. These random effects essentially generalize the model (make it more applicable in situations when the error term does have structure) – that is, the random effect models pull out structure from the error term “ε”. For instance, in the baseball weights example, we add a random effect for “player”, and this characterizes idiosyncratic variation that is due to individual body build differences. The intertwining of fixed and random effects is what makes these models mixed-effect models.

Our updated formula looks like this:

M1: Weight ~ Height + position + ε1

M2: Weight ~ Height + position + (1|player) + ε2

“(1|player)” is the R notation for random player effects. It assumes an intercept that’s different for each player”, and “1” stands for the constant-term or intercept. This formula explicitly states that the model should account for (inter-player dependencies) multiple responses per player (say over time) – i.e., account for different baselines for each player. This effectively resolves the conflict with weight-dependences within players that stem from multiple weight observations.

The mixed effect model still contains a general error term “ε”, as even though it accounts for individual by-player variation, there’s still going to be “random” differences between different body-size measurements (e.g., weight) from the same player (noise).

The player position represents an additional source of non-independence that may need to be accounted for. Similarly to the case of by-player variation, we may expect by-position variation. For instance, there might be some special body-size demands for different positions that may lead to overall higher/lower weight. The weight measurements for different players may similarly be affected by this random factor due to game-position-specific idiosyncrasies. Thus, the different responses to one position may not always be independent. There may be some similarities in multiple weight measurements for the same play-position – even if these come from different players. Disregarding these interdependencies, we may violate the linear model independence assumptions. Below is an exemplary visual representation of the by-position variability in weight.

data <- read.table('https://umich.instructure.com/files/330381/download?download_frd=1&verifier=HpfmjfMFaMsk7rIpfPx0tmz960oTW7JA8ZonGvVC',as.is=T, header=T)
library("reshape2")
# melting by "Position". `melt is from the reshape2 package. 
# do ?melt for help
data.m <- melt(data[,-c(1:2)], id.var = "Position")
#
require(ggplot2)
ggplot(data = data.m, aes(x=variable, y=value)) + geom_boxplot(aes(fill=Position))

ggplot(data = data.m, aes(x=Position, y=value)) + 
        geom_boxplot() + facet_wrap(~variable,ncol = 4)

p <- ggplot(data = data.m, aes(x=variable, y=value)) 
p <- p + geom_boxplot(aes(fill = Position))
# to color the points replace group with color=Position
p <- p + geom_point(aes(y=value, group=Position), position = position_dodge(width=0.75))
p <- p + facet_wrap( ~ variable, scales="free")
p <- p + xlab("x-axis") + ylab("y-axis") + ggtitle("Title")
p <- p + guides(fill=guide_legend(title="Legend_Title"))
p

The variation between positions may be different from the variation between players!

To add an additional random effect for position we expand the model:

Weight ~ Height + position + (1|player) + (1|position) + ε

Note that a similar argument may be made for “Team”, there could be error-term dependencies or patterns reflecting the between-team random effects on weight – some teams may be bulkier than others. This is really well documented in sports (e.g., top notch Serbian and Spanish water polo teams have a vastly different physiques).

data.m2 <- melt(data[,-c(1,3)], id.var = "Team")
# require(ggplot2)
p <- ggplot(data = data.m2, aes(x=variable, y=value)) + geom_boxplot(aes(fill=Team))
p + facet_wrap( ~ variable, scales="free")

Thus, in addition to modeling different intercepts for different players, we may include, as necessary, different intercepts for different positions or teams. This “resolves” dependencies in the linear model, potentially accounting for multiple responses per player, position or team, and we correctly representing the model by-player and by-position variation in overall weight.

Prior to the advent of mixed linear models, researchers used the averages of all multiple records. For example, in nursing, researchers may average the vital signs (e.g., heart rates, temperatures, etc.) of their patients over time or condition for a patient-based analysis where each data point is assumed to be derived from one subject, assuring independence. Then researchers may also average over subjects for a diagnostic condition type analysis, where each data point comes from one condition. There are advantages and disadvantages of this averaging approach. Whereas traditional analyses based on averaging are in principle correct, mixed models provide more flexibility and take the complete data into account. In a patient-based analysis (averaging over conditions/Dx), we basically disregard the by-condition variability. Conversely, in a condition-based analysis, we may disregard by-patient variation. A mixed-effect model would account for both sources of variation in a single analysis.

Let’s apply this understanding of the linear mixed effects modeling via R.

R commands for mixed-effect modeling (See appendix for complete R script) Open RStudio and install the R package lme4 :

install.packages(“lme4”)
library(lme4)

This makes available the function lmer(), which is the mixed model equivalent of the function lm() in the fixed-effect model. Load some data in RStudio , .

data = read.csv(file.choose( ))
attach(data)

You can also try the R QCC package (for quality control, see the first section of this tutorial):

install.packages("qcc")
library("qcc", lib.loc="~/R/win-library/3.1")
data(orangejuice)
attach(orangejuice)
qcc(D[trial], sizes=size[trial], type="p")

Test the data-import and get a summary by using head(), tail(), summary(), str(), colnames(), or whatever commands you commonly use to get an overview of a dataset. Also, it is always good to check for missing values:

which(is.na(Team)==T)

If there are missing values, there are no problems for the mixed-effect modeling.

Using the MLB Baseball data we can explore relationship between variables (e.g., Height and Weight) by means of boxplots:

# data <- read.table('E:\\Ivo.dir\\Research\\UMichigan\\Education_Teaching_Curricula\\2015_2016\\HS_853_Fall_2015\\Modules_docx\\data\\01a_data.txt',as.is=T, header=T)
data <- read.table('data.txt',as.is=T, header=T)
boxplot(Weight ~ Height, col=c("white","lightgray"))
boxplot(Weight ~ Height*team, col=c("white","lightgray"))

Is the thick median line in the middle of the boxplot lower for the light than the heavier players?

To construct the mixed model try in the command below …

lmer(Weight ~ Height, data=data)

You will get an error that should look like this:

Error: No random effects terms specified in formula

This is because the model needs at least 1 random effect however we only specified a single fixed effect (Height). So, let’s add random intercepts for subjects and items (remember that items are called “scenarios” here):

model.lmer <- lmer(Weight ~ Height + (1|Team) + (1|Position), data= data)

This command creates a model that includes a fixed effect “Height” to predict Weight, controlling for by-team and by-position variability. The model is saved in an object model.lmer. To see the model type in model.lmer – this will print the output in the shell. Note that in the classical fixed effect model, lm(), we need to use summary() to get this output.

This is the full output:

summary(model.lmer)

Linear mixed model fit by REML ['lmerMod']

Formula: Weight ~ Height + (1 | Team) + (1 | Position)

Data: data

REML criterion at convergence: 8841.4

Min	1Q	Median	3Q	Max
-2.9396	-0.6885	-0.0222	0.6010	4.1358

Random Effects

Groups Name	Variance	Std. Dev.
Team (Intercept)	1.285	1.133
Position (Intercept)	56.218	7.498
Residual	295.652	17.195

Number of obs: 1034, groups: Team, 30; Position, 9

Fixed Effects

	Estimate Std.	Error	t Value
(Intercept)	-142.8215	19.0160	-7.511
Height	4.6973	0.2571	18.271

Correlation of Fixed Effects: (Intr) Height -0.991

The output starts with the user-specified model and data. Then, there’s the restricted maximum likelihood (ReML) criterion, summary of the model residuals (ε), followed by random and fixed effects.

The standard deviation column represents a measure of the variability for each random effect included in the model. “Team” has much less variability than Position. The Residual row represents the variability that’s not due to either Team or Position. This is the ε term – random deviations from the predicted values that are not due to Team or Position. This reflects the fact that body-size has some factors that affect Weight that are outside of the observations we have in this experiment. Similarly, to the fixed effect models, the fixed effects output in this mixed model output includes estimates, SE and T-statistic.

The coefficient of “Height” (4.6973) is the slope for this variable. Finally, the Correlation of the Fixed Effects (H vs. Intercept) is reported to be negative! The output under "correlation of fixed effects" has a different interpretation from the intuitive meaning. It is not about the correlation of the variables, but about the “expected correlation of the regression coefficients”. This may, or may not, be related to multicollinearity. In this case, corr(H,Intrc)= -0.991 suggest that if we redo the experiment again a decrease of the Height coefficient will increase the intercept, and vice-versa increase of the Height coefficient is expected to will drive down the intercept term (which is currently =-142.8215).

plot(Height, Weight, main="Scatterplot Height vs. Weight", xlab="Height ", ylab="Weight ", pch=19) 
# Add fit lines
abline(lm(Weight~Height), col="red") # regression line (W~H) 
lines(lowess(Height,Weight), col="blue") # lowess line (H,W), (locally weighted scatterplot smoothing)

Inference on LMER coefficients: The Satterthwaite approximation, implemented in the lmerTest package, overloads the lmerM function, using exactly the same model, but the summary() will include approximate degrees of freedom and p-values for all predictors.

# install.packages("lmerTest") 
library(lmerTest)

# re-fit model
model.lmer <- lmer(Weight ~ Height + (1|Team) + (1|Position), data= data, REML = FALSE)
coeffs <- data.frame(coef(summary(model.lmer)))

# get Satterthwaite-approximated degrees of freedom
coeffs$\$$df.Satt <- coef(summary(model.lmer))[, 3]
 # get approximate p-values
 coeffs$\$$p.Satt <- coef(summary(model.lmer))[, 5]
coeffs

	Estimate Std.	Error	df	t Value	Pr...t	df.Satt
(Intercept)	-143.26916	18.9698712	977.2059	-7.552458	9.792167e-14	977.2059
Height	4.70294	0.2567663	1030.9565	18.316031	0.000000e+00	1030.9565

p.Satt

(Intercept) 9.792167e-14

Height 0.000000e+00

For categorical variables, like in the fixed effect model, lm(), the mixed effect model, lmer(), takes whatever comes first in the alphabet to be the reference level. “A” comes before “C”, so the slope representing the change from “A” to “C” for a categorical variable (e.g., Team) is in that direction. If the interpretation requires the reference category to be “C”, rather than “A”, then we need to change how we interpret the sign of the slope coefficient. However, interpretation of standard errors and significance would remain the unchanged (as there are sign agnostic).

As the earlier model (model.lmer <- lmer(Weight ~ Height + (1|Team) + (1|Position), data= data)) did not account for age, the intercept may be biased between different age groups. If there are age effects – e.g., the distribution of weights may be bimodal (for younger and older players) – than the average weight may not be representative at all. Think of a study of mammals including 10 bipedal (humans) and 10 quadrupedal (dogs) where the mean number of legs of a mammal would be 3 ((10*2+10*4)/20=3), which is not informative and actually incorrect for mammals.

Adding Age as an additional fixed effect:

lmer.model.2 = lmer(Weight ~ Height + Age + (1|Team) + (1|Position), data=data)

This expands the original model object lmer.model with an age predictor. “Age” is added as a fixed effect because the relationship between Age and Weight is systematic and predictable (not random). Older players are expected to have higher weights.

plot(Age, Weight, main="Scatterplot Age vs. Weight", xlab="Age ", ylab="Weight ", pch=19) 
abline(lm(Weight~Age), col="red") # regression line (W~A)

Note that Age is different from the random effects for Team and Position, where the relationship between these and Weight is more unpredictable or stochastic.

Look at the residuals of lmer.model.2 output:

summary(lmer.model.2)

Linear mixed model fit by REML ['lmerMod']

Formula: Weight ~ Height + Age + (1 | Team) + (1 | Position)

Data: data

REML criterion at convergence: 8792.2

Scaled Residuals

Min	1Q	Median	3Q	Max
-2.9434	-0.6454	-0.0414	0.6022	4.4715

Random Effects

Groups Name	Variance	Std. Dev.
Team (Intercept)	1.136	1.066
Position (Intercept)	47.935	6.924
Residual	281.805	16.787

Number of obs: 1034, groups: Team, 30; Position, 9

Fixed Effects

	Estimate Std.	Error	t Value
(Intercept)	-176.4537	19.0913	-9.243
Height	4.8043	0.2512	19.125
Age	0.8888	0.1222	7.273

Correlation of Fixed Effects

	(Intr)	Height
Height	-0.974
Age	-0.240	0.056

Next See

Machine Learning Algorithms section for data modeling, training , testing, forecasting, prediction, and simulation.

SOCR Home page: http://www.socr.umich.edu

Translate this page:

(default)	Deutsch	Español	Français	Italiano	Português	日本語	България	الامارات العربية المتحدة	Suomi	इस भाषा में	Norge
한국어	中文	繁体中文	Русский	Nederlands	Ελληνικά	Hrvatska	Česká republika	Danmark	Polska	România	Sverige

@@ Line 318: / Line 318: @@
 |}
 </center>
+<center>
+Random Effects
+{| class="wikitable" style="text-align:center; width:35%" border="1"
+|-
+|Groups Name||Variance||Std. Dev.
+|-
+|Team (Intercept)||1.136||1.066
+|-
+|Position (Intercept)||47.935||6.924
+|-
+|Residual||281.805||16.787
+|}
+Number of obs: 1034, groups:  Team, 30; Position, 9
+</center>
+<center>
+Fixed Effects
+{| class="wikitable" style="text-align:center; width:35%" border="1"
+|-
+| ||Estimate Std.||Error||t Value
+|-
+|(Intercept)||-176.4537||19.0913||-9.243
+|-
+|Height||4.8043||0.2512||19.125
+|-
+|Age||0.8888||0.1222||7.273
+|}
+</center>
+<center>
+Correlation of Fixed Effects
+{| class="wikitable" style="text-align:center; width:35%" border="1"
+|-
+|||(Intr)||Height
+|-
+|Height||-0.974||
+|-
+|Age||-0.240||0.056
+|}
+</center>

Difference between revisions of "SMHS LinearModeling LMM"

Revision as of 11:49, 10 February 2016

SMHS Linear Modeling - Linear mixed effects analyses

Next See

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools