SMHS GEE
Contents
Scientific Methods for Health Sciences - Generalized Estimating Equations (GEE) Models
Overview
Generalized estimating equation (GEE) is a commonly used method to estimate the parameters of a generalized linear model with a possible unknown correlation between outcomes. It provides a general approach for analyzing discrete and continuous responses with marginal models and works as a popular alternative to maximum likelihood estimation (MLE). In this section, we are going to present a general introduction to the GEE method and illustrate its application with examples. The R package geeglm will be discussed in the attached paper reference.
Motivation
There is no nature specification or convenient ways to deal with situations like the joint multivariate distribution of $Y_{i} = {(Y_{i1}, Y_{i2}, \cdots, Y_{in})}'$ where the responses are discrete. That’s where the GEE method is proposed based on the concept of estimating equations. It provides a general approach for analyzing discrete and continuous responses with marginal models. So, how does the GEE model work?
GEE is applicable when (1) $\beta$, a generalized linear model regression parameter, characterizes systematic variation across covariate levels, (2) the date represents repeated measurements, clustered data, multivariate response, and (3) the correlation structure is a nuisance feature of the data. Denote:
- Response variables: $\{Y_{i,1}, Y_{i,2}, ..., Y_{i,n_t}\}$, where $i \in [1,N]$ is the index for clusters or subjects, and $j \in [1, n_t]$ is the index of the measurement within cluster/subject.
- Covariate vector: $\{X_{i,1}, X_{i,2}, ..., X_{i,n_t}\}$.
The primary focus of GEE is the estimation of the mean model: $E(Y_{i,j}|X_{i,j})=\mu_{i,j}$, where $g(\mu_{i,j})=\beta_o+\beta_1 X_{i,j}(1) +\beta_2 X_{i,j}(2)+\beta_3 X_{i,j}(3)+...+\beta_p X_{i,j}(p)=X_{i,j}\times \beta$.
This mean model can be any generalized linear model, e.g.,
- $P(Y_{i,j}=1|X_{i,j})=\pi_{i,j}$ (marginal probability, as we don't condition on any other variables)
- $g(\mu_{i,j})= \log ( \frac{\pi_{i,j}}{1-\pi_{i,j}}) = X_{i,j}\times \beta$.
Since the data could be clustered (e.g., within subject, or within unit), we need to choose a correlation model:
- $var(Y_{i,j}|X_{i}) = V_{i,j}$
- $A_{i}= diag(V_{i,j})$
- paired correlations: $\rho_{i,j,k} = corr(Y_{i,j},Y_{i,k}|X_{i})$
- correlation matrix: $R_i= \{\rho_{i,j,k}\}$, for all $j$ and $k$
- paired predictor-response covariances: $V_i=cov(Y_i|X_i)=A_i^{1/2} R_i A_i^{1/2}$.
Different correlation structures assumed in the data lead to alternative models. For example:
- Independence: $R_i = \begin{bmatrix} 1 & 0 & 0 & 0 \\ 0 & 1 & 0 & 0 \\ 0 & 0 & 1 & 0 \\ 0 & 0 & 0 & 1 \\ \end{bmatrix}$,
- Exchangeable/equi-correlation:: $R_i = \begin{bmatrix} 1 & \alpha & \alpha & \alpha \\ \alpha & 1 & \alpha & \alpha \\ \alpha & \alpha & 1 & \alpha \\ \alpha & \alpha & \alpha & 1 \\ \end{bmatrix}$, for any $\alpha$
- Unstructured: $R_i = \begin{bmatrix} 1 & \alpha_{1,2} & \alpha_{1,3} & \alpha_{1,4} \\ \alpha_{2,1} & 1 & \alpha_{2,3} & \alpha_{2,4} \\ \alpha_{3,1} & \alpha_{3,2} & 1 & \alpha_{3,4} \\ \alpha_{4,1} & \alpha_{4,2} & \alpha_{4,3} & 1 \\ \end{bmatrix}$
- and so on.
Notes:
- GEE is a semi-parametric technique because:
- The specification of a mean model, $\mu_{i,j}(\beta)$, and a correlation model, $R_i(\alpha)$, does not identify a complete probability model for $Y_i$
- The model $\{ \mu_{i,j}(\beta), R_i(\alpha)\}$ is semi-parametric since it only specifies the first two multivariate moments (mean and covariance) of $Y_i$. Higher order moments are not specified.
- Without an explicit likelihood function, to estimate the parameter vector $\beta$ (and perhaps the covariance parameter matrix $R_i(\alpha)$) and perform valid statistical inference that takes the dependence into consideration, we need to construct an unbiased estimating function:
- $D_i(\beta) = \frac{\partial \mu_i}{\partial \beta}$, the partial derivative, w.r.t. $\beta$, of the mean-model for subject $i$
- $D_i(j,k) = \frac{\partial \mu_{i,j}}{\partial \beta_k}$
- Estimating (cost) function: $U(\beta)=\sum_{i=1}^N{D_i^T(\beta)V_i^{-1}(\beta,\alpha)\{Y_i-\mu_i(\beta)\}}$.
- Solving the Estimating Equations leads to parameter estimating solutions: $0=U(\hat{\beta})=\sum_{i=1}^N{D_i^T(\hat{\beta})V_i^{-1}(\hat{\beta},\alpha)\{Y_i-\mu_i(\hat{\beta})\}}$.
- $0=U(\hat{\beta})=\sum_{i=1}^N{\underbrace{D_i^T(\hat{\beta})}_{\text{scale}} \underbrace{V_i^{-1}(\hat{\beta},\alpha)}_{\text{variance weight}} \underbrace{\{Y_i-\mu_i(\hat{\beta})\}}_{\text{model mean}}}$
- Scale: a change of scale term transforming the scale of the mean, $\mu_i$, to the scale of the regression coefficients (covariates)
- Variance weight: the inverse of hte variance-covariance matrix is used to weight in the data for subject $i$, i.e., giving more weight to differences between observed and expected values for subjects that contribute more information
- Model Mean: specifies the mean model, $\mu_i(\beta)$, compared to the observed data, $Y_i$. This fidelity term minimizes the difference between actually-observed and mean-expected (within the $i^{th}$ cluster/subject).
- $0=U(\hat{\beta})=\sum_{i=1}^N{\underbrace{D_i^T(\hat{\beta})}_{\text{scale}} \underbrace{V_i^{-1}(\hat{\beta},\alpha)}_{\text{variance weight}} \underbrace{\{Y_i-\mu_i(\hat{\beta})\}}_{\text{model mean}}}$
Theory
1) GEE
GEE is a commonly used method to estimate the parameters of a generalized linear model with a possible unknown correlation between outcomes. Parameter estimates from the GEE are consistent even when the covariance is mis-specified under mild regularity conditions and the focus of GEE is on estimating the average response over the population rather than the regression parameters that would enable prediction of the effect of changing one or more covariates on a given individual.
- GEEs belong to the class of semi-parametric regression techniques since they rely on specification of only the first two moments and are a popular alternative to the likelihood-based generalized linear mixed model.
- They are commonly used in large epidemiological studies, especially multi-site cohort studies because they can handle many types of unmeasured dependence between outcomes.
2) Marginal models
The vector of observations from subject $I$ are $Y_{i} = {(Y_{i1}, Y_{i2}, \cdots, Y_{in})}',$ for $(i=1,2, \cdots, N).$ Consider a marginal model that has a mean function, a function of predictors; a known variance functions (discrete data the mean determines the variance); possibly a scale parameter for overdispersion; a working correlation matrix.
- More specifically, the marginal mean of the response is $E[Y_{ij}]=\mu_{ij},$ the mean depends on explanatory variables $X_{ij}$ through a known link function: $g(\mu_{ij})=\eta_{ij} ={X_{ij}}’ \beta.$
- The marginal variance of $Y_{ij}$ depends on the marginal mean according to $Var[Y_{ij}]=v(\mu_{ij}) \phi,$ where the variance function $v(\mu_{ij})$ is known, $\phi =1$ is fixed for Poisson, Bernoulli. Parameter $\phi$ may have to be estimated for normal model or overdispersed data. Correlation between $Y_{ij}$ and $Y_{ik}$ is a function of extra parameter, \alpha. Correlation may depend on means $\mu_{ij}$ and $\mu_{ik}.$
- The mean idea is to generalize the usual univariate likelihood equations by introducing the covariance matrix of the vector of responses,$ Y_{i}$. For linear models, weighted least squares (WLS) or generalized least squares (GLS) can be considered a special case of this ‘estimating equations’ approach; for non-linear models, this approach is called ‘generalized estimating equations’ (GEE).
3) GEE computation:
- Consider normal linear regression $y_{i}={x_{i}}’ \beta + \epsilon_{i}, \epsilon_{i} \sim N(0, \sigma^{2}).$ The normal density for $N$ observations looks like $(2* \pi * \sigma^{2})^{-N/2}$ times the exponential of $\sum_{i=1}^{N} -0.5(y_{i}- {x_{i}}' \beta) \sigma^{-2} (y_{i} - {x_{i}}' \beta),$ which is also referred to as the log likelihood function.
- Likelihood equations, normal regression: find the maximum likelihood estimates by finding the value $\hat{\beta}$ of $\beta$ that maximizes the likelihood $l(\beta)= \sum_{i=1}^{N} -0.5(y_{i}- {x_{i}}' \beta) \sigma^{-2} (y_{i} - {x_{i}}' \beta);$ maximum of a function $l(\beta)$ is found by differentiating the function $l(\beta)$ with respect to $\beta,$ setting this equal to zero and then solving $\frac{dl(\beta)}{d \beta} = 0.$
- Regression likelihood equations: the log likelihood looks like $\sum_{i=1}^{N} -0.5(y_{i}- {x_{i}}' \beta) \sigma^{-2} (y_{i} - {x_{i}}' \beta);$ differentiating with respect to $\beta$ and setting equal to zero gives $\sum_{i=1}^{N}{x_{i}}'\sigma^{-2}(y_{i}-{x_{i}}'\beta) = 0.$ This is the likelihood equation. Solving for $\beta$ gives us $\hat{\beta}=(\sum_{i} x_{i}{x_{i}}')^{-1}(\sum_{i} x_{i}y_{i}).$
- Normal longitudinal model: generalize to multivariate regression (longitudinal models); suppose continuous data $Y_{i}$, nx1; predictor matrix $X_{i}$,$n$ rows by $p$ columns; coefficient vector $\beta, p by 1$. The model is $Y_{i}=X_{i}+\epsilon_{i},$ the residual distribution is $\epsilon \sim N(0, V_{i}),$ variance matrix is $V_{i}$ to distinguish from summation $\sum.$ Covariance matrix $V_{i}$ depends on unknown parameter $\theta.$
- Multivariate likelihood: the multivariate normal log likelihood is $\sum_{i=1}^{N} -0.5{(Y_{i}-X_{i}\beta)}'V_{i}^{-1}(Y_{i}-X_{i}\beta);$ looks like a normal regression likelihood as before; vector $Y_{i}$ has replaced scalar $y_{i}$; vector $X_{i}\beta$ has replaced scalar ${x_{i}}’\beta;$ and matrix $V_{i}^{-1}$ has replaced $\sigma^{-2}.$
- Multivariate observation likelihood: differentiate the log likelihood and set to zero; gives the likelihood equation $\sum_{i=1}^{N}{X_{i}}'V_{i}^{-1}(Y_{i}-X_{i}\beta)=0;$ similar to linear regression likelihood equation; the solution (suppose we know $V_{i})$ is $\hat{\beta}=(\sum_{i=1}^{N}{X_{i}'V_{i}^{-1}X_{i}})^{-1}(\sum_{i=1}^{N}{X_{i}}'V_{i}^{-1}Y_{i}),$ this is the weighted least squares estimator for $\hat{\beta}.$
- Interpretation of the likelihood equations: $\sum_{i=1}^{N}{X_{i}}'V_{i}^{-1}(Y_{i}-X_{i}\beta)=0;$ the residual $(Y_{i}-X_{i}\beta)$ is $(Y_{i}-\mu_{i}),$ the observation vector $Y_{i}$ minus its mean $\mu_{i}=X_{i}\beta;$ the $V_{i}^{-1}$ is the inverse of the inverse of the covariance matrix; the $X_{i}$ can be thought of as the derivative of $\mu_{i}$ with respect to the regression parameter $\beta,$ written as $d\mu_{i}/d\beta.$
- Multivariate generalized linear models equations: generalize to generalized linear models (GLMs) with correlated data. It is no longer a likelihood equation, instead use a weighted least squares equation, where we replace each term in the previous likelihood equation with a generalized linear model equivalent, and include something to adjust for correlations. $Y_{i}$ vector remains the same; mean vector is $\mu_{i}$ rather than $X_{i}\alpha;$ replace $X_{i}$ with the derivative $D_{i} = d\mu_{i}/d\beta;$ for GLMs $D_{i}$ has components that look like $D_{1}^{-1}X_{i}, matrix D_{1}$ is diagonal and has elements $v(\mu_{ij})$ the variance functions of the means; thus $D_{i}$ is a weighted predictor matrix; the overdispersion parameter $\phi$ is not involved at this stage. The $V_{i}$’s are constructed to be similar to the covariance matrix $var[Y_{i}]$, but not actually equal; the GEEs are $\sum_{i=1}^{N}D_{i}V_{i}^{-1}(Y_{i}-\mu_{i}(\beta))=0,$ solve for $\beta$ giving the GEE estimate we write as $\hat{\beta}_{GEE}.$
- Constructing the working covariance matrix: $V_{i} \approx Var[Y_{i}];$ separate $V_{i}$ into two parts of variance matrix and correlation matrix; the variance matrix is a diagonal matrix $A_{i}$ of variances; the diagonal elements of $A_{i}$ are $\phi v(\mu_{ij})$ are the variances of the observation $Y_{ij}.$ The correlation matrix $Corr(Y_{i})$ is a function of unknown parameter $\alpha$; the matrix $V_{i}$ is known as a working covariance matrix; it is not the true underlying correlation matrix $Corr[Y_{i}]$;, the $V_{i}$ is close to $Var[Y_{i}],$ but is not assumed to be exactly correct.
- Variances of the GEE estimates: the working correlation is a postulated covariance matrix $Y_{i}-\mu_{i};$ not the final or true or modeled covariance matrix; rather a working correlation matrix used to create estimates $\hat{\beta}_{GEE}.$ If the model for the data is correct, then the covariance matrix of the estimate is $B^{-1}=Var[\hat{\beta}_{GEE}]=(\sum_{i=1}^{N}{D_{i}}'V_{i}^{-1}D_{i})^{-1};$ this is the same as the proc mixed covariance model $(\sum_{i=1}{N}X_{i}'V_{i}^{-1}X_{i})^{-1},$ but with $D_{i}=d\mu_{i}/d\beta replacing X_{i}.$ Assuming working correlation matrix is incorrect, modify covariance matrix $B^{-1}, Var[\hat{\beta}_{GEE}]=B^{-1}MB^{-1},$ increase the variance estimate; inflate $Var[\hat{\beta}_{GEE}]$ to pay for errors in $Corr[Y_{i}],$ the $B^{-1}$ is the bread; $M$ is the meat in the $B^{-1}MB^{-1}$ sandwich, which gives this estimator the nickname ‘sandwich estimator’.
- The middle: $M=\sum_{i=1}^{N}D_{i}'V_{i}^{-1}Cov[Y_{i}]V_{i}^{-1}D_{i},$ estimate $D_{i}=d\mu_{i}/d\beta$ and $V_{i}$ is the working correlation matrix times the variance function; Given that we don’t know $Cov[Y_{i}],$ we replace it with $(Y_{i}-\hat{\mu}_{i})(Y_{i}-\hat{\mu}_{i})’,$ which has the right expectation and $B^{-1}MB^{-1}$ is referred to as aka Huber sandwich estimator or robust standard errors.
- Advantages of GEE: the estimated regression coefficient $\hat{\beta}_{GEE}$ is asymptotically correct if the underlying regression mean model is correct even if assumed correlation model is incorrect. Valid standard errors are attained with the sandwich covariance estimator and are easy to use methodology.
- Disadvantages of GEE: the assumptions can correspond to a mathematically impossible model, that is some combinations of correlations and variances are mathematically impossible, no matter how the data is generated; the sandwich estimator inflates standard errors; inflated SE’s are a serious cost of not modeling the covariance correctly; GEE does not fully specify a statistical model and covers instead with asymptotic statements; full assumptions are hidden.
- GEE does not allow or make individual subject predictions; GEE methodology most suited to be balanced longitudinal design where the number of observations $N$ is large and number $n$ of repeated measure is small; GEE is not good for highly unbalanced data sets.
Applications
1) This article discussed extensions of generalized linear models for the analysis of longitudinal data. Two approaches are considered: subject-specific (SS) models in which heterogeneity in regression parameters is explicitly modeled; and population-average (PA) models in which the aggregate response for the population is the focus. We use a generalized estimating equation approach to fit both classes of models for discrete and continuous outcomes. When the subject-specific parameters are assumed to follow a Gaussian distribution, simple relationships between the PA and SS parameters are available. The methods are illustrated with an analysis of data on mother’s smoking and children’s respiratory disease.
2) This article discussed about the generalized estimating equations for correlated binary data using the odds ratio as a measure of association. Moment methods for analyzing repeated binary responses have been proposed before and estimation of the parameters were associated with the expected value of an individual's vector of binary responses as well as the correlations between pairs of binary responses. Because the odds ratio has many desirable properties, and some investigators may find the odds ratio is easier to interpret, the paper discuss modeling the association between binary responses at pairs of times with the odds ratio and modify the estimating equations of Prentice to estimate the odds ratios. In simulations, the parameter estimates for the logistic regression model for the marginal probabilities appear slightly more efficient when using the odds ratio parameterization.
3) This article talked about the generalized estimating equation (GEE) for longitudinal data analysis. The GEE approach of Zeger and Liang facilitates analysis of data collected in longitudinal, nested, or repeated measures designs. GEEs use the generalized linear model to estimate more efficient and unbiased regression parameters relative to ordinary least squares regression in part because they permit specification of a working correlation matrix that accounts for the form of within-subject correlation of responses on dependent variables of many different distributions, including normal, binomial, and Poisson. The author briefly explains the theory behind GEEs and their beneficial statistical properties and limitations and compares GEEs to suboptimal approaches for analyzing longitudinal data through use of two examples. The first demonstration applies GEEs to the analysis of data from a longitudinal lab study with a counted response variable; the second demonstration applies GEEs to analysis of data with a normally distributed response variable from subjects nested within branch offices of an organization.
Software
The geeglm functiongeeglem(outcome ~ baseline + center + sex + treat + age + I(age^), data = respiratory, id = interaction (center, id), family = binomial, corstr = ‘exchangeable’)
Problems
Example 1: library(geepack) # install this package first data(dietox) summary(dietox)
Weight Feed Time Pig Min. : 15.00 Min. : 3.30 Min. : 1.000 Min. :4601 1st Qu.: 38.30 1st Qu.: 32.80 1st Qu.: 3.000 1st Qu.:4857 Median : 59.20 Median : 74.50 Median : 6.000 Median :5866 Mean : 60.73 Mean : 80.73 Mean : 6.481 Mean :6238 3rd Qu.: 81.20 3rd Qu.:123.00 3rd Qu.: 9.000 3rd Qu.:8050 Max. :117.00 Max. :224.50 Max. :12.000 Max. :8442 NA's :72 Evit Cu Litter Min. :1.000 Min. :1.000 Min. : 1.00 1st Qu.:1.000 1st Qu.:1.000 1st Qu.: 5.00 Median :2.000 Median :2.000 Median :11.00 Mean :2.027 Mean :2.015 Mean :12.14 3rd Qu.:3.000 3rd Qu.:3.000 3rd Qu.:20.00 Max. :3.000 Max. :3.000 Max. :24.00
attach(dietox) Cu <- as.factor(dietox$Cu) mf <- formula(Weight~Cu*(Time+I(Time^2)+I(Time^3))) gee <- geeglm(mf,data=dietox,id=Pig,family=poisson('identity'),corstr='ar1') ## The geeglm function fits generalized estimating equations using the ‘geese.fit’ function of the ‘geepack’ package for doing the actual computations. gee Call: geeglm(formula = mf, family = poisson("identity"), data = dietox,
id = Pig, corstr = "ar1")
Coefficients:
(Intercept) Cu Time I(Time^2) I(Time^3) Cu:Time
22.025614857 0.013857443 2.177350061 0.683063921 -0.027808001 0.424728590 Cu:I(Time^2) Cu:I(Time^3) -0.047363387 0.001313779
Degrees of Freedom: 861 Total (i.e. Null); 853 Residual
Scale Link: identity Estimated Scale Parameters: [1] 0.7877976
Correlation: Structure = ar1 Link = identity Estimated Correlation Parameters:
alpha
0.9576663 Number of clusters: 72 Maximum cluster size: 12
aov(gee)
Call:
aov(formula = gee)
Terms:
Cu Time I(Time^2) I(Time^3) Cu:Time Cu:I(Time^2)
Sum of Squares 521.2 492381.6 1006.6 297.1 0.0 34.1 Deg. of Freedom 1 1 1 1 1 1
Cu:I(Time^3) Residuals
Sum of Squares 1.4 42350.3 Deg. of Freedom 1 853
Residual standard error: 7.046179 Estimated effects may be unbalanced
Example 2 (use dataset of Normal and Schizophrenia Neuroimaging study of Children at http://wiki.socr.umich.edu/index.php/SOCR_Data_Oct2009_ID_NI) ▪ Subject: Subject identifier ▪ Age: Subject Age ▪ DX: Subject diagnosis (Normals=1; Schizophrenia=2) ▪ Sex: Subject gender (Male=1; female=2) ▪ FS_IQ: Subject Intelligence Quotient (IQ) ▪ TBV: Total Brain Volume (mm 3 ) ▪ GMV: Total Gray Matter Volume (mm 3 ) ▪ WMV: Total White Matter Volume (mm 3 ) ▪ CSF: Total Cerebrospinal Fluid (mm 3 ) ▪ Background: Background volume (mm 3 ) ▪ L_superior_frontal_gyrus, R_superior_frontal_gyrus, ..., brainstem: 56 regional cortical and subcortical volumes (region anatomical names are encoded in the column heading).
neuro <- read.csv('M:/neuro.csv',header=T) summary(neuro) mf2 <- formula(FS_IQ ~ Sex+Age*Sex+Age^2+Age^3+TBV+GMV+WMV+CSF+Background) gee2 <- geeglm(mf2,data=neuro,id=Subject,family=poisson('identity'),corstr='ar1') gee2 Call: geeglm(formula = mf2, family = poisson("identity"), data = neuro,
id = Subject, corstr = "ar1")
Coefficients:
(Intercept) Sex Age TBV GMV WMV CSF Background
-3.093394e+02 2.255625e+01 3.532712e+00 3.392570e+00 -3.392482e+00 -3.392484e+00 -3.392693e+00 3.144914e-05
Sex:Age
-1.526318e+00
Degrees of Freedom: 63 Total (i.e. Null); 54 Residual
Scale Link: identity Estimated Scale Parameters: [1] 2.538312
Correlation: Structure = ar1 Link = identity Estimated Correlation Parameters: alpha
0
Number of clusters: 63 Maximum cluster size: 1
aov(gee2)
Call:
aov(formula = gee2)
Terms:
Sex Age TBV GMV WMV CSF Background Sex:Age Residuals
Sum of Squares 497.829 740.825 1040.712 25.337 3131.065 205.501 384.453 366.533 15459.175 Deg. of Freedom 1 1 1 1 1 1 1 1 54
Residual standard error: 16.91984 Estimated effects may be unbalanced
mf3 <- formula(FS_IQ ~ Sex+Age*Sex+Age^2+Age^3) gee3 <- geeglm(mf3,data=neuro,id=Subject,family=poisson('identity'),corstr='ar1') gee3
Call: geeglm(formula = mf3, family = poisson("identity"), data = neuro,
id = Subject, corstr = "ar1")
Coefficients: (Intercept) Sex Age Sex:Age
81.3791778 1.9781743 1.7571481 -0.5086719
Degrees of Freedom: 63 Total (i.e. Null); 59 Residual
Scale Link: identity Estimated Scale Parameters: [1] 3.364596
Correlation: Structure = ar1 Link = identity Estimated Correlation Parameters: alpha
0
Number of clusters: 63 Maximum cluster size: 1
aov(gee3) Call:
aov(formula = gee3)
Terms:
Sex Age Sex:Age Residuals
Sum of Squares 497.829 740.825 29.199 20583.576 Deg. of Freedom 1 1 1 59
Residual standard error: 18.67817 Estimated effects may be unbalanced
References
- SOCR Home page: http://www.socr.umich.edu
Translate this page: