Difference between revisions of "SMHS BigDataBigSci GCM"

From SOCR
Jump to: navigation, search
(Respiratory Illness GEE R example)
(See also)
 
(31 intermediate revisions by 3 users not shown)
Line 76: Line 76:
 
  # the z-values, the standardized parameter values, and returns a data frame
 
  # the z-values, the standardized parameter values, and returns a data frame
 
  fitted(fit4) # return the model-implied (fitted) covariance matrix (and mean vector) of a fitted model
 
  fitted(fit4) # return the model-implied (fitted) covariance matrix (and mean vector) of a fitted model
 
  
 
  # resid() function return (unstandardized) residuals of a fitted model including the difference between  
 
  # resid() function return (unstandardized) residuals of a fitted model including the difference between  
Line 91: Line 90:
  
 
(CFI) is an incremental measure directly based on the non-centrality measure.  If d = χ2(df) where df are the degrees of freedom of the model, the Comparative Fit Index is:
 
(CFI) is an incremental measure directly based on the non-centrality measure.  If d = χ2(df) where df are the degrees of freedom of the model, the Comparative Fit Index is:
\begin{equation}
+
$
 
\frac{(Null Model)-d(Proposed Model)}{d(Null Model)}.
 
\frac{(Null Model)-d(Proposed Model)}{d(Null Model)}.
\end{equation}
+
$
  
 
$0≤CFI≤1$ (by definition). It is interpreted as:
 
$0≤CFI≤1$ (by definition). It is interpreted as:
  
*$CFI<0.9$  - model fitting is poor.  
+
*$CFI<0.9$  - model fitting is poor.
  
 
*$0.9≤CFI≤0.95$  is considered marginal,  
 
*$0.9≤CFI≤0.95$  is considered marginal,  
  
*$CFI>0.95$  is good.
+
*$CFI>0.95$  is good.  
  
 
CFI is a relative index of model fit – it compare the fit of your model to the fit of (the worst) fitting null model.
 
CFI is a relative index of model fit – it compare the fit of your model to the fit of (the worst) fitting null model.
Line 225: Line 224:
 
<b>Note:</b> See discussion of SEM modeling pros/cons <sup>2</sup>.
 
<b>Note:</b> See discussion of SEM modeling pros/cons <sup>2</sup>.
  
===Generalized Estimating Equation (GEE) Modeling===
+
==Generalized Estimating Equation (GEE) Modeling==
  
Generalized Estimating Equations (GEE) modeling   is used for analyzing data with the following characteristics:
+
Generalized Estimating Equations (GEE) modeling<sup>3</sup> is used for analyzing data with the following characteristics:
 
(1) the observations within a group may be correlated, (2) observations in separate clusters are independent, (3) a monotone transformation of the expectation is linearly related to the explanatory variables, and (4) the variance is a function of the expectation. The expectation (#3) and the variance (# 4) are conditional given group-level or individual-level covariates.
 
(1) the observations within a group may be correlated, (2) observations in separate clusters are independent, (3) a monotone transformation of the expectation is linearly related to the explanatory variables, and (4) the variance is a function of the expectation. The expectation (#3) and the variance (# 4) are conditional given group-level or individual-level covariates.
  
 
GEE is applied to handle correlated discrete and continuous outcome variables. For the outcome variables, it only requires specification  of the first 2 moments and  correlation  among  them.    The  goal  is  to estimate fixed parameters    without    specifying    their    joint    distribution.  The correlation is specified by one of these 4 alternatives (which is specified in the R call: geeglm(outcome ~ center + treat + sex + baseline + age, data = respiratory, family = "binomial", id = id, <b>corstr = " exchangeable"</b>, scale.fix = TRUE):
 
GEE is applied to handle correlated discrete and continuous outcome variables. For the outcome variables, it only requires specification  of the first 2 moments and  correlation  among  them.    The  goal  is  to estimate fixed parameters    without    specifying    their    joint    distribution.  The correlation is specified by one of these 4 alternatives (which is specified in the R call: geeglm(outcome ~ center + treat + sex + baseline + age, data = respiratory, family = "binomial", id = id, <b>corstr = " exchangeable"</b>, scale.fix = TRUE):
  
[[Image:SMHS_BigDataBigSci8.png|300px]]
+
<center>[[Image:SMHS_BigDataBigSci8.png|300px]]</center>
  
 
===Respiratory Illness GEE R example===
 
===Respiratory Illness GEE R example===
Line 244: Line 243:
 
<b>Table 1</b>: Distribution of patients for <b>different response patterns</b> classified by <b>baseline-respiratory</b> response and <b>treatment</b>. The patterns are ordered according to increasing numbers of positive responses.
 
<b>Table 1</b>: Distribution of patients for <b>different response patterns</b> classified by <b>baseline-respiratory</b> response and <b>treatment</b>. The patterns are ordered according to increasing numbers of positive responses.
  
<mark>!!!!!Insert tables here</mark>
+
<center>
 +
{| class="wikitable" style="text-align:center; width:75%" border="1"
 +
|-
 +
! ||Visit|| colspan="15"| All Possible Response Patterns (2*2*2*2=16 permutation patterns)||
 +
|-
 +
|||1||0||1||0||0||0||1||1||1||0||0||1||1||1||0||1||
 +
|-
 +
|||2||0||0||1||0||0||1||0||0||1||0||1||1||0||1||1||
 +
|-
 +
|||3||0||0||0||1||0||0||1||0||1||1||1||0||1||1||1||
 +
|-
 +
|||4||0||0||0||0||1||0||0||1||0||1||0||1||1||1||1||
 +
|-
 +
!Baseline||Treatment||||||||||||||||||||||||||||||||Sum
 +
|-
 +
| rowspan="2"|0||A||7||2||2||2||1||0||1||0||1||0||1||2||0||4||7||30
 +
|-
 +
|P||18||1||0||2||1||2||0||0||1||0||0||1||2||0||3||31
 +
|-
 +
|rowspan="2"|1||A||0||0||0||0||0||0||1||1||0||0||4||0||1||0||17||24
 +
|-
 +
|P||1||4||1||0||0||0||0||1||1||3||1||1||2||1||10||26
 +
|-
 +
|Sum||||26||7||3||4||2||2||2||2||3||3||6||4||5||5||37||111
 +
|}
 +
</center>
 +
 
  
 
<b>Table 2</b>: Distribution of patients for the number of positive responses across the 4 visits for <b>Sex</b> and <b>Center</b>.  
 
<b>Table 2</b>: Distribution of patients for the number of positive responses across the 4 visits for <b>Sex</b> and <b>Center</b>.  
 +
 +
<center>
 +
{| class="wikitable" style="text-align:center; width:75%" border="1"
 +
|-
 +
! colspan="2" rowspan="2"| ||colspan="5"|Number of positive responses
 +
|-
 +
| 0||1||2||3||4
 +
|-
 +
|rowspan="2"|Sex || F||7||3||3||3||7
 +
|-
 +
|M||19||13||9||17||30
 +
|-
 +
|rowspan="2"|Center|| 1||18||9||6||11||12
 +
|-
 +
|2||8||7||6||9||25
 +
|}
 +
</center>
  
 
<b>Figure 1</b> shows a plot of age against the proportion of positive responses for each patient. It indicates a quadratic relationship between the proportions and the age. Fitting a logistic model to the data (which would be appropriate if there were <i>no time effects</i> and <i>no spread in the response probabilities</i> for patients with the same covariate values).
 
<b>Figure 1</b> shows a plot of age against the proportion of positive responses for each patient. It indicates a quadratic relationship between the proportions and the age. Fitting a logistic model to the data (which would be appropriate if there were <i>no time effects</i> and <i>no spread in the response probabilities</i> for patients with the same covariate values).
Line 378: Line 420:
 
  # GEE modeling: R function arguments/options
 
  # GEE modeling: R function arguments/options
  
<b>corstr</b>= for defining the correlation structure within groups in a GEE model
+
*<b>corstr</b>= for defining the correlation structure within groups in a GEE model
  
<b>id</b>= is used to identify the grouping variable in a GEE model
+
*<b>id</b>= is used to identify the grouping variable in a GEE model
  
<b>scale.fix</b>= when TRUE causes the scale parameter to be fixed (by default at 1) rather than estimated
+
*<b>scale.fix</b>= when TRUE causes the scale parameter to be fixed (by default at 1) rather than estimated
  
<b>waves</b>= names a positive integer-valued variable that is used to identify the order and spacing of observations within groups in a GEE model. This argument is crucial when there are missing values and gaps in the data
+
*<b>waves</b>= names a positive integer-valued variable that is used to identify the order and spacing of observations within groups in a GEE model. This argument is crucial when there are missing values and gaps in the data
  
 
  gee.model1 <- <b>geeglm</b>(outcome ~ center + treat + sex + baseline + age, data = respiratory, family = "binomial", id = id, corstr = "exchangeable", scale.fix = TRUE)
 
  gee.model1 <- <b>geeglm</b>(outcome ~ center + treat + sex + baseline + age, data = respiratory, family = "binomial", id = id, corstr = "exchangeable", scale.fix = TRUE)
Line 392: Line 434:
 
  summary(gee.model1)
 
  summary(gee.model1)
  
  # To test the effect of treatment using anova()
+
  # To test the effect of ''treatment'' using anova()
 
  gee.model1 <- <b>geeglm</b>(outcome ~ center + <b><u>treat</u></b> + sex + baseline + age, data = respiratory, family=binomial(link="logit"), id = id, corstr = "exchangeable", std.err="san.se")
 
  gee.model1 <- <b>geeglm</b>(outcome ~ center + <b><u>treat</u></b> + sex + baseline + age, data = respiratory, family=binomial(link="logit"), id = id, corstr = "exchangeable", std.err="san.se")
 
  gee.model2 <- geeglm(outcome ~ center + sex + baseline + age, data = respiratory, family=binomial(link="logit"), id=id, corstr = "exchangeable", std.err="san.se")
 
  gee.model2 <- geeglm(outcome ~ center + sex + baseline + age, data = respiratory, family=binomial(link="logit"), id=id, corstr = "exchangeable", std.err="san.se")
Line 403: Line 445:
 
  # individual predictors with the tests carried out in the order the predictors are listed in the model formula.
 
  # individual predictors with the tests carried out in the order the predictors are listed in the model formula.
 
  anova(gee.model1)
 
  anova(gee.model1)
 
<sup>3</sup>http://www.jstatsoft.org/v15/i02/
 
<sup>4</sup>https://books.google.com/books?id=mdEqBgAAQBAJ
 
  
 
===PD GEE example===
 
===PD GEE example===
Line 451: Line 490:
 
  CI2
 
  CI2
  
Appendix
+
==Appendix==
  
 
SEM References
 
SEM References
  
http://socserv.mcmaster.ca/jfox/Misc/sem/SEM-paper.pdf  
+
*http://socserv.mcmaster.ca/jfox/Misc/sem/SEM-paper.pdf  
  
 
GEE References
 
GEE References
  
https://cran.r-project.org/web/packages/geepack/geepack.pdf
+
*https://cran.r-project.org/web/packages/geepack/geepack.pdf
http://www.jstatsoft.org/v15/i02/paper
+
 
 +
*http://www.jstatsoft.org/v15/i02/paper
 +
 
 +
===Footnotes===
 +
 
 +
*<sup>2</sup> http://www.imachordata.com/ecological-sems-and-composite-variables-what-why-and-how/
 +
*<sup>3</sup> http://www.jstatsoft.org/v15/i02/
 +
*<sup>4</sup> https://books.google.com/books?id=mdEqBgAAQBAJ
  
 
==See also==
 
==See also==
 
* [[SMHS_BigDataBigSci| Back to Model-based Analytics]]  
 
* [[SMHS_BigDataBigSci| Back to Model-based Analytics]]  
 
* [[SMHS_BigDataBigSci_SEM| Structural Equation Modeling (SEM)]]  
 
* [[SMHS_BigDataBigSci_SEM| Structural Equation Modeling (SEM)]]  
* [[SMHS_BigDataBigSci_GEE| Next Section: Generalized Estimating Equation (GEE) Modeling]]
+
 
  
 
<hr>
 
<hr>

Latest revision as of 15:00, 23 May 2016

Model-based Analytics - Growth Curve Models

Latent growth curve models may be used to analyze longitudinal or temporal data where the outcome measure is assessed on multiple occasions, and we examine its change over time, e.g., the trajectory over time can be modeled as a linear or quadratic function. Random effects are used to capture individual differences by conveniently representing (continuous) latent variables, aka growth factors. To fit a linear growth model we may specify a model with two latent variables: a random intercept, and a random slope:

#load data   05_PPMI_top_UPDRS_Integrated_LongFormat.csv ( dim(myData) 661  71), wide 
# setwd("/dir/")
myData <- read.csv("https://umich.instructure.com/files/330395/download?download_frd=1&verifier=v6jBvV4x94ka3EYcGKuXXg5BZNaOLBVp0xkJih0H",header=TRUE)
attach(myData)
# dichotomize the "ResearchGroup" variable
table(myData$\$$ResearchGroup)
 myData$\$$ResearchGroup <- ifelse(myData$\$$ResearchGroup == "Control", 1, 0)

 # linear growth model with 4 timepoints
 # intercept (i) and slope (s) with fixed coefficients
 # i =~ 1*t1 + 1*t2 + 1*t3 + 1*t4 (intercept/constant)
 # s =~ 0*t1 + 1*t2 + 2*t3 + 3*t4  (slope/linear term)
 # ??? =~ 0*t1 + 1*t2 + 2*t3 + 3*t4  (quadratic term)

In this model, we have fixed all the coefficients of the linear growth functions:

 model4 <-
 ' 
 i =~ 1*UPDRS_Part_I_Summary_Score_Baseline + 1*UPDRS_Part_I_Summary_Score_Month_03 + 
 1*UPDRS_Part_I_Summary_Score_Month_06 + 1*UPDRS_Part_I_Summary_Score_Month_09 + 
 1*UPDRS_Part_I_Summary_Score_Month_12 + 1*UPDRS_Part_I_Summary_Score_Month_18 + 
 1*UPDRS_Part_I_Summary_Score_Month_24 + 
 1*UPDRS_Part_II_Patient_Questionnaire_Summary_Score_Baseline + 
 1*UPDRS_Part_II_Patient_Questionnaire_Summary_Score_Month_03 + 
 1*UPDRS_Part_II_Patient_Questionnaire_Summary_Score_Month_06 + 
 1*UPDRS_Part_II_Patient_Questionnaire_Summary_Score_Month_09 + 
 1*UPDRS_Part_II_Patient_Questionnaire_Summary_Score_Month_12 + 
 1*UPDRS_Part_II_Patient_Questionnaire_Summary_Score_Month_18 +
 1*UPDRS_Part_II_Patient_Questionnaire_Summary_Score_Month_24 + 
 1*UPDRS_Part_III_Summary_Score_Baseline + 1*UPDRS_Part_III_Summary_Score_Month_03 + 
 1*UPDRS_Part_III_Summary_Score_Month_06 + 1*UPDRS_Part_III_Summary_Score_Month_09 + 
 1*UPDRS_Part_III_Summary_Score_Month_12 + 1*UPDRS_Part_III_Summary_Score_Month_18 + 
 1*UPDRS_Part_III_Summary_Score_Month_24 + 
 1*X_Assessment_Non.Motor_Epworth_Sleepiness_Scale_Summary_Score_Baseline + 
 1*X_Assessment_Non.Motor_Epworth_Sleepiness_Scale_Summary_Score_Month_06 + 
 1*X_Assessment_Non.Motor_Epworth_Sleepiness_Scale_Summary_Score_Month_12 + 
 1*X_Assessment_Non.Motor_Epworth_Sleepiness_Scale_Summary_Score_Month_24 + 
 1*X_Assessment_Non.Motor_Geriatric_Depression_Scale_GDS_Short_Summary_Score_Baseline + 
 1*X_Assessment_Non.Motor_Geriatric_Depression_Scale_GDS_Short_Summary_Score_Month_06 +
 1*X_Assessment_Non.Motor_Geriatric_Depression_Scale_GDS_Short_Summary_Score_Month_12 + 
 1*X_Assessment_Non.Motor_Geriatric_Depression_Scale_GDS_Short_Summary_Score_Month_24 
 s =~ 0*UPDRS_Part_I_Summary_Score_Baseline + 1*UPDRS_Part_I_Summary_Score_Month_03 + 
 2*UPDRS_Part_I_Summary_Score_Month_06 + 3*UPDRS_Part_I_Summary_Score_Month_09 + 
 4*UPDRS_Part_I_Summary_Score_Month_12 + 5*UPDRS_Part_I_Summary_Score_Month_18 + 
 6*UPDRS_Part_I_Summary_Score_Month_24 +
 0*UPDRS_Part_II_Patient_Questionnaire_Summary_Score_Baseline + 
 1*UPDRS_Part_II_Patient_Questionnaire_Summary_Score_Month_03 + 
 2*UPDRS_Part_II_Patient_Questionnaire_Summary_Score_Month_06 + 
 3*UPDRS_Part_II_Patient_Questionnaire_Summary_Score_Month_09 + 
 4*UPDRS_Part_II_Patient_Questionnaire_Summary_Score_Month_12 + 
 5*UPDRS_Part_II_Patient_Questionnaire_Summary_Score_Month_18 +         
 6*UPDRS_Part_II_Patient_Questionnaire_Summary_Score_Month_24 + 
 0*UPDRS_Part_III_Summary_Score_Baseline + 1*UPDRS_Part_III_Summary_Score_Month_03 + 
 2*UPDRS_Part_III_Summary_Score_Month_06 + 3*UPDRS_Part_III_Summary_Score_Month_09 + 
 4*UPDRS_Part_III_Summary_Score_Month_12 + 5*UPDRS_Part_III_Summary_Score_Month_18 + 
 6*UPDRS_Part_III_Summary_Score_Month_24 + 
 0*X_Assessment_Non.Motor_Epworth_Sleepiness_Scale_Summary_Score_Baseline + 
 2*X_Assessment_Non.Motor_Epworth_Sleepiness_Scale_Summary_Score_Month_06 + 
 4*X_Assessment_Non.Motor_Epworth_Sleepiness_Scale_Summary_Score_Month_12 +
 6*X_Assessment_Non.Motor_Epworth_Sleepiness_Scale_Summary_Score_Month_24 +
 0*X_Assessment_Non.Motor_Geriatric_Depression_Scale_GDS_Short_Summary_Score_Baseline + 
 2*X_Assessment_Non.Motor_Geriatric_Depression_Scale_GDS_Short_Summary_Score_Month_06 + 
 4*X_Assessment_Non.Motor_Geriatric_Depression_Scale_GDS_Short_Summary_Score_Month_12 + 
 6*X_Assessment_Non.Motor_Geriatric_Depression_Scale_GDS_Short_Summary_Score_Month_24
 '

 fit4 <- growth(model4, data=myData)
 summary(fit4)
 parameterEstimates(fit4)	# extracts the values of the estimated parameters, the standard errors, 
 # the z-values, the standardized parameter values, and returns a data frame	
 fitted(fit4)	# return the model-implied (fitted) covariance matrix (and mean vector) of a fitted model

 # resid() function return (unstandardized) residuals of a fitted model including the difference between 
 # the observed and implied covariance matrix and mean vector
 resid(fit4)

=='"`UNIQ--h-1--QINU`"'Measures of model quality (Comparative Fit Index (CFI), Root Mean Square Error of Approximation (RMSEA))==

 # report the fit measures as a signature vector: Comparative Fit Index (CFI), Root Mean Square Error of 
 # Approximation (RMSEA)
 fitMeasures(fit4, c("cfi", "rmsea", "srmr"))

===='"`UNIQ--h-2--QINU`"'Comparative Fit Index====

(CFI) is an incremental measure directly based on the non-centrality measure.  If d = χ2(df) where df are the degrees of freedom of the model, the Comparative Fit Index is:
$
\frac{(Null Model)-d(Proposed Model)}{d(Null Model)}.
$

$0≤CFI≤1$ (by definition). It is interpreted as:

*$CFI<0.9$  - model fitting is poor.

*$0.9≤CFI≤0.95$  is considered marginal, 

*$CFI>0.95$   is good. 

CFI is a relative index of model fit – it compare the fit of your model to the fit of (the worst) fitting null model.

===='"`UNIQ--h-3--QINU`"'Root Mean Square Error of Approximation====
(RMSEA) - “Ramsey”

An absolute measure of fit based on the non-centrality parameter: 

$\sqrt{\frac{X^2-df}{df×(N - 1)}}$,

where N the sample size and df the degrees of freedom of the model.  If χ<sup>2</sup>  < df, then the RMSEA∶=0.  It has a penalty for complexity via the chi square to df ratio.  The RMSEA is a popular measure of model fit. 

*RMSEA < 0.01, excellent, 

*RMSEA  < 0.05, good 

*RMSEA > 0.10 cutoff for poor fitting models

===='"`UNIQ--h-4--QINU`"'Standardized Root Mean Square Residual==== 
(SRMR) is an absolute measure of fit defined as the standardized difference between the observed correlation and the predicted correlation.  A value of zero indicates perfect fit.  The SRMR has no penalty for model complexity.  SRMR <0.08 is considered a good fit.

 # inspect the model results (report parameter table)
 inspect(fit4)

 #install.packages("semTools")
 # library("semTools")

<b><u>A Simpler Model (fit5)</u></b>

 model5 <- '
  # intercept and slope with fixed coefficients
 i =~ UPDRS_Part_I_Summary_Score_Baseline + UPDRS_Part_I_Summary_Score_Month_03 + UPDRS_Part_I_Summary_Score_Month_24
 s =~ 0*UPDRS_Part_I_Summary_Score_Baseline + 1*UPDRS_Part_I_Summary_Score_Month_03 + 6*UPDRS_Part_I_Summary_Score_Month_24
  # regressions
 i ~  R_fusiform_gyrus_Volume + Weight + ResearchGroup + Age + chr12_rs34637584_GT                                                              
 s ~ R_fusiform_gyrus_Volume + Weight + ResearchGroup + Age + chr12_rs34637584_GT
  # time-varying covariates
    UPDRS_Part_I_Summary_Score_Baseline ~ Weight
    UPDRS_Part_I_Summary_Score_Month_03  ~ ResearchGroup 
     UPDRS_Part_I_Summary_Score_Month_24 ~ Age
 '

 fit5 <- growth(model5, data=myData)
 summary(fit5); fitMeasures(fit5, c("cfi", "rmsea", "srmr"))
 parameterEstimates(fit5)	# extracts the values of the estimated parameters, the standard errors, 
 # the z-values, the standardized parameter values, and returns a data frame

 lavaan (0.5-18) converged normally after  99 iterations
  Number of observations                           661
  Estimator                                         ML
  Minimum Function Test Statistic                3.703
  Degrees of freedom                                 1
  P-value (Chi-square)                           0.054
 Parameter estimates:
  Information                                 Expected
  Standard Errors                             Standard
                    Estimate  Std.err  Z-value  P(>|z|)
 Latent variables:
  i =~
    UPDRS_P_I_S_S     1.000
    UPDRS_P_I_S_S     1.074
    UPDRS_P_I_S_S     1.172
  s =~
    UPDRS_P_I_S_S     0.000
    UPDRS_P_I_S_S     1.000
    UPDRS_P_I_S_S     6.000
 
 Regressions:
  i ~
    R_fsfrm_gyr_V     0.000
    Weight            0.003
    ResearchGroup    -0.880
    Age              -0.009
    c12_34637584_    -0.907
  s ~
    R_fsfrm_gyr_V    -0.000
    Weight           -0.000
    ResearchGroup    -0.084
    Age               0.002
    c12_34637584_    -0.047
  UPDRS_Part_I_Summary_Score_Baseline ~
    Weight           -0.000
  UPDRS_Part_I_Summary_Score_Month_03 ~
    ResearchGroup     0.693
  UPDRS_Part_I_Summary_Score_Month_24 ~
    Age              -0.002
 
 Covariances:
  i ~~
    s                 0.074
 
 Intercepts:
    UPDRS_P_I_S_S     0.000
    UPDRS_P_I_S_S     0.000
    UPDRS_P_I_S_S     0.000
    i                 1.633
    s                -0.023
 
 Variances:
    UPDRS_P_I_S_S     1.017
    UPDRS_P_I_S_S     1.093
    UPDRS_P_I_S_S     2.993
    i                 1.019
    s                -0.025
 
  <b>cfi rmsea  srmr</b>
 <b>0.996 0.064 0.008</b>

 fitted(fit5)	# return the model-implied (fitted) covariance matrix (and mean vector) of a fitted model
 # write.table(fitted(fit5), file="C:\\Users\\Dinov\\Desktop\\test1.txt")

 # resid() function return (unstandardized) residuals of a fitted model including the difference between 
 # the observed and implied covariance matrix and mean vector
 resid(fit5)

 # report the fit measures as a signature vector
 fitMeasures(fit5, c("cfi", "rmsea", "srmr"))   # comparative fit index (CFI)

 # inspect the model results (report parameter table)
 inspect(fit5)

<b>Note:</b> See discussion of SEM modeling pros/cons <sup>2</sup>.

=='"`UNIQ--h-5--QINU`"'Generalized Estimating Equation (GEE) Modeling==

Generalized Estimating Equations (GEE) modeling<sup>3</sup> is used for analyzing data with the following characteristics:
(1) the observations within a group may be correlated, (2) observations in separate clusters are independent, (3) a monotone transformation of the expectation is linearly related to the explanatory variables, and (4) the variance is a function of the expectation. The expectation (#3) and the variance (# 4) are conditional given group-level or individual-level covariates.

GEE is applied to handle correlated discrete and continuous outcome variables. For the outcome variables, it only requires specification   of the first 2 moments and   correlation   among   them.    The   goal   is   to estimate fixed parameters    without    specifying    their    joint    distribution.  The correlation is specified by one of these 4 alternatives (which is specified in the R call: geeglm(outcome ~ center + treat + sex + baseline + age, data = respiratory, family = "binomial", id = id, <b>corstr = " exchangeable"</b>, scale.fix = TRUE):

<center>[[Image:SMHS_BigDataBigSci8.png|300px]]</center>

==='"`UNIQ--h-6--QINU`"'Respiratory Illness GEE R example===

This example is based on a data set on respiratory illness <sup>4</sup> and the <b>geepack</b> package. The data is from a clinical study of the treatment effects on patients with respiratory illness. N=111 patients from 2 clinical centers randomized to receive either placebo or active treatments. 4 temporal examinations assessed the <b>respiratory state</b> of patients as good (=1) or poor (=0). Explanatory variables characterizing a patient were: <b>center</b> (1,2), treatment (A=active, P=placebo), <b>sex</b> (M=male, F=female), <b>age</b> (in years) at baseline. The values of the covariates were constant for the repeated elementary observations on each patient.

<b>Table 1</b> shows the number of patients for the response patterns across the 4 visits split by baseline-status and treatment. Baseline respiratory status = 0 appear to have either low or high number of positive responses. Baseline respiratory status = 1 tend to respond positively. <b>Table 2</b> describes the distribution of the number of positive responses per patient for sex and center.

 # library("geepack")

<b>Table 1</b>: Distribution of patients for <b>different response patterns</b> classified by <b>baseline-respiratory</b> response and <b>treatment</b>. The patterns are ordered according to increasing numbers of positive responses.

<center>
{| class="wikitable" style="text-align:center; width:75%" border="1"
|-
! ||Visit|| colspan="15"| All Possible Response Patterns (2*2*2*2=16 permutation patterns)||
|-
|||1||0||1||0||0||0||1||1||1||0||0||1||1||1||0||1||
|-
|||2||0||0||1||0||0||1||0||0||1||0||1||1||0||1||1||
|-
|||3||0||0||0||1||0||0||1||0||1||1||1||0||1||1||1||
|-
|||4||0||0||0||0||1||0||0||1||0||1||0||1||1||1||1||
|-
!Baseline||Treatment||||||||||||||||||||||||||||||||Sum
|-
| rowspan="2"|0||A||7||2||2||2||1||0||1||0||1||0||1||2||0||4||7||30
|-
|P||18||1||0||2||1||2||0||0||1||0||0||1||2||0||3||31
|-
|rowspan="2"|1||A||0||0||0||0||0||0||1||1||0||0||4||0||1||0||17||24
|-
|P||1||4||1||0||0||0||0||1||1||3||1||1||2||1||10||26
|-
|Sum||||26||7||3||4||2||2||2||2||3||3||6||4||5||5||37||111
|}
</center>


<b>Table 2</b>: Distribution of patients for the number of positive responses across the 4 visits for <b>Sex</b> and <b>Center</b>. 

<center>
{| class="wikitable" style="text-align:center; width:75%" border="1"
|-
! colspan="2" rowspan="2"| ||colspan="5"|Number of positive responses
|-
| 0||1||2||3||4
|-
|rowspan="2"|Sex || F||7||3||3||3||7
|-
|M||19||13||9||17||30
|-
|rowspan="2"|Center|| 1||18||9||6||11||12
|-
|2||8||7||6||9||25
|}
</center>

<b>Figure 1</b> shows a plot of age against the proportion of positive responses for each patient. It indicates a quadratic relationship between the proportions and the age. Fitting a logistic model to the data (which would be appropriate if there were <i>no time effects</i> and <i>no spread in the response probabilities</i> for patients with the same covariate values).

 # install.packages("geepack")
 library("geepack")

 # data include a clinical trial of 111 patients with respiratory illness from two different clinics were randomized to receive either 
 # placebo (P) or an active (A) treatment. Patients were examined at baseline and at four visits during treatment. 
 # At each examination, respiratory status (categorized as 1 = good, 0 = poor)
 data("respiratory")
 head(respiratory)
 myData <- respiratory

<center>head(myData)
{| class="wikitable" style="text-align:center; " border="1"
|-
|||Center||ID||Treat||Sex||Age||Baseline||Visit||Outcome
|-
|1 ||1||1||P||M||46||0||1||0
|-
|2 ||1||1||P||M||46||0||2||0
|-
|3 ||1||1||P||M||46||0||3||0
|-
|4 ||1||1||P||M||46||0||4||0
|-
|5||1||2||P||M||28||0||1||0
|-
|6||1||2||P||M||28||0||2||0
|}
</center>

 # Get proportions of positive responses
 responses <- factor(myData$\$$outcome, labels = c("OutcomePositive", "OutcomeNegative"))
data.frame <- data.frame(responses, myData$\$$age)
 head(data.frame)
 tab <- prop.table(table(data.frame), 1); tab	# compute proportions
 sum(tab[1,])				# check proportions (sums to 1.0)?
 prop <- tab[1,]				# save the proportions of positive responses for each patient
 plot(as.numeric(dimnames(tab)$\$$myData.age), tab[1,], xlab = "Age", ylab = "Proportion of Positive Outcomes")
# dimnames(tab)				# to see/inspect positive/negative outcomes

SMHS BigDataBigSci9.png

x <- as.numeric(dimnames(tab)$\$$myData.age)
poly <- loess( prop ~ x)	# fit a Local Polynomial Regression Fitting
plot(x, prop)
lines(predict(poly), col='red', lwd=2)
smoothingSpline <- smooth.spline(x, prop, spar=0.6)
plot(x, prop)
lines(smoothingSpline, col='red', lwd=1.5)
smoothPolySpline <- smooth.spline(x, predict(poly), spar=0.6)
lines(smoothPolySpline, col='blue', lwd=2)
legend("topright", inset=.05, title="Polynomial regression models",  c("Raw Poly","Smooth Poly"), fill=c('red', 'blue'), horiz=TRUE)

SMHS BigDataBigSci10.png

model.glm <- glm(outcome ~ baseline + center + sex + treat + age + I(age^2), data = respiratory, family = binomial)
summary(model.glm)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.5951 -0.9108 0.4034 0.8336 2.0951
Coefficients:
Estimate Std. Error z value z|)$
(Intercept) 3.3579727 1.0285292 3.265 0.0011 **
baseline 1.8850421 0.2482959 7.592 3.15e-14 ***
center 0.5099244 0.2453982 2.078 0.0377 *
sexM -0.4510595 0.3166570 -1.424 0.1543
Treatp -1.3231587 0.2431603 -5.442 5.28e-08 ***
age -0.2072815 0.0472538 -4.387 1.15e-05 ***
I(age^2) 0.0025650 0.0006324 4.056 4.99e-05 ***

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 609.41 on 443 degrees of freedom

Residual deviance: 468.62 on 437 degrees of freedom

AIC: 482.62

The correlation matrix of the of the outcome measures across visits is shown in Table 3.

attach(myData)
mat1 <- matrix(c(outcome[visit==1], outcome [visit==2], outcome [visit==3], 

outcome[visit==4]), ncol = 4)

cor(mat1)

Table 3: Correlation matrix for the outcome measurements at different visits.

Coefficients:
[,1] [,2] [,3] [,4]
[,1] 1.0000000 0.5087944 0.4431438 0.5139016
[,2] 0.5087944 1.0000000 0.5821877 0.5301611
[,3] 0.4431438 0.5821877 1.0000000 0.5871276
[,4] 0.5139016 0.5301611 0.5871276 1.0000000
# We can also examine for multicollinearity problem, using the correlation matrix for X
cor(model.matrix(model.glm)[,-1])
# GEE modeling: R function arguments/options
  • corstr= for defining the correlation structure within groups in a GEE model
  • id= is used to identify the grouping variable in a GEE model
  • scale.fix= when TRUE causes the scale parameter to be fixed (by default at 1) rather than estimated
  • waves= names a positive integer-valued variable that is used to identify the order and spacing of observations within groups in a GEE model. This argument is crucial when there are missing values and gaps in the data
gee.model1 <- geeglm(outcome ~ center + treat + sex + baseline + age, data = respiratory, family = "binomial", id = id, corstr = "exchangeable", scale.fix = TRUE)
# The column labeled Wald in the summary table is the square of the z-statistic. The reported p-values are the 
# upper tailed probabilities from a chisq1 distribution and test whether the true parameter value ≠0.
summary(gee.model1)
# To test the effect of treatment using anova()
gee.model1 <- geeglm(outcome ~ center + treat + sex + baseline + age, data = respiratory, family=binomial(link="logit"), id = id, corstr = "exchangeable", std.err="san.se")
gee.model2 <- geeglm(outcome ~ center + sex + baseline + age, data = respiratory, family=binomial(link="logit"), id=id, corstr = "exchangeable", std.err="san.se")
anova(gee.model1, gee.model2)
# To test whether a categorical predictor with more than two levels should be retained in a GEE model we need 
# to test the entire set of dummy variables simultaneously as a single construct. 
# The geepack package provides a method for the anova function for a multivariate Wald test
# When the anova function is applied to a single geeglm object it returns sequential Wald tests for 
# individual predictors with the tests carried out in the order the predictors are listed in the model formula.
anova(gee.model1)

PD GEE example

This example used the PPMI/PD data to show GEE analysis.

# 05_PPMI_top_UPDRS_Integrated_LongFormat1.csv
longData <- read.csv("https://umich.instructure.com/files/330397/download?download_frd=1",header=TRUE)
# library("geepack")
# Data Elements: FID_IID	L_insular_cortex_ComputeArea	L_insular_cortex_Volume	R_insular_cortex_ComputeArea	R_insular_cortex_Volume	L_cingulate_gyrus_ComputeArea	L_cingulate_gyrus_Volume	 R_cingulate_gyrus_ComputeArea	R_cingulate_gyrus_Volume	L_caudate_ComputeArea	L_caudate_Volume	R_caudate_ComputeArea	R_caudate_Volume	L_putamen_ComputeArea	L_putamen_Volume	R_putamen_ComputeArea	 R_putamen_Volume	Sex	Weight	ResearchGroup	Age	chr12_rs34637584_GT	chr17_rs11868035_GT	chr17_rs11012_GT	chr17_rs393152_GT	chr17_rs12185268_GT	chr17_rs199533_GT	UPDRS_part_I	 UPDRS_part_II	UPDRS_part_III	time_visit
dim(longData) 
data1 = na.omit(longData)
attach(data1)
ControlGroup <- ifelse(ResearchGroup == "Control", 1, 0)
# these calculations take a long time!!!
# if you get “Error in geese.fit(xx, yy, id, offset, soffset, w, waves = waves, zsca,  : 
# nrow(zsca) and length(y) not match” – this indicates some of the variables are of different lengths
# if you get “glm.fit: algorithm did not converge” – see this discussion: http://goo.gl/lrjBjB 
gee.model0 <- geeglm(ControlGroup ~ L_insular_cortex_ComputeArea+L_insular_cortex_Volume+ Sex + Weight + Age + chr17_rs11012_GT + chr17_rs199533_GT + UPDRS_part_I + UPDRS_part_II + time_visit, data = data1,  family=binomial(link="logit"), id = FID_IID, corstr = "unstructured", std.err="san.se")
gee.model1 <- geeglm(ControlGroup ~ L_insular_cortex_ComputeArea+L_insular_cortex_Volume+ R_putamen_ComputeArea + R_putamen_Volume + Sex + Weight + Age + chr17_rs11012_GT + chr17_rs199533_GT + UPDRS_part_I + UPDRS_part_II +   time_visit, data = data1, family=binomial(link="logit"), id = FID_IID, corstr = "unstructured", std.err="san.se")
# compare 2 gee models
# anova(gee.model0,gee.model1)
# you can try the “family = poisson(link = "log")” model for the ResearchGroup response, as well
gee.model2 <- geeglm(ControlGroup 
~ L_insular_cortex_ComputeArea+L_insular_cortex_Volume+R_insular_cortex_ComputeArea+ R_insular_cortex_Volume +L_cingulate_gyrus_ComputeArea + L_cingulate_gyrus_Volume + R_cingulate_gyrus_ComputeArea  + R_cingulate_gyrus_Volume +  L_caudate_ComputeArea + L_caudate_Volume + R_caudate_ComputeArea + R_caudate_Volume + L_putamen_ComputeArea + L_putamen_Volume + R_putamen_ComputeArea + R_putamen_Volume + Sex + Weight + Age +  chr12_rs34637584_GT +   chr17_rs11868035_GT + chr17_rs11012_GT + chr17_rs393152_GT + chr17_rs12185268_GT + chr17_rs199533_GT + UPDRS_part_I + UPDRS_part_II + time_visit, data = data1, family=binomial(link="logit"), id = FID_IID, corstr  = "unstructured",  std.err="san.se")

Remember that we do not interpret GEE coefficients as relating to individuals – GEE models are marginal models and the conclusions drawn are interpreted as population-based. Also, the time element in the model (time_visit) is just another controlling factor. The effect-sizes (betas) associated with each variable/predictor represent the slopes associated with the corresponding covariate, while holding time constant. If we need to examine interactions (e.g., Weight change over Time), we need to include an interaction term in model: (i.e. + Weight*time_visit).

summary (gee.model2)
# Individual Wald test and confidence intervals for each covariate
predictors2 <- coef(summary(gee.model2))
CI2 <- with(as.data.frame(predictors2), cbind(lwr=Estimate-1.96*Std.err, est=Estimate, upr=Estimate+1.96*Std.err))
rownames(CI2) <- rownames(predictors2)
CI2

Appendix

SEM References

GEE References

Footnotes

See also





Translate this page:

(default)
Uk flag.gif

Deutsch
De flag.gif

Español
Es flag.gif

Français
Fr flag.gif

Italiano
It flag.gif

Português
Pt flag.gif

日本語
Jp flag.gif

България
Bg flag.gif

الامارات العربية المتحدة
Ae flag.gif

Suomi
Fi flag.gif

इस भाषा में
In flag.gif

Norge
No flag.png

한국어
Kr flag.gif

中文
Cn flag.gif

繁体中文
Cn flag.gif

Русский
Ru flag.gif

Nederlands
Nl flag.gif

Ελληνικά
Gr flag.gif

Hrvatska
Hr flag.gif

Česká republika
Cz flag.gif

Danmark
Dk flag.gif

Polska
Pl flag.png

România
Ro flag.png

Sverige
Se flag.gif