Difference between revisions of "SMHS BigDataBigSci"

From SOCR
Jump to: navigation, search
(Generalized Estimating Equation (GEE) Modeling)
 
(40 intermediate revisions by 3 users not shown)
Line 1: Line 1:
==[[SMHS| Scientific Methods for Health Sciences]] - Structural Equation Modeling (SEM) and Generalized Estimating Equation (GEE) Modeling ==
+
==[[SMHS| Scientific Methods for Health Sciences]] - Model-based Analyses ==
  
<b>Questions</b>
+
Structural Equation Modeling (SEM), Growth Curve Models (GCM), and Generalized Estimating Equation (GEE) Modeling
  
• How to represent dependencies in linear models and examine causal effects?
+
==Questions ==
  
Is there a way to study population average effects of a covariate against specific individual effects?
+
*How to represent dependencies in linear models and examine causal effects?
 +
*Is there a way to study population average effects of a covariate against specific individual effects?
  
<b>Overview</b>
+
==Overview==
  
 
SEM allow re-parameterization of random-effects to specify latent variables that may affect measures at different time points using structural equations. SEM show variables having predictive (possibly causal) effects on other variables (denoted by arrows) where coefficients index the strength and direction of predictive relations. SEM does not offer much more than what classical regression methods do, but it does allow simultaneous estimation of multiple equations modeling complementary relations.  
 
SEM allow re-parameterization of random-effects to specify latent variables that may affect measures at different time points using structural equations. SEM show variables having predictive (possibly causal) effects on other variables (denoted by arrows) where coefficients index the strength and direction of predictive relations. SEM does not offer much more than what classical regression methods do, but it does allow simultaneous estimation of multiple equations modeling complementary relations.  
 +
 +
Growth Curve (or latent growth) modeling is a statistical technique employed in SEM for estimating growth trajectories for longitudinal data (over time). It represent repeated measures of dependent variables as functions of time and other covariates. When subjects or units are observed repeatedly over known time points latent growth curve models reveal the trend of an individual as a function of an underlying growth process where the growth curve parameters can be estimated for each subject/unit.
  
 
GEE is a marginal longitudinal method that directly assesses the mean relations of interest (i.e., how the mean dependent variable changes over time), accounting for covariances among the observations within subjects, and getting a better estimate and valid significance tests of the relations. Thus, GEE estimates two different equations, (1) for the mean relations, and (2) for the covariance structure. An advantage of GEE over random-effect models is that it does not require the dependent variable to be normally distributed. However, a disadvantage of GEE is that it is less flexible and versatile – commonly employed algorithms for it require a small-to-moderate number of time points evenly (or approximately evenly) spaced, and similarly spaced across subjects. Nevertheless, it is a little more flexible than repeated-measure ANOVA because it permits some missing values and has an easy way to test for and model away the specific form of autocorrelation within subjects.
 
GEE is a marginal longitudinal method that directly assesses the mean relations of interest (i.e., how the mean dependent variable changes over time), accounting for covariances among the observations within subjects, and getting a better estimate and valid significance tests of the relations. Thus, GEE estimates two different equations, (1) for the mean relations, and (2) for the covariance structure. An advantage of GEE over random-effect models is that it does not require the dependent variable to be normally distributed. However, a disadvantage of GEE is that it is less flexible and versatile – commonly employed algorithms for it require a small-to-moderate number of time points evenly (or approximately evenly) spaced, and similarly spaced across subjects. Nevertheless, it is a little more flexible than repeated-measure ANOVA because it permits some missing values and has an easy way to test for and model away the specific form of autocorrelation within subjects.
Line 15: Line 18:
 
GEE is mostly used when the study is focused on uncovering the population average effect of a covariate vs. the individual specific effect. These two things are only equivalent for linear models, but not in non-linear models.
 
GEE is mostly used when the study is focused on uncovering the population average effect of a covariate vs. the individual specific effect. These two things are only equivalent for linear models, but not in non-linear models.
  
<mark>FIX FORMULA For instance, suppose Y<sub>i</sub>, <sub>j</sub>,  is the random effects <b>logistic model</b> of the j<sup>th</sup>,  observation of the i<sup>th</sup> subject, then log(p_(i,j)/(1-p_(i,j) ))=μ+ν_i, where  ν_i~N(0,σ^2) is a random effect for subject i and p_(i,j)=P(Y_(i,j)=1|ν_i).</mark>
+
For instance, suppose $Y_{i,j}$ is the random effects <b>logistic model</b> of the $j^{th}$,  observation of the $i^{th}$ subject, then  
 +
$
 +
log\Bigg(\frac{p_{i,j}}{1-p_{i,j}} \Bigg)=μ+ν_i,
 +
$
 +
where  $ν_i \sim N(0,σ^2)$ is a random effect for <u>subject i</u> and $p_{i,j}=P(Y_{i,j}=1|ν_i).$
  
(1) When using a random effects model on such data, the estimate of μ accounts for the fact that a mean zero normally distributed perturbation was applied to each individual, making it individual-specific.
+
(1) When using a random effects model on such data, the estimate of μ accounts for the fact that a mean zero normally distributed perturbation was applied to each individual, making it ''individual-specific''.
  
 
(2) When using a GEE model on the same data, we estimate the <i>population average log odds</i>,
 
(2) When using a GEE model on the same data, we estimate the <i>population average log odds</i>,
  
<mark>FIX FORMULA δ=log((E_ν (1/(1+e^(-μ+ν_i ) )))/(1-E_ν (1/(1+e^(-μ+ν_i ) )) )), in general μ≠δ.
+
\begin{equation}
If μ=1 and σ^2=1, then δ≈.83. </mark>
+
δ=log\Bigg(\frac{E_v(\frac{1}{1+e^{-μ+v}i})}{1-E_v(\frac{1}{1+e^{-μ+v}i})}
 +
\Bigg),
 +
\end{equation}
 +
 
 +
in general $μ≠δ$.
 +
 
 +
If $μ=1$ and $σ^2=1$, then $δ≈.83$.  
  
 
empirically:
 
empirically:
Line 36: Line 49:
 
<b># theoretically</b>, if it can be computed:
 
<b># theoretically</b>, if it can be computed:
  
<mark>FIX FORMULAS!!
+
$E(Y)=μ=1$ (in this specific case), but the expectation of the population average log odds  
E(Y)=μ=1 (in this specific case), but the expectation of the population average log odds δ= log[(P(Y_(i,j)=1|ν_i))/(1-P(Y_(i,j)=1|ν_i))] would be < 1 . Note that this is kind of related to the fact that a grand-total average  need not be equal to an average of partial averages. </mark>
+
$δ=log\Bigg[\frac{P(Y_{i,j}=1|v_i)}{1-P(Y_{i,j}=1|v_i)}\Bigg]would be $< 1$ <SUP>1</SUP>.  
 +
Note that this is kind of related to the fact that a grand-total average  need not be equal to an average of partial averages.  
  
<mark>The mean of the ith person in the jth observation (e.g., location, time, etc.) can be expressed by:
+
The mean of the $i^{th}$ person in the $j^{th}$ observation (e.g., location, time, etc.) can be expressed by:
E(Yij | Xij,α_j)= g[μ(Xij ┤|β)+Uij(α_j,Xij)],
 
Where μ(X_ij |β) is the average “response” of a person with the same covariates X_ij, β a set of fixed effect coefficients, and Uij(α_j,Xij) is an error term that is a function of the (time, space) random effects, α_j, and also a  function of the covariates X_ij, and g is the link function which specifies the regression type -- e.g.,
 
linear: g^(-1) (u)=u,
 
log: g^(-1) (u)= log(u),
 
logistic: g^(-1) (u)=log(u/(1-u))
 
E(Uij(α_j,Xij)|Xij)=0.</mark>
 
  
The link function, g(u), provides the relationship between the linear predictor and the mean of the distribution function. For practical applications there are many commonly used link functions. It makes sense to try to match the domain of the link function to the range of the distribution function's mean.
+
$E(Yij | Xij,α_j)= g(Xij|β)+Uij(α_j,Xij)]$,
  
  <mark>INSERT TABLE!!!!!!!!!!!</mark>
+
Where $μ(X_{ij}|β)$ is the average “response” of a person with the same covariates $X_{ij}$, $β$ a set of fixed effect coefficients, and $Uij(α_j,Xij)$ is an error term that is a function of the (time, space) random effects, $α_j$, and also a function of the covariates $X_{ij}$, and $g$ is the '''link function''' which specifies the regression type -- e.g.,
 +
 +
*<u>linear</u>:''' $g^{-1} (u)=u,$
  
Structural Equation Modeling (SEM)
+
*<u>log</u>:'''         $g^{-1} (u)= log(u),$
  
SEM is a general multivariate statistical analysis technique that can be used for causal modeling/inference, path analysis, confirmatory factor analysis (CFA), covariance structure modeling, and correlation structure modeling.
+
*<u>logistic</u>:''' $g^{-1} (u)=log(\frac{u}{1-u})$
 
+
Advantages
+
*$E(Uij(α_j,Xij)|Xij)=0.$
 
 
• It allows testing models with multiple dependent variables
 
 
 
• Provides mechanisms for modeling mediating variables
 
 
 
• Enables modeling of error terms
 
 
 
• Facilitates modeling of challenging data  (longitudinal with auto-correlated errors, multi-level data, non-normal data, incomplete data)
 
 
 
This method SEM allows separation of observed and latent variables. Other standard statistical procedures may be viewed as special cases of SEM, where statistical significance less important, than in other techniques, and covariances are the core of structural equation models.
 
 
 
Definitions
 
 
 
* The <b>disturbance</b>, <i>D</i>, is the variance in Y unexplained by a variable X that is assumed to affect Y. 
 
 
 
X    →    Y  ←    D
 
 
 
*<b>Measurement error</b>, <i>E</i>, is the variance in X unexplained by A, where X is an observed variable that is presumed to measure a latent variable, <i>A</i>.
 
 
 
A    →    X  ←    E
 
 
 
*Categorical variables in a model are <b>exogenous</b> (independent) or <b>endogenous</b> (dependent).
 
 
 
Notation
 
 
 
*In SEM <b>observed (or manifest) indicators</b> are represented by <b>squares/rectangles</b> whereas latent variables (or factors) represented by circles/ovals.
 
 
 
[[Image:SMHS_BigDataBigSci1.png|500px]]
 
 
 
<mark> PLEASE FIX ARROWS *Relations: Direct effects (→), Reciprocal effects (<--> or  ), and Correlation or covariance ( ) all have different appearance in SEM models.</mark>
 
 
 
Model Components
 
 
 
The <b>measurement part</b> of SEM model deals with the latent variables and their indicators. A pure measurement model is a confirmatory factor analysis (CFA) model with unmeasured covariance (bidirectional arrows) between each possible pair of latent variables. There are <u>straight arrows from the latent variables to their respective indicators and straight arrows from the error and disturbance terms to their respective variables, but no direct effects (straight arrows) connecting the latent variables</u>. The <b>measurement model</b> is evaluated using goodness of fit measures (Chi-Square test, BIC, AIC, etc.) <b>Validation of the measurement model is always first.</b>
 
 
 
<b>Then we proceed to the structural model</b> (including a set of exogenous and endogenous variables together with the direct effects (straight arrows) connecting them along with the disturbance and error terms for these variables that reflect the effects of unmeasured variables not in the model).
 
 
 
Notes
 
 
 
• Sample-size considerations: mostly same as for regression - more is always better
 
 
 
• Model assessment strategies: Chi-square test, Comparative Fit Index, Root Mean Square Error, Tucker Lewis Index, Goodness of Fit Index, AIC, and BIC.
 
 
 
• Choice for number of Indicator variables: depends on pilot data analyses, a priori concerns, fewer is better.
 
  
Hands-on Example 1 (School Kids Mental Abilities)
+
The link function, $g(u)$, provides the relationship between the linear predictor and the mean of the distribution function. For practical applications there are many commonly used link functions. It makes sense to try to match the domain of the link function to the range of the distribution function's mean.
 
 
These data (Holzinger & Swineford 1939) include mental ability test scores of 7 & 8 grade children from two schools (Pasteur and Grant-White). This version of the dataset includes only 9 (out of the 26) tests. We can build and test a confirmatory factor analysis (CFA) SEM model for 3 correlated latent variables (or factors), each with three indicators:
 
 
 
o visual factor measured by 3 variables: x1, x2 and x3,
 
 
 
o textual factor measured by 3 variables: x4, x5 and x6,
 
 
 
o speed factor measured by 3 variables: x7, x8 and x9.
 
  
 +
<center>Common distributions with typical uses and canonical link functions</center>
 
<center>
 
<center>
 
{| class="wikitable" style="text-align:center; " border="1"
 
{| class="wikitable" style="text-align:center; " border="1"
 
|-
 
|-
|ID||lhs||op||rhs||user||free||ustart
+
|<b>Distribution</b> ||<b>Support of distribution</b>||<b>Typical uses</b>||<b>Link name</b>||<b>Link function</b>||<b>Mean function</b>
 
|-
 
|-
|1 ||Visual||=~||x1||1||0||1
+
|Normal||real: $(-&#8734;, +&#8734;)$||Linear-response data||Identity||$X\beta=\mu$||$\mu=X\beta$
 
|-
 
|-
|2 ||Visual||=~||x2||1||1||NA
+
|Exponential, Gamma||real:$(0, +&#8734;)$||Exponential-response data, scale parameters||Inverse||$X\beta=-\mu^{-1}$||$\mu=-(X\beta)^{-1}$
 
|-
 
|-
|3 ||Visual||=~||x3||1||2||NA
+
|Inverse Gaussian||real:$(0, +&#8734;)$|| ||Inverse squared||$X\beta=-\mu^{-2}$||$\mu=(-X\beta)^{-1/2}$
|-
 
|4 ||Textual||=~||x4||1||0||1
 
|-
 
|5||Textual||=~||x5||1||3||NA
 
|-
 
|6||Textual||=~||x6||1||4||NA
 
|-
 
|7 ||Speed||=~||x7||1||0||1
 
|-
 
|8 ||Speed||=~||x8||1||5||NA
 
|-
 
|9 ||Speed||=~||x9||1||6||NA
 
|-
 
|10 ||x1||~~||x1||0||7||NA
 
|-
 
|11||x2||~~||x2||0||8||NA
 
|-
 
|12||x3||~~||x3||0||9||NA
 
|-
 
|13||x4||~~||x4||0||10||NA
 
|-
 
|14||x5||~~||x5||0||11||NA
 
|-
 
|15||x6||~~||x6||0||12||NA
 
|-
 
|16||x7||~~||x7||0||13||NA
 
|-
 
|17||x8||~~||x8||0||14||NA
 
|-
 
|18||x9||~~||x9||0||15||47.8
 
|-
 
|19||Visual||~~||Visual||0||16||NA
 
|-
 
|20||Textual||~~||Textual||0||17||NA
 
|-
 
|21||Speed||~~||Speed||boy||18||NA
 
|-
 
|22||Visual||~~||Textual||girl||19||NA
 
|-
 
|23||Visual||~~||Speed||girl||20||NA
 
|-
 
|24||Textual||~~||Speed||boy||21||NA
 
 
 
 
|}
 
|}
 
</center>
 
</center>
  
There are 3 latent variables (factors) in this model, each with 3 indicators, resulting in 9 factor loadings that need to be estimated. There are also 3 covariances among the latent variables {another three parameters}.
+
===Footnotes===
  
These <b>12 parameters</b> are represented in the path diagram as single-headed and double-headed arrows, respectively. We also need to estimate the residual variances of the 9 observed variables and the variances of the 3 latent variables, resulting in <b>12 additional free parameters</b>. In total, we have <b>24 parameters.</b>
+
*<sup>1</sup> http://www.researchgate.net/publication/41895248
  
[[Image:SMHS_BigDataBigSci2.png|200px]]
+
==Model-based Analytics==
 
 
To fully identify the model we need to set the metric of the latent variables. There are 2 ways to do this:
 
 
 
o for each latent variable, fix the factor loading of one of the indicators (typically the first) to a constant (e.g., 1.0), or
 
 
 
o standardize the variances of the 3 latent variables.
 
 
 
Either way, we fix 3 of these 24 parameters, and 21 parameters remain free.
 
 
 
The <b>parTable(fit)</b> method, generates this table output.
 
 
 
The `rhs', `op' and `lhs' columns define the parameters of the model.
 
All parameters with the <b><u>`=~'</u></b> operator are factor loadings, whereas all parameters with the <b><u>`~~'</u></b> operator are variances or covariances. Nonzero elements in the `free' column are the free parameters of the model. Zero elements in the `free' column correspond to fixed parameters, whose value is found in the `start' column.
 
 
 
Lavaan’s user-friendly model-specification approach is implemented in the fitting functions: cfa() and sem().
 
 
 
Since these data contain 3 latent variables, and no regressions, the minimalist syntax is:
 
 
 
<b>data.model <- 'visual =~ x1 + x2 + x3
 
textual                 =~ x4 + x5 + x6
 
speed                         =~ x7 + x8 + x9'</b>
 
 
 
Fit the CFA model:
 
 
 
<b>fit.1 <- cfa(data.model, data = HolzingerSwineford1939)</b>
 
 
 
The `user' column (parTabale) shows which parameters were explicitly contained in the user-specified model syntax (= 1), and which parameters were added by the cfa() function (= 0).
 
<b>parTable(fit.1)</b>
 
 
 
 
 
If we prefer <b>not to fix the factor loadings</b> of the first indicator, but instead want to fix the variances of the latent variances, the model syntax would be changed to:
 
<b>fit.2 <- 'visual =~ NA*x1 + x2 + x3
 
textual =~ NA*x4 + x5 + x6
 
speed =~ NA*x7 + x8 + x9
 
visual ~~ 1*visual
 
textual ~~ 1*textual
 
speed ~~ 1*speed'</b>
 
 
 
More complex model specifications can be made using the full lavaan model syntax:
 
 
 
<b>fit.full <- ' # latent variables
 
visual =~ 1*x1 + x2 + x3
 
textual =~ 1*x4 + x5 + x6
 
speed =~ 1*x7 + x8 + x9
 
# residual variances observed variables
 
x1 ~~ x1
 
x2 ~~ x2
 
x3 ~~ x3
 
x4 ~~ x4
 
x5 ~~ x5
 
x6 ~~ x6
 
x7 ~~ x7
 
x8 ~~ x8
 
x9 ~~ x9
 
# factor variances
 
visual ~~ visual
 
textual ~~ textual
 
speed ~~ speed
 
# factor covariances
 
visual ~~ textual + speed
 
textual ~~ speed'
 
fit.3 <- lavaan(fit.full, data = HolzingerSwineford1939)</b>
 
 
 
We can specify the model where the first factor loadings are explicitly fixed to one, and the covariances among the factors are added manually.
 
 
 
<b>fit.mixed <- ' # latent variables
 
visual =~ 1*x1 + x2 + x3
 
textual =~ 1*x4 + x5 + x6
 
speed =~ 1*x7 + x8 + x9
 
# factor covariances
 
visual ~~ textual + speed
 
textual ~~ speed'
 
fit <- lavaan(fit.mixed, data = HolzingerSwineford1939, auto.var = TRUE)</b>
 
 
 
The best method to view results from a SEM fitted with lavaan is <b>summary()</b>, which can be called with optional arguments like fit.measures, standardized, and rsquare.
 
 
 
Core Lavaan Methods
 
 
 
<b>summary</b>() print a long summary of the model results
 
 
 
<b>show</b>() print a short summary of the model results
 
 
 
<b>coef</b>() returns the estimates of the free parameters in the model as a named numeric vector
 
 
 
<b>fitted</b>() returns the implied moments (covariance matrix and mean vector) of the model
 
 
 
<b>resid</b>() returns the raw, normalized or standardized residuals (difference between implied and observed moments)
 
 
 
<b>vcov</b>() returns the covariance matrix of the estimated parameters
 
 
 
<b>predict</b>() compute factor scores
 
 
 
<b>logLik</b>() returns the log-likelihood of the fitted model (if maximum likelihood estimation was used)
 
 
 
<b>AIC</b>(), BIC() compute information criteria (if maximum likelihood estimation is used)
 
 
 
<b>update</b>() update a fitted lavaan object
 
 
 
<b>inspect</b>() peek into the internal representation of the model; by default, it returns a list of model matrices counting the free parameters in the model; can also be used to extract starting values, gradient values, and much more
 
 
 
If these args are set to TRUE, the output includes additional fit measures, standardized estimates, and R2 values for the dependent variables, respectively
 
 
 
<b>fit.model <- 'visual =~ x1 + x2 + x3
 
textual =~ x4 + x5 + x6
 
  speed =~ x7 + x8 + x9'
 
fit <- cfa(fit.model, data = HolzingerSwineford1939)
 
summary(fit, fit.measures = TRUE)
 
 
 
fit <- cfa(fit.model, data=HolzingerSwineford1939, estimator="GLS", group="sex")
 
fit.4 <- cfa(fit.model, data=HolzingerSwineford1939, estimator="GLD", group="sex", group.equal="regressions")
 
anova(fit, fit.4)</b>
 
 
 
The output consists of three sections.
 
 
 
* The <b>first section</b> (first 6 lines) contains the package version number, an indication whether the model has converged (and in how many iterations), and the effective number of observations used in the analysis.
 
 
*The <b>second section</b> contains the model χ^2 test statistic, degrees of freedom, and a p value are printed. If fit.measures = TRUE, it also prints the test statistic of the baseline model (where all observed variables are assumed to be uncorrelated) and several popular fit indices. If maximum likelihood estimation is used, this section will also contain information about the log-likelihood, the AIC, and the BIC.
 
 
*The <b>third section</b> provides an overview of the parameter estimates, including the type of standard errors used and whether the observed or expected information matrix was used to compute standard errors. Then, for each model parameter, the estimate and the standard error are displayed, and if appropriate, a z value based on the Wald test and a corresponding two-sided p value are also shown. To ease the reading of the parameter estimates, they are grouped into three blocks:
 
 
o factor loadings,
 
 
 
o factor covariances, and
 
 
 
o residual variances of both observed variables and factors.
 
 
 
 
 
The <b>summary</b>() method provides a nice summary of the model results for visualization purposes. The <b>parameterEstimates</b>() method returns the actual parameter estimates as a <b>data.frame</b>, which can be processed further.
 
 
 
<center><b>parameterEstimates(fit)</b>
 
{| class="wikitable" style="text-align:center; " border="1"
 
|-
 
|Index||lhs||op||rhs||est||se||z||pvalue||ci.lower||ci.upper
 
|-
 
|1||visual||=~||x1||1||0||NA||NA||1||1
 
|-
 
|2||visual||=~||x2||0.553||0.1||5.554||0||0.358||0.749
 
|-
 
|3||visual||=~||x3||0.729||0.109||6.685||0||0.516||0.943
 
|-
 
|4||textual||=~||x4||1||0||NA||NA||1||1
 
|-
 
|5||textual||=~||x5||1.113||0.065||17.014||0||0.985||1.241
 
|-
 
|6||textual||=~||x6||0.926||0.055||16.703||0||0.817||1.035
 
|-
 
|7||speed||=~||x7||1||0||NA||NA||1||1
 
|-
 
|8||speed||=~||x8||1.18||0.165||7.152||0||0.857||1.503
 
|-
 
|9||speed||=~||x9||1.082||0.151||7.155||0||0.785||1.378
 
|-
 
|10||x1||~~||x1||0.549||0.114||4.833||0||0.326||0.772
 
|-
 
|11||x2||~~||x2||1.134||0.102||11.146||0||0.934||1.333
 
|-
 
|12||x3||~~||x3||0.844||0.091||9.317||0||0.667||1.022
 
|-
 
|13||x4||~~||x4||0.371||0.048||7.779||0||0.278||0.465
 
|-
 
|14||x5||~~||x5||0.446||0.058||7.642||0||0.332||0.561
 
|-
 
|15||x6||~~||x6||0.356||0.043||8.277||0||0.272||0.441
 
|-
 
|16||x7||~~||x7||0.799||0.081||9.823||0||0.64||0.959
 
|-
 
|17||x8||~~||x8||0.488||0.074||6.573||0||0.342||0.633
 
|-
 
|18||x9||~~||x9||0.566||0.071||8.003||0||0.427||0.705
 
|-
 
|19||visual||~~||visual||0.809||0.145||5.564||0||0.524||1.094
 
|-
 
|20||textual||~~||textual||0.979||0.112||8.737||0||0.76||1.199
 
|-
 
|21||speed||~~||speed||0.384||0.086||4.451||0||0.215||0.553
 
|-
 
|22||visual||~~||textual||0.408||0.074||5.552||0||0.264||0.552
 
|-
 
|23||visual||~~||speed||0.262||0.056||4.66||0||0.152||0.373
 
|-
 
|24||textual||~~||speed||0.173||0.049||3.518||0||0.077||0.27
 
 
 
|}
 
</center>
 
 
 
The confidence level can be changed by setting the level argument. To obtain several standardized versions of the estimates, we can use standardized = TRUE:
 
 
 
<center><b>est <- parameterEstimates(fit, ci = FALSE, standardized = TRUE)
 
subset(est, op == "=~")
 
</b>
 
{| class="wikitable" style="text-align:center; " border="1"
 
|-
 
|Index||lhs||op||rhs||est||se||z||pvalue||std.lv||std.all||std.nox
 
|-
 
|1||visual||=~||x1||1||0||NA||NA||0.9||0.772||0.772
 
|-
 
|2||visual||=~||x2||0.553||0.1||5.554||0||0.498||0.424||0.424
 
|-
 
|3||visual||=~||x3||0.729||0.109||6.685||0||0.656||0.581||0.581
 
|-
 
|4||textual||=~||x4||1||0||NA||NA||0.99||0.852||0.852
 
|-
 
|5||textual||=~||x5||1.113||0.065||17.014||0||1.102||0.855||0.855
 
|-
 
|6||textual||=~||x6||0.926||0.055||16.703||0||0.917||0.838||0.838
 
|-
 
|7||speed||=~||x7||1||0||NA||NA||0.619||0.57||0.57
 
|-
 
|8||speed||=~||x8||1.18||0.165||7.152||0||0.731||0.723||0.723
 
|-
 
|9||speed||=~||x9||1.082||0.151||7.155||0||0.67||0.665||0.665
 
 
 
|}
 
</center>
 
  
This only shows the factor loadings are shown but 3 additional columns with standardized values are added.
+
===[[SMHS_BigDataBigSci_SEM| Structural Equation Modeling (SEM)]]===
  
• In the first column <b>(std.lv)</b>, only the latent variables have been standardized;
+
===[[SMHS_BigDataBigSci_GCM| Growth Curve Modeling (GCM)]]===
  
• In the second column <b>(std.all)</b>, both the latent and the observed variables have been standardized;
+
===[[SMHS_BigDataBigSci_GCM| Generalized Estimating Equation (GEE) Modeling]]===
  
• In the third column <b>(std.nox)</b>, both the latent and the observed variables have been standardized, except for the exogenous observed variables. This option may be useful if the standardization of exogenous observed variables has little meaning (for example, binary covariates). Since there are no exogenous covariates in this model, the last two columns are identical in this output.
+
===[[SMHS_BigDataBigSci_CrossVal|Internal Validation - Statistical n-fold cross-validaiton]]===
  
<b>library("semPlot")
+
<hr>
# semPaths(fit, "std", "show")</b>
+
* SOCR Home page: http://www.socr.umich.edu
semPaths(fit, "std", curvePivot = TRUE, edge.label.cex = 1.0)
 
# get the margines right:
 
# semPaths(fit, "std", curvePivot = TRUE, edge.label.cex = 1.0, mar = c(10, 3, 10, 3))
 
# semPaths(fit, "std", curvePivot = TRUE, edge.label.cex = 1.0, mar = c(10, 3, 10, 3), as.expression = c("nodes",
 
# "edges"), sizeMan = 3, sizeInt = 1, sizeLat = 4)
 
  
[[Image:SMHS_BigDataBigSci3.png|500px]]
+
{{translate|pageName=http://wiki.socr.umich.edu/index.php?title=SMHS_BigDataBigSci}}

Latest revision as of 09:57, 24 May 2016

Scientific Methods for Health Sciences - Model-based Analyses

Structural Equation Modeling (SEM), Growth Curve Models (GCM), and Generalized Estimating Equation (GEE) Modeling

Questions

  • How to represent dependencies in linear models and examine causal effects?
  • Is there a way to study population average effects of a covariate against specific individual effects?

Overview

SEM allow re-parameterization of random-effects to specify latent variables that may affect measures at different time points using structural equations. SEM show variables having predictive (possibly causal) effects on other variables (denoted by arrows) where coefficients index the strength and direction of predictive relations. SEM does not offer much more than what classical regression methods do, but it does allow simultaneous estimation of multiple equations modeling complementary relations.

Growth Curve (or latent growth) modeling is a statistical technique employed in SEM for estimating growth trajectories for longitudinal data (over time). It represent repeated measures of dependent variables as functions of time and other covariates. When subjects or units are observed repeatedly over known time points latent growth curve models reveal the trend of an individual as a function of an underlying growth process where the growth curve parameters can be estimated for each subject/unit.

GEE is a marginal longitudinal method that directly assesses the mean relations of interest (i.e., how the mean dependent variable changes over time), accounting for covariances among the observations within subjects, and getting a better estimate and valid significance tests of the relations. Thus, GEE estimates two different equations, (1) for the mean relations, and (2) for the covariance structure. An advantage of GEE over random-effect models is that it does not require the dependent variable to be normally distributed. However, a disadvantage of GEE is that it is less flexible and versatile – commonly employed algorithms for it require a small-to-moderate number of time points evenly (or approximately evenly) spaced, and similarly spaced across subjects. Nevertheless, it is a little more flexible than repeated-measure ANOVA because it permits some missing values and has an easy way to test for and model away the specific form of autocorrelation within subjects.

GEE is mostly used when the study is focused on uncovering the population average effect of a covariate vs. the individual specific effect. These two things are only equivalent for linear models, but not in non-linear models.

For instance, suppose $Y_{i,j}$ is the random effects logistic model of the $j^{th}$, observation of the $i^{th}$ subject, then $ log\Bigg(\frac{p_{i,j}}{1-p_{i,j}} \Bigg)=μ+ν_i, $ where $ν_i \sim N(0,σ^2)$ is a random effect for subject i and $p_{i,j}=P(Y_{i,j}=1|ν_i).$

(1) When using a random effects model on such data, the estimate of μ accounts for the fact that a mean zero normally distributed perturbation was applied to each individual, making it individual-specific.

(2) When using a GEE model on the same data, we estimate the population average log odds,

\begin{equation} δ=log\Bigg(\frac{E_v(\frac{1}{1+e^{-μ+v}i})}{1-E_v(\frac{1}{1+e^{-μ+v}i})} \Bigg), \end{equation}

in general $μ≠δ$.

If $μ=1$ and $σ^2=1$, then $δ≈.83$.

empirically:

m <- 1; s <- 1; v<-rnorm(1000, 0,s); v2 <- 1/(1+exp(-m+v)); v_mean <- mean(v2)

d <- log(v_mean/(1-v_mean)); d

Note that the random effects have mean zero on the transformed, linked, scale, but their effect is not mean zero on the original scale of the data. We can also simulate data from a mixed effects logistic regression model and compare the population level average with the inverse-logit of the intercept to see that they are not equal. This leads to a difference of the interpretation of the coefficients between GEE and random effects models, or SEM.

That is, there will be a difference between the GEE population average coefficients and the individual specific coefficients (random effects models).

# theoretically, if it can be computed:

$E(Y)=μ=1$ (in this specific case), but the expectation of the population average log odds $δ=log\Bigg[\frac{P(Y_{i,j}=1|v_i)}{1-P(Y_{i,j}=1|v_i)}\Bigg]$ would be $< 1$ 1. Note that this is kind of related to the fact that a grand-total average need not be equal to an average of partial averages.

The mean of the $i^{th}$ person in the $j^{th}$ observation (e.g., location, time, etc.) can be expressed by:

$E(Yij | Xij,α_j)= g[μ(Xij|β)+Uij(α_j,Xij)]$,

Where $μ(X_{ij}|β)$ is the average “response” of a person with the same covariates $X_{ij}$, $β$ a set of fixed effect coefficients, and $Uij(α_j,Xij)$ is an error term that is a function of the (time, space) random effects, $α_j$, and also a function of the covariates $X_{ij}$, and $g$ is the link function which specifies the regression type -- e.g.,

  • linear: $g^{-1} (u)=u,$
  • log: $g^{-1} (u)= log(u),$
  • logistic: $g^{-1} (u)=log(\frac{u}{1-u})$
  • $E(Uij(α_j,Xij)|Xij)=0.$

The link function, $g(u)$, provides the relationship between the linear predictor and the mean of the distribution function. For practical applications there are many commonly used link functions. It makes sense to try to match the domain of the link function to the range of the distribution function's mean.

Common distributions with typical uses and canonical link functions
Distribution Support of distribution Typical uses Link name Link function Mean function
Normal real: $(-∞, +∞)$ Linear-response data Identity $X\beta=\mu$ $\mu=X\beta$
Exponential, Gamma real:$(0, +∞)$ Exponential-response data, scale parameters Inverse $X\beta=-\mu^{-1}$ $\mu=-(X\beta)^{-1}$
Inverse Gaussian real:$(0, +∞)$ Inverse squared $X\beta=-\mu^{-2}$ $\mu=(-X\beta)^{-1/2}$

Footnotes

Model-based Analytics

Structural Equation Modeling (SEM)

Growth Curve Modeling (GCM)

Generalized Estimating Equation (GEE) Modeling

Internal Validation - Statistical n-fold cross-validaiton




Translate this page:

(default)
Uk flag.gif

Deutsch
De flag.gif

Español
Es flag.gif

Français
Fr flag.gif

Italiano
It flag.gif

Português
Pt flag.gif

日本語
Jp flag.gif

България
Bg flag.gif

الامارات العربية المتحدة
Ae flag.gif

Suomi
Fi flag.gif

इस भाषा में
In flag.gif

Norge
No flag.png

한국어
Kr flag.gif

中文
Cn flag.gif

繁体中文
Cn flag.gif

Русский
Ru flag.gif

Nederlands
Nl flag.gif

Ελληνικά
Gr flag.gif

Hrvatska
Hr flag.gif

Česká republika
Cz flag.gif

Danmark
Dk flag.gif

Polska
Pl flag.png

România
Ro flag.png

Sverige
Se flag.gif