Difference between revisions of "SMHS BigDataBigSci"

Latest revision as of 09:57, 24 May 2016

Scientific Methods for Health Sciences - Model-based Analyses

Structural Equation Modeling (SEM), Growth Curve Models (GCM), and Generalized Estimating Equation (GEE) Modeling

Questions

How to represent dependencies in linear models and examine causal effects?
Is there a way to study population average effects of a covariate against specific individual effects?

Overview

SEM allow re-parameterization of random-effects to specify latent variables that may affect measures at different time points using structural equations. SEM show variables having predictive (possibly causal) effects on other variables (denoted by arrows) where coefficients index the strength and direction of predictive relations. SEM does not offer much more than what classical regression methods do, but it does allow simultaneous estimation of multiple equations modeling complementary relations.

Growth Curve (or latent growth) modeling is a statistical technique employed in SEM for estimating growth trajectories for longitudinal data (over time). It represent repeated measures of dependent variables as functions of time and other covariates. When subjects or units are observed repeatedly over known time points latent growth curve models reveal the trend of an individual as a function of an underlying growth process where the growth curve parameters can be estimated for each subject/unit.

GEE is a marginal longitudinal method that directly assesses the mean relations of interest (i.e., how the mean dependent variable changes over time), accounting for covariances among the observations within subjects, and getting a better estimate and valid significance tests of the relations. Thus, GEE estimates two different equations, (1) for the mean relations, and (2) for the covariance structure. An advantage of GEE over random-effect models is that it does not require the dependent variable to be normally distributed. However, a disadvantage of GEE is that it is less flexible and versatile – commonly employed algorithms for it require a small-to-moderate number of time points evenly (or approximately evenly) spaced, and similarly spaced across subjects. Nevertheless, it is a little more flexible than repeated-measure ANOVA because it permits some missing values and has an easy way to test for and model away the specific form of autocorrelation within subjects.

GEE is mostly used when the study is focused on uncovering the population average effect of a covariate vs. the individual specific effect. These two things are only equivalent for linear models, but not in non-linear models.

For instance, suppose $Y_{i,j}$ is the random effects logistic model of the $j^{th}$, observation of the $i^{th}$ subject, then $ log\Bigg(\frac{p_{i,j}}{1-p_{i,j}} \Bigg)=μ+ν_i, $ where $ν_i \sim N(0,σ^2)$ is a random effect for subject i and $p_{i,j}=P(Y_{i,j}=1|ν_i).$

(1) When using a random effects model on such data, the estimate of μ accounts for the fact that a mean zero normally distributed perturbation was applied to each individual, making it individual-specific.

(2) When using a GEE model on the same data, we estimate the population average log odds,

\begin{equation} δ=log\Bigg(\frac{E_v(\frac{1}{1+e^{-μ+v}i})}{1-E_v(\frac{1}{1+e^{-μ+v}i})} \Bigg), \end{equation}

in general $μ≠δ$.

If $μ=1$ and $σ^2=1$, then $δ≈.83$.

empirically:

m <- 1; s <- 1; v<-rnorm(1000, 0,s); v2 <- 1/(1+exp(-m+v)); v_mean <- mean(v2)

d <- log(v_mean/(1-v_mean)); d

Note that the random effects have mean zero on the transformed, linked, scale, but their effect is not mean zero on the original scale of the data. We can also simulate data from a mixed effects logistic regression model and compare the population level average with the inverse-logit of the intercept to see that they are not equal. This leads to a difference of the interpretation of the coefficients between GEE and random effects models, or SEM.

That is, there will be a difference between the GEE population average coefficients and the individual specific coefficients (random effects models).

# theoretically, if it can be computed:

$E(Y)=μ=1$ (in this specific case), but the expectation of the population average log odds $δ=log\Bigg[\frac{P(Y_{i,j}=1|v_i)}{1-P(Y_{i,j}=1|v_i)}\Bigg]$ would be $< 1$ ¹. Note that this is kind of related to the fact that a grand-total average need not be equal to an average of partial averages.

The mean of the $i^{th}$ person in the $j^{th}$ observation (e.g., location, time, etc.) can be expressed by:

$E(Yij | Xij,α_j)= g[μ(Xij|β)+Uij(α_j,Xij)]$,

Where $μ(X_{ij}|β)$ is the average “response” of a person with the same covariates $X_{ij}$, $β$ a set of fixed effect coefficients, and $Uij(α_j,Xij)$ is an error term that is a function of the (time, space) random effects, $α_j$, and also a function of the covariates $X_{ij}$, and $g$ is the link function which specifies the regression type -- e.g.,

linear: $g^{-1} (u)=u,$

log: $g^{-1} (u)= log(u),$

logistic: $g^{-1} (u)=log(\frac{u}{1-u})$

$E(Uij(α_j,Xij)|Xij)=0.$

The link function, $g(u)$, provides the relationship between the linear predictor and the mean of the distribution function. For practical applications there are many commonly used link functions. It makes sense to try to match the domain of the link function to the range of the distribution function's mean.

Common distributions with typical uses and canonical link functions

Distribution	Support of distribution	Typical uses	Link name	Link function	Mean function
Normal	real: $(-∞, +∞)$	Linear-response data	Identity	$X\beta=\mu$	$\mu=X\beta$
Exponential, Gamma	real:$(0, +∞)$	Exponential-response data, scale parameters	Inverse	$X\beta=-\mu^{-1}$	$\mu=-(X\beta)^{-1}$
Inverse Gaussian	real:$(0, +∞)$		Inverse squared	$X\beta=-\mu^{-2}$	$\mu=(-X\beta)^{-1/2}$

Footnotes

¹ http://www.researchgate.net/publication/41895248

SOCR Home page: http://www.socr.umich.edu

Translate this page:

(default)	Deutsch	Español	Français	Italiano	Português	日本語	България	الامارات العربية المتحدة	Suomi	इस भाषा में	Norge
한국어	中文	繁体中文	Русский	Nederlands	Ελληνικά	Hrvatska	Česká republika	Danmark	Polska	România	Sverige

@@ Line 1: / Line 1: @@
-==[[SMHS| Scientific Methods for Health Sciences]] - Structural Equation Modeling (SEM) and Generalized Estimating Equation (GEE) Modeling ==
+==[[SMHS| Scientific Methods for Health Sciences]] - Model-based Analyses ==
-== Questions ==
+Structural Equation Modeling (SEM), Growth Curve Models (GCM), and Generalized Estimating Equation (GEE) Modeling
-* How to represent dependencies in linear models and examine causal effects?
-* Is there a way to study population average effects of a covariate against specific individual effects?
+==Questions ==
+*How to represent dependencies in linear models and examine causal effects?
+*Is there a way to study population average effects of a covariate against specific individual effects?
 ==Overview==
 SEM allow re-parameterization of random-effects to specify latent variables that may affect measures at different time points using structural equations. SEM show variables having predictive (possibly causal) effects on other variables (denoted by arrows) where coefficients index the strength and direction of predictive relations. SEM does not offer much more than what classical regression methods do, but it does allow simultaneous estimation of multiple equations modeling complementary relations.
+Growth Curve (or latent growth) modeling is a statistical technique employed in SEM for estimating growth trajectories for longitudinal data (over time). It represent repeated measures of dependent variables as functions of time and other covariates. When subjects or units are observed repeatedly over known time points latent growth curve models reveal the trend of an individual as a function of an underlying growth process where the growth curve parameters can be estimated for each subject/unit.
 GEE is a marginal longitudinal method that directly assesses the mean relations of interest (i.e., how the mean dependent variable changes over time), accounting for covariances among the observations within subjects, and getting a better estimate and valid significance tests of the relations. Thus, GEE estimates two different equations, (1) for the mean relations, and (2) for the covariance structure. An advantage of GEE over random-effect models is that it does not require the dependent variable to be normally distributed. However, a disadvantage of GEE is that it is less flexible and versatile – commonly employed algorithms for it require a small-to-moderate number of time points evenly (or approximately evenly) spaced, and similarly spaced across subjects. Nevertheless, it is a little more flexible than repeated-measure ANOVA because it permits some missing values and has an easy way to test for and model away the specific form of autocorrelation within subjects.
@@ Line 13: / Line 18: @@
 GEE is mostly used when the study is focused on uncovering the population average effect of a covariate vs. the individual specific effect. These two things are only equivalent for linear models, but not in non-linear models.
- <mark>FIX FORMULA For instance, suppose Y<sub>i</sub>, <sub>j</sub>,   is the random effects <b>logistic model</b> of the j<sup>th</sup>,  observation of the i<sup>th</sup> subject, then log(p_(i,j)/(1-p_(i,j) ))=μ+ν_i, where  ν_i~N(0,σ^2) is a random effect for subject i and p_(i,j)=P(Y_(i,j)=1|ν_i).</mark>
+For instance, suppose $Y_{i,j}$ is the random effects <b>logistic model</b> of the $j^{th}$,  observation of the $i^{th}$ subject, then
+$
+log\Bigg(\frac{p_{i,j}}{1-p_{i,j}} \Bigg)=μ+ν_i,
+$
+where  $ν_i \sim N(0,σ^2)$ is a random effect for <u>subject i</u> and $p_{i,j}=P(Y_{i,j}=1|ν_i).$
-(1) When using a random effects model on such data, the estimate of μ accounts for the fact that a mean zero normally distributed perturbation was applied to each individual, making it individual-specific.
+(1) When using a random effects model on such data, the estimate of μ accounts for the fact that a mean zero normally distributed perturbation was applied to each individual, making it ''individual-specific''.
 (2) When using a GEE model on the same data, we estimate the <i>population average log odds</i>,
- <mark>FIX FORMULA δ=log((E_ν (1/(1+e^(-μ+ν_i ) )))/(1-E_ν (1/(1+e^(-μ+ν_i ) )) )), in general μ≠δ.
+\begin{equation}
- If μ=1 and σ^2=1, then δ≈.83. </mark>
+δ=log\Bigg(\frac{E_v(\frac{1}{1+e^{-μ+v}i})}{1-E_v(\frac{1}{1+e^{-μ+v}i})}
+\Bigg),
+\end{equation}
+in general $μ≠δ$.
+If $μ=1$ and $σ^2=1$, then $δ≈.83$.
 empirically:
@@ Line 34: / Line 49: @@
 <b># theoretically</b>, if it can be computed:
- <mark>FIX FORMULAS!!
+$E(Y)=μ=1$ (in this specific case), but the expectation of the population average log odds
- E(Y)=μ=1 (in this specific case), but the expectation of the population average log odds δ=  log[(P(Y_(i,j)=1|ν_i))/(1-P(Y_(i,j)=1|ν_i))] would be < 1  . Note that this is kind of related to the fact that a grand-total average  need not be equal to an average of partial averages. </mark>
+$δ=log\Bigg[\frac{P(Y_{i,j}=1|v_i)}{1-P(Y_{i,j}=1|v_i)}\Bigg]$  would be $< 1$ <SUP>1</SUP>.
+Note that this is kind of related to the fact that a grand-total average  need not be equal to an average of partial averages.
- <mark>The mean of the ith person in the jth observation (e.g., location, time, etc.) can be expressed by:
+The mean of the $i^{th}$ person in the $j^{th}$ observation (e.g., location, time, etc.) can be expressed by:
- E(Yij | Xij,α_j)= g[μ(Xij ┤|β)+Uij(α_j,Xij)],
- Where μ(X_ij |β) is the average “response” of a person with the same covariates X_ij, β a set of fixed effect coefficients, and Uij(α_j,Xij) is an error term that is a function of the (time, space) random effects, α_j, and also a  function of the covariates X_ij, and g is the link function which specifies the regression type -- e.g.,
- linear: 		g^(-1) (u)=u,
- log:		g^(-1) (u)= log(u),
- logistic: 		g^(-1) (u)=log(u/(1-u))
- E(Uij(α_j,Xij)|Xij)=0.</mark>
-The link function, g(u), provides the relationship between the linear predictor and the mean of the distribution function. For practical applications there are many commonly used link functions. It makes sense to try to match the domain of the link function to the range of the distribution function's mean.
+$E(Yij | Xij,α_j)= g[μ(Xij|β)+Uij(α_j,Xij)]$,
-  <mark>INSERT TABLE!!!!!!!!!!!</mark>
+Where $μ(X_{ij}|β)$ is the average “response” of a person with the same covariates $X_{ij}$, $β$ a set of fixed effect coefficients, and $Uij(α_j,Xij)$ is an error term that is a function of the (time, space) random effects, $α_j$, and also a  function of the covariates $X_{ij}$, and $g$ is the '''link function''' which specifies the regression type -- e.g.,
+*<u>linear</u>:''' 		$g^{-1} (u)=u,$
+*<u>log</u>:'''		        $g^{-1} (u)= log(u),$
+*<u>logistic</u>:''' 		$g^{-1} (u)=log(\frac{u}{1-u})$
+*$E(Uij(α_j,Xij)|Xij)=0.$
+The link function, $g(u)$, provides the relationship between the linear predictor and the mean of the distribution function. For practical applications there are many commonly used link functions. It makes sense to try to match the domain of the link function to the range of the distribution function's mean.
+<center>Common distributions with typical uses and canonical link functions</center>
+<center>
+{| class="wikitable" style="text-align:center; " border="1"
+|-
+|<b>Distribution</b> ||<b>Support of distribution</b>||<b>Typical uses</b>||<b>Link name</b>||<b>Link function</b>||<b>Mean function</b>
+|-
+|Normal||real: $(-&#8734;, +&#8734;)$||Linear-response data||Identity||$X\beta=\mu$||$\mu=X\beta$
+|-
+|Exponential, Gamma||real:$(0, +&#8734;)$||Exponential-response data, scale parameters||Inverse||$X\beta=-\mu^{-1}$||$\mu=-(X\beta)^{-1}$
+|-
+|Inverse Gaussian||real:$(0, +&#8734;)$|| ||Inverse squared||$X\beta=-\mu^{-2}$||$\mu=(-X\beta)^{-1/2}$
+|}
+</center>
+===Footnotes===
+*<sup>1</sup> http://www.researchgate.net/publication/41895248
 ==Model-based Analytics==
@@ Line 53: / Line 91: @@
 ===[[SMHS_BigDataBigSci_SEM| Structural Equation Modeling (SEM)]]===
-===[[SMHS_BigDataBigSci_GEE| Generalized Estimating Equation (GEE) Modeling]]===
+===[[SMHS_BigDataBigSci_GCM| Growth Curve Modeling (GCM)]]===
+===[[SMHS_BigDataBigSci_GCM| Generalized Estimating Equation (GEE) Modeling]]===
+===[[SMHS_BigDataBigSci_CrossVal|Internal Validation - Statistical n-fold cross-validaiton]]===
 <hr>

Difference between revisions of "SMHS BigDataBigSci"

Latest revision as of 09:57, 24 May 2016

Contents

Scientific Methods for Health Sciences - Model-based Analyses

Questions

Overview

Footnotes

Model-based Analytics

Structural Equation Modeling (SEM)

Growth Curve Modeling (GCM)

Generalized Estimating Equation (GEE) Modeling

Internal Validation - Statistical n-fold cross-validaiton

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools