https://wiki.socr.umich.edu/api.php?action=feedcontributions&user=Clgalla&feedformat=atomSOCR - User contributions [en]2022-08-17T13:05:23ZUser contributionsMediaWiki 1.31.6https://wiki.socr.umich.edu/index.php?title=SMHS_ANCOVA&diff=14538SMHS ANCOVA2014-10-29T15:05:50Z<p>Clgalla: </p>
<hr />
<div>==[[SMHS| Scientific Methods for Health Sciences]] - Analysis of Covariance (ANCOVA) ==<br />
<br />
===Overview===<br />
Analysis of Variance (ANOVA) is the common method applied to analyze the differences between group means. Analysis of Covariance (ANCOVA) is the common method applied to blend ANOVA and regression and evaluate whether population means of a dependent variance (DV) are equal across levels of a categorical independent variable (IV) while statistically controlling for the effects of other continuous variables (CV). Multivariate analysis of variance (MANOVA) is a statistical test procedure for comparing multivariate means of several groups, which is a generalized form of ANOVA. Similar to MANOVA, MANOCVA (multivariate analysis of covariance) is an extension of ANCOVA and is designed for cases where there is more than one dependent variable and where the control of concomitant continuous independent variables is required. In this section, we review ANOVA, ANCOVA, MANOVA and MANCOVA and illustrate their application with examples.<br />
<br />
===Motivation===<br />
We have talked about analysis of variance (ANOVA). ANCOVA is similar to ANOVA and deals with covariance instead. What if we have more than one dependent variable and studied on multivariate observations? What if we want to see if the interactions among dependent variables or changes in the independent variables influence the dependent variable? Then we will need to use the extension of ANOVA and ANCOVA, MANOVA and MACOVA respectively. So the question would be how do those methods work and what kind of conclusions we can drawn from those methods?<br />
<br />
===Theory===<br />
<br />
::'''ANOVA'''<br />
Analysis of Variance (ANOVA) is the common method applied to analyze the differences between group means. In ANOVA, we divide the observed variance into components attributed to different sources of variation.<br />
*One-way ANOVA: we expand our inference methods to study and compare k independent samples. In this case, we will be decomposing the entire variation in the data into independent components.<br />
*Notations first: $y_{ij}$ is the measurement from group $i$, observation index $j$; $k$ is the number of groups; $n_{i}$ is the number of observations in group $i$; $n$ is the total number of observations and $n=n_{1}+n_{2}+⋯+n_{k}$. The group mean for group $i$ is $\bar y_{i}. =\frac{\sum_{j=1}^n_{i}y_{ij}}{n_{i}},$ the grand mean is $\bar y = \bar y_{..}=\frac{\sum_{i=1}^{k}\sum_{j=1}^{n_{i}}y_{ij}} {n}$.<br />
*Difference between the means (compare each group mean to the grand mean): total variance $SST(total)=\sum_{i=1}^{k}\sum_{j=1}^{n_{i}}\left(y_{ij}-\bar y_{..}\right)^{2}$, degrees of freedom $df(total)=n-1$; difference between each group mean and grand mean: $SST(between)=\sum_{i=}^{k}n_{i} \left(\bar y_{i.} \bar y_{..}\right )^{2}$, degrees of freedom $df(between)=k-1$; Sum square due to error (combination of variation within group): $SSE=\sum_{i=1}^{k} \sum_{j=1}^{n_{i}}\left(y_{ij}-\bar y_{i.}\right)^{2}$, degrees of freedom $df(within)=n-k$. <br />
<br />
With ANOVA decomposition, we have $\sum_{i=1}^{k}\sum_{j=1}^{n_{i}}\left(y_{ij}-\bar y_{..}\right)^{2}=\sum_{i=}^{k}n_{i}\left(\bar y_{i.} - \bar y_{..}\right )^{2}+\sum_{i=1}^{k}\sum_{j=1}^{n_{i}}\left(y_{ij}-\bar y_{i.}\right)^{2}$,that is $ST(total)=SST(between)+SSE(within)\,and\, df(total)=df(between)+df(within).$<br />
<br />
*Calculations: <br />
ANOVA table: <br />
<center><br />
{| class="wikitable" style="text-align:center; width:45%" border="1"<br />
|-<br />
|Variance source||Degree of freedom (df)|| Sum of squares (SS) ||Mean sum of squares (MS)|| F-statistics||P-value<br />
|-<br />
|Treatment effect (between group)||$k-1$||$\sum_{i=1}^{k}n_{i} \left(\bar y_{i.} -\bar y_{..}\right)^{2}$ || $\frac{SST(between)}{df(between)}$ || $F_{0}=\frac{MST(between)}{MSE(within)}$ || $p(F_{(df(between),df(within)}>F_{0}$<br />
|-<br />
|Error (within group)|| $n-k$ ||$\sum_{i=1}^{k} \sum_{j=1}^{n_{i}} \left(y_{ij}-\bar y_{i.}\right)^{2}$ || $\frac{SSE(within)}{df(within)}$ || ||<br />
|- <br />
|Total || $n-1$ || $\sum_{i=1}^{k} \sum_{j=1}^{n_{i}} \left(y_{ij}-\bar y_{..}\right)^{2}$ || || ||<br />
|- <br />
|}<br />
</center><br />
<br />
<br />
:::ANOVA hypotheses(general form<br />
$H_{o}:\mu_{1}=\mu_{2}=⋯=\mu_{k}; H_{a}:\mu_{i}≠μ_{j}$ for some $i≠j$. The test statistics: $F_{0}=\frac{MST(between)}{MSE(within)}$, if $F_{0}$ is large, then there is a lot between group variation, relative to the within group variation. Therefore, the discrepancies between the group means are large compared to the variability within the groups (error). That is large $F_{0}$ provides strong evidence against $H_{0}$.<br />
**ANOVA conditions: valid if (1) design conditions: all groups of observations represent random samples from their population respectively. Plus, all the observations within each group are independent of each other; (2) population conditions: the k population distributions must be approximately normal. If sample size is large, the normality condition is less crucial. Plus, the standard deviations of all populations are equal, which can be slightly relaxed when $0.5≤\frac{\sigma_{i}} {\sigma_{j}}≤2,$ for all $i$ and $j$, none of the population variance is twice larger than any of the other ones. <br />
<br />
<br />
::'''Two-way ANOVA'''<br />
We focus on decomposing the variance of a dataset into independent (orthogonal) components when we have two grouping factors.<br />
*Notations first: two-way model: $y_{ijk}=\mu+\tau_{i}+\beta_{j}+\gamma_{ij}+\varepsilon_{ijk}$,for all $1≤i≤a,1≤j≤b$ and $1≤k≤r. y_{ijk}$ is the A-factor level $i$, and B-factor level $j,$ observation-index $k$ measurement; $k$ is the number of replications; $a_{i}$ is the number of A-factor observations at level $i,a=a_{1}+⋯+a_{i}; b_{j}$ is the number of B-factor observations at level $j,b=b_{1}+⋯+b_{J}; N$ is the total number of observations and $N=a*a*b.$ Here $\mu$ is the overall mean response, $\tau_{i}$ is the effect due to the $i^{th}$ level of factor A, $\beta_{j}$ is the effect due to the $j^{th}$ level of factor B, and $\gamma_{ij}$ is the effect due to any interaction between the $i^{th}$ level of factor A and $j^{th}$ level of factor B. The mean for A-factor group mean at level $i$ and B-factor at level $j$ is $\bar y_{ij}.=\frac{\sum_{k=1}^{r}y_{ijk}}{r}$, the grand mean is $\bar y=\bar y_{…}=\frac{\sum_{k=1}^{r}\sum_{i=1}^{a}\sum_{j=1}^{b} y_{ijk}} {n}$, we have $SST(total)=SS(A)+SS(B)+SS(AB)+SSE.$<br />
<br />
<br />
:::Hypotheses <br />
*Null hypotheses: (1) the population means of the first factor are equal, which is like the one-way ANOVA for the row factor; (2) the population means of the second factor are equal, which is like the one-way ANOVA for the column factor; (3) there is no interaction between the two factors, which is similar to performing a test for independence with contingency tables.<br />
*Factors: factor A and factor B are independent variables in two-way ANOVA.<br />
*Treatment groups: formed by making all possible combinations of two factors. For example, if the factor A has 3 levels and factor B has 5 levels, then there will be $3*5=15$ different treatment groups.<br />
*Main effect: involves the dependent variable one at a time. The interaction is ignored for this part. <br />
*Interaction effect: the effect that one factor has on the other factor. The degree of freedom is the product of the two degrees of freedom of each factor.<br />
*Calculations:<br />
<br />
ANOVA table: <br />
<center><br />
{| class="wikitable" style="text-align:center; width:45%" border="1"<br />
|-<br />
|Variance source||Degree of freedom (df)||Sum of squares (SS)|| Mean sum of squares (MS)||F-statistics||P-value<br />
|-<br />
|Main effect A|| $a-1$ ||$SS(A)=rb\Sigma_{i=1}^{a} n_{i} (\bar y_{i..} -\bar y_{...} )^{2}$ ||$SS(A)/df(A)$||$F_{0}=\frac{MS(A)}{MSE}$ ||$p(F_{(df(B),df(E)})>F_{0}$<br />
|-<br />
|Main effect B || $b-1$||$SS(B)=ra\Sigma_{j=1}^{b} n_{i} (\bar y_{.j.}-\bar y_{...})^{2}$ || $SS(B)/df(B)$|| $F_{0}=\frac{MS(B)}{MSE}$||$p(F_{(df(AB),df(E)} )>F_{0}$<br />
|-<br />
|A vs. B interaction||$(a-1)(b-1)$||$SS(AB)=r\Sigma_{i=1}^{a} \Sigma_{j=1}^{b} ((\bar y_{ij.}-\bar y_{i..}) +\bar y_{.j.} -\bar y_{...} )^{2}$ ||$SS(AB)/df(AB)$||$F_{0}=\frac{MS(AB)} {MSE}$||<br />
|- <br />
|Error ||$N-ab$||$SSE=\Sigma_{k=1}^{r} \Sigma_{i=1}^{a} \Sigma_{j=1}^{b} (y_{ijk}-\bar y_{ij.})^{2}$||$SSE/df(error)$|| ||<br />
|- <br />
|Total||$N-1$||$SST=\Sigma_{k=1}^{r}\Sigma_{i=1}^{a} \Sigma_{j=1}^{b}(y_{ijk}-\bar y_{...})^{2}$|| || ||<br />
|- <br />
|}<br />
</center><br />
*ANOVA conditions: valid if (1) the population from which the samples were obtained must be normally or approximately normally distributed; (2) the samples must be independent; (3) the variances of the populations must be equal; (4) the groups must have the same sample size.<br />
<br />
<br />
::'''ANCOVA'''<br />
Analysis of Covariance is the common method applied to blend ANOVA and regression and evaluate whether population means of a dependent variance (DV) are equal across levels of a categorical independent variable (IV) while statistically controlling for the effects of other continuous variables (CV).<br />
*Assumptions of ANCOVA: (1) normality of residuals; (2) homogeneity of variance for error; (3) Homogeneity of regression slopes, regression lines should be parallel among groups; (4) Linearity of regression; (5) independence of error terms.<br />
*Increase statistical power: ANCOVA reducing the with-in group error variance and increase statistical power. Use the F-test to evaluate difference between groups by dividing the explained variance between groups by the unexplained variance within the groups. $F=\frac{MS between}{MSwithin}$. If this value is greater than the critical value, then there is significant difference between groups. The influence of CVs is grouped into the denominator. When we control for the effect of CVs on the DV, remove it from the denominator making F bigger, thereby increased our power to find a significant effect if one exists.<br />
*Adjusting preexisting difference in nonequivalent groups: correct for initial group difference that exists on DV among several intact groups. In this case, CVs are used to adjust scores and make participants more similar than without the CV since the participants cannot be made equal through random assignment. CV may be so intimately related to the IV that removing the variance on the DV associated with CV would remove considerable variance on DV, which will make the result meaningless.<br />
*Conduct ANCOVA: (1) Test multicollinearity: if a CV is highly related to another CV, then it won’t adjust the DV over and above the other CV. One or the other should be removed since they are statistically redundant. (2) Test the homogeneity of variance assumption: Levene’s test of equality of error variances. (3) Test of the homogeneity of regression slopes assumption: tested by testing if the CV significantly interacts with the IV by running an ANCOVA model including both the IV and the CV*IV interaction term in the model. If the interaction term is significant, then we should not perform ANCOVA. Instead assess group difference on DV at particular level of CV. (4) Run ANCOVA analysis: if the interaction is not significant, then rerun the ANCOVA without the interaction term. In this analysis, use the adjust means and adjusted MSerror. (5) Follow-up analyses: if there was a significant main effect, then there is a significant difference between the levels of one IV, ignoring all other factors. To find out exactly which levels are significant from others, use the same follow-up tests for ANOVA.<br />
<br />
<br />
<br />
::'''MANOVA'''<br />
Multivariate analysis of variance or multiple analysis of variance is a statistical test procedure for comparing multivariate means of several groups. MANOVA is a generalized form of ANOVA.<br />
*Relationship with ANOVA: <br />
**MANOVA is an extension of ANOVA, though, unlike ANOVA, it uses the variance-covariance between variables in testing the statistical significance of the mean difference. It is similar to ANOVA, but allows adding of interval independents as covariates. Several specific use-cases for MANOVA: (1) to compare groups formed by categorical independent variables on group differences in a set of dependent variables; (2) to use lack of difference for a set of dependent variables as a criterion for reducing a set of independent variables to a smaller, more easily modeled number of variables; (3) to identify the independent variables which differentiate a set of dependent variables the most.<br />
**Analogous to ANOVA, MANOVA is based on the product of model variance matrix, $\Sigma_{model}$ and inverse of the error variance matrix, $\Sigma_{res}^{-1}, or A=\Sigma_{model} * \Sigma_{res}^{-1}$. The hypothesis that $\Sigma_{model} = \Sigma_{residual}$ implies that the product $A \sim I$. Invariance considerations imply the MANOVA statistic should be a measure of magnitude of the singular value decomposition of this matrix product, there is no unique choice owing to the multi-dimensional nature of the alternative hypothesis.<br />
**MANOVA calculations closely resemble the ANOVA calculations, except that they are in vector and matrix forms. Assume that instead of a single dependent variance in the one-way ANOVA, there are three dependent variables as in our neuroimaging example above. Under the null hypothesis, it is assumed that scores on the three variables for each of the four groups are sampled from a tri-variate normal distribution mean vector $\mu =(\mu_{1}, \mu_{2}, \mu_{3})^{T}$ and variance-covariance matrix $\Sigma=\bigl(\begin{smallmatrix} \sigma_{1}^{2} & \rho_{1,2}\sigma_{1}\sigma{2} &\rho_{1,3}\sigma_{1}\sigma{3} \\ \rho_{2,1}\sigma_{2}\sigma_{1} & \sigma_{2}^{2} & \rho_{2,3}\sigma_{2}\sigma{3} \\\rho_{3,1}\sigma_{3}\sigma_{1}&\rho_{3,2}\sigma_{3}\sigma_{2} &\sigma_{3}^{2}\end{smallmatrix}\bigr)$. Where the covariance between variables 1 and 2 is expressed in terms of their correlation $(\rho_{1,2})$ and individual variances $(\sigma_{1}$ and $\sigma_{2})$. Under the null hypothesis, the scores for all subjects in groups 1, 2 and 3 are sampled from the same distribution.<br />
Example: a $2*2$ factorial design with medication as one factor and type of therapy as the second factor. The matrix of the data looks includes the patient ID, the drug-treatment (vitamin-E or Placebo), Therapy (Cognitive/physical), MMSE, CDR, Imaging. It's better when the study the design is balanced with equal numbers of patients in all four conditions, as this avoid potential problems of sample-size-driven effects (e.g., variance estimates). Recall that a univariate ANOVA (on any single outcome measure) would contain three types of effects -- a main effect for therapy, a mean effect for medication, and an interaction between therapy and medication. Similarly, MANOVA will contain the same three effects: main effects: (1) Therapy: The univariate ANOVA main effect for therapy tells whether the physical vs. cognitive therapy groups have different means, irrespective of their medications. The MANOVA main effect for therapy tells whether the physical vs. cognitive therapy group have different mean vectors irrespective of their medication. The vectors in this case are the $(3*1)$ column vectors of means (MMSE, CDR and Imaging); (2) Medication: The univariate ANOVA for medication tells whether the placebo group has a different mean from the Vitamin-E group irrespective of the therapy type. The MANOVA main effect for medication tells whether the placebo group has a different mean vector from the VItamin-E group irrespective of therapy; interaction effects: (3) The univariate ANOVA interaction tells whether the four means for a single variable differ from the value predicted from knowledge of the main effects of therapy and medication. The MANOVA interaction term tells whether the four mean vectors differ from the vector predicted from knowledge of the main effects of therapy and medication.<br />
**Variance partitioning: MANOVA has the same properties as an ANOVA. The only difference is that an ANOVA deals with a $(1*1)$ mean vector for any group (as the response is univatiate). While a MANOVA deals with a $(k*1)$ vector for any group, $k$ being the number of dependent variables, 3 in our example. The variance-covariance matrix for 1 variable is a $(1*1)$ matrix that has only one element, the variance of the variable. What is the variance-covariance matrix for $k$ variables is a $(k*k)$ matrix with the variances on the diagonal and the covariances representing the off diagonal elements. The ANOVA partitions the $(1*1)$ covariance matrix into a part due to error and a part due to the researcher-specified hypotheses (the two main effects and the interaction term). That is: $V_{total} = V_{therapy} + V_{medication} + V_{therapy*medication} + V_{error}.$ Likewise, MANOVA partitions its $(k*k)$ covariance matrix into a part due to research-hypotheses and a part due to error. Thus, in out example, MANOVA will have a $(3*3)$ covariance matrix for total variability, a $(3*3)$ covariance matrix due to therapy, a $(3*3)$ covariance matrix due to medication, a $(3*3)$ covariance matrix due to the interaction of therapy with medication, and finally a $(3*3)$ covariance matrix for the error. That is: $V_{total} = V_{therapy} + V_{medication} + V_{therapy*medication} + V_{error}$. Now,$V$ stands for the appropriate $(3*3)$ matrix, as opposed to $(1*1)$ value, as in ANOVA. The second equation is the matrix-form of the first one. Here is how we interpret these matrices. The error matrix looks like:<br />
<center><br />
{| class="wikitable" style="text-align:center; width:45%" border="1"<br />
|-<br />
| ||MMSE ||CDR ||Imaging<br />
|-<br />
|MMSE|| $V_{error1}$|| COV(error1, error2)|| COV(error1, error3)<br />
|-<br />
|CDR|| COV(error2, error1)||$V_{error2}$|| COV(error2, error1)<br />
|-<br />
|Imaging|| COV(error3, error1)|| COV(error3, error2) ||$V_{error3}$<br />
|-<br />
|}<br />
</center><br />
<br />
*Common statistics are summarized based on the root (eigenvalues) \lambda_{p} of the A matrix: (1) Samuel Stanley Wilks’, $\Lambda_{Wilks}=\prod_{1,\cdots,p} (1/(1+\lambda_{p}))=det(I+A)^{-1}=det(\Sigma_{res})/det(\Sigma_{res}+\Sigma_{model})$ distributed as lambda $(\Lambda)$; (2) The Pillai-M.S. Bartlett trace, $\Lambda_{Pillai}=\sum_{1,\cdots,p}(1/(1+\lambda_{p}))=tr((I+A)^{-1})$; (3) The Lawley-Hotelling trace, $\Lambda_{LH}=\sum_{1,\cdots,p}(\lambda_{p})=tr(A)$; (4) Roy’s greatest root, $\Lambda_{Roy}=max_{p}(\lambda_{p})=\left \| A \right \|_{\infty}$. The 4 major types of MANOVA test, the statistical power of these tests follow: $Pillai’s > Wilk’s > Hotelling’s > Roy’s Robustness$.<br />
**Let the $A$ statistic be the ratio of the sum of squares for an hypothesis and the sum of squares for error. Let $H$ denote the hypothesis sums of squares and cross products matrix, and let $E$ denote the error sums of squares and cross products matrix. The multivariate $A$ statistic is the matrix $A = HE_{-1}.$ Notice how mean squares (that is, covariance matrices) disappear from MANOVA just as they did for ANOVA. All hypothesis tests may be performed on the matrix $A$. Note also that because both $H$ and $E$ are symmetric, $HE^{-1}=E^{-1} H$. This is one special case where the order of matrix multiplication does not matter.<br />
<br />
*All MANOVA tests are made on $A=E^{-1}H$. There are four different multivariate tests that are made on this matrix. Each of the four test statistics has its own associated $F$ ratio. In some cases the four tests give an exact $F$ ratio for testing the null hypothesis and in other cases the $F$ ratio is only approximate. The reason for four different statistics and for approximations is that the MANOVA calculations may be complicated in some cases (i.e., the sampling distribution of the $F$ statistic in some multivariate cases would be difficult to compute exactly.) Suppose there are $k$ dependent variables in the MANOVA, and let $\lambda I$ denote the ith eigenvalue of $A=E^{-1}H$.<br />
**Wilk’s $\Lambda:\, 1- \Lambda$ is an index of variance explained by the model, $\eta^{2}$ is a measure of effect size analogous to $R^{2}$ in regression. $Wilk’s\: \Lambda$ is the pooled ratio of error variance to effect variance plus error variance: $\Lambda=\frac{|E|}{|H+E|}=\prod_{i=1}^{k}\frac{1}{1+\lambda_{i}}$.<br />
**Pillai’s Trace: Pillai’s criterion is the pooled effect variances. $Pillai’s\: trace=trace[H(H+E)^{-1}]=\prod_{i=1}^{k}\frac{1}{1+\lambda_{i}}$.<br />
**Hotelling’s Trace: the pooled ratio of effect variance to error variance:$trace(A)=trace[HE^{-1}]=\sum_{i=1}^{k}\lambda_{i}$.<br />
**Roy’s largest root: gives an upper bound for the F statistic, $Roy’s\: largest \,root=max(\hat{\lambda_{i}})$. <br />
<br />
<br />
::'''MANCOVA'''<br />
A multivariate analysis of covariance MANCOVA is a statistical an extension of ANCOVA and is designed for cases where there is more than one dependent variable and where the control of concomitant continuous independent variables is required. The process of characterizing a covariate in a data source allows the reduction of the magnitude of the error term, represented in the MANCOVA design as $MS_{error}$. The MANCOVA allows the characterization of the difference in group means in regards to a linear combination of multiple dependent variables, while simultaneously controlling for covariates. <br />
*Assumptions: (1) normality, for each group, each dependent variable follows a normal distribution and any linear combination of dependent variables are normally distributed; (2) independence of observations from all other observations; (3) homogeneity of variance: each dependent variable demonstrate similar levels of variance across each independent variable; (4) homogeneity of covariance: the intercorrelation matrix between dependent variables equals to each other across all levels of independent variable.<br />
*Covariate represents the source of variance that has not been controlled in the experiment and is believed to affect the dependent variable. And methods like ANCOVA and MANCOVA aim to remove the effects of such uncontrolled variation in order to increase statistical power and to ensure an accurate measurement of the true relationship between independent and dependent variables. <br />
*In some studies with covariates it happens that the F value actually becomes smaller (less significant) after including covariates in the design. This indicates that the covariates are not only correlated with the dependent variable, but also with the between-groups factors.<br />
<br />
===Applications===<br />
<br />
[http://rer.sagepub.com/content/68/3/350.short This article] examined articles published in several prominent educational journals to investigate the use of data analysis tools by researchers in four research paradigms: between-subjects univariate designs, between-subjects multivariate designs, repeated measures designs, and covariance designs. In addition to examining specific details pertaining to the research design (e.g., sample size, group size equality/inequality) and methods employed for data analysis, the authors also catalogued whether (a) validity assumptions were examined, (b) effect size indices were reported, (c) sample sizes were selected on the basis of power considerations, and (d) appropriate textbooks and/or articles were cited to communicate the nature of the analyses that were performed. The present analyses imply that researchers rarely verify that validity assumptions are satisfied and that, accordingly, they typically use analyses that are nonrobust to assumption violations. In addition, researchers rarely report effect size statistics, nor do they routinely perform power analyses to determine sample size requirements. Recommendations are offered to rectify these shortcomings.<br />
<br />
[http://www.tandfonline.com/doi/abs/10.1080/03610919808813485#.U-aSFhZTWdA This article] investigated the performance of ANOVA, MANOVA, WLS, and GEE for repeated ordinal data with small sample sizes. Repeated ordinal outcomes are common in behavioral and medical sciences. Due to the familiarity, simplicity and robustness of ANOVA methodology, this approach has been frequently used for repeated ordinal data. Weighted least squares (WLS) and generalized estimating equations (GEE) are usually the procedures of choice for repeated ordinal data since, unlike ANOVA, they generally make no or few untenable assumptions. However, these methods are based on asymptotic results and their properties are not well understood for small samples. Moreover, few software packages have procedures for implementing these methods. For a design with two groups and four time points, the simulation results indicated that ANOVA with the Huynh-Feldt adjustment performed matrix, known as sphericity, or the H‐F condition, is a sufficient condition for the usual F tests to be valid. <br />
<br />
===Software=== <br />
<br />
RCODE:<br />
fit <- aov(y ~ A, data = mydata) #one way ANOVA (completely randomized design)<br />
fit <- aov(y ~ A+B, data=mydata) # randomized block design where B is the blocking factor<br />
fit <- aov(y ~ A+B+A*B, data=mydata) ## two way factorial design<br />
fit <- aov(y ~ A+x, data=mydata) ## analysis of covariance<br />
## for within subject designs, the data frame has to be rearranged for each measurement on a subject to be a separate observation<br />
fit <- aov(y ~ A+Error(subject/A), data=mydata) ## one within factor<br />
fit <- aov(y ~(w1*w2*B1*B2)+Error(Subject/(W1*W2))+(B1*B2),data=mydata) # two within factors W1 and W2, two between factors B1 and B2.<br />
## 2*2 factorial MANOVA with 3 dependent variables<br />
Y <- cbind(y1,y2,y3)<br />
fit <- manova(Y ~ A*B)<br />
<br />
===Problems===<br />
<br />
Use data of the CPI (consumer price index) for food, housing, transportation and medical care from 1981 to 2007 to do a two-way analysis of the covariance in R. We take ‘Month’ as one factor and the item the CPI measured on as another factor and did a 2* 2 factorial design. The data are linked at [http://wiki.socr.umich.edu/index.php/SOCR_Data_Dinov_021808_ConsumerPriceIndex3Way Consumer Price Index].<br />
<br />
In R:<br />
CPI <- read.csv('/Users/yufangli/Desktop/CPI_Food.csv',header=T)<br />
attach(CPI)<br />
summary(CPI)<br />
Month <- factor(Month)<br />
CPI_Item <- factor(CPI_Item)<br />
fit <- aov(CPI_Value ~ Month + CPI_Item + CPI_Item*Month, data=CPI)<br />
fit<br />
Call:<br />
aov(formula = CPI_Value ~ Month + CPI_Item + CPI_Item * Month, <br />
data = CPI)<br />
<br />
Terms:<br />
Month CPI_Item Month:CPI_Item Residuals<br />
Sum of Squares 3282.6 1078702.9 706.8 2987673.8<br />
Deg. of Freedom 11 3 33 1248<br />
<br />
Residual standard error: 48.92821 <br />
Estimated effects may be unbalanced<br />
<br />
summary(fit)<br />
Df Sum Sq Mean Sq F value Pr(>F) <br />
Month 11 3283 298 0.125 1 <br />
CPI_Item 3 1078703 359568 150.197 <2e-16 ***<br />
Month:CPI_Item 33 707 21 0.009 1 <br />
Residuals 1248 2987674 2394 <br />
---<br />
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1<br />
<br />
fit2 <- aov(CPI_Value ~ CPI_Item, data=CPI)<br />
fit2<br />
Call:<br />
aov(formula = CPI_Value ~ CPI_Item, data = CPI)<br />
<br />
Terms:<br />
CPI_Item Residuals<br />
Sum of Squares 1078703 2991663<br />
Deg. of Freedom 3 1292<br />
<br />
Residual standard error: 48.11994 <br />
Estimated effects may be unbalanced<br />
summary(fit2)<br />
Df Sum Sq Mean Sq F value Pr(>F) <br />
CPI_Item 3 1078703 359568 155.3 <2e-16 ***<br />
Residuals 1292 2991663 2316 <br />
---<br />
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1<br />
<br />
Seems like the CPI_Item is an important factor while month is not a significant factor of CPI based on the dataset we got.<br />
<br />
<br />
===References===<br />
[http://www.statsoft.com/Textbook/ANOVA-MANOVA ANOVA-MANOVA]<br />
<br />
[http://en.wikipedia.org/wiki/Analysis_of_variance ANOVA Wikipedia]<br />
<br />
[http://wiki.stat.ucla.edu/socr/index.php/Probability_and_statistics_EBook#Chapter_XI:_Analysis_of_Variance_.28ANOVA.29 SOCR]<br />
<br />
[http://en.wikipedia.org/wiki/Analysis_of_covariance ANCOVA Wikipedia]<br />
<br />
[http://en.wikipedia.org/wiki/Multivariate_analysis_of_variance MANOVA Wikipedia]<br />
<br />
[http://en.wikipedia.org/wiki/MANCOVA MANCOVA Wikipedia]<br />
<br />
<br />
<br />
<br />
<br />
<br />
<hr><br />
* SOCR Home page: http://www.socr.umich.edu<br />
<br />
{{translate|pageName=http://wiki.socr.umich.edu/index.php?title=SMHS_ANCOVA}}</div>Clgallahttps://wiki.socr.umich.edu/index.php?title=SMHS_ANCOVA&diff=14537SMHS ANCOVA2014-10-29T15:00:40Z<p>Clgalla: /* Scientific Methods for Health Sciences - Analysis of Covariance (ANCOVA) */</p>
<hr />
<div>==[[SMHS| Scientific Methods for Health Sciences]] - Analysis of Covariance (ANCOVA) ==<br />
<br />
===Overview===<br />
Analysis of Variance (ANOVA) is the common method applied to analyze the differences between group means. Analysis of Covariance (ANCOVA) is the common method applied to blend ANOVA and regression and evaluate whether population means of a dependent variance (DV) are equal across levels of a categorical independent variable (IV) while statistically controlling for the effects of other continuous variables (CV). Multivariate analysis of variance (MANOVA) is a statistical test procedure for comparing multivariate means of several groups, which is a generalized form of ANOVA. Similar to MANOVA, MANOCVA (multivariate analysis of covariance) is an extension of ANCOVA and is designed for cases where there is more than one dependent variable and where the control of concomitant continuous independent variables is required. In this section, we review ANOVA, ANCOVA, MANOVA and MANCOVA and illustrate their application with examples.<br />
<br />
===Motivation===<br />
We have talked about analysis of variance (ANOVA). ANCOVA is similar to ANOVA and deals with covariance instead. What if we have more than one dependent variable and studied on multivariate observations? What if we want to see if the interactions among dependent variables or changes in the independent variables influence the dependent variable? Then we will need to use the extension of ANOVA and ANCOVA, MANOVA and MACOVA respectively. So the question would be how do those methods work and what kind of conclusions we can drawn from those methods?<br />
<br />
===Theory===<br />
<br />
::'''ANOVA'''<br />
Analysis of Variance (ANOVA) is the common method applied to analyze the differences between group means. In ANOVA, we divide the observed variance into components attributed to different sources of variation.<br />
*One-way ANOVA: we expand our inference methods to study and compare k independent samples. In this case, we will be decomposing the entire variation in the data into independent components.<br />
*Notations first: $y_{ij}$ is the measurement from group $i$, observation index $j$; $k$ is the number of groups; $n_{i}$ is the number of observations in group $i$; $n$ is the total number of observations and $n=n_{1}+n_{2}+⋯+n_{k}$. The group mean for group $i$ is $\bar y_{i}. =\frac{\sum_{j=1}^n_{i}y_{ij}}{n_{i}},$ the grand mean is $\bar y = \bar y_{..}=\frac{\sum_{i=1}^{k}\sum_{j=1}^{n_{i}}y_{ij}} {n}$.<br />
*Difference between the means (compare each group mean to the grand mean): total variance $SST(total)=\sum_{i=1}^{k}\sum_{j=1}^{n_{i}}\left(y_{ij}-\bar y_{..}\right)^{2}$, degrees of freedom $df(total)=n-1$; difference between each group mean and grand mean: $SST(between)=\sum_{i=}^{k}n_{i} \left(\bar y_{i.} \bar y_{..}\right )^{2}$, degrees of freedom $df(between)=k-1$; Sum square due to error (combination of variation within group): $SSE=\sum_{i=1}^{k} \sum_{j=1}^{n_{i}}\left(y_{ij}-\bar y_{i.}\right)^{2}$, degrees of freedom $df(within)=n-k$. <br />
<br />
With ANOVA decomposition, we have $\sum_{i=1}^{k}\sum_{j=1}^{n_{i}}\left(y_{ij}-\bar y_{..}\right)^{2}=\sum_{i=}^{k}n_{i}\left(\bar y_{i.} - \bar y_{..}\right )^{2}+\sum_{i=1}^{k}\sum_{j=1}^{n_{i}}\left(y_{ij}-\bar y_{i.}\right)^{2}$,that is $ST(total)=SST(between)+SSE(within)\,and\, df(total)=df(between)+df(within).$<br />
<br />
*Calculations: <br />
ANOVA table: <br />
<center><br />
{| class="wikitable" style="text-align:center; width:45%" border="1"<br />
|-<br />
|Variance source||Degree of freedom (df)|| Sum of squares (SS) ||Mean sum of squares (MS)|| F-statistics||P-value<br />
|-<br />
|Treatment effect (between group)||$k-1$||$\sum_{i=1}^{k}n_{i} \left(\bar y_{i.} -\bar y_{..}\right)^{2}$ || $\frac{SST(between)}{df(between)}$ || $F_{0}=\frac{MST(between)}{MSE(within)}$ || $p(F_{(df(between),df(within)}>F_{0}$<br />
|-<br />
|Error (within group)|| $n-k$ ||$\sum_{i=1}^{k} \sum_{j=1}^{n_{i}} \left(y_{ij}-\bar y_{i.}\right)^{2}$ || $\frac{SSE(within)}{df(within)}$ || ||<br />
|- <br />
|Total || $n-1$ || $\sum_{i=1}^{k} \sum_{j=1}^{n_{i}} \left(y_{ij}-\bar y_{..}\right)^{2}$ || || ||<br />
|- <br />
|}<br />
</center><br />
<br />
<br />
:::ANOVA hypotheses(general form<br />
$H_{o}:\mu_{1}=\mu_{2}=⋯=\mu_{k}; H_{a}:\mu_{i}≠μ_{j}$ for some $i≠j$. The test statistics: $F_{0}=\frac{MST(between)}{MSE(within)}$, if $F_{0}$ is large, then there is a lot between group variation, relative to the within group variation. Therefore, the discrepancies between the group means are large compared to the variability within the groups (error). That is large $F_{0}$ provides strong evidence against $H_{0}$.<br />
**ANOVA conditions: valid if (1) design conditions: all groups of observations represent random samples from their population respectively. Plus, all the observations within each group are independent of each other; (2) population conditions: the k population distributions must be approximately normal. If sample size is large, the normality condition is less crucial. Plus, the standard deviations of all populations are equal, which can be slightly relaxed when $0.5≤\frac{\sigma_{i}} {\sigma_{j}}≤2,$ for all $i$ and $j$, none of the population variance is twice larger than any of the other ones. <br />
<br />
<br />
::'''Two-way ANOVA'''<br />
We focus on decomposing the variance of a dataset into independent (orthogonal) components when we have two grouping factors.<br />
*Notations first: two-way model: $y_{ijk}=\mu+\tau_{i}+\beta_{j}+\gamma_{ij}+\varepsilon_{ijk}$,for all $1≤i≤a,1≤j≤b$ and $1≤k≤r. y_{ijk}$ is the A-factor level $i$, and B-factor level $j,$ observation-index $k$ measurement; $k$ is the number of replications; $a_{i}$ is the number of A-factor observations at level $i,a=a_{1}+⋯+a_{i}; b_{j}$ is the number of B-factor observations at level $j,b=b_{1}+⋯+b_{J}; N$ is the total number of observations and $N=a*a*b.$ Here $\mu$ is the overall mean response, $\tau_{i}$ is the effect due to the $i^{th}$ level of factor A, $\beta_{j}$ is the effect due to the $j^{th}$ level of factor B, and $\gamma_{ij}$ is the effect due to any interaction between the $i^{th}$ level of factor A and $j^{th}$ level of factor B. The mean for A-factor group mean at level $i$ and B-factor at level $j$ is $\bar y_{ij}.=\frac{\sum_{k=1}^{r}y_{ijk}}{r}$, the grand mean is $\bar y=\bar y_{…}=\frac{\sum_{k=1}^{r}\sum_{i=1}^{a}\sum_{j=1}^{b} y_{ijk}} {n}$, we have $SST(total)=SS(A)+SS(B)+SS(AB)+SSE.$<br />
<br />
<br />
:::Hypotheses <br />
*Null hypotheses: (1) the population means of the first factor are equal, which is like the one-way ANOVA for the row factor; (2) the population means of the second factor are equal, which is like the one-way ANOVA for the column factor; (3) there is no interaction between the two factors, which is similar to performing a test for independence with contingency tables.<br />
*Factors: factor A and factor B are independent variables in two-way ANOVA.<br />
*Treatment groups: formed by making all possible combinations of two factors. For example, if the factor A has 3 levels and factor B has 5 levels, then there will be $3*5=15$ different treatment groups.<br />
*Main effect: involves the dependent variable one at a time. The interaction is ignored for this part. <br />
*Interaction effect: the effect that one factor has on the other factor. The degree of freedom is the product of the two degrees of freedom of each factor.<br />
*Calculations:<br />
<br />
ANOVA table: <br />
<center><br />
{| class="wikitable" style="text-align:center; width:45%" border="1"<br />
|-<br />
|Variance source||Degree of freedom (df)||Sum of squares (SS)|| Mean sum of squares (MS)||F-statistics||P-value<br />
|-<br />
|Main effect A|| $a-1$ ||$SS(A)=rb\Sigma_{i=1}^{a} n_{i} (\bar y_{i..} -\bar y_{...} )^{2}$ ||$SS(A)/df(A)$||$F_{0}=\frac{MS(A)}{MSE}$ ||$p(F_{(df(B),df(E)})>F_{0}$<br />
|-<br />
|Main effect B || $b-1$||$SS(B)=ra\Sigma_{j=1}^{b} n_{i} (\bar y_{.j.}-\bar y_{...})^{2}$ || $SS(B)/df(B)$|| $F_{0}=\frac{MS(B)}{MSE}$||$p(F_{(df(AB),df(E)} )>F_{0}$<br />
|-<br />
|A vs. B interaction||$(a-1)(b-1)$||$SS(AB)=r\Sigma_{i=1}^{a} \Sigma_{j=1}^{b} ((\bar y_{ij.}-\bar y_{i..}) +\bar y_{.j.} -\bar y_{...} )^{2}$ ||$SS(AB)/df(AB)$||$F_{0}=\frac{MS(AB)} {MSE}$||<br />
|- <br />
|Error ||$N-ab$||$SSE=\Sigma_{k=1}^{r} \Sigma_{i=1}^{a} \Sigma_{j=1}^{b} (y_{ijk}-\bar y_{ij.})^{2}$||$SSE/df(error)$|| ||<br />
|- <br />
|Total||$N-1$||$SST=\Sigma_{k=1}^{r}\Sigma_{i=1}^{a} \Sigma_{j=1}^{b}(y_{ijk}-\bar y_{...})^{2}$|| || ||<br />
|- <br />
|}<br />
</center><br />
<br />
*ANOVA conditions: valid if (1) the population from which the samples were obtained must be normally or approximately normally distributed; (2) the samples must be independent; (3) the variances of the populations must be equal; (4) the groups must have the same sample size.<br />
<br />
<br />
::'''ANCOVA'''<br />
Analysis of Covariance is the common method applied to blend ANOVA and regression and evaluate whether population means of a dependent variance (DV) are equal across levels of a categorical independent variable (IV) while statistically controlling for the effects of other continuous variables (CV).<br />
*Assumptions of ANCOVA: (1) normality of residuals; (2) homogeneity of variance for error; (3) Homogeneity of regression slopes, regression lines should be parallel among groups; (4) Linearity of regression; (5) independence of error terms.<br />
*Increase statistical power: ANCOVA reducing the with-in group error variance and increase statistical power. Use the F-test to evaluate difference between groups by dividing the explained variance between groups by the unexplained variance within the groups. $F=\frac{MS between}{MSwithin}$. If this value is greater than the critical value, then there is significant difference between groups. The influence of CVs is grouped into the denominator. When we control for the effect of CVs on the DV, remove it from the denominator making F bigger, thereby increased our power to find a significant effect if one exists.<br />
*Adjusting preexisting difference in nonequivalent groups: correct for initial group difference that exists on DV among several intact groups. In this case, CVs are used to adjust scores and make participants more similar than without the CV since the participants cannot be made equal through random assignment. CV may be so intimately related to the IV that removing the variance on the DV associated with CV would remove considerable variance on DV, which will make the result meaningless.<br />
*Conduct ANCOVA: (1) Test multicollinearity: if a CV is highly related to another CV, then it won’t adjust the DV over and above the other CV. One or the other should be removed since they are statistically redundant. (2) Test the homogeneity of variance assumption: Levene’s test of equality of error variances. (3) Test of the homogeneity of regression slopes assumption: tested by testing if the CV significantly interacts with the IV by running an ANCOVA model including both the IV and the CV*IV interaction term in the model. If the interaction term is significant, then we should not perform ANCOVA. Instead assess group difference on DV at particular level of CV. (4) Run ANCOVA analysis: if the interaction is not significant, then rerun the ANCOVA without the interaction term. In this analysis, use the adjust means and adjusted MSerror. (5) Follow-up analyses: if there was a significant main effect, then there is a significant difference between the levels of one IV, ignoring all other factors. To find out exactly which levels are significant from others, use the same follow-up tests for ANOVA.<br />
<br />
<br />
<br />
::'''MANOVA'''<br />
Multivariate analysis of variance or multiple analysis of variance is a statistical test procedure for comparing multivariate means of several groups. MANOVA is a generalized form of ANOVA.<br />
*Relationship with ANOVA: <br />
**MANOVA is an extension of ANOVA, though, unlike ANOVA, it uses the variance-covariance between variables in testing the statistical significance of the mean difference. It is similar to ANOVA, but allows adding of interval independents as covariates. Several specific use-cases for MANOVA: (1) to compare groups formed by categorical independent variables on group differences in a set of dependent variables; (2) to use lack of difference for a set of dependent variables as a criterion for reducing a set of independent variables to a smaller, more easily modeled number of variables; (3) to identify the independent variables which differentiate a set of dependent variables the most.<br />
**Analogous to ANOVA, MANOVA is based on the product of model variance matrix, $\Sigma_{model}$ and inverse of the error variance matrix, $\Sigma_{res}^{-1}, or A=\Sigma_{model} * \Sigma_{res}^{-1}$. The hypothesis that $\Sigma_{model} = \Sigma_{residual}$ implies that the product $A \sim I$. Invariance considerations imply the MANOVA statistic should be a measure of magnitude of the singular value decomposition of this matrix product, there is no unique choice owing to the multi-dimensional nature of the alternative hypothesis.<br />
**MANOVA calculations closely resemble the ANOVA calculations, except that they are in vector and matrix forms. Assume that instead of a single dependent variance in the one-way ANOVA, there are three dependent variables as in our neuroimaging example above. Under the null hypothesis, it is assumed that scores on the three variables for each of the four groups are sampled from a tri-variate normal distribution mean vector $\mu =(\mu_{1}, \mu_{2}, \mu_{3})^{T}$ and variance-covariance matrix $\Sigma=\bigl(\begin{smallmatrix} \sigma_{1}^{2} & \rho_{1,2}\sigma_{1}\sigma{2} &\rho_{1,3}\sigma_{1}\sigma{3} \\ \rho_{2,1}\sigma_{2}\sigma_{1} & \sigma_{2}^{2} & \rho_{2,3}\sigma_{2}\sigma{3} \\\rho_{3,1}\sigma_{3}\sigma_{1}&\rho_{3,2}\sigma_{3}\sigma_{2} &\sigma_{3}^{2}\end{smallmatrix}\bigr)$. Where the covariance between variables 1 and 2 is expressed in terms of their correlation $(\rho_{1,2})$ and individual variances $(\sigma_{1}$ and $\sigma_{2})$. Under the null hypothesis, the scores for all subjects in groups 1, 2 and 3 are sampled from the same distribution.<br />
Example: a $2*2$ factorial design with medication as one factor and type of therapy as the second factor. The matrix of the data looks includes the patient ID, the drug-treatment (vitamin-E or Placebo), Therapy (Cognitive/physical), MMSE, CDR, Imaging. It's better when the study the design is balanced with equal numbers of patients in all four conditions, as this avoid potential problems of sample-size-driven effects (e.g., variance estimates). Recall that a univariate ANOVA (on any single outcome measure) would contain three types of effects -- a main effect for therapy, a mean effect for medication, and an interaction between therapy and medication. Similarly, MANOVA will contain the same three effects: main effects: (1) Therapy: The univariate ANOVA main effect for therapy tells whether the physical vs. cognitive therapy groups have different means, irrespective of their medications. The MANOVA main effect for therapy tells whether the physical vs. cognitive therapy group have different mean vectors irrespective of their medication. The vectors in this case are the $(3*1)$ column vectors of means (MMSE, CDR and Imaging); (2) Medication: The univariate ANOVA for medication tells whether the placebo group has a different mean from the Vitamin-E group irrespective of the therapy type. The MANOVA main effect for medication tells whether the placebo group has a different mean vector from the VItamin-E group irrespective of therapy; interaction effects: (3) The univariate ANOVA interaction tells whether the four means for a single variable differ from the value predicted from knowledge of the main effects of therapy and medication. The MANOVA interaction term tells whether the four mean vectors differ from the vector predicted from knowledge of the main effects of therapy and medication.<br />
**Variance partitioning: MANOVA has the same properties as an ANOVA. The only difference is that an ANOVA deals with a $(1*1)$ mean vector for any group (as the response is univatiate). While a MANOVA deals with a $(k*1)$ vector for any group, $k$ being the number of dependent variables, 3 in our example. The variance-covariance matrix for 1 variable is a $(1*1)$ matrix that has only one element, the variance of the variable. What is the variance-covariance matrix for $k$ variables is a $(k*k)$ matrix with the variances on the diagonal and the covariances representing the off diagonal elements. The ANOVA partitions the $(1*1)$ covariance matrix into a part due to error and a part due to the researcher-specified hypotheses (the two main effects and the interaction term). That is: $V_{total} = V_{therapy} + V_{medication} + V_{therapy*medication} + V_{error}.$ Likewise, MANOVA partitions its $(k*k)$ covariance matrix into a part due to research-hypotheses and a part due to error. Thus, in out example, MANOVA will have a $(3*3)$ covariance matrix for total variability, a $(3*3)$ covariance matrix due to therapy, a $(3*3)$ covariance matrix due to medication, a $(3*3)$ covariance matrix due to the interaction of therapy with medication, and finally a $(3*3)$ covariance matrix for the error. That is: $V_{total} = V_{therapy} + V_{medication} + V_{therapy*medication} + V_{error}$. Now,$V$ stands for the appropriate $(3*3)$ matrix, as opposed to $(1*1)$ value, as in ANOVA. The second equation is the matrix-form of the first one. Here is how we interpret these matrices. The error matrix looks like:<br />
<center><br />
{| class="wikitable" style="text-align:center; width:45%" border="1"<br />
|-<br />
| ||MMSE ||CDR ||Imaging<br />
|-<br />
|MMSE|| $V_{error1}$|| COV(error1, error2)|| COV(error1, error3)<br />
|-<br />
|CDR|| COV(error2, error1)||$V_{error2}$|| COV(error2, error1)<br />
|-<br />
|Imaging|| COV(error3, error1)|| COV(error3, error2) ||$V_{error3}$<br />
|-<br />
|}<br />
</center><br />
<br />
*Common statistics are summarized based on the root (eigenvalues) \lambda_{p} of the A matrix: (1) Samuel Stanley Wilks’, $\Lambda_{Wilks}=\prod_{1,\cdots,p} (1/(1+\lambda_{p}))=det(I+A)^{-1}=det(\Sigma_{res})/det(\Sigma_{res}+\Sigma_{model})$ distributed as lambda $(\Lambda)$; (2) The Pillai-M.S. Bartlett trace, $\Lambda_{Pillai}=\sum_{1,\cdots,p}(1/(1+\lambda_{p}))=tr((I+A)^{-1})$; (3) The Lawley-Hotelling trace, $\Lambda_{LH}=\sum_{1,\cdots,p}(\lambda_{p})=tr(A)$; (4) Roy’s greatest root, $\Lambda_{Roy}=max_{p}(\lambda_{p})=\left \| A \right \|_{\infty}$. The 4 major types of MANOVA test, the statistical power of these tests follow: $Pillai’s > Wilk’s > Hotelling’s > Roy’s Robustness$.<br />
**Let the $A$ statistic be the ratio of the sum of squares for an hypothesis and the sum of squares for error. Let $H$ denote the hypothesis sums of squares and cross products matrix, and let $E$ denote the error sums of squares and cross products matrix. The multivariate $A$ statistic is the matrix $A = HE_{-1}.$ Notice how mean squares (that is, covariance matrices) disappear from MANOVA just as they did for ANOVA. All hypothesis tests may be performed on the matrix $A$. Note also that because both $H$ and $E$ are symmetric, $HE^{-1}=E^{-1} H$. This is one special case where the order of matrix multiplication does not matter.<br />
<br />
*All MANOVA tests are made on $A=E^{-1}H$. There are four different multivariate tests that are made on this matrix. Each of the four test statistics has its own associated $F$ ratio. In some cases the four tests give an exact $F$ ratio for testing the null hypothesis and in other cases the $F$ ratio is only approximate. The reason for four different statistics and for approximations is that the MANOVA calculations may be complicated in some cases (i.e., the sampling distribution of the $F$ statistic in some multivariate cases would be difficult to compute exactly.) Suppose there are $k$ dependent variables in the MANOVA, and let $\lambda I$ denote the ith eigenvalue of $A=E^{-1}H$.<br />
**Wilk’s $\Lambda:\, 1- \Lambda$ is an index of variance explained by the model, $\eta^{2}$ is a measure of effect size analogous to $R^{2}$ in regression. $Wilk’s\: \Lambda$ is the pooled ratio of error variance to effect variance plus error variance: $\Lambda=\frac{|E|}{|H+E|}=\prod_{i=1}^{k}\frac{1}{1+\lambda_{i}}$.<br />
**Pillai’s Trace: Pillai’s criterion is the pooled effect variances. $Pillai’s\: trace=trace[H(H+E)^{-1}]=\prod_{i=1}^{k}\frac{1}{1+\lambda_{i}}$.<br />
**Hotelling’s Trace: the pooled ratio of effect variance to error variance:$trace(A)=trace[HE^{-1}]=\sum_{i=1}^{k}\lambda_{i}$.<br />
**Roy’s largest root: gives an upper bound for the F statistic, $Roy’s\: largest \,root=max(\hat{\lambda_{i}})$. <br />
<br />
<br />
::'''MANCOVA'''<br />
A multivariate analysis of covariance MANCOVA is a statistical an extension of ANCOVA and is designed for cases where there is more than one dependent variable and where the control of concomitant continuous independent variables is required. The process of characterizing a covariate in a data source allows the reduction of the magnitude of the error term, represented in the MANCOVA design as $MS_{error}$. The MANCOVA allows the characterization of the difference in group means in regards to a linear combination of multiple dependent variables, while simultaneously controlling for covariates. <br />
*Assumptions: (1) normality, for each group, each dependent variable follows a normal distribution and any linear combination of dependent variables are normally distributed; (2) independence of observations from all other observations; (3) homogeneity of variance: each dependent variable demonstrate similar levels of variance across each independent variable; (4) homogeneity of covariance: the intercorrelation matrix between dependent variables equals to each other across all levels of independent variable.<br />
*Covariate represents the source of variance that has not been controlled in the experiment and is believed to affect the dependent variable. And methods like ANCOVA and MANCOVA aim to remove the effects of such uncontrolled variation in order to increase statistical power and to ensure an accurate measurement of the true relationship between independent and dependent variables. <br />
*In some studies with covariates it happens that the F value actually becomes smaller (less significant) after including covariates in the design. This indicates that the covariates are not only correlated with the dependent variable, but also with the between-groups factors.<br />
<br />
===Applications===<br />
<br />
[http://rer.sagepub.com/content/68/3/350.short This article] examined articles published in several prominent educational journals to investigate the use of data analysis tools by researchers in four research paradigms: between-subjects univariate designs, between-subjects multivariate designs, repeated measures designs, and covariance designs. In addition to examining specific details pertaining to the research design (e.g., sample size, group size equality/inequality) and methods employed for data analysis, the authors also catalogued whether (a) validity assumptions were examined, (b) effect size indices were reported, (c) sample sizes were selected on the basis of power considerations, and (d) appropriate textbooks and/or articles were cited to communicate the nature of the analyses that were performed. The present analyses imply that researchers rarely verify that validity assumptions are satisfied and that, accordingly, they typically use analyses that are nonrobust to assumption violations. In addition, researchers rarely report effect size statistics, nor do they routinely perform power analyses to determine sample size requirements. Recommendations are offered to rectify these shortcomings.<br />
<br />
[http://www.tandfonline.com/doi/abs/10.1080/03610919808813485#.U-aSFhZTWdA This article] investigated the performance of ANOVA, MANOVA, WLS, and GEE for repeated ordinal data with small sample sizes. Repeated ordinal outcomes are common in behavioral and medical sciences. Due to the familiarity, simplicity and robustness of ANOVA methodology, this approach has been frequently used for repeated ordinal data. Weighted least squares (WLS) and generalized estimating equations (GEE) are usually the procedures of choice for repeated ordinal data since, unlike ANOVA, they generally make no or few untenable assumptions. However, these methods are based on asymptotic results and their properties are not well understood for small samples. Moreover, few software packages have procedures for implementing these methods. For a design with two groups and four time points, the simulation results indicated that ANOVA with the Huynh-Feldt adjustment performed matrix, known as sphericity, or the H‐F condition, is a sufficient condition for the usual F tests to be valid. <br />
<br />
===Software=== <br />
<br />
RCODE:<br />
fit <- aov(y ~ A, data = mydata) #one way ANOVA (completely randomized design)<br />
fit <- aov(y ~ A+B, data=mydata) # randomized block design where B is the blocking factor<br />
fit <- aov(y ~ A+B+A*B, data=mydata) ## two way factorial design<br />
fit <- aov(y ~ A+x, data=mydata) ## analysis of covariance<br />
## for within subject designs, the data frame has to be rearranged for each measurement on a subject to be a separate observation<br />
fit <- aov(y ~ A+Error(subject/A), data=mydata) ## one within factor<br />
fit <- aov(y ~(w1*w2*B1*B2)+Error(Subject/(W1*W2))+(B1*B2),data=mydata) # two within factors W1 and W2, two between factors B1 and B2.<br />
## 2*2 factorial MANOVA with 3 dependent variables<br />
Y <- cbind(y1,y2,y3)<br />
fit <- manova(Y ~ A*B)<br />
<br />
===Problems===<br />
<br />
Use data of the CPI (consumer price index) for food, housing, transportation and medical care from 1981 to 2007 to do a two-way analysis of the covariance in R. We take ‘Month’ as one factor and the item the CPI measured on as another factor and did a 2* 2 factorial design. The data are linked at [http://wiki.socr.umich.edu/index.php/SOCR_Data_Dinov_021808_ConsumerPriceIndex3Way Consumer Price Index].<br />
<br />
In R:<br />
CPI <- read.csv('/Users/yufangli/Desktop/CPI_Food.csv',header=T)<br />
attach(CPI)<br />
summary(CPI)<br />
Month <- factor(Month)<br />
CPI_Item <- factor(CPI_Item)<br />
fit <- aov(CPI_Value ~ Month + CPI_Item + CPI_Item*Month, data=CPI)<br />
fit<br />
Call:<br />
aov(formula = CPI_Value ~ Month + CPI_Item + CPI_Item * Month, <br />
data = CPI)<br />
<br />
Terms:<br />
Month CPI_Item Month:CPI_Item Residuals<br />
Sum of Squares 3282.6 1078702.9 706.8 2987673.8<br />
Deg. of Freedom 11 3 33 1248<br />
<br />
Residual standard error: 48.92821 <br />
Estimated effects may be unbalanced<br />
<br />
summary(fit)<br />
Df Sum Sq Mean Sq F value Pr(>F) <br />
Month 11 3283 298 0.125 1 <br />
CPI_Item 3 1078703 359568 150.197 <2e-16 ***<br />
Month:CPI_Item 33 707 21 0.009 1 <br />
Residuals 1248 2987674 2394 <br />
---<br />
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1<br />
<br />
fit2 <- aov(CPI_Value ~ CPI_Item, data=CPI)<br />
fit2<br />
Call:<br />
aov(formula = CPI_Value ~ CPI_Item, data = CPI)<br />
<br />
Terms:<br />
CPI_Item Residuals<br />
Sum of Squares 1078703 2991663<br />
Deg. of Freedom 3 1292<br />
<br />
Residual standard error: 48.11994 <br />
Estimated effects may be unbalanced<br />
summary(fit2)<br />
Df Sum Sq Mean Sq F value Pr(>F) <br />
CPI_Item 3 1078703 359568 155.3 <2e-16 ***<br />
Residuals 1292 2991663 2316 <br />
---<br />
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1<br />
<br />
Seems like the CPI_Item is an important factor while month is not a significant factor of CPI based on the dataset we got.<br />
<br />
<br />
===References===<br />
[http://www.statsoft.com/Textbook/ANOVA-MANOVA ANOVA-MANOVA]<br />
<br />
[http://en.wikipedia.org/wiki/Analysis_of_variance ANOVA Wikipedia]<br />
<br />
[http://wiki.stat.ucla.edu/socr/index.php/Probability_and_statistics_EBook#Chapter_XI:_Analysis_of_Variance_.28ANOVA.29 SOCR]<br />
<br />
[http://en.wikipedia.org/wiki/Analysis_of_covariance ANCOVA Wikipedia]<br />
<br />
[http://en.wikipedia.org/wiki/Multivariate_analysis_of_variance MANOVA Wikipedia]<br />
<br />
[http://en.wikipedia.org/wiki/MANCOVA MANCOVA Wikipedia]<br />
<br />
<br />
<br />
<br />
<br />
<br />
<hr><br />
* SOCR Home page: http://www.socr.umich.edu<br />
<br />
{{translate|pageName=http://wiki.socr.umich.edu/index.php?title=SMHS_ANCOVA}}</div>Clgallahttps://wiki.socr.umich.edu/index.php?title=SMHS_ANCOVA&diff=14536SMHS ANCOVA2014-10-29T13:54:15Z<p>Clgalla: </p>
<hr />
<div>==[[SMHS| Scientific Methods for Health Sciences]] - Analysis of Covariance (ANCOVA) ==<br />
<br />
===Overview===<br />
Analysis of Variance (ANOVA) is the common method applied to analyze the differences between group means. Analysis of Covariance (ANCOVA) is the common method applied to blend ANOVA and regression and evaluate whether population means of a dependent variance (DV) are equal across levels of a categorical independent variable (IV) while statistically controlling for the effects of other continuous variables (CV). Multivariate analysis of variance (MANOVA) is a statistical test procedure for comparing multivariate means of several groups, which is a generalized form of ANOVA. Similar to MANOVA, MANOCVA (multivariate analysis of covariance) is an extension of ANCOVA and is designed for cases where there is more than one dependent variable and where the control of concomitant continuous independent variables is required. In this section, we review ANOVA, ANCOVA, MANOVA and MANCOVA and illustrate their application with examples.<br />
<br />
===Motivation===<br />
We have talked about analysis of variance (ANOVA). ANCOVA is similar to ANOVA and deals with covariance instead. What if we have more than one dependent variable and studied on multivariate observations? What if we want to see if the interactions among dependent variables or changes in the independent variables influence the dependent variable? Then we will need to use the extension of ANOVA and ANCOVA, MANOVA and MACOVA respectively. So the question would be how do those methods work and what kind of conclusions we can drawn from those methods?<br />
<br />
===Theory===<br />
<br />
::'''ANOVA'''<br />
Analysis of Variance (ANOVA) is the common method applied to analyze the differences between group means. In ANOVA, we divide the observed variance into components attributed to different sources of variation.<br />
*One-way ANOVA: we expand our inference methods to study and compare k independent samples. In this case, we will be decomposing the entire variation in the data into independent components.<br />
*Notations first: $y_{ij}$ is the measurement from group $i$, observation index $j$; $k$ is the number of groups; $n_{i}$ is the number of observations in group $i$; $n$ is the total number of observations and $n=n_{1}+n_{2}+⋯+n_{k}$. The group mean for group $i$ is $\bar y_{i}. =\frac{\sum_{j=1}^n_{i}y_{ij}}{n_{i}},$ the grand mean is $\bar y = \bar y_{..}=\frac{\sum_{i=1}^{k}\sum_{j=1}^{n_{i}}y_{ij}} {n}$.<br />
*Difference between the means (compare each group mean to the grand mean): total variance $SST(total)=\sum_{i=1}^{k}\sum_{j=1}^{n_{i}}\left(y_{ij}-\bar y_{..}\right)^{2}$, degrees of freedom $df(total)=n-1$; difference between each group mean and grand mean: $SST(between)=\sum_{i=}^{k}n_{i} \left(\bar y_{i.} \bar y_{..}\right )^{2}$, degrees of freedom $df(between)=k-1$; Sum square due to error (combination of variation within group): $SSE=\sum_{i=1}^{k} \sum_{j=1}^{n_{i}}\left(y_{ij}-\bar y_{i.}\right)^{2}$, degrees of freedom $df(within)=n-k$. <br />
<br />
With ANOVA decomposition, we have $\sum_{i=1}^{k}\sum_{j=1}^{n_{i}}\left(y_{ij}-\bar y_{..}\right)^{2}=\sum_{i=}^{k}n_{i}\left(\bar y_{i.} - \bar y_{..}\right )^{2}+\sum_{i=1}^{k}\sum_{j=1}^{n_{i}}\left(y_{ij}-\bar y_{i.}\right)^{2}$,that is $ST(total)=SST(between)+SSE(within)\,and\, df(total)=df(between)+df(within).$<br />
<br />
*Calculations: <br />
ANOVA table: <br />
<center><br />
{| class="wikitable" style="text-align:center; width:45%" border="1"<br />
|-<br />
|Variance source||Degree of freedom (df)|| Sum of squares (SS) ||Mean sum of squares (MS)|| F-statistics||P-value<br />
|-<br />
|Treatment effect (between group)||$k-1$||$\sum_{i=1}^{k}n_{i} \left(\bar y_{i.} -\bar y_{..}\right)^{2}$ || $\frac{SST(between)}{df(between)}$ || $F_{0}=\frac{MST(between)}{MSE(within)}$ || $p(F_{(df(between),df(within)}>F_{0}$<br />
|-<br />
|Error (within group)|| $n-k$ ||$\sum_{i=1}^{k} \sum_{j=1}^{n_{i}} \left(y_{ij}-\bar y_{i.}\right)^{2}$ || $\frac{SSE(within)}{df(within)}$ || ||<br />
|- <br />
|Total || $n-1$ || $\sum_{i=1}^{k} \sum_{j=1}^{n_{i}} \left(y_{ij}-\bar y_{..}\right)^{2}$ || || ||<br />
|- <br />
|}<br />
</center><br />
<br />
<br />
:::ANOVA hypotheses(general form<br />
$H_{o}:\mu_{1}=\mu_{2}=⋯=\mu_{k}; H_{a}:\mu_{i}≠μ_{j}$ for some $i≠j$. The test statistics: $F_{0}=\frac{MST(between)}{MSE(within)}$, if $F_{0}$ is large, then there is a lot between group variation, relative to the within group variation. Therefore, the discrepancies between the group means are large compared to the variability within the groups (error). That is large $F_{0}$ provides strong evidence against $H_{0}$.<br />
**ANOVA conditions: valid if (1) design conditions: all groups of observations represent random samples from their population respectively. Plus, all the observations within each group are independent of each other; (2) population conditions: the k population distributions must be approximately normal. If sample size is large, the normality condition is less crucial. Plus, the standard deviations of all populations are equal, which can be slightly relaxed when $0.5≤\frac{\sigma_{i}} {\sigma_{j}}≤2,$ for all $i$ and $j$, none of the population variance is twice larger than any of the other ones. <br />
<br />
<br />
::'''Two-way ANOVA'''<br />
We focus on decomposing the variance of a dataset into independent (orthogonal) components when we have two grouping factors.<br />
*Notations first: two-way model: $y_{ijk}=\mu+\tau_{i}+\beta_{j}+\gamma_{ij}+\varepsilon_{ijk}$,for all $1≤i≤a,1≤j≤b$ and $1≤k≤r. y_{ijk}$ is the A-factor level $i$, and B-factor level $j,$ observation-index $k$ measurement; $k$ is the number of replications; $a_{i}$ is the number of A-factor observations at level $i,a=a_{1}+⋯+a_{i}; b_{j}$ is the number of B-factor observations at level $j,b=b_{1}+⋯+b_{J}; N$ is the total number of observations and $N=a*a*b.$ Here $\mu$ is the overall mean response, $\tau_{i}$ is the effect due to the $i^{th}$ level of factor A, $\beta_{j}$ is the effect due to the $j^{th}$ level of factor B, and $\gamma_{ij}$ is the effect due to any interaction between the $i^{th}$ level of factor A and $j^{th}$ level of factor B. The mean for A-factor group mean at level $i$ and B-factor at level $j$ is $\bar y_{ij}.=\frac{\sum_{k=1}^{r}y_{ijk}}{r}$, the grand mean is $\bar y=\bar y_{…}=\frac{\sum_{k=1}^{r}\sum_{i=1}^{a}\sum_{j=1}^{b} y_{ijk}} {n}$, we have $SST(total)=SS(A)+SS(B)+SS(AB)+SSE.$<br />
<br />
<br />
:::Hypotheses <br />
*Null hypotheses: (1) the population means of the first factor are equal, which is like the one-way ANOVA for the row factor; (2) the population means of the second factor are equal, which is like the one-way ANOVA for the column factor; (3) there is no interaction between the two factors, which is similar to performing a test for independence with contingency tables.<br />
*Factors: factor A and factor B are independent variables in two-way ANOVA.<br />
*Treatment groups: formed by making all possible combinations of two factors. For example, if the factor A has 3 levels and factor B has 5 levels, then there will be $3*5=15$ different treatment groups.<br />
*Main effect: involves the dependent variable one at a time. The interaction is ignored for this part. <br />
*Interaction effect: the effect that one factor has on the other factor. The degree of freedom is the product of the two degrees of freedom of each factor.<br />
*Calculations:<br />
<br />
ANOVA table: <br />
<center><br />
{| class="wikitable" style="text-align:center; width:45%" border="1"<br />
|-<br />
|Variance source||Degree of freedom (df)||Sum of squares (SS)|| Mean sum of squares (MS)||F-statistics||P-value<br />
|-<br />
|Main effect A|| $a-1$ ||$SS(A)=rb\Sigma_{i=1}^{a} n_{i} (\bar y_{i..} -\bar y_{...} )^{2}$ ||$SS(A)/df(A)$||$F_{0}=\frac{MS(A)}{MSE}$ ||$p(F_{(df(B),df(E)})>F_{0}$<br />
|-<br />
|Main effect B || $b-1$||$SS(B)=ra\Sigma_{j=1}^{b} n_{i} (\bar y_{.j.}-\bar y_{...})^{2}$ || $SS(B)/df(B)$|| $F_{0}=\frac{MS(B)}{MSE}$||$p(F_{(df(AB),df(E)} )>F_{0}$<br />
|-<br />
|A vs. B interaction||$(a-1)(b-1)$||$SS(AB)=r\Sigma_{i=1}^{a} \Sigma_{j=1}^{b} ((\bar y_{ij.}-\bar y_{i..}) +\bar y_{.j.} -\bar y_{...} )^{2}$ ||$SS(AB)/df(AB)$||$F_{0}=\frac{MS(AB)} {MSE}$||<br />
|- <br />
|Error ||$N-ab$||$SSE=\Sigma_{k=1}^{r} \Sigma_{i=1}^{a} \Sigma_{j=1}^{b} (y_{ijk}-\bar y_{ij.})^{2}$||$SSE/df(error)$|| ||<br />
|- <br />
|Total||$N-1$||$SST=\Sigma_{k=1}^{r}\Sigma_{i=1}^{a} \Sigma_{j=1}^{b}(y_{ijk}-\bar y_{...})^{2}$|| || ||<br />
|- <br />
|}<br />
</center><br />
<br />
*ANOVA conditions: valid if (1) the population from which the samples were obtained must be normally or approximately normally distributed; (2) the samples must be independent; (3) the variances of the populations must be equal; (4) the groups must have the same sample size.<br />
<br />
<br />
::'''ANCOVA'''<br />
Analysis of Covariance is the common method applied to blend ANOVA and regression and evaluate whether population means of a dependent variance (DV) are equal across levels of a categorical independent variable (IV) while statistically controlling for the effects of other continuous variables (CV).<br />
*Assumptions of ANCOVA: (1) normality of residuals; (2) homogeneity of variance for error; (3) Homogeneity of regression slopes, regression lines should be parallel among groups; (4) Linearity of regression; (5) independence of error terms.<br />
*Increase statistical power: ANCOVA reducing the with-in group error variance and increase statistical power. Use the F-test to evaluate difference between groups by dividing the explained variance between groups by the unexplained variance within the groups. $F=\frac{MS between}{MSwithin}$. If this value is greater than the critical value, then there is significant difference between groups. The influence of CVs is grouped into the denominator. When we control for the effect of CVs on the DV, remove it from the denominator making F bigger, thereby increased our power to find a significant effect if one exists.<br />
*Adjusting preexisting difference in nonequivalent groups: correct for initial group difference that exists on DV among several intact groups. In this case, CVs are used to adjust scores and make participants more similar than without the CV since the participants cannot be made equal through random assignment. CV may be so intimately related to the IV that removing the variance on the DV associated with CV would remove considerable variance on DV, which will make the result meaningless.<br />
*Conduct ANCOVA: (1) Test multicollinearity: if a CV is highly related to another CV, then it won’t adjust the DV over and above the other CV. One or the other should be removed since they are statistically redundant. (2) Test the homogeneity of variance assumption: Levene’s test of equality of error variances. (3) Test of the homogeneity of regression slopes assumption: tested by testing if the CV significantly interacts with the IV by running an ANCOVA model including both the IV and the CV*IV interaction term in the model. If the interaction term is significant, then we should not perform ANCOVA. Instead assess group difference on DV at particular level of CV. (4) Run ANCOVA analysis: if the interaction is not significant, then rerun the ANCOVA without the interaction term. In this analysis, use the adjust means and adjusted MSerror. (5) Follow-up analyses: if there was a significant main effect, then there is a significant difference between the levels of one IV, ignoring all other factors. To find out exactly which levels are significant from others, use the same follow-up tests for ANOVA.<br />
<br />
<br />
<br />
::'''MANOVA'''<br />
Multivariate analysis of variance or multiple analysis of variance is a statistical test procedure for comparing multivariate means of several groups. MANOVA is a generalized form of ANOVA.<br />
*Relationship with ANOVA: <br />
**MANOVA is an extension of ANOVA, though, unlike ANOVA, it uses the variance-covariance between variables in testing the statistical significance of the mean difference. It is similar to ANOVA, but allows adding of interval independents as covariates. Several specific use-cases for MANOVA: (1) to compare groups formed by categorical independent variables on group differences in a set of dependent variables; (2) to use lack of difference for a set of dependent variables as a criterion for reducing a set of independent variables to a smaller, more easily modeled number of variables; (3) to identify the independent variables which differentiate a set of dependent variables the most.<br />
**Analogous to ANOVA, MANOVA is based on the product of model variance matrix, $\Sigma_{model}$ and inverse of the error variance matrix, $\Sigma_{res}^{-1}, or A=\Sigma_{model} * \Sigma_{res}^{-1}$. The hypothesis that $\Sigma_{model} = \Sigma_{residual}$ implies that the product $A \sim I$. Invariance considerations imply the MANOVA statistic should be a measure of magnitude of the singular value decomposition of this matrix product, there is no unique choice owing to the multi-dimensional nature of the alternative hypothesis.<br />
**MANOVA calculations closely resemble the ANOVA calculations, except that they are in vector and matrix forms. Assume that instead of a single dependent variance in the one-way ANOVA, there are three dependent variables as in our neuroimaging example above. Under the null hypothesis, it is assumed that scores on the three variables for each of the four groups are sampled from a tri-variate normal distribution mean vector $\mu =(\mu_{1}, \mu_{2}, \mu_{3})^{T}$ and variance-covariance matrix $\Sigma=\bigl(\begin{smallmatrix} \sigma_{1}^{2} & \rho_{1,2}\sigma_{1}\sigma{2} &\rho_{1,3}\sigma_{1}\sigma{3} \\ \rho_{2,1}\sigma_{2}\sigma_{1} & \sigma_{2}^{2} & \rho_{2,3}\sigma_{2}\sigma{3} \\\rho_{3,1}\sigma_{3}\sigma_{1}&\rho_{3,2}\sigma_{3}\sigma_{2} &\sigma_{3}^{2}\end{smallmatrix}\bigr)$. Where the covariance between variables 1 and 2 is expressed in terms of their correlation $(\rho_{1,2})$ and individual variances $(\sigma_{1}$ and $\sigma_{2})$. Under the null hypothesis, the scores for all subjects in groups 1, 2 and 3 are sampled from the same distribution.<br />
Example: a $2*2$ factorial design with medication as one factor and type of therapy as the second factor. The matrix of the data looks includes the patient ID, the drug-treatment (vitamin-E or Placebo), Therapy (Cognitive/physical), MMSE, CDR, Imaging. It's better when the study the design is balanced with equal numbers of patients in all four conditions, as this avoid potential problems of sample-size-driven effects (e.g., variance estimates). Recall that a univariate ANOVA (on any single outcome measure) would contain three types of effects -- a main effect for therapy, a mean effect for medication, and an interaction between therapy and medication. Similarly, MANOVA will contain the same three effects: main effects: (1) Therapy: The univariate ANOVA main effect for therapy tells whether the physical vs. cognitive therapy groups have different means, irrespective of their medications. The MANOVA main effect for therapy tells whether the physical vs. cognitive therapy group have different mean vectors irrespective of their medication. The vectors in this case are the $(3*1)$ column vectors of means (MMSE, CDR and Imaging); (2) Medication: The univariate ANOVA for medication tells whether the placebo group has a different mean from the Vitamin-E group irrespective of the therapy type. The MANOVA main effect for medication tells whether the placebo group has a different mean vector from the VItamin-E group irrespective of therapy; interaction effects: (3) The univariate ANOVA interaction tells whether the four means for a single variable differ from the value predicted from knowledge of the main effects of therapy and medication. The MANOVA interaction term tells whether the four mean vectors differ from the vector predicted from knowledge of the main effects of therapy and medication.<br />
**Variance partitioning: MANOVA has the same properties as an ANOVA. The only difference is that an ANOVA deals with a $(1*1)$ mean vector for any group (as the response is univatiate). While a MANOVA deals with a $(k*1)$ vector for any group, $k$ being the number of dependent variables, 3 in our example. The variance-covariance matrix for 1 variable is a $(1*1)$ matrix that has only one element, the variance of the variable. What is the variance-covariance matrix for $k$ variables is a $(k*k)$ matrix with the variances on the diagonal and the covariances representing the off diagonal elements. The ANOVA partitions the $(1*1)$ covariance matrix into a part due to error and a part due to the researcher-specified hypotheses (the two main effects and the interaction term). That is: $V_{total} = V_{therapy} + V_{medication} + V_{therapy*medication} + V_{error}.$ Likewise, MANOVA partitions its $(k*k)$ covariance matrix into a part due to research-hypotheses and a part due to error. Thus, in out example, MANOVA will have a $(3*3)$ covariance matrix for total variability, a $(3*3)$ covariance matrix due to therapy, a $(3*3)$ covariance matrix due to medication, a $(3*3)$ covariance matrix due to the interaction of therapy with medication, and finally a $(3*3)$ covariance matrix for the error. That is: $V_{total} = V_{therapy} + V_{medication} + V_{therapy*medication} + V_{error}$. Now,$V$ stands for the appropriate $(3*3)$ matrix, as opposed to $(1*1)$ value, as in ANOVA. The second equation is the matrix-form of the first one. Here is how we interpret these matrices. The error matrix looks like:<br />
<center><br />
{| class="wikitable" style="text-align:center; width:45%" border="1"<br />
|-<br />
| ||MMSE ||CDR ||Imaging<br />
|-<br />
|MMSE|| $V_{error1}$|| COV(error1, error2)|| COV(error1, error3)<br />
|-<br />
|CDR|| COV(error2, error1)||$V_{error2}$|| COV(error2, error1)<br />
|-<br />
|Imaging|| COV(error3, error1)|| COV(error3, error2) ||$V_{error3}$<br />
|-<br />
|}<br />
</center><br />
<br />
*Common statistics are summarized based on the root (eigenvalues) \lambda_{p} of the A matrix: (1) Samuel Stanley Wilks’, $\Lambda_{Wilks}=\prod_{1,\cdots,p} (1/(1+\lambda_{p}))=det(I+A)^{-1}=det(\Sigma_{res})/det(\Sigma_{res}+\Sigma_{model})$ distributed as lambda $(\Lambda)$; (2) The Pillai-M.S. Bartlett trace, $\Lambda_{Pillai}=\sum_{1,\cdots,p}(1/(1+\lambda_{p}))=tr((I+A)^{-1})$; (3) The Lawley-Hotelling trace, $\Lambda_{LH}=\sum_{1,\cdots,p}(\lambda_{p})=tr(A)$; (4) Roy’s greatest root, $\Lambda_{Roy}=max_{p}(\lambda_{p})=\left \| A \right \|_{\infty}$. The 4 major types of MANOVA test, the statistical power of these tests follow: $Pillai’s > Wilk’s > Hotelling’s > Roy’s Robustness$.<br />
**Let the $A$ statistic be the ratio of the sum of squares for an hypothesis and the sum of squares for error. Let $H$ denote the hypothesis sums of squares and cross products matrix, and let $E$ denote the error sums of squares and cross products matrix. The multivariate $A$ statistic is the matrix $A = HE_{-1}.$ Notice how mean squares (that is, covariance matrices) disappear from MANOVA just as they did for ANOVA. All hypothesis tests may be performed on the matrix $A$. Note also that because both $H$ and $E$ are symmetric, $HE^{-1}=E^{-1} H$. This is one special case where the order of matrix multiplication does not matter.<br />
<br />
*All MANOVA tests are made on $A=E^{-1}H$. There are four different multivariate tests that are made on this matrix. Each of the four test statistics has its own associated $F$ ratio. In some cases the four tests give an exact $F$ ratio for testing the null hypothesis and in other cases the $F$ ratio is only approximate. The reason for four different statistics and for approximations is that the MANOVA calculations may be complicated in some cases (i.e., the sampling distribution of the $F$ statistic in some multivariate cases would be difficult to compute exactly.) Suppose there are $k$ dependent variables in the MANOVA, and let $\lambda I$ denote the ith eigenvalue of $A=E^{-1}H$.<br />
**Wilk’s $\Lambda:\, 1- \Lambda$ is an index of variance explained by the model, $\eta^{2}$ is a measure of effect size analogous to $R^{2}$ in regression. $Wilk’s\: \Lambda$ is the pooled ratio of error variance to effect variance plus error variance: $\Lambda=\frac{|E|}{|H+E|}=\prod_{i=1}^{k}\frac{1}{1+\lambda_{i}}$.<br />
**Pillai’s Trace: Pillai’s criterion is the pooled effect variances. $Pillai’s\: trace=trace[H(H+E)^{-1}]=\prod_{i=1}^{k}\frac{1}{1+\lambda_{i}}$.<br />
**Hotelling’s Trace: the pooled ratio of effect variance to error variance:$trace(A)=trace[HE^{-1}]=\sum_{i=1}^{k}\lambda_{i}$.<br />
**Roy’s largest root: gives an upper bound for the F statistic, $Roy’s\: largest \,root=max(\hat{\lambda_{i}})$. <br />
<br />
::MANCOVA<br />
A multivariate analysis of covariance MANCOVA is a statistical an extension of ANCOVA and is designed for cases where there is more than one dependent variable and where the control of concomitant continuous independent variables is required. The process of characterizing a covariate in a data source allows the reduction of the magnitude of the error term, represented in the MANCOVA design as $MS_{error}$. The MANCOVA allows the characterization of the difference in group means in regards to a linear combination of multiple dependent variables, while simultaneously controlling for covariates. <br />
*Assumptions: (1) normality, for each group, each dependent variable follows a normal distribution and any linear combination of dependent variables are normally distributed; (2) independence of observations from all other observations; (3) homogeneity of variance: each dependent variable demonstrate similar levels of variance across each independent variable; (4) homogeneity of covariance: the intercorrelation matrix between dependent variables equals to each other across all levels of independent variable.<br />
*Covariate represents the source of variance that has not been controlled in the experiment and is believed to affect the dependent variable. And methods like ANCOVA and MANCOVA aim to remove the effects of such uncontrolled variation in order to increase statistical power and to ensure an accurate measurement of the true relationship between independent and dependent variables. <br />
*In some studies with covariates it happens that the F value actually becomes smaller (less significant) after including covariates in the design. This indicates that the covariates are not only correlated with the dependent variable, but also with the between-groups factors.<br />
<br />
===Applications===<br />
<br />
[http://rer.sagepub.com/content/68/3/350.short This article] examined articles published in several prominent educational journals to investigate the use of data analysis tools by researchers in four research paradigms: between-subjects univariate designs, between-subjects multivariate designs, repeated measures designs, and covariance designs. In addition to examining specific details pertaining to the research design (e.g., sample size, group size equality/inequality) and methods employed for data analysis, the authors also catalogued whether (a) validity assumptions were examined, (b) effect size indices were reported, (c) sample sizes were selected on the basis of power considerations, and (d) appropriate textbooks and/or articles were cited to communicate the nature of the analyses that were performed. The present analyses imply that researchers rarely verify that validity assumptions are satisfied and that, accordingly, they typically use analyses that are nonrobust to assumption violations. In addition, researchers rarely report effect size statistics, nor do they routinely perform power analyses to determine sample size requirements. Recommendations are offered to rectify these shortcomings.<br />
<br />
[http://www.tandfonline.com/doi/abs/10.1080/03610919808813485#.U-aSFhZTWdA This article] investigated the performance of ANOVA, MANOVA, WLS, and GEE for repeated ordinal data with small sample sizes. Repeated ordinal outcomes are common in behavioral and medical sciences. Due to the familiarity, simplicity and robustness of ANOVA methodology, this approach has been frequently used for repeated ordinal data. Weighted least squares (WLS) and generalized estimating equations (GEE) are usually the procedures of choice for repeated ordinal data since, unlike ANOVA, they generally make no or few untenable assumptions. However, these methods are based on asymptotic results and their properties are not well understood for small samples. Moreover, few software packages have procedures for implementing these methods. For a design with two groups and four time points, the simulation results indicated that ANOVA with the Huynh-Feldt adjustment performed matrix, known as sphericity, or the H‐F condition, is a sufficient condition for the usual F tests to be valid. <br />
<br />
===Software=== <br />
<br />
RCODE:<br />
fit <- aov(y ~ A, data = mydata) #one way ANOVA (completely randomized design)<br />
fit <- aov(y ~ A+B, data=mydata) # randomized block design where B is the blocking factor<br />
fit <- aov(y ~ A+B+A*B, data=mydata) ## two way factorial design<br />
fit <- aov(y ~ A+x, data=mydata) ## analysis of covariance<br />
## for within subject designs, the data frame has to be rearranged for each measurement on a subject to be a separate observation<br />
fit <- aov(y ~ A+Error(subject/A), data=mydata) ## one within factor<br />
fit <- aov(y ~(w1*w2*B1*B2)+Error(Subject/(W1*W2))+(B1*B2),data=mydata) # two within factors W1 and W2, two between factors B1 and B2.<br />
## 2*2 factorial MANOVA with 3 dependent variables<br />
Y <- cbind(y1,y2,y3)<br />
fit <- manova(Y ~ A*B)<br />
<br />
===Problems===<br />
<br />
Use data of the CPI (consumer price index) for food, housing, transportation and medical care from 1981 to 2007 to do a two-way analysis of the covariance in R. We take ‘Month’ as one factor and the item the CPI measured on as another factor and did a 2* 2 factorial design. The data are linked at [http://wiki.socr.umich.edu/index.php/SOCR_Data_Dinov_021808_ConsumerPriceIndex3Way Consumer Price Index].<br />
<br />
In R:<br />
CPI <- read.csv('/Users/yufangli/Desktop/CPI_Food.csv',header=T)<br />
attach(CPI)<br />
summary(CPI)<br />
Month <- factor(Month)<br />
CPI_Item <- factor(CPI_Item)<br />
fit <- aov(CPI_Value ~ Month + CPI_Item + CPI_Item*Month, data=CPI)<br />
fit<br />
Call:<br />
aov(formula = CPI_Value ~ Month + CPI_Item + CPI_Item * Month, <br />
data = CPI)<br />
<br />
Terms:<br />
Month CPI_Item Month:CPI_Item Residuals<br />
Sum of Squares 3282.6 1078702.9 706.8 2987673.8<br />
Deg. of Freedom 11 3 33 1248<br />
<br />
Residual standard error: 48.92821 <br />
Estimated effects may be unbalanced<br />
<br />
summary(fit)<br />
Df Sum Sq Mean Sq F value Pr(>F) <br />
Month 11 3283 298 0.125 1 <br />
CPI_Item 3 1078703 359568 150.197 <2e-16 ***<br />
Month:CPI_Item 33 707 21 0.009 1 <br />
Residuals 1248 2987674 2394 <br />
---<br />
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1<br />
<br />
fit2 <- aov(CPI_Value ~ CPI_Item, data=CPI)<br />
fit2<br />
Call:<br />
aov(formula = CPI_Value ~ CPI_Item, data = CPI)<br />
<br />
Terms:<br />
CPI_Item Residuals<br />
Sum of Squares 1078703 2991663<br />
Deg. of Freedom 3 1292<br />
<br />
Residual standard error: 48.11994 <br />
Estimated effects may be unbalanced<br />
summary(fit2)<br />
Df Sum Sq Mean Sq F value Pr(>F) <br />
CPI_Item 3 1078703 359568 155.3 <2e-16 ***<br />
Residuals 1292 2991663 2316 <br />
---<br />
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1<br />
<br />
Seems like the CPI_Item is an important factor while month is not a significant factor of CPI based on the dataset we got.<br />
<br />
<br />
===References===<br />
[http://www.statsoft.com/Textbook/ANOVA-MANOVA ANOVA-MANOVA]<br />
<br />
[http://en.wikipedia.org/wiki/Analysis_of_variance ANOVA Wikipedia]<br />
<br />
[http://wiki.stat.ucla.edu/socr/index.php/Probability_and_statistics_EBook#Chapter_XI:_Analysis_of_Variance_.28ANOVA.29 SOCR]<br />
<br />
[http://en.wikipedia.org/wiki/Analysis_of_covariance ANCOVA Wikipedia]<br />
<br />
[http://en.wikipedia.org/wiki/Multivariate_analysis_of_variance MANOVA Wikipedia]<br />
<br />
[http://en.wikipedia.org/wiki/MANCOVA MANCOVA Wikipedia]<br />
<br />
<br />
<br />
<br />
<br />
<br />
<hr><br />
* SOCR Home page: http://www.socr.umich.edu<br />
<br />
{{translate|pageName=http://wiki.socr.umich.edu/index.php?title=SMHS_ANCOVA}}</div>Clgallahttps://wiki.socr.umich.edu/index.php?title=SMHS_ANCOVA&diff=14535SMHS ANCOVA2014-10-29T13:42:08Z<p>Clgalla: </p>
<hr />
<div>==[[SMHS| Scientific Methods for Health Sciences]] - Analysis of Covariance (ANCOVA) ==<br />
<br />
===Overview===<br />
Analysis of Variance (ANOVA) is the common method applied to analyze the differences between group means. Analysis of Covariance (ANCOVA) is the common method applied to blend ANOVA and regression and evaluate whether population means of a dependent variance (DV) are equal across levels of a categorical independent variable (IV) while statistically controlling for the effects of other continuous variables (CV). Multivariate analysis of variance (MANOVA) is a statistical test procedure for comparing multivariate means of several groups, which is a generalized form of ANOVA. Similar to MANOVA, MANOCVA (multivariate analysis of covariance) is an extension of ANCOVA and is designed for cases where there is more than one dependent variable and where the control of concomitant continuous independent variables is required. In this section, we review ANOVA, ANCOVA, MANOVA and MANCOVA and illustrate their application with examples.<br />
<br />
===Motivation===<br />
We have talked about analysis of variance (ANOVA). ANCOVA is similar to ANOVA and deals with covariance instead. What if we have more than one dependent variable and studied on multivariate observations? What if we want to see if the interactions among dependent variables or changes in the independent variables influence the dependent variable? Then we will need to use the extension of ANOVA and ANCOVA, MANOVA and MACOVA respectively. So the question would be how do those methods work and what kind of conclusions we can drawn from those methods?<br />
<br />
===Theory===<br />
<br />
*==ANOVA==<br />
Analysis of Variance (ANOVA) is the common method applied to analyze the differences between group means. In ANOVA, we divide the observed variance into components attributed to different sources of variation.<br />
**One-way ANOVA: we expand our inference methods to study and compare k independent samples. In this case, we will be decomposing the entire variation in the data into independent components.<br />
**Notations first: $y_{ij}$ is the measurement from group $i$, observation index $j$; $k$ is the number of groups; $n_{i}$ is the number of observations in group $i$; $n$ is the total number of observations and $n=n_{1}+n_{2}+⋯+n_{k}$. The group mean for group $i$ is $\bar y_{i}. =\frac{\sum_{j=1}^n_{i}y_{ij}}{n_{i}},$ the grand mean is $\bar y = \bar y_{..}=\frac{\sum_{i=1}^{k}\sum_{j=1}^{n_{i}}y_{ij}} {n}$.<br />
**Difference between the means (compare each group mean to the grand mean): total variance $SST(total)=\sum_{i=1}^{k}\sum_{j=1}^{n_{i}}\left(y_{ij}-\bar y_{..}\right)^{2}$, degrees of freedom $df(total)=n-1$; difference between each group mean and grand mean: $SST(between)=\sum_{i=}^{k}n_{i} \left(\bar y_{i.} \bar y_{..}\right )^{2}$, degrees of freedom $df(between)=k-1$; Sum square due to error (combination of variation within group): $SSE=\sum_{i=1}^{k} \sum_{j=1}^{n_{i}}\left(y_{ij}-\bar y_{i.}\right)^{2}$, degrees of freedom $df(within)=n-k$. <br />
<br />
With ANOVA decomposition, we have $\sum_{i=1}^{k}\sum_{j=1}^{n_{i}}\left(y_{ij}-\bar y_{..}\right)^{2}=\sum_{i=}^{k}n_{i}\left(\bar y_{i.} - \bar y_{..}\right )^{2}+\sum_{i=1}^{k}\sum_{j=1}^{n_{i}}\left(y_{ij}-\bar y_{i.}\right)^{2}$,that is $ST(total)=SST(between)+SSE(within)\,and\, df(total)=df(between)+df(within).$<br />
<br />
*Calculations: <br />
ANOVA table: <br />
<center><br />
{| class="wikitable" style="text-align:center; width:45%" border="1"<br />
|-<br />
|Variance source||Degree of freedom (df)|| Sum of squares (SS) ||Mean sum of squares (MS)|| F-statistics||P-value<br />
|-<br />
|Treatment effect (between group)||$k-1$||$\sum_{i=1}^{k}n_{i} \left(\bar y_{i.} -\bar y_{..}\right)^{2}$ || $\frac{SST(between)}{df(between)}$ || $F_{0}=\frac{MST(between)}{MSE(within)}$ || $p(F_{(df(between),df(within)}>F_{0}$<br />
|-<br />
|Error (within group)|| $n-k$ ||$\sum_{i=1}^{k} \sum_{j=1}^{n_{i}} \left(y_{ij}-\bar y_{i.}\right)^{2}$ || $\frac{SSE(within)}{df(within)}$ || ||<br />
|- <br />
|Total || $n-1$ || $\sum_{i=1}^{k} \sum_{j=1}^{n_{i}} \left(y_{ij}-\bar y_{..}\right)^{2}$ || || ||<br />
|- <br />
|}<br />
</center><br />
<br />
*==ANOVA hypotheses(general form)== <br />
$H_{o}:\mu_{1}=\mu_{2}=⋯=\mu_{k}; H_{a}:\mu_{i}≠μ_{j}$ for some $i≠j$. The test statistics: $F_{0}=\frac{MST(between)}{MSE(within)}$, if $F_{0}$ is large, then there is a lot between group variation, relative to the within group variation. Therefore, the discrepancies between the group means are large compared to the variability within the groups (error). That is large $F_{0}$ provides strong evidence against $H_{0}$.<br />
**ANOVA conditions: valid if (1) design conditions: all groups of observations represent random samples from their population respectively. Plus, all the observations within each group are independent of each other; (2) population conditions: the k population distributions must be approximately normal. If sample size is large, the normality condition is less crucial. Plus, the standard deviations of all populations are equal, which can be slightly relaxed when $0.5≤\frac{\sigma_{i}} {\sigma_{j}}≤2,$ for all $i$ and $j$, none of the population variance is twice larger than any of the other ones. <br />
<br />
*==Two-way ANOVA== <br />
We focus on decomposing the variance of a dataset into independent (orthogonal) components when we have two grouping factors.<br />
**Notations first: two-way model: $y_{ijk}=\mu+\tau_{i}+\beta_{j}+\gamma_{ij}+\varepsilon_{ijk}$,for all $1≤i≤a,1≤j≤b$ and $1≤k≤r. y_{ijk}$ is the A-factor level $i$, and B-factor level $j,$ observation-index $k$ measurement; $k$ is the number of replications; $a_{i}$ is the number of A-factor observations at level $i,a=a_{1}+⋯+a_{i}; b_{j}$ is the number of B-factor observations at level $j,b=b_{1}+⋯+b_{J}; N$ is the total number of observations and $N=a*a*b.$ Here $\mu$ is the overall mean response, $\tau_{i}$ is the effect due to the $i^{th}$ level of factor A, $\beta_{j}$ is the effect due to the $j^{th}$ level of factor B, and $\gamma_{ij}$ is the effect due to any interaction between the $i^{th}$ level of factor A and $j^{th}$ level of factor B. The mean for A-factor group mean at level $i$ and B-factor at level $j$ is $\bar y_{ij}.=\frac{\sum_{k=1}^{r}y_{ijk}}{r}$, the grand mean is $\bar y=\bar y_{…}=\frac{\sum_{k=1}^{r}\sum_{i=1}^{a}\sum_{j=1}^{b} y_{ijk}} {n}$, we have $SST(total)=SS(A)+SS(B)+SS(AB)+SSE.$<br />
<br />
*==Hypotheses:== <br />
**Null hypotheses: (1) the population means of the first factor are equal, which is like the one-way ANOVA for the row factor; (2) the population means of the second factor are equal, which is like the one-way ANOVA for the column factor; (3) there is no interaction between the two factors, which is similar to performing a test for independence with contingency tables.<br />
**Factors: factor A and factor B are independent variables in two-way ANOVA.<br />
**Treatment groups: formed by making all possible combinations of two factors. For example, if the factor A has 3 levels and factor B has 5 levels, then there will be $3*5=15$ different treatment groups.<br />
**Main effect: involves the dependent variable one at a time. The interaction is ignored for this part. <br />
**Interaction effect: the effect that one factor has on the other factor. The degree of freedom is the product of the two degrees of freedom of each factor.<br />
**Calculations:<br />
<br />
ANOVA table: <br />
<center><br />
{| class="wikitable" style="text-align:center; width:45%" border="1"<br />
|-<br />
|Variance source||Degree of freedom (df)||Sum of squares (SS)|| Mean sum of squares (MS)||F-statistics||P-value<br />
|-<br />
|Main effect A|| $a-1$ ||$SS(A)=rb\Sigma_{i=1}^{a} n_{i} (\bar y_{i..} -\bar y_{...} )^{2}$ ||$SS(A)/df(A)$||$F_{0}=\frac{MS(A)}{MSE}$ ||$p(F_{(df(B),df(E)})>F_{0}$<br />
|-<br />
|Main effect B || $b-1$||$SS(B)=ra\Sigma_{j=1}^{b} n_{i} (\bar y_{.j.}-\bar y_{...})^{2}$ || $SS(B)/df(B)$|| $F_{0}=\frac{MS(B)}{MSE}$||$p(F_{(df(AB),df(E)} )>F_{0}$<br />
|-<br />
|A vs. B interaction||$(a-1)(b-1)$||$SS(AB)=r\Sigma_{i=1}^{a} \Sigma_{j=1}^{b} ((\bar y_{ij.}-\bar y_{i..}) +\bar y_{.j.} -\bar y_{...} )^{2}$ ||$SS(AB)/df(AB)$||$F_{0}=\frac{MS(AB)} {MSE}$||<br />
|- <br />
|Error ||$N-ab$||$SSE=\Sigma_{k=1}^{r} \Sigma_{i=1}^{a} \Sigma_{j=1}^{b} (y_{ijk}-\bar y_{ij.})^{2}$||$SSE/df(error)$|| ||<br />
|- <br />
|Total||$N-1$||$SST=\Sigma_{k=1}^{r}\Sigma_{i=1}^{a} \Sigma_{j=1}^{b}(y_{ijk}-\bar y_{...})^{2}$|| || ||<br />
|- <br />
|}<br />
</center><br />
<br />
**ANOVA conditions: valid if (1) the population from which the samples were obtained must be normally or approximately normally distributed; (2) the samples must be independent; (3) the variances of the populations must be equal; (4) the groups must have the same sample size.<br />
<br />
*==ANCOVA==<br />
Analysis of Covariance is the common method applied to blend ANOVA and regression and evaluate whether population means of a dependent variance (DV) are equal across levels of a categorical independent variable (IV) while statistically controlling for the effects of other continuous variables (CV).<br />
**Assumptions of ANCOVA: (1) normality of residuals; (2) homogeneity of variance for error; (3) Homogeneity of regression slopes, regression lines should be parallel among groups; (4) Linearity of regression; (5) independence of error terms.<br />
**Increase statistical power: ANCOVA reducing the with-in group error variance and increase statistical power. Use the F-test to evaluate difference between groups by dividing the explained variance between groups by the unexplained variance within the groups. $F=\frac{MS between}{MSwithin}$. If this value is greater than the critical value, then there is significant difference between groups. The influence of CVs is grouped into the denominator. When we control for the effect of CVs on the DV, remove it from the denominator making F bigger, thereby increased our power to find a significant effect if one exists.<br />
**Adjusting preexisting difference in nonequivalent groups: correct for initial group difference that exists on DV among several intact groups. In this case, CVs are used to adjust scores and make participants more similar than without the CV since the participants cannot be made equal through random assignment. CV may be so intimately related to the IV that removing the variance on the DV associated with CV would remove considerable variance on DV, which will make the result meaningless.<br />
**Conduct ANCOVA: (1) Test multicollinearity: if a CV is highly related to another CV, then it won’t adjust the DV over and above the other CV. One or the other should be removed since they are statistically redundant. (2) Test the homogeneity of variance assumption: Levene’s test of equality of error variances. (3) Test of the homogeneity of regression slopes assumption: tested by testing if the CV significantly interacts with the IV by running an ANCOVA model including both the IV and the CV*IV interaction term in the model. If the interaction term is significant, then we should not perform ANCOVA. Instead assess group difference on DV at particular level of CV. (4) Run ANCOVA analysis: if the interaction is not significant, then rerun the ANCOVA without the interaction term. In this analysis, use the adjust means and adjusted MSerror. (5) Follow-up analyses: if there was a significant main effect, then there is a significant difference between the levels of one IV, ignoring all other factors. To find out exactly which levels are significant from others, use the same follow-up tests for ANOVA.<br />
<br />
*==MANOVA==<br />
Multivariate analysis of variance or multiple analysis of variance is a statistical test procedure for comparing multivariate means of several groups. MANOVA is a generalized form of ANOVA.<br />
Relationship with ANOVA: <br />
**MANOVA is an extension of ANOVA, though, unlike ANOVA, it uses the variance-covariance between variables in testing the statistical significance of the mean difference. It is similar to ANOVA, but allows adding of interval independents as covariates. Several specific use-cases for MANOVA: (1) to compare groups formed by categorical independent variables on group differences in a set of dependent variables; (2) to use lack of difference for a set of dependent variables as a criterion for reducing a set of independent variables to a smaller, more easily modeled number of variables; (3) to identify the independent variables which differentiate a set of dependent variables the most.<br />
**Analogous to ANOVA, MANOVA is based on the product of model variance matrix, $\Sigma_{model}$ and inverse of the error variance matrix, $\Sigma_{res}^{-1}, or A=\Sigma_{model} * \Sigma_{res}^{-1}$. The hypothesis that $\Sigma_{model} = \Sigma_{residual}$ implies that the product $A \sim I$. Invariance considerations imply the MANOVA statistic should be a measure of magnitude of the singular value decomposition of this matrix product, there is no unique choice owing to the multi-dimensional nature of the alternative hypothesis.<br />
**MANOVA calculations closely resemble the ANOVA calculations, except that they are in vector and matrix forms. Assume that instead of a single dependent variance in the one-way ANOVA, there are three dependent variables as in our neuroimaging example above. Under the null hypothesis, it is assumed that scores on the three variables for each of the four groups are sampled from a tri-variate normal distribution mean vector $\mu =(\mu_{1}, \mu_{2}, \mu_{3})^{T}$ and variance-covariance matrix $\Sigma=\bigl(\begin{smallmatrix} \sigma_{1}^{2} & \rho_{1,2}\sigma_{1}\sigma{2} &\rho_{1,3}\sigma_{1}\sigma{3} \\ \rho_{2,1}\sigma_{2}\sigma_{1} & \sigma_{2}^{2} & \rho_{2,3}\sigma_{2}\sigma{3} \\\rho_{3,1}\sigma_{3}\sigma_{1}&\rho_{3,2}\sigma_{3}\sigma_{2} &\sigma_{3}^{2}\end{smallmatrix}\bigr)$. Where the covariance between variables 1 and 2 is expressed in terms of their correlation $(\rho_{1,2})$ and individual variances $(\sigma_{1}$ and $\sigma_{2})$. Under the null hypothesis, the scores for all subjects in groups 1, 2 and 3 are sampled from the same distribution.<br />
**Example: a $2*2$ factorial design with medication as one factor and type of therapy as the second factor. The matrix of the data looks includes the patient ID, the drug-treatment (vitamin-E or Placebo), Therapy (Cognitive/physical), MMSE, CDR, Imaging. It's better when the study the design is balanced with equal numbers of patients in all four conditions, as this avoid potential problems of sample-size-driven effects (e.g., variance estimates). Recall that a univariate ANOVA (on any single outcome measure) would contain three types of effects -- a main effect for therapy, a mean effect for medication, and an interaction between therapy and medication. Similarly, MANOVA will contain the same three effects: main effects: (1) Therapy: The univariate ANOVA main effect for therapy tells whether the physical vs. cognitive therapy groups have different means, irrespective of their medications. The MANOVA main effect for therapy tells whether the physical vs. cognitive therapy group have different mean vectors irrespective of their medication. The vectors in this case are the $(3*1)$ column vectors of means (MMSE, CDR and Imaging); (2) Medication: The univariate ANOVA for medication tells whether the placebo group has a different mean from the Vitamin-E group irrespective of the therapy type. The MANOVA main effect for medication tells whether the placebo group has a different mean vector from the VItamin-E group irrespective of therapy; interaction effects: (3) The univariate ANOVA interaction tells whether the four means for a single variable differ from the value predicted from knowledge of the main effects of therapy and medication. The MANOVA interaction term tells whether the four mean vectors differ from the vector predicted from knowledge of the main effects of therapy and medication.<br />
**Variance partitioning: MANOVA has the same properties as an ANOVA. The only difference is that an ANOVA deals with a $(1*1)$ mean vector for any group (as the response is univatiate). While a MANOVA deals with a $(k*1)$ vector for any group, $k$ being the number of dependent variables, 3 in our example. The variance-covariance matrix for 1 variable is a $(1*1)$ matrix that has only one element, the variance of the variable. What is the variance-covariance matrix for $k$ variables is a $(k*k)$ matrix with the variances on the diagonal and the covariances representing the off diagonal elements. The ANOVA partitions the $(1*1)$ covariance matrix into a part due to error and a part due to the researcher-specified hypotheses (the two main effects and the interaction term). That is: $V_{total} = V_{therapy} + V_{medication} + V_{therapy*medication} + V_{error}.$ Likewise, MANOVA partitions its $(k*k)$ covariance matrix into a part due to research-hypotheses and a part due to error. Thus, in out example, MANOVA will have a $(3*3)$ covariance matrix for total variability, a $(3*3)$ covariance matrix due to therapy, a $(3*3)$ covariance matrix due to medication, a $(3*3)$ covariance matrix due to the interaction of therapy with medication, and finally a $(3*3)$ covariance matrix for the error. That is: $V_{total} = V_{therapy} + V_{medication} + V_{therapy*medication} + V_{error}$. Now,$V$ stands for the appropriate $(3*3)$ matrix, as opposed to $(1*1)$ value, as in ANOVA. The second equation is the matrix-form of the first one. Here is how we interpret these matrices. The error matrix looks like:<br />
<center><br />
{| class="wikitable" style="text-align:center; width:45%" border="1"<br />
|-<br />
| ||MMSE ||CDR ||Imaging<br />
|-<br />
|MMSE|| $V_{error1}$|| COV(error1, error2)|| COV(error1, error3)<br />
|-<br />
|CDR|| COV(error2, error1)||$V_{error2}$|| COV(error2, error1)<br />
|-<br />
|Imaging|| COV(error3, error1)|| COV(error3, error2) ||$V_{error3}$<br />
|-<br />
|}<br />
</center><br />
<br />
*Common statistics are summarized based on the root (eigenvalues) \lambda_{p} of the A matrix: (1) Samuel Stanley Wilks’, $\Lambda_{Wilks}=\prod_{1,\cdots,p} (1/(1+\lambda_{p}))=det(I+A)^{-1}=det(\Sigma_{res})/det(\Sigma_{res}+\Sigma_{model})$ distributed as lambda $(\Lambda)$; (2) The Pillai-M.S. Bartlett trace, $\Lambda_{Pillai}=\sum_{1,\cdots,p}(1/(1+\lambda_{p}))=tr((I+A)^{-1})$; (3) The Lawley-Hotelling trace, $\Lambda_{LH}=\sum_{1,\cdots,p}(\lambda_{p})=tr(A)$; (4) Roy’s greatest root, $\Lambda_{Roy}=max_{p}(\lambda_{p})=\left \| A \right \|_{\infty}$. The 4 major types of MANOVA test, the statistical power of these tests follow: $Pillai’s > Wilk’s > Hotelling’s > Roy’s Robustness$.<br />
**Let the $A$ statistic be the ratio of the sum of squares for an hypothesis and the sum of squares for error. Let $H$ denote the hypothesis sums of squares and cross products matrix, and let $E$ denote the error sums of squares and cross products matrix. The multivariate $A$ statistic is the matrix $A = HE_{-1}.$ Notice how mean squares (that is, covariance matrices) disappear from MANOVA just as they did for ANOVA. All hypothesis tests may be performed on the matrix $A$. Note also that because both $H$ and $E$ are symmetric, $HE^{-1}=E^{-1} H$. This is one special case where the order of matrix multiplication does not matter.<br />
**All MANOVA tests are made on $A=E^{-1}H$. There are four different multivariate tests that are made on this matrix. Each of the four test statistics has its own associated $F$ ratio. In some cases the four tests give an exact $F$ ratio for testing the null hypothesis and in other cases the $F$ ratio is only approximate. The reason for four different statistics and for approximations is that the MANOVA calculations may be complicated in some cases (i.e., the sampling distribution of the $F$ statistic in some multivariate cases would be difficult to compute exactly.) Suppose there are $k$ dependent variables in the MANOVA, and let $\lambda I$ denote the ith eigenvalue of $A=E^{-1}H$.<br />
**Wilk’s $\Lambda:\, 1- \Lambda$ is an index of variance explained by the model, $\eta^{2}$ is a measure of effect size analogous to $R^{2}$ in regression. $Wilk’s\: \Lambda$ is the pooled ratio of error variance to effect variance plus error variance: $\Lambda=\frac{|E|}{|H+E|}=\prod_{i=1}^{k}\frac{1}{1+\lambda_{i}}$.<br />
**Pillai’s Trace: Pillai’s criterion is the pooled effect variances. $Pillai’s\: trace=trace[H(H+E)^{-1}]=\prod_{i=1}^{k}\frac{1}{1+\lambda_{i}}$.<br />
**Hotelling’s Trace: the pooled ratio of effect variance to error variance:$trace(A)=trace[HE^{-1}]=\sum_{i=1}^{k}\lambda_{i}$.<br />
**Roy’s largest root: gives an upper bound for the F statistic, $Roy’s\: largest \,root=max(\hat{\lambda_{i}})$. <br />
<br />
*==MANCOVA==<br />
A multivariate analysis of covariance MANCOVA is a statistical an extension of ANCOVA and is designed for cases where there is more than one dependent variable and where the control of concomitant continuous independent variables is required. The process of characterizing a covariate in a data source allows the reduction of the magnitude of the error term, represented in the MANCOVA design as $MS_{error}$. The MANCOVA allows the characterization of the difference in group means in regards to a linear combination of multiple dependent variables, while simultaneously controlling for covariates. <br />
**Assumptions: (1) normality, for each group, each dependent variable follows a normal distribution and any linear combination of dependent variables are normally distributed; (2) independence of observations from all other observations; (3) homogeneity of variance: each dependent variable demonstrate similar levels of variance across each independent variable; (4) homogeneity of covariance: the intercorrelation matrix between dependent variables equals to each other across all levels of independent variable.<br />
**Covariate represents the source of variance that has not been controlled in the experiment and is believed to affect the dependent variable. And methods like ANCOVA and MANCOVA aim to remove the effects of such uncontrolled variation in order to increase statistical power and to ensure an accurate measurement of the true relationship between independent and dependent variables. <br />
**In some studies with covariates it happens that the F value actually becomes smaller (less significant) after including covariates in the design. This indicates that the covariates are not only correlated with the dependent variable, but also with the between-groups factors.<br />
<br />
===Applications===<br />
<br />
[http://rer.sagepub.com/content/68/3/350.short This article] examined articles published in several prominent educational journals to investigate the use of data analysis tools by researchers in four research paradigms: between-subjects univariate designs, between-subjects multivariate designs, repeated measures designs, and covariance designs. In addition to examining specific details pertaining to the research design (e.g., sample size, group size equality/inequality) and methods employed for data analysis, the authors also catalogued whether (a) validity assumptions were examined, (b) effect size indices were reported, (c) sample sizes were selected on the basis of power considerations, and (d) appropriate textbooks and/or articles were cited to communicate the nature of the analyses that were performed. The present analyses imply that researchers rarely verify that validity assumptions are satisfied and that, accordingly, they typically use analyses that are nonrobust to assumption violations. In addition, researchers rarely report effect size statistics, nor do they routinely perform power analyses to determine sample size requirements. Recommendations are offered to rectify these shortcomings.<br />
<br />
[http://www.tandfonline.com/doi/abs/10.1080/03610919808813485#.U-aSFhZTWdA This article] investigated the performance of ANOVA, MANOVA, WLS, and GEE for repeated ordinal data with small sample sizes. Repeated ordinal outcomes are common in behavioral and medical sciences. Due to the familiarity, simplicity and robustness of ANOVA methodology, this approach has been frequently used for repeated ordinal data. Weighted least squares (WLS) and generalized estimating equations (GEE) are usually the procedures of choice for repeated ordinal data since, unlike ANOVA, they generally make no or few untenable assumptions. However, these methods are based on asymptotic results and their properties are not well understood for small samples. Moreover, few software packages have procedures for implementing these methods. For a design with two groups and four time points, the simulation results indicated that ANOVA with the Huynh-Feldt adjustment performed matrix, known as sphericity, or the H‐F condition, is a sufficient condition for the usual F tests to be valid. <br />
<br />
===Software=== <br />
<br />
RCODE:<br />
fit <- aov(y ~ A, data = mydata) #one way ANOVA (completely randomized design)<br />
fit <- aov(y ~ A+B, data=mydata) # randomized block design where B is the blocking factor<br />
fit <- aov(y ~ A+B+A*B, data=mydata) ## two way factorial design<br />
fit <- aov(y ~ A+x, data=mydata) ## analysis of covariance<br />
## for within subject designs, the data frame has to be rearranged for each measurement on a subject to be a separate observation<br />
fit <- aov(y ~ A+Error(subject/A), data=mydata) ## one within factor<br />
fit <- aov(y ~(w1*w2*B1*B2)+Error(Subject/(W1*W2))+(B1*B2),data=mydata) # two within factors W1 and W2, two between factors B1 and B2.<br />
## 2*2 factorial MANOVA with 3 dependent variables<br />
Y <- cbind(y1,y2,y3)<br />
fit <- manova(Y ~ A*B)<br />
<br />
===Problems===<br />
<br />
Use data of the CPI (consumer price index) for food, housing, transportation and medical care from 1981 to 2007 to do a two-way analysis of the covariance in R. We take ‘Month’ as one factor and the item the CPI measured on as another factor and did a 2* 2 factorial design. The data are linked at [http://wiki.socr.umich.edu/index.php/SOCR_Data_Dinov_021808_ConsumerPriceIndex3Way Consumer Price Index].<br />
<br />
In R:<br />
CPI <- read.csv('/Users/yufangli/Desktop/CPI_Food.csv',header=T)<br />
attach(CPI)<br />
summary(CPI)<br />
Month <- factor(Month)<br />
CPI_Item <- factor(CPI_Item)<br />
fit <- aov(CPI_Value ~ Month + CPI_Item + CPI_Item*Month, data=CPI)<br />
fit<br />
Call:<br />
aov(formula = CPI_Value ~ Month + CPI_Item + CPI_Item * Month, <br />
data = CPI)<br />
<br />
Terms:<br />
Month CPI_Item Month:CPI_Item Residuals<br />
Sum of Squares 3282.6 1078702.9 706.8 2987673.8<br />
Deg. of Freedom 11 3 33 1248<br />
<br />
Residual standard error: 48.92821 <br />
Estimated effects may be unbalanced<br />
<br />
summary(fit)<br />
Df Sum Sq Mean Sq F value Pr(>F) <br />
Month 11 3283 298 0.125 1 <br />
CPI_Item 3 1078703 359568 150.197 <2e-16 ***<br />
Month:CPI_Item 33 707 21 0.009 1 <br />
Residuals 1248 2987674 2394 <br />
---<br />
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1<br />
<br />
fit2 <- aov(CPI_Value ~ CPI_Item, data=CPI)<br />
fit2<br />
Call:<br />
aov(formula = CPI_Value ~ CPI_Item, data = CPI)<br />
<br />
Terms:<br />
CPI_Item Residuals<br />
Sum of Squares 1078703 2991663<br />
Deg. of Freedom 3 1292<br />
<br />
Residual standard error: 48.11994 <br />
Estimated effects may be unbalanced<br />
summary(fit2)<br />
Df Sum Sq Mean Sq F value Pr(>F) <br />
CPI_Item 3 1078703 359568 155.3 <2e-16 ***<br />
Residuals 1292 2991663 2316 <br />
---<br />
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1<br />
<br />
Seems like the CPI_Item is an important factor while month is not a significant factor of CPI based on the dataset we got.<br />
<br />
<br />
===References===<br />
[http://www.statsoft.com/Textbook/ANOVA-MANOVA ANOVA-MANOVA]<br />
<br />
[http://en.wikipedia.org/wiki/Analysis_of_variance ANOVA Wikipedia]<br />
<br />
[http://wiki.stat.ucla.edu/socr/index.php/Probability_and_statistics_EBook#Chapter_XI:_Analysis_of_Variance_.28ANOVA.29 SOCR]<br />
<br />
[http://en.wikipedia.org/wiki/Analysis_of_covariance ANCOVA Wikipedia]<br />
<br />
[http://en.wikipedia.org/wiki/Multivariate_analysis_of_variance MANOVA Wikipedia]<br />
<br />
[http://en.wikipedia.org/wiki/MANCOVA MANCOVA Wikipedia]<br />
<br />
<br />
<br />
<br />
<br />
<br />
<hr><br />
* SOCR Home page: http://www.socr.umich.edu<br />
<br />
{{translate|pageName=http://wiki.socr.umich.edu/index.php?title=SMHS_ANCOVA}}</div>Clgallahttps://wiki.socr.umich.edu/index.php?title=SMHS_ANCOVA&diff=14534SMHS ANCOVA2014-10-29T13:38:19Z<p>Clgalla: </p>
<hr />
<div>==[[SMHS| Scientific Methods for Health Sciences]] - Analysis of Covariance (ANCOVA) ==<br />
<br />
===Overview===<br />
Analysis of Variance (ANOVA) is the common method applied to analyze the differences between group means. Analysis of Covariance (ANCOVA) is the common method applied to blend ANOVA and regression and evaluate whether population means of a dependent variance (DV) are equal across levels of a categorical independent variable (IV) while statistically controlling for the effects of other continuous variables (CV). Multivariate analysis of variance (MANOVA) is a statistical test procedure for comparing multivariate means of several groups, which is a generalized form of ANOVA. Similar to MANOVA, MANOCVA (multivariate analysis of covariance) is an extension of ANCOVA and is designed for cases where there is more than one dependent variable and where the control of concomitant continuous independent variables is required. In this section, we review ANOVA, ANCOVA, MANOVA and MANCOVA and illustrate their application with examples.<br />
<br />
===Motivation===<br />
We have talked about analysis of variance (ANOVA). ANCOVA is similar to ANOVA and deals with covariance instead. What if we have more than one dependent variable and studied on multivariate observations? What if we want to see if the interactions among dependent variables or changes in the independent variables influence the dependent variable? Then we will need to use the extension of ANOVA and ANCOVA, MANOVA and MACOVA respectively. So the question would be how do those methods work and what kind of conclusions we can drawn from those methods?<br />
<br />
===Theory===<br />
<br />
==ANOVA==<br />
Analysis of Variance (ANOVA) is the common method applied to analyze the differences between group means. In ANOVA, we divide the observed variance into components attributed to different sources of variation.<br />
*One-way ANOVA: we expand our inference methods to study and compare k independent samples. In this case, we will be decomposing the entire variation in the data into independent components.<br />
**Notations first: $y_{ij}$ is the measurement from group $i$, observation index $j$; $k$ is the number of groups; $n_{i}$ is the number of observations in group $i$; $n$ is the total number of observations and $n=n_{1}+n_{2}+⋯+n_{k}$. The group mean for group $i$ is $\bar y_{i}. =\frac{\sum_{j=1}^n_{i}y_{ij}}{n_{i}},$ the grand mean is $\bar y = \bar y_{..}=\frac{\sum_{i=1}^{k}\sum_{j=1}^{n_{i}}y_{ij}} {n}$.<br />
**Difference between the means (compare each group mean to the grand mean): total variance $SST(total)=\sum_{i=1}^{k}\sum_{j=1}^{n_{i}}\left(y_{ij}-\bar y_{..}\right)^{2}$, degrees of freedom $df(total)=n-1$; difference between each group mean and grand mean: $SST(between)=\sum_{i=}^{k}n_{i} \left(\bar y_{i.} \bar y_{..}\right )^{2}$, degrees of freedom $df(between)=k-1$; Sum square due to error (combination of variation within group): $SSE=\sum_{i=1}^{k} \sum_{j=1}^{n_{i}}\left(y_{ij}-\bar y_{i.}\right)^{2}$, degrees of freedom $df(within)=n-k$. <br />
<br />
With ANOVA decomposition, we have $\sum_{i=1}^{k}\sum_{j=1}^{n_{i}}\left(y_{ij}-\bar y_{..}\right)^{2}=\sum_{i=}^{k}n_{i}\left(\bar y_{i.} - \bar y_{..}\right )^{2}+\sum_{i=1}^{k}\sum_{j=1}^{n_{i}}\left(y_{ij}-\bar y_{i.}\right)^{2}$,that is $ST(total)=SST(between)+SSE(within)\,and\, df(total)=df(between)+df(within).$<br />
<br />
**Calculations: <br />
ANOVA table: <br />
<center><br />
{| class="wikitable" style="text-align:center; width:45%" border="1"<br />
|-<br />
|Variance source||Degree of freedom (df)|| Sum of squares (SS) ||Mean sum of squares (MS)|| F-statistics||P-value<br />
|-<br />
|Treatment effect (between group)||$k-1$||$\sum_{i=1}^{k}n_{i} \left(\bar y_{i.} -\bar y_{..}\right)^{2}$ || $\frac{SST(between)}{df(between)}$ || $F_{0}=\frac{MST(between)}{MSE(within)}$ || $p(F_{(df(between),df(within)}>F_{0}$<br />
|-<br />
|Error (within group)|| $n-k$ ||$\sum_{i=1}^{k} \sum_{j=1}^{n_{i}} \left(y_{ij}-\bar y_{i.}\right)^{2}$ || $\frac{SSE(within)}{df(within)}$ || ||<br />
|- <br />
|Total || $n-1$ || $\sum_{i=1}^{k} \sum_{j=1}^{n_{i}} \left(y_{ij}-\bar y_{..}\right)^{2}$ || || ||<br />
|- <br />
|}<br />
</center><br />
<br />
==ANOVA hypotheses(general form)== <br />
$H_{o}:\mu_{1}=\mu_{2}=⋯=\mu_{k}; H_{a}:\mu_{i}≠μ_{j}$ for some $i≠j$. The test statistics: $F_{0}=\frac{MST(between)}{MSE(within)}$, if $F_{0}$ is large, then there is a lot between group variation, relative to the within group variation. Therefore, the discrepancies between the group means are large compared to the variability within the groups (error). That is large $F_{0}$ provides strong evidence against $H_{0}$.<br />
*ANOVA conditions: valid if (1) design conditions: all groups of observations represent random samples from their population respectively. Plus, all the observations within each group are independent of each other; (2) population conditions: the k population distributions must be approximately normal. If sample size is large, the normality condition is less crucial. Plus, the standard deviations of all populations are equal, which can be slightly relaxed when $0.5≤\frac{\sigma_{i}} {\sigma_{j}}≤2,$ for all $i$ and $j$, none of the population variance is twice larger than any of the other ones. <br />
<br />
==Two-way ANOVA== <br />
We focus on decomposing the variance of a dataset into independent (orthogonal) components when we have two grouping factors.<br />
*Notations first: two-way model: $y_{ijk}=\mu+\tau_{i}+\beta_{j}+\gamma_{ij}+\varepsilon_{ijk}$,for all $1≤i≤a,1≤j≤b$ and $1≤k≤r. y_{ijk}$ is the A-factor level $i$, and B-factor level $j,$ observation-index $k$ measurement; $k$ is the number of replications; $a_{i}$ is the number of A-factor observations at level $i,a=a_{1}+⋯+a_{i}; b_{j}$ is the number of B-factor observations at level $j,b=b_{1}+⋯+b_{J}; N$ is the total number of observations and $N=a*a*b.$ Here $\mu$ is the overall mean response, $\tau_{i}$ is the effect due to the $i^{th}$ level of factor A, $\beta_{j}$ is the effect due to the $j^{th}$ level of factor B, and $\gamma_{ij}$ is the effect due to any interaction between the $i^{th}$ level of factor A and $j^{th}$ level of factor B. The mean for A-factor group mean at level $i$ and B-factor at level $j$ is $\bar y_{ij}.=\frac{\sum_{k=1}^{r}y_{ijk}}{r}$, the grand mean is $\bar y=\bar y_{…}=\frac{\sum_{k=1}^{r}\sum_{i=1}^{a}\sum_{j=1}^{b} y_{ijk}} {n}$, we have $SST(total)=SS(A)+SS(B)+SS(AB)+SSE.$<br />
<br />
==Hypotheses:== <br />
*Null hypotheses: (1) the population means of the first factor are equal, which is like the one-way ANOVA for the row factor; (2) the population means of the second factor are equal, which is like the one-way ANOVA for the column factor; (3) there is no interaction between the two factors, which is similar to performing a test for independence with contingency tables.<br />
**Factors: factor A and factor B are independent variables in two-way ANOVA.<br />
**Treatment groups: formed by making all possible combinations of two factors. For example, if the factor A has 3 levels and factor B has 5 levels, then there will be $3*5=15$ different treatment groups.<br />
**Main effect: involves the dependent variable one at a time. The interaction is ignored for this part. <br />
**Interaction effect: the effect that one factor has on the other factor. The degree of freedom is the product of the two degrees of freedom of each factor.<br />
**Calculations:<br />
<br />
ANOVA table: <br />
<center><br />
{| class="wikitable" style="text-align:center; width:45%" border="1"<br />
|-<br />
|Variance source||Degree of freedom (df)||Sum of squares (SS)|| Mean sum of squares (MS)||F-statistics||P-value<br />
|-<br />
|Main effect A|| $a-1$ ||$SS(A)=rb\Sigma_{i=1}^{a} n_{i} (\bar y_{i..} -\bar y_{...} )^{2}$ ||$SS(A)/df(A)$||$F_{0}=\frac{MS(A)}{MSE}$ ||$p(F_{(df(B),df(E)})>F_{0}$<br />
|-<br />
|Main effect B || $b-1$||$SS(B)=ra\Sigma_{j=1}^{b} n_{i} (\bar y_{.j.}-\bar y_{...})^{2}$ || $SS(B)/df(B)$|| $F_{0}=\frac{MS(B)}{MSE}$||$p(F_{(df(AB),df(E)} )>F_{0}$<br />
|-<br />
|A vs. B interaction||$(a-1)(b-1)$||$SS(AB)=r\Sigma_{i=1}^{a} \Sigma_{j=1}^{b} ((\bar y_{ij.}-\bar y_{i..}) +\bar y_{.j.} -\bar y_{...} )^{2}$ ||$SS(AB)/df(AB)$||$F_{0}=\frac{MS(AB)} {MSE}$||<br />
|- <br />
|Error ||$N-ab$||$SSE=\Sigma_{k=1}^{r} \Sigma_{i=1}^{a} \Sigma_{j=1}^{b} (y_{ijk}-\bar y_{ij.})^{2}$||$SSE/df(error)$|| ||<br />
|- <br />
|Total||$N-1$||$SST=\Sigma_{k=1}^{r}\Sigma_{i=1}^{a} \Sigma_{j=1}^{b}(y_{ijk}-\bar y_{...})^{2}$|| || ||<br />
|- <br />
|}<br />
</center><br />
<br />
*ANOVA conditions: valid if (1) the population from which the samples were obtained must be normally or approximately normally distributed; (2) the samples must be independent; (3) the variances of the populations must be equal; (4) the groups must have the same sample size.<br />
<br />
==ANCOVA==<br />
Analysis of Covariance is the common method applied to blend ANOVA and regression and evaluate whether population means of a dependent variance (DV) are equal across levels of a categorical independent variable (IV) while statistically controlling for the effects of other continuous variables (CV).<br />
*Assumptions of ANCOVA: (1) normality of residuals; (2) homogeneity of variance for error; (3) Homogeneity of regression slopes, regression lines should be parallel among groups; (4) Linearity of regression; (5) independence of error terms.<br />
*Increase statistical power: ANCOVA reducing the with-in group error variance and increase statistical power. Use the F-test to evaluate difference between groups by dividing the explained variance between groups by the unexplained variance within the groups. $F=\frac{MS between}{MSwithin}$. If this value is greater than the critical value, then there is significant difference between groups. The influence of CVs is grouped into the denominator. When we control for the effect of CVs on the DV, remove it from the denominator making F bigger, thereby increased our power to find a significant effect if one exists.<br />
**Adjusting preexisting difference in nonequivalent groups: correct for initial group difference that exists on DV among several intact groups. In this case, CVs are used to adjust scores and make participants more similar than without the CV since the participants cannot be made equal through random assignment. CV may be so intimately related to the IV that removing the variance on the DV associated with CV would remove considerable variance on DV, which will make the result meaningless.<br />
**Conduct ANCOVA: (1) Test multicollinearity: if a CV is highly related to another CV, then it won’t adjust the DV over and above the other CV. One or the other should be removed since they are statistically redundant. (2) Test the homogeneity of variance assumption: Levene’s test of equality of error variances. (3) Test of the homogeneity of regression slopes assumption: tested by testing if the CV significantly interacts with the IV by running an ANCOVA model including both the IV and the CV*IV interaction term in the model. If the interaction term is significant, then we should not perform ANCOVA. Instead assess group difference on DV at particular level of CV. (4) Run ANCOVA analysis: if the interaction is not significant, then rerun the ANCOVA without the interaction term. In this analysis, use the adjust means and adjusted MSerror. (5) Follow-up analyses: if there was a significant main effect, then there is a significant difference between the levels of one IV, ignoring all other factors. To find out exactly which levels are significant from others, use the same follow-up tests for ANOVA.<br />
<br />
==MANOVA==<br />
Multivariate analysis of variance or multiple analysis of variance is a statistical test procedure for comparing multivariate means of several groups. MANOVA is a generalized form of ANOVA.<br />
Relationship with ANOVA: <br />
*MANOVA is an extension of ANOVA, though, unlike ANOVA, it uses the variance-covariance between variables in testing the statistical significance of the mean difference. It is similar to ANOVA, but allows adding of interval independents as covariates. Several specific use-cases for MANOVA: (1) to compare groups formed by categorical independent variables on group differences in a set of dependent variables; (2) to use lack of difference for a set of dependent variables as a criterion for reducing a set of independent variables to a smaller, more easily modeled number of variables; (3) to identify the independent variables which differentiate a set of dependent variables the most.<br />
*Analogous to ANOVA, MANOVA is based on the product of model variance matrix, $\Sigma_{model}$ and inverse of the error variance matrix, $\Sigma_{res}^{-1}, or A=\Sigma_{model} * \Sigma_{res}^{-1}$. The hypothesis that $\Sigma_{model} = \Sigma_{residual}$ implies that the product $A \sim I$. Invariance considerations imply the MANOVA statistic should be a measure of magnitude of the singular value decomposition of this matrix product, there is no unique choice owing to the multi-dimensional nature of the alternative hypothesis.<br />
*MANOVA calculations closely resemble the ANOVA calculations, except that they are in vector and matrix forms. Assume that instead of a single dependent variance in the one-way ANOVA, there are three dependent variables as in our neuroimaging example above. Under the null hypothesis, it is assumed that scores on the three variables for each of the four groups are sampled from a tri-variate normal distribution mean vector $\mu =(\mu_{1}, \mu_{2}, \mu_{3})^{T}$ and variance-covariance matrix $\Sigma=\bigl(\begin{smallmatrix} \sigma_{1}^{2} & \rho_{1,2}\sigma_{1}\sigma{2} &\rho_{1,3}\sigma_{1}\sigma{3} \\ \rho_{2,1}\sigma_{2}\sigma_{1} & \sigma_{2}^{2} & \rho_{2,3}\sigma_{2}\sigma{3} \\\rho_{3,1}\sigma_{3}\sigma_{1}&\rho_{3,2}\sigma_{3}\sigma_{2} &\sigma_{3}^{2}\end{smallmatrix}\bigr)$. Where the covariance between variables 1 and 2 is expressed in terms of their correlation $(\rho_{1,2})$ and individual variances $(\sigma_{1}$ and $\sigma_{2})$. Under the null hypothesis, the scores for all subjects in groups 1, 2 and 3 are sampled from the same distribution.<br />
*Example: a $2*2$ factorial design with medication as one factor and type of therapy as the second factor. The matrix of the data looks includes the patient ID, the drug-treatment (vitamin-E or Placebo), Therapy (Cognitive/physical), MMSE, CDR, Imaging. It's better when the study the design is balanced with equal numbers of patients in all four conditions, as this avoid potential problems of sample-size-driven effects (e.g., variance estimates). Recall that a univariate ANOVA (on any single outcome measure) would contain three types of effects -- a main effect for therapy, a mean effect for medication, and an interaction between therapy and medication. Similarly, MANOVA will contain the same three effects: main effects: (1) Therapy: The univariate ANOVA main effect for therapy tells whether the physical vs. cognitive therapy groups have different means, irrespective of their medications. The MANOVA main effect for therapy tells whether the physical vs. cognitive therapy group have different mean vectors irrespective of their medication. The vectors in this case are the $(3*1)$ column vectors of means (MMSE, CDR and Imaging); (2) Medication: The univariate ANOVA for medication tells whether the placebo group has a different mean from the Vitamin-E group irrespective of the therapy type. The MANOVA main effect for medication tells whether the placebo group has a different mean vector from the VItamin-E group irrespective of therapy; interaction effects: (3) The univariate ANOVA interaction tells whether the four means for a single variable differ from the value predicted from knowledge of the main effects of therapy and medication. The MANOVA interaction term tells whether the four mean vectors differ from the vector predicted from knowledge of the main effects of therapy and medication.<br />
**Variance partitioning: MANOVA has the same properties as an ANOVA. The only difference is that an ANOVA deals with a $(1*1)$ mean vector for any group (as the response is univatiate). While a MANOVA deals with a $(k*1)$ vector for any group, $k$ being the number of dependent variables, 3 in our example. The variance-covariance matrix for 1 variable is a $(1*1)$ matrix that has only one element, the variance of the variable. What is the variance-covariance matrix for $k$ variables is a $(k*k)$ matrix with the variances on the diagonal and the covariances representing the off diagonal elements. The ANOVA partitions the $(1*1)$ covariance matrix into a part due to error and a part due to the researcher-specified hypotheses (the two main effects and the interaction term). That is: $V_{total} = V_{therapy} + V_{medication} + V_{therapy*medication} + V_{error}.$ Likewise, MANOVA partitions its $(k*k)$ covariance matrix into a part due to research-hypotheses and a part due to error. Thus, in out example, MANOVA will have a $(3*3)$ covariance matrix for total variability, a $(3*3)$ covariance matrix due to therapy, a $(3*3)$ covariance matrix due to medication, a $(3*3)$ covariance matrix due to the interaction of therapy with medication, and finally a $(3*3)$ covariance matrix for the error. That is: $V_{total} = V_{therapy} + V_{medication} + V_{therapy*medication} + V_{error}$. Now,$V$ stands for the appropriate $(3*3)$ matrix, as opposed to $(1*1)$ value, as in ANOVA. The second equation is the matrix-form of the first one. Here is how we interpret these matrices. The error matrix looks like:<br />
<center><br />
{| class="wikitable" style="text-align:center; width:45%" border="1"<br />
|-<br />
| ||MMSE ||CDR ||Imaging<br />
|-<br />
|MMSE|| $V_{error1}$|| COV(error1, error2)|| COV(error1, error3)<br />
|-<br />
|CDR|| COV(error2, error1)||$V_{error2}$|| COV(error2, error1)<br />
|-<br />
|Imaging|| COV(error3, error1)|| COV(error3, error2) ||$V_{error3}$<br />
|-<br />
|}<br />
</center><br />
<br />
*Common statistics are summarized based on the root (eigenvalues) \lambda_{p} of the A matrix: (1) Samuel Stanley Wilks’, $\Lambda_{Wilks}=\prod_{1,\cdots,p} (1/(1+\lambda_{p}))=det(I+A)^{-1}=det(\Sigma_{res})/det(\Sigma_{res}+\Sigma_{model})$ distributed as lambda $(\Lambda)$; (2) The Pillai-M.S. Bartlett trace, $\Lambda_{Pillai}=\sum_{1,\cdots,p}(1/(1+\lambda_{p}))=tr((I+A)^{-1})$; (3) The Lawley-Hotelling trace, $\Lambda_{LH}=\sum_{1,\cdots,p}(\lambda_{p})=tr(A)$; (4) Roy’s greatest root, $\Lambda_{Roy}=max_{p}(\lambda_{p})=\left \| A \right \|_{\infty}$. The 4 major types of MANOVA test, the statistical power of these tests follow: $Pillai’s > Wilk’s > Hotelling’s > Roy’s Robustness$.<br />
**Let the $A$ statistic be the ratio of the sum of squares for an hypothesis and the sum of squares for error. Let $H$ denote the hypothesis sums of squares and cross products matrix, and let $E$ denote the error sums of squares and cross products matrix. The multivariate $A$ statistic is the matrix $A = HE_{-1}.$ Notice how mean squares (that is, covariance matrices) disappear from MANOVA just as they did for ANOVA. All hypothesis tests may be performed on the matrix $A$. Note also that because both $H$ and $E$ are symmetric, $HE^{-1}=E^{-1} H$. This is one special case where the order of matrix multiplication does not matter.<br />
*All MANOVA tests are made on $A=E^{-1}H$. There are four different multivariate tests that are made on this matrix. Each of the four test statistics has its own associated $F$ ratio. In some cases the four tests give an exact $F$ ratio for testing the null hypothesis and in other cases the $F$ ratio is only approximate. The reason for four different statistics and for approximations is that the MANOVA calculations may be complicated in some cases (i.e., the sampling distribution of the $F$ statistic in some multivariate cases would be difficult to compute exactly.) Suppose there are $k$ dependent variables in the MANOVA, and let $\lambda I$ denote the ith eigenvalue of $A=E^{-1}H$.<br />
*Wilk’s $\Lambda:\, 1- \Lambda$ is an index of variance explained by the model, $\eta^{2}$ is a measure of effect size analogous to $R^{2}$ in regression. $Wilk’s\: \Lambda$ is the pooled ratio of error variance to effect variance plus error variance: $\Lambda=\frac{|E|}{|H+E|}=\prod_{i=1}^{k}\frac{1}{1+\lambda_{i}}$.<br />
*Pillai’s Trace: Pillai’s criterion is the pooled effect variances. $Pillai’s\: trace=trace[H(H+E)^{-1}]=\prod_{i=1}^{k}\frac{1}{1+\lambda_{i}}$.<br />
*Hotelling’s Trace: the pooled ratio of effect variance to error variance:$trace(A)=trace[HE^{-1}]=\sum_{i=1}^{k}\lambda_{i}$.<br />
*Roy’s largest root: gives an upper bound for the F statistic, $Roy’s\: largest \,root=max(\hat{\lambda_{i}})$. <br />
<br />
==MANCOVA==<br />
A multivariate analysis of covariance MANCOVA is a statistical an extension of ANCOVA and is designed for cases where there is more than one dependent variable and where the control of concomitant continuous independent variables is required. The process of characterizing a covariate in a data source allows the reduction of the magnitude of the error term, represented in the MANCOVA design as $MS_{error}$. The MANCOVA allows the characterization of the difference in group means in regards to a linear combination of multiple dependent variables, while simultaneously controlling for covariates. <br />
*Assumptions: (1) normality, for each group, each dependent variable follows a normal distribution and any linear combination of dependent variables are normally distributed; (2) independence of observations from all other observations; (3) homogeneity of variance: each dependent variable demonstrate similar levels of variance across each independent variable; (4) homogeneity of covariance: the intercorrelation matrix between dependent variables equals to each other across all levels of independent variable.<br />
*Covariate represents the source of variance that has not been controlled in the experiment and is believed to affect the dependent variable. And methods like ANCOVA and MANCOVA aim to remove the effects of such uncontrolled variation in order to increase statistical power and to ensure an accurate measurement of the true relationship between independent and dependent variables. <br />
*In some studies with covariates it happens that the F value actually becomes smaller (less significant) after including covariates in the design. This indicates that the covariates are not only correlated with the dependent variable, but also with the between-groups factors.<br />
<br />
===Applications===<br />
<br />
[http://rer.sagepub.com/content/68/3/350.short This article] examined articles published in several prominent educational journals to investigate the use of data analysis tools by researchers in four research paradigms: between-subjects univariate designs, between-subjects multivariate designs, repeated measures designs, and covariance designs. In addition to examining specific details pertaining to the research design (e.g., sample size, group size equality/inequality) and methods employed for data analysis, the authors also catalogued whether (a) validity assumptions were examined, (b) effect size indices were reported, (c) sample sizes were selected on the basis of power considerations, and (d) appropriate textbooks and/or articles were cited to communicate the nature of the analyses that were performed. The present analyses imply that researchers rarely verify that validity assumptions are satisfied and that, accordingly, they typically use analyses that are nonrobust to assumption violations. In addition, researchers rarely report effect size statistics, nor do they routinely perform power analyses to determine sample size requirements. Recommendations are offered to rectify these shortcomings.<br />
<br />
[http://www.tandfonline.com/doi/abs/10.1080/03610919808813485#.U-aSFhZTWdA This article] investigated the performance of ANOVA, MANOVA, WLS, and GEE for repeated ordinal data with small sample sizes. Repeated ordinal outcomes are common in behavioral and medical sciences. Due to the familiarity, simplicity and robustness of ANOVA methodology, this approach has been frequently used for repeated ordinal data. Weighted least squares (WLS) and generalized estimating equations (GEE) are usually the procedures of choice for repeated ordinal data since, unlike ANOVA, they generally make no or few untenable assumptions. However, these methods are based on asymptotic results and their properties are not well understood for small samples. Moreover, few software packages have procedures for implementing these methods. For a design with two groups and four time points, the simulation results indicated that ANOVA with the Huynh-Feldt adjustment performed matrix, known as sphericity, or the H‐F condition, is a sufficient condition for the usual F tests to be valid. <br />
<br />
===Software=== <br />
<br />
RCODE:<br />
fit <- aov(y ~ A, data = mydata) #one way ANOVA (completely randomized design)<br />
fit <- aov(y ~ A+B, data=mydata) # randomized block design where B is the blocking factor<br />
fit <- aov(y ~ A+B+A*B, data=mydata) ## two way factorial design<br />
fit <- aov(y ~ A+x, data=mydata) ## analysis of covariance<br />
## for within subject designs, the data frame has to be rearranged for each measurement on a subject to be a separate observation<br />
fit <- aov(y ~ A+Error(subject/A), data=mydata) ## one within factor<br />
fit <- aov(y ~(w1*w2*B1*B2)+Error(Subject/(W1*W2))+(B1*B2),data=mydata) # two within factors W1 and W2, two between factors B1 and B2.<br />
## 2*2 factorial MANOVA with 3 dependent variables<br />
Y <- cbind(y1,y2,y3)<br />
fit <- manova(Y ~ A*B)<br />
<br />
===Problems===<br />
<br />
Use data of the CPI (consumer price index) for food, housing, transportation and medical care from 1981 to 2007 to do a two-way analysis of the covariance in R. We take ‘Month’ as one factor and the item the CPI measured on as another factor and did a 2* 2 factorial design. The data are linked at [http://wiki.socr.umich.edu/index.php/SOCR_Data_Dinov_021808_ConsumerPriceIndex3Way Consumer Price Index].<br />
<br />
In R:<br />
CPI <- read.csv('/Users/yufangli/Desktop/CPI_Food.csv',header=T)<br />
attach(CPI)<br />
summary(CPI)<br />
Month <- factor(Month)<br />
CPI_Item <- factor(CPI_Item)<br />
fit <- aov(CPI_Value ~ Month + CPI_Item + CPI_Item*Month, data=CPI)<br />
fit<br />
Call:<br />
aov(formula = CPI_Value ~ Month + CPI_Item + CPI_Item * Month, <br />
data = CPI)<br />
<br />
Terms:<br />
Month CPI_Item Month:CPI_Item Residuals<br />
Sum of Squares 3282.6 1078702.9 706.8 2987673.8<br />
Deg. of Freedom 11 3 33 1248<br />
<br />
Residual standard error: 48.92821 <br />
Estimated effects may be unbalanced<br />
<br />
summary(fit)<br />
Df Sum Sq Mean Sq F value Pr(>F) <br />
Month 11 3283 298 0.125 1 <br />
CPI_Item 3 1078703 359568 150.197 <2e-16 ***<br />
Month:CPI_Item 33 707 21 0.009 1 <br />
Residuals 1248 2987674 2394 <br />
---<br />
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1<br />
<br />
fit2 <- aov(CPI_Value ~ CPI_Item, data=CPI)<br />
fit2<br />
Call:<br />
aov(formula = CPI_Value ~ CPI_Item, data = CPI)<br />
<br />
Terms:<br />
CPI_Item Residuals<br />
Sum of Squares 1078703 2991663<br />
Deg. of Freedom 3 1292<br />
<br />
Residual standard error: 48.11994 <br />
Estimated effects may be unbalanced<br />
summary(fit2)<br />
Df Sum Sq Mean Sq F value Pr(>F) <br />
CPI_Item 3 1078703 359568 155.3 <2e-16 ***<br />
Residuals 1292 2991663 2316 <br />
---<br />
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1<br />
<br />
Seems like the CPI_Item is an important factor while month is not a significant factor of CPI based on the dataset we got.<br />
<br />
<br />
===References===<br />
[http://www.statsoft.com/Textbook/ANOVA-MANOVA ANOVA-MANOVA]<br />
<br />
[http://en.wikipedia.org/wiki/Analysis_of_variance ANOVA Wikipedia]<br />
<br />
[http://wiki.stat.ucla.edu/socr/index.php/Probability_and_statistics_EBook#Chapter_XI:_Analysis_of_Variance_.28ANOVA.29 SOCR]<br />
<br />
[http://en.wikipedia.org/wiki/Analysis_of_covariance ANCOVA Wikipedia]<br />
<br />
[http://en.wikipedia.org/wiki/Multivariate_analysis_of_variance MANOVA Wikipedia]<br />
<br />
[http://en.wikipedia.org/wiki/MANCOVA MANCOVA Wikipedia]<br />
<br />
<br />
<br />
<br />
<br />
<br />
<hr><br />
* SOCR Home page: http://www.socr.umich.edu<br />
<br />
{{translate|pageName=http://wiki.socr.umich.edu/index.php?title=SMHS_ANCOVA}}</div>Clgallahttps://wiki.socr.umich.edu/index.php?title=SMHS_ANCOVA&diff=14533SMHS ANCOVA2014-10-28T18:20:31Z<p>Clgalla: /MANCOVA/</p>
<hr />
<div>==[[SMHS| Scientific Methods for Health Sciences]] - Analysis of Covariance (ANCOVA) ==<br />
<br />
===Overview===<br />
Analysis of Variance (ANOVA) is the common method applied to analyze the differences between group means. Analysis of Covariance (ANCOVA) is the common method applied to blend ANOVA and regression and evaluate whether population means of a dependent variance (DV) are equal across levels of a categorical independent variable (IV) while statistically controlling for the effects of other continuous variables (CV). Multivariate analysis of variance (MANOVA) is a statistical test procedure for comparing multivariate means of several groups, which is a generalized form of ANOVA. Similar to MANOVA, MANOCVA (multivariate analysis of covariance) is an extension of ANCOVA and is designed for cases where there is more than one dependent variable and where the control of concomitant continuous independent variables is required. In this section, we review ANOVA, ANCOVA, MANOVA and MANCOVA and illustrate their application with examples.<br />
<br />
===Motivation===<br />
We have talked about analysis of variance (ANOVA). ANCOVA is similar to ANOVA and deals with covariance instead. What if we have more than one dependent variable and studied on multivariate observations? What if we want to see if the interactions among dependent variables or changes in the independent variables influence the dependent variable? Then we will need to use the extension of ANOVA and ANCOVA, MANOVA and MACOVA respectively. So the question would be how do those methods work and what kind of conclusions we can drawn from those methods?<br />
<br />
===Theory===<br />
<br />
==ANOVA==<br />
Analysis of Variance (ANOVA) is the common method applied to analyze the differences between group means. In ANOVA, we divide the observed variance into components attributed to different sources of variation.<br />
*One-way ANOVA: we expand our inference methods to study and compare k independent samples. In this case, we will be decomposing the entire variation in the data into independent components.<br />
**Notations first: $y_{ij}$ is the measurement from group $i$, observation index $j$; $k$ is the number of groups; $n_{i}$ is the number of observations in group $i$; $n$ is the total number of observations and $n=n_{1}+n_{2}+⋯+n_{k}$. The group mean for group $i$ is $\bar y_{i}. =\frac{\sum_{j=1}^n_{i}y_{ij}}{n_{i}},$ the grand mean is $\bar y = \bar y_{..}=\frac{\sum_{i=1}^{k}\sum_{j=1}^{n_{i}}y_{ij}} {n}$.<br />
**Difference between the means (compare each group mean to the grand mean): total variance $SST(total)=\sum_{i=1}^{k}\sum_{j=1}^{n_{i}}\left(y_{ij}-\bar y_{..}\right)^{2}$, degrees of freedom $df(total)=n-1$; difference between each group mean and grand mean: $SST(between)=\sum_{i=}^{k}n_{i} \left(\bar y_{i.} \bar y_{..}\right )^{2}$, degrees of freedom $df(between)=k-1$; Sum square due to error (combination of variation within group): $SSE=\sum_{i=1}^{k} \sum_{j=1}^{n_{i}}\left(y_{ij}-\bar y_{i.}\right)^{2}$, degrees of freedom $df(within)=n-k$. <br />
<br />
With ANOVA decomposition, we have $\sum_{i=1}^{k}\sum_{j=1}^{n_{i}}\left(y_{ij}-\bar y_{..}\right)^{2}=\sum_{i=}^{k}n_{i}\left(\bar y_{i.} - \bar y_{..}\right )^{2}+\sum_{i=1}^{k}\sum_{j=1}^{n_{i}}\left(y_{ij}-\bar y_{i.}\right)^{2}$,that is $ST(total)=SST(between)+SSE(within)\,and\, df(total)=df(between)+df(within).$<br />
<br />
**Calculations: <br />
ANOVA table: <br />
<center><br />
{| class="wikitable" style="text-align:center; width:45%" border="1"<br />
|-<br />
|Variance source||Degree of freedom (df)|| Sum of squares (SS) ||Mean sum of squares (MS)|| F-statistics||P-value<br />
|-<br />
|Treatment effect (between group)||$k-1$||$\sum_{i=1}^{k}n_{i} \left(\bar y_{i.} -\bar y_{..}\right)^{2}$ || $\frac{SST(between)}{df(between)}$ || $F_{0}=\frac{MST(between)}{MSE(within)}$ || $p(F_{(df(between),df(within)}>F_{0}$<br />
|-<br />
|Error (within group)|| $n-k$ ||$\sum_{i=1}^{k} \sum_{j=1}^{n_{i}} \left(y_{ij}-\bar y_{i.}\right)^{2}$ || $\frac{SSE(within)}{df(within)}$ || ||<br />
|- <br />
|Total || $n-1$ || $\sum_{i=1}^{k} \sum_{j=1}^{n_{i}} \left(y_{ij}-\bar y_{..}\right)^{2}$ || || ||<br />
|- <br />
|}<br />
</center><br />
<br />
==ANOVA hypotheses(general form)== <br />
$H_{o}:\mu_{1}=\mu_{2}=⋯=\mu_{k}; H_{a}:\mu_{i}≠μ_{j}$ for some $i≠j$. The test statistics: $F_{0}=\frac{MST(between)}{MSE(within)}$, if $F_{0}$ is large, then there is a lot between group variation, relative to the within group variation. Therefore, the discrepancies between the group means are large compared to the variability within the groups (error). That is large $F_{0}$ provides strong evidence against $H_{0}$.<br />
*ANOVA conditions: valid if (1) design conditions: all groups of observations represent random samples from their population respectively. Plus, all the observations within each group are independent of each other; (2) population conditions: the k population distributions must be approximately normal. If sample size is large, the normality condition is less crucial. Plus, the standard deviations of all populations are equal, which can be slightly relaxed when $0.5≤\frac{\sigma_{i}} {\sigma_{j}}≤2,$ for all $i$ and $j$, none of the population variance is twice larger than any of the other ones. <br />
<br />
==Two-way ANOVA== <br />
We focus on decomposing the variance of a dataset into independent (orthogonal) components when we have two grouping factors.<br />
*Notations first: two-way model: $y_{ijk}=\mu+\tau_{i}+\beta_{j}+\gamma_{ij}+\varepsilon_{ijk}$,for all $1≤i≤a,1≤j≤b$ and $1≤k≤r. y_{ijk}$ is the A-factor level $i$, and B-factor level $j,$ observation-index $k$ measurement; $k$ is the number of replications; $a_{i}$ is the number of A-factor observations at level $i,a=a_{1}+⋯+a_{i}; b_{j}$ is the number of B-factor observations at level $j,b=b_{1}+⋯+b_{J}; N$ is the total number of observations and $N=a*a*b.$ Here $\mu$ is the overall mean response, $\tau_{i}$ is the effect due to the $i^{th}$ level of factor A, $\beta_{j}$ is the effect due to the $j^{th}$ level of factor B, and $\gamma_{ij}$ is the effect due to any interaction between the $i^{th}$ level of factor A and $j^{th}$ level of factor B. The mean for A-factor group mean at level $i$ and B-factor at level $j$ is $\bar y_{ij}.=\frac{\sum_{k=1}^{r}y_{ijk}}{r}$, the grand mean is $\bar y=\bar y_{…}=\frac{\sum_{k=1}^{r}\sum_{i=1}^{a}\sum_{j=1}^{b} y_{ijk}} {n}$, we have $SST(total)=SS(A)+SS(B)+SS(AB)+SSE.$<br />
<br />
==Hypotheses:== <br />
*Null hypotheses: (1) the population means of the first factor are equal, which is like the one-way ANOVA for the row factor; (2) the population means of the second factor are equal, which is like the one-way ANOVA for the column factor; (3) there is no interaction between the two factors, which is similar to performing a test for independence with contingency tables.<br />
**Factors: factor A and factor B are independent variables in two-way ANOVA.<br />
**Treatment groups: formed by making all possible combinations of two factors. For example, if the factor A has 3 levels and factor B has 5 levels, then there will be $3*5=15$ different treatment groups.<br />
**Main effect: involves the dependent variable one at a time. The interaction is ignored for this part. <br />
**Interaction effect: the effect that one factor has on the other factor. The degree of freedom is the product of the two degrees of freedom of each factor.<br />
**Calculations:<br />
<br />
ANOVA table: <br />
<center><br />
{| class="wikitable" style="text-align:center; width:45%" border="1"<br />
|-<br />
|Variance source||Degree of freedom (df)||Sum of squares (SS)|| Mean sum of squares (MS)||F-statistics||P-value<br />
|-<br />
|Main effect A|| $a-1$ ||$SS(A)=rb\Sigma_{i=1}^{a} n_{i} (\bar y_{i..} -\bar y_{...} )^{2}$ ||$SS(A)/df(A)$||$F_{0}=\frac{MS(A)}{MSE}$ ||$p(F_{(df(B),df(E)})>F_{0}$<br />
|-<br />
|Main effect B || $b-1$||$SS(B)=ra\Sigma_{j=1}^{b} n_{i} (\bar y_{.j.}-\bar y_{...})^{2}$ || $SS(B)/df(B)$|| $F_{0}=\frac{MS(B)}{MSE}$||$p(F_{(df(AB),df(E)} )>F_{0}$<br />
|-<br />
|A vs. B interaction||$(a-1)(b-1)$||$SS(AB)=r\Sigma_{i=1}^{a} \Sigma_{j=1}^{b} ((\bar y_{ij.}-\bar y_{i..}) +\bar y_{.j.} -\bar y_{...} )^{2}$ ||$SS(AB)/df(AB)$||$F_{0}=\frac{MS(AB)} {MSE}$||<br />
|- <br />
|Error ||$N-ab$||$SSE=\Sigma_{k=1}^{r} \Sigma_{i=1}^{a} \Sigma_{j=1}^{b} (y_{ijk}-\bar y_{ij.})^{2}$||$SSE/df(error)$|| ||<br />
|- <br />
|Total||$N-1$||$SST=\Sigma_{k=1}^{r}\Sigma_{i=1}^{a} \Sigma_{j=1}^{b}(y_{ijk}-\bar y_{...})^{2}$|| || ||<br />
|- <br />
|}<br />
</center><br />
<br />
*ANOVA conditions: valid if (1) the population from which the samples were obtained must be normally or approximately normally distributed; (2) the samples must be independent; (3) the variances of the populations must be equal; (4) the groups must have the same sample size.<br />
<br />
==ANCOVA==<br />
Analysis of Covariance is the common method applied to blend ANOVA and regression and evaluate whether population means of a dependent variance (DV) are equal across levels of a categorical independent variable (IV) while statistically controlling for the effects of other continuous variables (CV).<br />
*Assumptions of ANCOVA: (1) normality of residuals; (2) homogeneity of variance for error; (3) Homogeneity of regression slopes, regression lines should be parallel among groups; (4) Linearity of regression; (5) independence of error terms.<br />
*Increase statistical power: ANCOVA reducing the with-in group error variance and increase statistical power. Use the F-test to evaluate difference between groups by dividing the explained variance between groups by the unexplained variance within the groups. $F=\frac{MS between}{MSwithin}$. If this value is greater than the critical value, then there is significant difference between groups. The influence of CVs is grouped into the denominator. When we control for the effect of CVs on the DV, remove it from the denominator making F bigger, thereby increased our power to find a significant effect if one exists.<br />
**Adjusting preexisting difference in nonequivalent groups: correct for initial group difference that exists on DV among several intact groups. In this case, CVs are used to adjust scores and make participants more similar than without the CV since the participants cannot be made equal through random assignment. CV may be so intimately related to the IV that removing the variance on the DV associated with CV would remove considerable variance on DV, which will make the result meaningless.<br />
**Conduct ANCOVA: (1) Test multicollinearity: if a CV is highly related to another CV, then it won’t adjust the DV over and above the other CV. One or the other should be removed since they are statistically redundant. (2) Test the homogeneity of variance assumption: Levene’s test of equality of error variances. (3) Test of the homogeneity of regression slopes assumption: tested by testing if the CV significantly interacts with the IV by running an ANCOVA model including both the IV and the CV*IV interaction term in the model. If the interaction term is significant, then we should not perform ANCOVA. Instead assess group difference on DV at particular level of CV. (4) Run ANCOVA analysis: if the interaction is not significant, then rerun the ANCOVA without the interaction term. In this analysis, use the adjust means and adjusted MSerror. (5) Follow-up analyses: if there was a significant main effect, then there is a significant difference between the levels of one IV, ignoring all other factors. To find out exactly which levels are significant from others, use the same follow-up tests for ANOVA.<br />
<br />
==MANOVA==<br />
Multivariate analysis of variance or multiple analysis of variance is a statistical test procedure for comparing multivariate means of several groups. MANOVA is a generalized form of ANOVA.<br />
Relationship with ANOVA: <br />
*MANOVA is an extension of ANOVA, though, unlike ANOVA, it uses the variance-covariance between variables in testing the statistical significance of the mean difference. It is similar to ANOVA, but allows adding of interval independents as covariates. Several specific use-cases for MANOVA: (1) to compare groups formed by categorical independent variables on group differences in a set of dependent variables; (2) to use lack of difference for a set of dependent variables as a criterion for reducing a set of independent variables to a smaller, more easily modeled number of variables; (3) to identify the independent variables which differentiate a set of dependent variables the most.<br />
*Analogous to ANOVA, MANOVA is based on the product of model variance matrix, $\Sigma_{model}$ and inverse of the error variance matrix, $\Sigma_{res}^{-1}, or A=\Sigma_{model} * \Sigma_{res}^{-1}$. The hypothesis that $\Sigma_{model} = \Sigma_{residual}$ implies that the product $A \sim I$. Invariance considerations imply the MANOVA statistic should be a measure of magnitude of the singular value decomposition of this matrix product, there is no unique choice owing to the multi-dimensional nature of the alternative hypothesis.<br />
*MANOVA calculations closely resemble the ANOVA calculations, except that they are in vector and matrix forms. Assume that instead of a single dependent variance in the one-way ANOVA, there are three dependent variables as in our neuroimaging example above. Under the null hypothesis, it is assumed that scores on the three variables for each of the four groups are sampled from a tri-variate normal distribution mean vector $\mu =(\mu_{1}, \mu_{2}, \mu_{3})^{T}$ and variance-covariance matrix $\Sigma=\bigl(\begin{smallmatrix} \sigma_{1}^{2} & \rho_{1,2}\sigma_{1}\sigma{2} &\rho_{1,3}\sigma_{1}\sigma{3} \\ \rho_{2,1}\sigma_{2}\sigma_{1} & \sigma_{2}^{2} & \rho_{2,3}\sigma_{2}\sigma{3} \\\rho_{3,1}\sigma_{3}\sigma_{1}&\rho_{3,2}\sigma_{3}\sigma_{2} &\sigma_{3}^{2}\end{smallmatrix}\bigr)$. Where the covariance between variables 1 and 2 is expressed in terms of their correlation $(\rho_{1,2})$ and individual variances $(\sigma_{1}$ and $\sigma_{2})$. Under the null hypothesis, the scores for all subjects in groups 1, 2 and 3 are sampled from the same distribution.<br />
*Example: a $2*2$ factorial design with medication as one factor and type of therapy as the second factor. The matrix of the data looks includes the patient ID, the drug-treatment (vitamin-E or Placebo), Therapy (Cognitive/physical), MMSE, CDR, Imaging. It's better when the study the design is balanced with equal numbers of patients in all four conditions, as this avoid potential problems of sample-size-driven effects (e.g., variance estimates). Recall that a univariate ANOVA (on any single outcome measure) would contain three types of effects -- a main effect for therapy, a mean effect for medication, and an interaction between therapy and medication. Similarly, MANOVA will contain the same three effects: main effects: (1) Therapy: The univariate ANOVA main effect for therapy tells whether the physical vs. cognitive therapy groups have different means, irrespective of their medications. The MANOVA main effect for therapy tells whether the physical vs. cognitive therapy group have different mean vectors irrespective of their medication. The vectors in this case are the $(3*1)$ column vectors of means (MMSE, CDR and Imaging); (2) Medication: The univariate ANOVA for medication tells whether the placebo group has a different mean from the Vitamin-E group irrespective of the therapy type. The MANOVA main effect for medication tells whether the placebo group has a different mean vector from the VItamin-E group irrespective of therapy; interaction effects: (3) The univariate ANOVA interaction tells whether the four means for a single variable differ from the value predicted from knowledge of the main effects of therapy and medication. The MANOVA interaction term tells whether the four mean vectors differ from the vector predicted from knowledge of the main effects of therapy and medication.<br />
**Variance partitioning: MANOVA has the same properties as an ANOVA. The only difference is that an ANOVA deals with a $(1*1)$ mean vector for any group (as the response is univatiate). While a MANOVA deals with a $(k*1)$ vector for any group, $k$ being the number of dependent variables, 3 in our example. The variance-covariance matrix for 1 variable is a $(1*1)$ matrix that has only one element, the variance of the variable. What is the variance-covariance matrix for $k$ variables is a $(k*k)$ matrix with the variances on the diagonal and the covariances representing the off diagonal elements. The ANOVA partitions the $(1*1)$ covariance matrix into a part due to error and a part due to the researcher-specified hypotheses (the two main effects and the interaction term). That is: $V_{total} = V_{therapy} + V_{medication} + V_{therapy*medication} + V_{error}.$ Likewise, MANOVA partitions its $(k*k)$ covariance matrix into a part due to research-hypotheses and a part due to error. Thus, in out example, MANOVA will have a $(3*3)$ covariance matrix for total variability, a $(3*3)$ covariance matrix due to therapy, a $(3*3)$ covariance matrix due to medication, a $(3*3)$ covariance matrix due to the interaction of therapy with medication, and finally a $(3*3)$ covariance matrix for the error. That is: $V_{total} = V_{therapy} + V_{medication} + V_{therapy*medication} + V_{error}$. Now,$V$ stands for the appropriate $(3*3)$ matrix, as opposed to $(1*1)$ value, as in ANOVA. The second equation is the matrix-form of the first one. Here is how we interpret these matrices. The error matrix looks like:<br />
<center><br />
{| class="wikitable" style="text-align:center; width:45%" border="1"<br />
|-<br />
| ||MMSE ||CDR ||Imaging<br />
|-<br />
|MMSE|| $V_{error1}$|| COV(error1, error2)|| COV(error1, error3)<br />
|-<br />
|CDR|| COV(error2, error1)||$V_{error2}$|| COV(error2, error1)<br />
|-<br />
|Imaging|| COV(error3, error1)|| COV(error3, error2) ||$V_{error3}$<br />
|-<br />
|}<br />
</center><br />
<br />
*Common statistics are summarized based on the root (eigenvalues) \lambda_{p} of the A matrix: (1) Samuel Stanley Wilks’, $\Lambda_{Wilks}=\prod_{1,\cdots,p} (1/(1+\lambda_{p}))=det(I+A)^{-1}=det(\Sigma_{res})/det(\Sigma_{res}+\Sigma_{model})$ distributed as lambda $(\Lambda)$; (2) The Pillai-M.S. Bartlett trace, $\Lambda_{Pillai}=\sum_{1,\cdots,p}(1/(1+\lambda_{p}))=tr((I+A)^{-1})$; (3) The Lawley-Hotelling trace, $\Lambda_{LH}=\sum_{1,\cdots,p}(\lambda_{p})=tr(A)$; (4) Roy’s greatest root, $\Lambda_{Roy}=max_{p}(\lambda_{p})=\left \| A \right \|_{\infty}$. The 4 major types of MANOVA test, the statistical power of these tests follow: $Pillai’s > Wilk’s > Hotelling’s > Roy’s Robustness$.<br />
**Let the $A$ statistic be the ratio of the sum of squares for an hypothesis and the sum of squares for error. Let $H$ denote the hypothesis sums of squares and cross products matrix, and let $E$ denote the error sums of squares and cross products matrix. The multivariate $A$ statistic is the matrix $A = HE_{-1}.$ Notice how mean squares (that is, covariance matrices) disappear from MANOVA just as they did for ANOVA. All hypothesis tests may be performed on the matrix $A$. Note also that because both $H$ and $E$ are symmetric, $HE^{-1}=E^{-1} H$. This is one special case where the order of matrix multiplication does not matter.<br />
**All MANOVA tests are made on $A=E^{-1}H$. There are four different multivariate tests that are made on this matrix. Each of the four test statistics has its own associated $F$ ratio. In some cases the four tests give an exact $F$ ratio for testing the null hypothesis and in other cases the $F$ ratio is only approximate. The reason for four different statistics and for approximations is that the MANOVA calculations may be complicated in some cases (i.e., the sampling distribution of the $F$ statistic in some multivariate cases would be difficult to compute exactly.) Suppose there are $k$ dependent variables in the MANOVA, and let $\lambda I$ denote the ith eigenvalue of $A=E^{-1}H$.<br />
**Wilk’s $\Lambda:\, 1- \Lambda$ is an index of variance explained by the model, $\eta^{2}$ is a measure of effect size analogous to $R^{2}$ in regression. $Wilk’s\: \Lambda$ is the pooled ratio of error variance to effect variance plus error variance: $\Lambda=\frac{|E|}{|H+E|}=\prod_{i=1}^{k}\frac{1}{1+\lambda_{i}}$.<br />
*Pillai’s Trace: Pillai’s criterion is the pooled effect variances. $Pillai’s\: trace=trace[H(H+E)^{-1}]=\prod_{i=1}^{k}\frac{1}{1+\lambda_{i}}$.<br />
*Hotelling’s Trace: the pooled ratio of effect variance to error variance:$trace(A)=trace[HE^{-1}]=\sum_{i=1}^{k}\lambda_{i}$.<br />
*Roy’s largest root: gives an upper bound for the F statistic, $Roy’s\: largest \,root=max(\hat{\lambda_{i}})$. <br />
<br />
<br />
===MANCOVA===<br />
A multivariate analysis of covariance MANCOVA is a statistical an extension of ANCOVA and is designed for cases where there is more than one dependent variable and where the control of concomitant continuous independent variables is required. The process of characterizing a covariate in a data source allows the reduction of the magnitude of the error term, represented in the MANCOVA design as $MS_{error}$. The MANCOVA allows the characterization of the difference in group means in regards to a linear combination of multiple dependent variables, while simultaneously controlling for covariates. <br />
**Assumptions: (1) normality, for each group, each dependent variable follows a normal distribution and any linear combination of dependent variables are normally distributed; (2) independence of observations from all other observations; (3) homogeneity of variance: each dependent variable demonstrate similar levels of variance across each independent variable; (4) homogeneity of covariance: the intercorrelation matrix between dependent variables equals to each other across all levels of independent variable.<br />
**Covariate represents the source of variance that has not been controlled in the experiment and is believed to affect the dependent variable. And methods like ANCOVA and MANCOVA aim to remove the effects of such uncontrolled variation in order to increase statistical power and to ensure an accurate measurement of the true relationship between independent and dependent variables. <br />
**In some studies with covariates it happens that the F value actually becomes smaller (less significant) after including covariates in the design. This indicates that the covariates are not only correlated with the dependent variable, but also with the between-groups factors.<br />
<br />
===Applications===<br />
<br />
[http://rer.sagepub.com/content/68/3/350.short This article] examined articles published in several prominent educational journals to investigate the use of data analysis tools by researchers in four research paradigms: between-subjects univariate designs, between-subjects multivariate designs, repeated measures designs, and covariance designs. In addition to examining specific details pertaining to the research design (e.g., sample size, group size equality/inequality) and methods employed for data analysis, the authors also catalogued whether (a) validity assumptions were examined, (b) effect size indices were reported, (c) sample sizes were selected on the basis of power considerations, and (d) appropriate textbooks and/or articles were cited to communicate the nature of the analyses that were performed. The present analyses imply that researchers rarely verify that validity assumptions are satisfied and that, accordingly, they typically use analyses that are nonrobust to assumption violations. In addition, researchers rarely report effect size statistics, nor do they routinely perform power analyses to determine sample size requirements. Recommendations are offered to rectify these shortcomings.<br />
<br />
[http://www.tandfonline.com/doi/abs/10.1080/03610919808813485#.U-aSFhZTWdA This article] investigated the performance of ANOVA, MANOVA, WLS, and GEE for repeated ordinal data with small sample sizes. Repeated ordinal outcomes are common in behavioral and medical sciences. Due to the familiarity, simplicity and robustness of ANOVA methodology, this approach has been frequently used for repeated ordinal data. Weighted least squares (WLS) and generalized estimating equations (GEE) are usually the procedures of choice for repeated ordinal data since, unlike ANOVA, they generally make no or few untenable assumptions. However, these methods are based on asymptotic results and their properties are not well understood for small samples. Moreover, few software packages have procedures for implementing these methods. For a design with two groups and four time points, the simulation results indicated that ANOVA with the Huynh-Feldt adjustment performed matrix, known as sphericity, or the H‐F condition, is a sufficient condition for the usual F tests to be valid. <br />
<br />
===Software=== <br />
RCODE:<br />
fit <- aov(y ~ A, data = mydata) #one way ANOVA (completely randomized design)<br />
fit <- aov(y ~ A+B, data=mydata) # randomized block design where B is the blocking factor<br />
fit <- aov(y ~ A+B+A*B, data=mydata) ## two way factorial design<br />
fit <- aov(y ~ A+x, data=mydata) ## analysis of covariance<br />
## for within subject designs, the data frame has to be rearranged for each measurement on a subject to be a separate observation<br />
fit <- aov(y ~ A+Error(subject/A), data=mydata) ## one within factor<br />
fit <- aov(y ~(w1*w2*B1*B2)+Error(Subject/(W1*W2))+(B1*B2),data=mydata) # two within factors W1 and W2, two between factors B1 and B2.<br />
## 2*2 factorial MANOVA with 3 dependent variables<br />
Y <- cbind(y1,y2,y3)<br />
fit <- manova(Y ~ A*B)<br />
<br />
===Problems===<br />
<br />
Use data of the CPI (consumer price index) for food, housing, transportation and medical care from 1981 to 2007 to do a two-way analysis of the covariance in R. We take ‘Month’ as one factor and the item the CPI measured on as another factor and did a 2* 2 factorial design. The data are linked at [http://wiki.socr.umich.edu/index.php/SOCR_Data_Dinov_021808_ConsumerPriceIndex3Way Consumer Price Index].<br />
<br />
In R:<br />
CPI <- read.csv('/Users/yufangli/Desktop/CPI_Food.csv',header=T)<br />
attach(CPI)<br />
summary(CPI)<br />
Month <- factor(Month)<br />
CPI_Item <- factor(CPI_Item)<br />
fit <- aov(CPI_Value ~ Month + CPI_Item + CPI_Item*Month, data=CPI)<br />
fit<br />
Call:<br />
aov(formula = CPI_Value ~ Month + CPI_Item + CPI_Item * Month, <br />
data = CPI)<br />
<br />
Terms:<br />
Month CPI_Item Month:CPI_Item Residuals<br />
Sum of Squares 3282.6 1078702.9 706.8 2987673.8<br />
Deg. of Freedom 11 3 33 1248<br />
<br />
Residual standard error: 48.92821 <br />
Estimated effects may be unbalanced<br />
<br />
summary(fit)<br />
Df Sum Sq Mean Sq F value Pr(>F) <br />
Month 11 3283 298 0.125 1 <br />
CPI_Item 3 1078703 359568 150.197 <2e-16 ***<br />
Month:CPI_Item 33 707 21 0.009 1 <br />
Residuals 1248 2987674 2394 <br />
---<br />
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1<br />
<br />
fit2 <- aov(CPI_Value ~ CPI_Item, data=CPI)<br />
fit2<br />
Call:<br />
aov(formula = CPI_Value ~ CPI_Item, data = CPI)<br />
<br />
Terms:<br />
CPI_Item Residuals<br />
Sum of Squares 1078703 2991663<br />
Deg. of Freedom 3 1292<br />
<br />
Residual standard error: 48.11994 <br />
Estimated effects may be unbalanced<br />
summary(fit2)<br />
Df Sum Sq Mean Sq F value Pr(>F) <br />
CPI_Item 3 1078703 359568 155.3 <2e-16 ***<br />
Residuals 1292 2991663 2316 <br />
---<br />
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1<br />
<br />
Seems like the CPI_Item is an important factor while month is not a significant factor of CPI based on the dataset we got.<br />
<br />
<br />
===References===<br />
[http://www.statsoft.com/Textbook/ANOVA-MANOVA ANOVA-MANOVA]<br />
<br />
[http://en.wikipedia.org/wiki/Analysis_of_variance ANOVA Wikipedia]<br />
<br />
[http://wiki.stat.ucla.edu/socr/index.php/Probability_and_statistics_EBook#Chapter_XI:_Analysis_of_Variance_.28ANOVA.29 SOCR]<br />
<br />
[http://en.wikipedia.org/wiki/Analysis_of_covariance ANCOVA Wikipedia]<br />
<br />
[http://en.wikipedia.org/wiki/Multivariate_analysis_of_variance MANOVA Wikipedia]<br />
<br />
[http://en.wikipedia.org/wiki/MANCOVA MANCOVA Wikipedia]<br />
<br />
<br />
<br />
<br />
<br />
<br />
<hr><br />
* SOCR Home page: http://www.socr.umich.edu<br />
<br />
{{translate|pageName=http://wiki.socr.umich.edu/index.php?title=SMHS_ANCOVA}}</div>Clgallahttps://wiki.socr.umich.edu/index.php?title=SMHS_ANCOVA&diff=14532SMHS ANCOVA2014-10-28T18:19:30Z<p>Clgalla: </p>
<hr />
<div>==[[SMHS| Scientific Methods for Health Sciences]] - Analysis of Covariance (ANCOVA) ==<br />
<br />
===Overview===<br />
Analysis of Variance (ANOVA) is the common method applied to analyze the differences between group means. Analysis of Covariance (ANCOVA) is the common method applied to blend ANOVA and regression and evaluate whether population means of a dependent variance (DV) are equal across levels of a categorical independent variable (IV) while statistically controlling for the effects of other continuous variables (CV). Multivariate analysis of variance (MANOVA) is a statistical test procedure for comparing multivariate means of several groups, which is a generalized form of ANOVA. Similar to MANOVA, MANOCVA (multivariate analysis of covariance) is an extension of ANCOVA and is designed for cases where there is more than one dependent variable and where the control of concomitant continuous independent variables is required. In this section, we review ANOVA, ANCOVA, MANOVA and MANCOVA and illustrate their application with examples.<br />
<br />
===Motivation===<br />
We have talked about analysis of variance (ANOVA). ANCOVA is similar to ANOVA and deals with covariance instead. What if we have more than one dependent variable and studied on multivariate observations? What if we want to see if the interactions among dependent variables or changes in the independent variables influence the dependent variable? Then we will need to use the extension of ANOVA and ANCOVA, MANOVA and MACOVA respectively. So the question would be how do those methods work and what kind of conclusions we can drawn from those methods?<br />
<br />
===Theory===<br />
<br />
==ANOVA==<br />
Analysis of Variance (ANOVA) is the common method applied to analyze the differences between group means. In ANOVA, we divide the observed variance into components attributed to different sources of variation.<br />
*One-way ANOVA: we expand our inference methods to study and compare k independent samples. In this case, we will be decomposing the entire variation in the data into independent components.<br />
**Notations first: $y_{ij}$ is the measurement from group $i$, observation index $j$; $k$ is the number of groups; $n_{i}$ is the number of observations in group $i$; $n$ is the total number of observations and $n=n_{1}+n_{2}+⋯+n_{k}$. The group mean for group $i$ is $\bar y_{i}. =\frac{\sum_{j=1}^n_{i}y_{ij}}{n_{i}},$ the grand mean is $\bar y = \bar y_{..}=\frac{\sum_{i=1}^{k}\sum_{j=1}^{n_{i}}y_{ij}} {n}$.<br />
**Difference between the means (compare each group mean to the grand mean): total variance $SST(total)=\sum_{i=1}^{k}\sum_{j=1}^{n_{i}}\left(y_{ij}-\bar y_{..}\right)^{2}$, degrees of freedom $df(total)=n-1$; difference between each group mean and grand mean: $SST(between)=\sum_{i=}^{k}n_{i} \left(\bar y_{i.} \bar y_{..}\right )^{2}$, degrees of freedom $df(between)=k-1$; Sum square due to error (combination of variation within group): $SSE=\sum_{i=1}^{k} \sum_{j=1}^{n_{i}}\left(y_{ij}-\bar y_{i.}\right)^{2}$, degrees of freedom $df(within)=n-k$. <br />
<br />
With ANOVA decomposition, we have $\sum_{i=1}^{k}\sum_{j=1}^{n_{i}}\left(y_{ij}-\bar y_{..}\right)^{2}=\sum_{i=}^{k}n_{i}\left(\bar y_{i.} - \bar y_{..}\right )^{2}+\sum_{i=1}^{k}\sum_{j=1}^{n_{i}}\left(y_{ij}-\bar y_{i.}\right)^{2}$,that is $ST(total)=SST(between)+SSE(within)\,and\, df(total)=df(between)+df(within).$<br />
<br />
**Calculations: <br />
ANOVA table: <br />
<center><br />
{| class="wikitable" style="text-align:center; width:45%" border="1"<br />
|-<br />
|Variance source||Degree of freedom (df)|| Sum of squares (SS) ||Mean sum of squares (MS)|| F-statistics||P-value<br />
|-<br />
|Treatment effect (between group)||$k-1$||$\sum_{i=1}^{k}n_{i} \left(\bar y_{i.} -\bar y_{..}\right)^{2}$ || $\frac{SST(between)}{df(between)}$ || $F_{0}=\frac{MST(between)}{MSE(within)}$ || $p(F_{(df(between),df(within)}>F_{0}$<br />
|-<br />
|Error (within group)|| $n-k$ ||$\sum_{i=1}^{k} \sum_{j=1}^{n_{i}} \left(y_{ij}-\bar y_{i.}\right)^{2}$ || $\frac{SSE(within)}{df(within)}$ || ||<br />
|- <br />
|Total || $n-1$ || $\sum_{i=1}^{k} \sum_{j=1}^{n_{i}} \left(y_{ij}-\bar y_{..}\right)^{2}$ || || ||<br />
|- <br />
|}<br />
</center><br />
<br />
==ANOVA hypotheses(general form)== <br />
$H_{o}:\mu_{1}=\mu_{2}=⋯=\mu_{k}; H_{a}:\mu_{i}≠μ_{j}$ for some $i≠j$. The test statistics: $F_{0}=\frac{MST(between)}{MSE(within)}$, if $F_{0}$ is large, then there is a lot between group variation, relative to the within group variation. Therefore, the discrepancies between the group means are large compared to the variability within the groups (error). That is large $F_{0}$ provides strong evidence against $H_{0}$.<br />
*ANOVA conditions: valid if (1) design conditions: all groups of observations represent random samples from their population respectively. Plus, all the observations within each group are independent of each other; (2) population conditions: the k population distributions must be approximately normal. If sample size is large, the normality condition is less crucial. Plus, the standard deviations of all populations are equal, which can be slightly relaxed when $0.5≤\frac{\sigma_{i}} {\sigma_{j}}≤2,$ for all $i$ and $j$, none of the population variance is twice larger than any of the other ones. <br />
<br />
==Two-way ANOVA== <br />
We focus on decomposing the variance of a dataset into independent (orthogonal) components when we have two grouping factors.<br />
*Notations first: two-way model: $y_{ijk}=\mu+\tau_{i}+\beta_{j}+\gamma_{ij}+\varepsilon_{ijk}$,for all $1≤i≤a,1≤j≤b$ and $1≤k≤r. y_{ijk}$ is the A-factor level $i$, and B-factor level $j,$ observation-index $k$ measurement; $k$ is the number of replications; $a_{i}$ is the number of A-factor observations at level $i,a=a_{1}+⋯+a_{i}; b_{j}$ is the number of B-factor observations at level $j,b=b_{1}+⋯+b_{J}; N$ is the total number of observations and $N=a*a*b.$ Here $\mu$ is the overall mean response, $\tau_{i}$ is the effect due to the $i^{th}$ level of factor A, $\beta_{j}$ is the effect due to the $j^{th}$ level of factor B, and $\gamma_{ij}$ is the effect due to any interaction between the $i^{th}$ level of factor A and $j^{th}$ level of factor B. The mean for A-factor group mean at level $i$ and B-factor at level $j$ is $\bar y_{ij}.=\frac{\sum_{k=1}^{r}y_{ijk}}{r}$, the grand mean is $\bar y=\bar y_{…}=\frac{\sum_{k=1}^{r}\sum_{i=1}^{a}\sum_{j=1}^{b} y_{ijk}} {n}$, we have $SST(total)=SS(A)+SS(B)+SS(AB)+SSE.$<br />
<br />
==Hypotheses:== <br />
*Null hypotheses: (1) the population means of the first factor are equal, which is like the one-way ANOVA for the row factor; (2) the population means of the second factor are equal, which is like the one-way ANOVA for the column factor; (3) there is no interaction between the two factors, which is similar to performing a test for independence with contingency tables.<br />
**Factors: factor A and factor B are independent variables in two-way ANOVA.<br />
**Treatment groups: formed by making all possible combinations of two factors. For example, if the factor A has 3 levels and factor B has 5 levels, then there will be $3*5=15$ different treatment groups.<br />
**Main effect: involves the dependent variable one at a time. The interaction is ignored for this part. <br />
**Interaction effect: the effect that one factor has on the other factor. The degree of freedom is the product of the two degrees of freedom of each factor.<br />
**Calculations:<br />
<br />
ANOVA table: <br />
<center><br />
{| class="wikitable" style="text-align:center; width:45%" border="1"<br />
|-<br />
|Variance source||Degree of freedom (df)||Sum of squares (SS)|| Mean sum of squares (MS)||F-statistics||P-value<br />
|-<br />
|Main effect A|| $a-1$ ||$SS(A)=rb\Sigma_{i=1}^{a} n_{i} (\bar y_{i..} -\bar y_{...} )^{2}$ ||$SS(A)/df(A)$||$F_{0}=\frac{MS(A)}{MSE}$ ||$p(F_{(df(B),df(E)})>F_{0}$<br />
|-<br />
|Main effect B || $b-1$||$SS(B)=ra\Sigma_{j=1}^{b} n_{i} (\bar y_{.j.}-\bar y_{...})^{2}$ || $SS(B)/df(B)$|| $F_{0}=\frac{MS(B)}{MSE}$||$p(F_{(df(AB),df(E)} )>F_{0}$<br />
|-<br />
|A vs. B interaction||$(a-1)(b-1)$||$SS(AB)=r\Sigma_{i=1}^{a} \Sigma_{j=1}^{b} ((\bar y_{ij.}-\bar y_{i..}) +\bar y_{.j.} -\bar y_{...} )^{2}$ ||$SS(AB)/df(AB)$||$F_{0}=\frac{MS(AB)} {MSE}$||<br />
|- <br />
|Error ||$N-ab$||$SSE=\Sigma_{k=1}^{r} \Sigma_{i=1}^{a} \Sigma_{j=1}^{b} (y_{ijk}-\bar y_{ij.})^{2}$||$SSE/df(error)$|| ||<br />
|- <br />
|Total||$N-1$||$SST=\Sigma_{k=1}^{r}\Sigma_{i=1}^{a} \Sigma_{j=1}^{b}(y_{ijk}-\bar y_{...})^{2}$|| || ||<br />
|- <br />
|}<br />
</center><br />
<br />
*ANOVA conditions: valid if (1) the population from which the samples were obtained must be normally or approximately normally distributed; (2) the samples must be independent; (3) the variances of the populations must be equal; (4) the groups must have the same sample size.<br />
<br />
==ANCOVA==<br />
Analysis of Covariance is the common method applied to blend ANOVA and regression and evaluate whether population means of a dependent variance (DV) are equal across levels of a categorical independent variable (IV) while statistically controlling for the effects of other continuous variables (CV).<br />
*Assumptions of ANCOVA: (1) normality of residuals; (2) homogeneity of variance for error; (3) Homogeneity of regression slopes, regression lines should be parallel among groups; (4) Linearity of regression; (5) independence of error terms.<br />
*Increase statistical power: ANCOVA reducing the with-in group error variance and increase statistical power. Use the F-test to evaluate difference between groups by dividing the explained variance between groups by the unexplained variance within the groups. $F=\frac{MS between}{MSwithin}$. If this value is greater than the critical value, then there is significant difference between groups. The influence of CVs is grouped into the denominator. When we control for the effect of CVs on the DV, remove it from the denominator making F bigger, thereby increased our power to find a significant effect if one exists.<br />
**Adjusting preexisting difference in nonequivalent groups: correct for initial group difference that exists on DV among several intact groups. In this case, CVs are used to adjust scores and make participants more similar than without the CV since the participants cannot be made equal through random assignment. CV may be so intimately related to the IV that removing the variance on the DV associated with CV would remove considerable variance on DV, which will make the result meaningless.<br />
**Conduct ANCOVA: (1) Test multicollinearity: if a CV is highly related to another CV, then it won’t adjust the DV over and above the other CV. One or the other should be removed since they are statistically redundant. (2) Test the homogeneity of variance assumption: Levene’s test of equality of error variances. (3) Test of the homogeneity of regression slopes assumption: tested by testing if the CV significantly interacts with the IV by running an ANCOVA model including both the IV and the CV*IV interaction term in the model. If the interaction term is significant, then we should not perform ANCOVA. Instead assess group difference on DV at particular level of CV. (4) Run ANCOVA analysis: if the interaction is not significant, then rerun the ANCOVA without the interaction term. In this analysis, use the adjust means and adjusted MSerror. (5) Follow-up analyses: if there was a significant main effect, then there is a significant difference between the levels of one IV, ignoring all other factors. To find out exactly which levels are significant from others, use the same follow-up tests for ANOVA.<br />
<br />
==MANOVA==<br />
Multivariate analysis of variance or multiple analysis of variance is a statistical test procedure for comparing multivariate means of several groups. MANOVA is a generalized form of ANOVA.<br />
Relationship with ANOVA: <br />
*MANOVA is an extension of ANOVA, though, unlike ANOVA, it uses the variance-covariance between variables in testing the statistical significance of the mean difference. It is similar to ANOVA, but allows adding of interval independents as covariates. Several specific use-cases for MANOVA: (1) to compare groups formed by categorical independent variables on group differences in a set of dependent variables; (2) to use lack of difference for a set of dependent variables as a criterion for reducing a set of independent variables to a smaller, more easily modeled number of variables; (3) to identify the independent variables which differentiate a set of dependent variables the most.<br />
*Analogous to ANOVA, MANOVA is based on the product of model variance matrix, $\Sigma_{model}$ and inverse of the error variance matrix, $\Sigma_{res}^{-1}, or A=\Sigma_{model} * \Sigma_{res}^{-1}$. The hypothesis that $\Sigma_{model} = \Sigma_{residual}$ implies that the product $A \sim I$. Invariance considerations imply the MANOVA statistic should be a measure of magnitude of the singular value decomposition of this matrix product, there is no unique choice owing to the multi-dimensional nature of the alternative hypothesis.<br />
*MANOVA calculations closely resemble the ANOVA calculations, except that they are in vector and matrix forms. Assume that instead of a single dependent variance in the one-way ANOVA, there are three dependent variables as in our neuroimaging example above. Under the null hypothesis, it is assumed that scores on the three variables for each of the four groups are sampled from a tri-variate normal distribution mean vector $\mu =(\mu_{1}, \mu_{2}, \mu_{3})^{T}$ and variance-covariance matrix $\Sigma=\bigl(\begin{smallmatrix} \sigma_{1}^{2} & \rho_{1,2}\sigma_{1}\sigma{2} &\rho_{1,3}\sigma_{1}\sigma{3} \\ \rho_{2,1}\sigma_{2}\sigma_{1} & \sigma_{2}^{2} & \rho_{2,3}\sigma_{2}\sigma{3} \\\rho_{3,1}\sigma_{3}\sigma_{1}&\rho_{3,2}\sigma_{3}\sigma_{2} &\sigma_{3}^{2}\end{smallmatrix}\bigr)$. Where the covariance between variables 1 and 2 is expressed in terms of their correlation $(\rho_{1,2})$ and individual variances $(\sigma_{1}$ and $\sigma_{2})$. Under the null hypothesis, the scores for all subjects in groups 1, 2 and 3 are sampled from the same distribution.<br />
*Example: a $2*2$ factorial design with medication as one factor and type of therapy as the second factor. The matrix of the data looks includes the patient ID, the drug-treatment (vitamin-E or Placebo), Therapy (Cognitive/physical), MMSE, CDR, Imaging. It's better when the study the design is balanced with equal numbers of patients in all four conditions, as this avoid potential problems of sample-size-driven effects (e.g., variance estimates). Recall that a univariate ANOVA (on any single outcome measure) would contain three types of effects -- a main effect for therapy, a mean effect for medication, and an interaction between therapy and medication. Similarly, MANOVA will contain the same three effects: main effects: (1) Therapy: The univariate ANOVA main effect for therapy tells whether the physical vs. cognitive therapy groups have different means, irrespective of their medications. The MANOVA main effect for therapy tells whether the physical vs. cognitive therapy group have different mean vectors irrespective of their medication. The vectors in this case are the $(3*1)$ column vectors of means (MMSE, CDR and Imaging); (2) Medication: The univariate ANOVA for medication tells whether the placebo group has a different mean from the Vitamin-E group irrespective of the therapy type. The MANOVA main effect for medication tells whether the placebo group has a different mean vector from the VItamin-E group irrespective of therapy; interaction effects: (3) The univariate ANOVA interaction tells whether the four means for a single variable differ from the value predicted from knowledge of the main effects of therapy and medication. The MANOVA interaction term tells whether the four mean vectors differ from the vector predicted from knowledge of the main effects of therapy and medication.<br />
**Variance partitioning: MANOVA has the same properties as an ANOVA. The only difference is that an ANOVA deals with a $(1*1)$ mean vector for any group (as the response is univatiate). While a MANOVA deals with a $(k*1)$ vector for any group, $k$ being the number of dependent variables, 3 in our example. The variance-covariance matrix for 1 variable is a $(1*1)$ matrix that has only one element, the variance of the variable. What is the variance-covariance matrix for $k$ variables is a $(k*k)$ matrix with the variances on the diagonal and the covariances representing the off diagonal elements. The ANOVA partitions the $(1*1)$ covariance matrix into a part due to error and a part due to the researcher-specified hypotheses (the two main effects and the interaction term). That is: $V_{total} = V_{therapy} + V_{medication} + V_{therapy*medication} + V_{error}.$ Likewise, MANOVA partitions its $(k*k)$ covariance matrix into a part due to research-hypotheses and a part due to error. Thus, in out example, MANOVA will have a $(3*3)$ covariance matrix for total variability, a $(3*3)$ covariance matrix due to therapy, a $(3*3)$ covariance matrix due to medication, a $(3*3)$ covariance matrix due to the interaction of therapy with medication, and finally a $(3*3)$ covariance matrix for the error. That is: $V_{total} = V_{therapy} + V_{medication} + V_{therapy*medication} + V_{error}$. Now,$V$ stands for the appropriate $(3*3)$ matrix, as opposed to $(1*1)$ value, as in ANOVA. The second equation is the matrix-form of the first one. Here is how we interpret these matrices. The error matrix looks like:<br />
<center><br />
{| class="wikitable" style="text-align:center; width:45%" border="1"<br />
|-<br />
| ||MMSE ||CDR ||Imaging<br />
|-<br />
|MMSE|| $V_{error1}$|| COV(error1, error2)|| COV(error1, error3)<br />
|-<br />
|CDR|| COV(error2, error1)||$V_{error2}$|| COV(error2, error1)<br />
|-<br />
|Imaging|| COV(error3, error1)|| COV(error3, error2) ||$V_{error3}$<br />
|-<br />
|}<br />
</center><br />
<br />
*Common statistics are summarized based on the root (eigenvalues) \lambda_{p} of the A matrix: (1) Samuel Stanley Wilks’, $\Lambda_{Wilks}=\prod_{1,\cdots,p} (1/(1+\lambda_{p}))=det(I+A)^{-1}=det(\Sigma_{res})/det(\Sigma_{res}+\Sigma_{model})$ distributed as lambda $(\Lambda)$; (2) The Pillai-M.S. Bartlett trace, $\Lambda_{Pillai}=\sum_{1,\cdots,p}(1/(1+\lambda_{p}))=tr((I+A)^{-1})$; (3) The Lawley-Hotelling trace, $\Lambda_{LH}=\sum_{1,\cdots,p}(\lambda_{p})=tr(A)$; (4) Roy’s greatest root, $\Lambda_{Roy}=max_{p}(\lambda_{p})=\left \| A \right \|_{\infty}$. The 4 major types of MANOVA test, the statistical power of these tests follow: $Pillai’s > Wilk’s > Hotelling’s > Roy’s Robustness$.<br />
**Let the $A$ statistic be the ratio of the sum of squares for an hypothesis and the sum of squares for error. Let $H$ denote the hypothesis sums of squares and cross products matrix, and let $E$ denote the error sums of squares and cross products matrix. The multivariate $A$ statistic is the matrix $A = HE_{-1}.$ Notice how mean squares (that is, covariance matrices) disappear from MANOVA just as they did for ANOVA. All hypothesis tests may be performed on the matrix $A$. Note also that because both $H$ and $E$ are symmetric, $HE^{-1}=E^{-1} H$. This is one special case where the order of matrix multiplication does not matter.<br />
**All MANOVA tests are made on $A=E^{-1}H$. There are four different multivariate tests that are made on this matrix. Each of the four test statistics has its own associated $F$ ratio. In some cases the four tests give an exact $F$ ratio for testing the null hypothesis and in other cases the $F$ ratio is only approximate. The reason for four different statistics and for approximations is that the MANOVA calculations may be complicated in some cases (i.e., the sampling distribution of the $F$ statistic in some multivariate cases would be difficult to compute exactly.) Suppose there are $k$ dependent variables in the MANOVA, and let $\lambda I$ denote the ith eigenvalue of $A=E^{-1}H$.<br />
**Wilk’s $\Lambda:\, 1- \Lambda$ is an index of variance explained by the model, $\eta^{2}$ is a measure of effect size analogous to $R^{2}$ in regression. $Wilk’s\: \Lambda$ is the pooled ratio of error variance to effect variance plus error variance: $\Lambda=\frac{|E|}{|H+E|}=\prod_{i=1}^{k}\frac{1}{1+\lambda_{i}}$.<br />
*Pillai’s Trace: Pillai’s criterion is the pooled effect variances. $Pillai’s\: trace=trace[H(H+E)^{-1}]=\prod_{i=1}^{k}\frac{1}{1+\lambda_{i}}$.<br />
*Hotelling’s Trace: the pooled ratio of effect variance to error variance:$trace(A)=trace[HE^{-1}]=\sum_{i=1}^{k}\lambda_{i}$.<br />
*Roy’s largest root: gives an upper bound for the F statistic, $Roy’s\: largest \,root=max(\hat{\lambda_{i}})$. <br />
<br />
===MANCOVA===<br />
A multivariate analysis of covariance MANCOVA is a statistical an extension of ANCOVA and is designed for cases where there is more than one dependent variable and where the control of concomitant continuous independent variables is required. The process of characterizing a covariate in a data source allows the reduction of the magnitude of the error term, represented in the MANCOVA design as $MS_{error}$. The MANCOVA allows the characterization of the difference in group means in regards to a linear combination of multiple dependent variables, while simultaneously controlling for covariates. <br />
**Assumptions: (1) normality, for each group, each dependent variable follows a normal distribution and any linear combination of dependent variables are normally distributed; (2) independence of observations from all other observations; (3) homogeneity of variance: each dependent variable demonstrate similar levels of variance across each independent variable; (4) homogeneity of covariance: the intercorrelation matrix between dependent variables equals to each other across all levels of independent variable.<br />
**Covariate represents the source of variance that has not been controlled in the experiment and is believed to affect the dependent variable. And methods like ANCOVA and MANCOVA aim to remove the effects of such uncontrolled variation in order to increase statistical power and to ensure an accurate measurement of the true relationship between independent and dependent variables. <br />
**In some studies with covariates it happens that the F value actually becomes smaller (less significant) after including covariates in the design. This indicates that the covariates are not only correlated with the dependent variable, but also with the between-groups factors.<br />
<br />
===Applications===<br />
<br />
[http://rer.sagepub.com/content/68/3/350.short This article] examined articles published in several prominent educational journals to investigate the use of data analysis tools by researchers in four research paradigms: between-subjects univariate designs, between-subjects multivariate designs, repeated measures designs, and covariance designs. In addition to examining specific details pertaining to the research design (e.g., sample size, group size equality/inequality) and methods employed for data analysis, the authors also catalogued whether (a) validity assumptions were examined, (b) effect size indices were reported, (c) sample sizes were selected on the basis of power considerations, and (d) appropriate textbooks and/or articles were cited to communicate the nature of the analyses that were performed. The present analyses imply that researchers rarely verify that validity assumptions are satisfied and that, accordingly, they typically use analyses that are nonrobust to assumption violations. In addition, researchers rarely report effect size statistics, nor do they routinely perform power analyses to determine sample size requirements. Recommendations are offered to rectify these shortcomings.<br />
<br />
[http://www.tandfonline.com/doi/abs/10.1080/03610919808813485#.U-aSFhZTWdA This article] investigated the performance of ANOVA, MANOVA, WLS, and GEE for repeated ordinal data with small sample sizes. Repeated ordinal outcomes are common in behavioral and medical sciences. Due to the familiarity, simplicity and robustness of ANOVA methodology, this approach has been frequently used for repeated ordinal data. Weighted least squares (WLS) and generalized estimating equations (GEE) are usually the procedures of choice for repeated ordinal data since, unlike ANOVA, they generally make no or few untenable assumptions. However, these methods are based on asymptotic results and their properties are not well understood for small samples. Moreover, few software packages have procedures for implementing these methods. For a design with two groups and four time points, the simulation results indicated that ANOVA with the Huynh-Feldt adjustment performed matrix, known as sphericity, or the H‐F condition, is a sufficient condition for the usual F tests to be valid. <br />
<br />
===Software=== <br />
RCODE:<br />
fit <- aov(y ~ A, data = mydata) #one way ANOVA (completely randomized design)<br />
fit <- aov(y ~ A+B, data=mydata) # randomized block design where B is the blocking factor<br />
fit <- aov(y ~ A+B+A*B, data=mydata) ## two way factorial design<br />
fit <- aov(y ~ A+x, data=mydata) ## analysis of covariance<br />
## for within subject designs, the data frame has to be rearranged for each measurement on a subject to be a separate observation<br />
fit <- aov(y ~ A+Error(subject/A), data=mydata) ## one within factor<br />
fit <- aov(y ~(w1*w2*B1*B2)+Error(Subject/(W1*W2))+(B1*B2),data=mydata) # two within factors W1 and W2, two between factors B1 and B2.<br />
## 2*2 factorial MANOVA with 3 dependent variables<br />
Y <- cbind(y1,y2,y3)<br />
fit <- manova(Y ~ A*B)<br />
<br />
===Problems===<br />
<br />
Use data of the CPI (consumer price index) for food, housing, transportation and medical care from 1981 to 2007 to do a two-way analysis of the covariance in R. We take ‘Month’ as one factor and the item the CPI measured on as another factor and did a 2* 2 factorial design. The data are linked at [http://wiki.socr.umich.edu/index.php/SOCR_Data_Dinov_021808_ConsumerPriceIndex3Way Consumer Price Index].<br />
<br />
In R:<br />
CPI <- read.csv('/Users/yufangli/Desktop/CPI_Food.csv',header=T)<br />
attach(CPI)<br />
summary(CPI)<br />
Month <- factor(Month)<br />
CPI_Item <- factor(CPI_Item)<br />
fit <- aov(CPI_Value ~ Month + CPI_Item + CPI_Item*Month, data=CPI)<br />
fit<br />
Call:<br />
aov(formula = CPI_Value ~ Month + CPI_Item + CPI_Item * Month, <br />
data = CPI)<br />
<br />
Terms:<br />
Month CPI_Item Month:CPI_Item Residuals<br />
Sum of Squares 3282.6 1078702.9 706.8 2987673.8<br />
Deg. of Freedom 11 3 33 1248<br />
<br />
Residual standard error: 48.92821 <br />
Estimated effects may be unbalanced<br />
<br />
summary(fit)<br />
Df Sum Sq Mean Sq F value Pr(>F) <br />
Month 11 3283 298 0.125 1 <br />
CPI_Item 3 1078703 359568 150.197 <2e-16 ***<br />
Month:CPI_Item 33 707 21 0.009 1 <br />
Residuals 1248 2987674 2394 <br />
---<br />
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1<br />
<br />
fit2 <- aov(CPI_Value ~ CPI_Item, data=CPI)<br />
fit2<br />
Call:<br />
aov(formula = CPI_Value ~ CPI_Item, data = CPI)<br />
<br />
Terms:<br />
CPI_Item Residuals<br />
Sum of Squares 1078703 2991663<br />
Deg. of Freedom 3 1292<br />
<br />
Residual standard error: 48.11994 <br />
Estimated effects may be unbalanced<br />
summary(fit2)<br />
Df Sum Sq Mean Sq F value Pr(>F) <br />
CPI_Item 3 1078703 359568 155.3 <2e-16 ***<br />
Residuals 1292 2991663 2316 <br />
---<br />
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1<br />
<br />
Seems like the CPI_Item is an important factor while month is not a significant factor of CPI based on the dataset we got.<br />
<br />
<br />
===References===<br />
[http://www.statsoft.com/Textbook/ANOVA-MANOVA ANOVA-MANOVA]<br />
<br />
[http://en.wikipedia.org/wiki/Analysis_of_variance ANOVA Wikipedia]<br />
<br />
[http://wiki.stat.ucla.edu/socr/index.php/Probability_and_statistics_EBook#Chapter_XI:_Analysis_of_Variance_.28ANOVA.29 SOCR]<br />
<br />
[http://en.wikipedia.org/wiki/Analysis_of_covariance ANCOVA Wikipedia]<br />
<br />
[http://en.wikipedia.org/wiki/Multivariate_analysis_of_variance MANOVA Wikipedia]<br />
<br />
[http://en.wikipedia.org/wiki/MANCOVA MANCOVA Wikipedia]<br />
<br />
<br />
<br />
<br />
<br />
<br />
<hr><br />
* SOCR Home page: http://www.socr.umich.edu<br />
<br />
{{translate|pageName=http://wiki.socr.umich.edu/index.php?title=SMHS_ANCOVA}}</div>Clgallahttps://wiki.socr.umich.edu/index.php?title=SMHS_ANCOVA&diff=14531SMHS ANCOVA2014-10-28T18:17:46Z<p>Clgalla: /* MANOVA: */</p>
<hr />
<div>==[[SMHS| Scientific Methods for Health Sciences]] - Analysis of Covariance (ANCOVA) ==<br />
<br />
===Overview===<br />
Analysis of Variance (ANOVA) is the common method applied to analyze the differences between group means. Analysis of Covariance (ANCOVA) is the common method applied to blend ANOVA and regression and evaluate whether population means of a dependent variance (DV) are equal across levels of a categorical independent variable (IV) while statistically controlling for the effects of other continuous variables (CV). Multivariate analysis of variance (MANOVA) is a statistical test procedure for comparing multivariate means of several groups, which is a generalized form of ANOVA. Similar to MANOVA, MANOCVA (multivariate analysis of covariance) is an extension of ANCOVA and is designed for cases where there is more than one dependent variable and where the control of concomitant continuous independent variables is required. In this section, we review ANOVA, ANCOVA, MANOVA and MANCOVA and illustrate their application with examples.<br />
<br />
===Motivation===<br />
We have talked about analysis of variance (ANOVA). ANCOVA is similar to ANOVA and deals with covariance instead. What if we have more than one dependent variable and studied on multivariate observations? What if we want to see if the interactions among dependent variables or changes in the independent variables influence the dependent variable? Then we will need to use the extension of ANOVA and ANCOVA, MANOVA and MACOVA respectively. So the question would be how do those methods work and what kind of conclusions we can drawn from those methods?<br />
<br />
===Theory===<br />
<br />
==ANOVA:==<br />
Analysis of Variance (ANOVA) is the common method applied to analyze the differences between group means. In ANOVA, we divide the observed variance into components attributed to different sources of variation.<br />
*One-way ANOVA: we expand our inference methods to study and compare k independent samples. In this case, we will be decomposing the entire variation in the data into independent components.<br />
**Notations first: $y_{ij}$ is the measurement from group $i$, observation index $j$; $k$ is the number of groups; $n_{i}$ is the number of observations in group $i$; $n$ is the total number of observations and $n=n_{1}+n_{2}+⋯+n_{k}$. The group mean for group $i$ is $\bar y_{i}. =\frac{\sum_{j=1}^n_{i}y_{ij}}{n_{i}},$ the grand mean is $\bar y = \bar y_{..}=\frac{\sum_{i=1}^{k}\sum_{j=1}^{n_{i}}y_{ij}} {n}$.<br />
**Difference between the means (compare each group mean to the grand mean): total variance $SST(total)=\sum_{i=1}^{k}\sum_{j=1}^{n_{i}}\left(y_{ij}-\bar y_{..}\right)^{2}$, degrees of freedom $df(total)=n-1$; difference between each group mean and grand mean: $SST(between)=\sum_{i=}^{k}n_{i} \left(\bar y_{i.} \bar y_{..}\right )^{2}$, degrees of freedom $df(between)=k-1$; Sum square due to error (combination of variation within group): $SSE=\sum_{i=1}^{k} \sum_{j=1}^{n_{i}}\left(y_{ij}-\bar y_{i.}\right)^{2}$, degrees of freedom $df(within)=n-k$. <br />
<br />
With ANOVA decomposition, we have $\sum_{i=1}^{k}\sum_{j=1}^{n_{i}}\left(y_{ij}-\bar y_{..}\right)^{2}=\sum_{i=}^{k}n_{i}\left(\bar y_{i.} - \bar y_{..}\right )^{2}+\sum_{i=1}^{k}\sum_{j=1}^{n_{i}}\left(y_{ij}-\bar y_{i.}\right)^{2}$,that is $ST(total)=SST(between)+SSE(within)\,and\, df(total)=df(between)+df(within).$<br />
<br />
**Calculations: <br />
ANOVA table: <br />
<center><br />
{| class="wikitable" style="text-align:center; width:45%" border="1"<br />
|-<br />
|Variance source||Degree of freedom (df)|| Sum of squares (SS) ||Mean sum of squares (MS)|| F-statistics||P-value<br />
|-<br />
|Treatment effect (between group)||$k-1$||$\sum_{i=1}^{k}n_{i} \left(\bar y_{i.} -\bar y_{..}\right)^{2}$ || $\frac{SST(between)}{df(between)}$ || $F_{0}=\frac{MST(between)}{MSE(within)}$ || $p(F_{(df(between),df(within)}>F_{0}$<br />
|-<br />
|Error (within group)|| $n-k$ ||$\sum_{i=1}^{k} \sum_{j=1}^{n_{i}} \left(y_{ij}-\bar y_{i.}\right)^{2}$ || $\frac{SSE(within)}{df(within)}$ || ||<br />
|- <br />
|Total || $n-1$ || $\sum_{i=1}^{k} \sum_{j=1}^{n_{i}} \left(y_{ij}-\bar y_{..}\right)^{2}$ || || ||<br />
|- <br />
|}<br />
</center><br />
<br />
==ANOVA hypotheses(general form):== <br />
$H_{o}:\mu_{1}=\mu_{2}=⋯=\mu_{k}; H_{a}:\mu_{i}≠μ_{j}$ for some $i≠j$. The test statistics: $F_{0}=\frac{MST(between)}{MSE(within)}$, if $F_{0}$ is large, then there is a lot between group variation, relative to the within group variation. Therefore, the discrepancies between the group means are large compared to the variability within the groups (error). That is large $F_{0}$ provides strong evidence against $H_{0}$.<br />
*ANOVA conditions: valid if (1) design conditions: all groups of observations represent random samples from their population respectively. Plus, all the observations within each group are independent of each other; (2) population conditions: the k population distributions must be approximately normal. If sample size is large, the normality condition is less crucial. Plus, the standard deviations of all populations are equal, which can be slightly relaxed when $0.5≤\frac{\sigma_{i}} {\sigma_{j}}≤2,$ for all $i$ and $j$, none of the population variance is twice larger than any of the other ones. <br />
<br />
==Two-way ANOVA:== <br />
We focus on decomposing the variance of a dataset into independent (orthogonal) components when we have two grouping factors.<br />
*Notations first: two-way model: $y_{ijk}=\mu+\tau_{i}+\beta_{j}+\gamma_{ij}+\varepsilon_{ijk}$,for all $1≤i≤a,1≤j≤b$ and $1≤k≤r. y_{ijk}$ is the A-factor level $i$, and B-factor level $j,$ observation-index $k$ measurement; $k$ is the number of replications; $a_{i}$ is the number of A-factor observations at level $i,a=a_{1}+⋯+a_{i}; b_{j}$ is the number of B-factor observations at level $j,b=b_{1}+⋯+b_{J}; N$ is the total number of observations and $N=a*a*b.$ Here $\mu$ is the overall mean response, $\tau_{i}$ is the effect due to the $i^{th}$ level of factor A, $\beta_{j}$ is the effect due to the $j^{th}$ level of factor B, and $\gamma_{ij}$ is the effect due to any interaction between the $i^{th}$ level of factor A and $j^{th}$ level of factor B. The mean for A-factor group mean at level $i$ and B-factor at level $j$ is $\bar y_{ij}.=\frac{\sum_{k=1}^{r}y_{ijk}}{r}$, the grand mean is $\bar y=\bar y_{…}=\frac{\sum_{k=1}^{r}\sum_{i=1}^{a}\sum_{j=1}^{b} y_{ijk}} {n}$, we have $SST(total)=SS(A)+SS(B)+SS(AB)+SSE.$<br />
<br />
==Hypotheses:== <br />
*Null hypotheses: (1) the population means of the first factor are equal, which is like the one-way ANOVA for the row factor; (2) the population means of the second factor are equal, which is like the one-way ANOVA for the column factor; (3) there is no interaction between the two factors, which is similar to performing a test for independence with contingency tables.<br />
**Factors: factor A and factor B are independent variables in two-way ANOVA.<br />
**Treatment groups: formed by making all possible combinations of two factors. For example, if the factor A has 3 levels and factor B has 5 levels, then there will be $3*5=15$ different treatment groups.<br />
**Main effect: involves the dependent variable one at a time. The interaction is ignored for this part. <br />
**Interaction effect: the effect that one factor has on the other factor. The degree of freedom is the product of the two degrees of freedom of each factor.<br />
**Calculations:<br />
<br />
ANOVA table: <br />
<center><br />
{| class="wikitable" style="text-align:center; width:45%" border="1"<br />
|-<br />
|Variance source||Degree of freedom (df)||Sum of squares (SS)|| Mean sum of squares (MS)||F-statistics||P-value<br />
|-<br />
|Main effect A|| $a-1$ ||$SS(A)=rb\Sigma_{i=1}^{a} n_{i} (\bar y_{i..} -\bar y_{...} )^{2}$ ||$SS(A)/df(A)$||$F_{0}=\frac{MS(A)}{MSE}$ ||$p(F_{(df(B),df(E)})>F_{0}$<br />
|-<br />
|Main effect B || $b-1$||$SS(B)=ra\Sigma_{j=1}^{b} n_{i} (\bar y_{.j.}-\bar y_{...})^{2}$ || $SS(B)/df(B)$|| $F_{0}=\frac{MS(B)}{MSE}$||$p(F_{(df(AB),df(E)} )>F_{0}$<br />
|-<br />
|A vs. B interaction||$(a-1)(b-1)$||$SS(AB)=r\Sigma_{i=1}^{a} \Sigma_{j=1}^{b} ((\bar y_{ij.}-\bar y_{i..}) +\bar y_{.j.} -\bar y_{...} )^{2}$ ||$SS(AB)/df(AB)$||$F_{0}=\frac{MS(AB)} {MSE}$||<br />
|- <br />
|Error ||$N-ab$||$SSE=\Sigma_{k=1}^{r} \Sigma_{i=1}^{a} \Sigma_{j=1}^{b} (y_{ijk}-\bar y_{ij.})^{2}$||$SSE/df(error)$|| ||<br />
|- <br />
|Total||$N-1$||$SST=\Sigma_{k=1}^{r}\Sigma_{i=1}^{a} \Sigma_{j=1}^{b}(y_{ijk}-\bar y_{...})^{2}$|| || ||<br />
|- <br />
|}<br />
</center><br />
<br />
*ANOVA conditions: valid if (1) the population from which the samples were obtained must be normally or approximately normally distributed; (2) the samples must be independent; (3) the variances of the populations must be equal; (4) the groups must have the same sample size.<br />
<br />
==ANCOVA:==<br />
Analysis of Covariance is the common method applied to blend ANOVA and regression and evaluate whether population means of a dependent variance (DV) are equal across levels of a categorical independent variable (IV) while statistically controlling for the effects of other continuous variables (CV).<br />
*Assumptions of ANCOVA: (1) normality of residuals; (2) homogeneity of variance for error; (3) Homogeneity of regression slopes, regression lines should be parallel among groups; (4) Linearity of regression; (5) independence of error terms.<br />
*Increase statistical power: ANCOVA reducing the with-in group error variance and increase statistical power. Use the F-test to evaluate difference between groups by dividing the explained variance between groups by the unexplained variance within the groups. $F=\frac{MS between}{MSwithin}$. If this value is greater than the critical value, then there is significant difference between groups. The influence of CVs is grouped into the denominator. When we control for the effect of CVs on the DV, remove it from the denominator making F bigger, thereby increased our power to find a significant effect if one exists.<br />
**Adjusting preexisting difference in nonequivalent groups: correct for initial group difference that exists on DV among several intact groups. In this case, CVs are used to adjust scores and make participants more similar than without the CV since the participants cannot be made equal through random assignment. CV may be so intimately related to the IV that removing the variance on the DV associated with CV would remove considerable variance on DV, which will make the result meaningless.<br />
**Conduct ANCOVA: (1) Test multicollinearity: if a CV is highly related to another CV, then it won’t adjust the DV over and above the other CV. One or the other should be removed since they are statistically redundant. (2) Test the homogeneity of variance assumption: Levene’s test of equality of error variances. (3) Test of the homogeneity of regression slopes assumption: tested by testing if the CV significantly interacts with the IV by running an ANCOVA model including both the IV and the CV*IV interaction term in the model. If the interaction term is significant, then we should not perform ANCOVA. Instead assess group difference on DV at particular level of CV. (4) Run ANCOVA analysis: if the interaction is not significant, then rerun the ANCOVA without the interaction term. In this analysis, use the adjust means and adjusted MSerror. (5) Follow-up analyses: if there was a significant main effect, then there is a significant difference between the levels of one IV, ignoring all other factors. To find out exactly which levels are significant from others, use the same follow-up tests for ANOVA.<br />
<br />
==MANOVA==<br />
Multivariate analysis of variance or multiple analysis of variance is a statistical test procedure for comparing multivariate means of several groups. MANOVA is a generalized form of ANOVA.<br />
Relationship with ANOVA: <br />
*MANOVA is an extension of ANOVA, though, unlike ANOVA, it uses the variance-covariance between variables in testing the statistical significance of the mean difference. It is similar to ANOVA, but allows adding of interval independents as covariates. Several specific use-cases for MANOVA: (1) to compare groups formed by categorical independent variables on group differences in a set of dependent variables; (2) to use lack of difference for a set of dependent variables as a criterion for reducing a set of independent variables to a smaller, more easily modeled number of variables; (3) to identify the independent variables which differentiate a set of dependent variables the most.<br />
*Analogous to ANOVA, MANOVA is based on the product of model variance matrix, $\Sigma_{model}$ and inverse of the error variance matrix, $\Sigma_{res}^{-1}, or A=\Sigma_{model} * \Sigma_{res}^{-1}$. The hypothesis that $\Sigma_{model} = \Sigma_{residual}$ implies that the product $A \sim I$. Invariance considerations imply the MANOVA statistic should be a measure of magnitude of the singular value decomposition of this matrix product, there is no unique choice owing to the multi-dimensional nature of the alternative hypothesis.<br />
*MANOVA calculations closely resemble the ANOVA calculations, except that they are in vector and matrix forms. Assume that instead of a single dependent variance in the one-way ANOVA, there are three dependent variables as in our neuroimaging example above. Under the null hypothesis, it is assumed that scores on the three variables for each of the four groups are sampled from a tri-variate normal distribution mean vector $\mu =(\mu_{1}, \mu_{2}, \mu_{3})^{T}$ and variance-covariance matrix $\Sigma=\bigl(\begin{smallmatrix} \sigma_{1}^{2} & \rho_{1,2}\sigma_{1}\sigma{2} &\rho_{1,3}\sigma_{1}\sigma{3} \\ \rho_{2,1}\sigma_{2}\sigma_{1} & \sigma_{2}^{2} & \rho_{2,3}\sigma_{2}\sigma{3} \\\rho_{3,1}\sigma_{3}\sigma_{1}&\rho_{3,2}\sigma_{3}\sigma_{2} &\sigma_{3}^{2}\end{smallmatrix}\bigr)$. Where the covariance between variables 1 and 2 is expressed in terms of their correlation $(\rho_{1,2})$ and individual variances $(\sigma_{1}$ and $\sigma_{2})$. Under the null hypothesis, the scores for all subjects in groups 1, 2 and 3 are sampled from the same distribution.<br />
*Example: a $2*2$ factorial design with medication as one factor and type of therapy as the second factor. The matrix of the data looks includes the patient ID, the drug-treatment (vitamin-E or Placebo), Therapy (Cognitive/physical), MMSE, CDR, Imaging. It's better when the study the design is balanced with equal numbers of patients in all four conditions, as this avoid potential problems of sample-size-driven effects (e.g., variance estimates). Recall that a univariate ANOVA (on any single outcome measure) would contain three types of effects -- a main effect for therapy, a mean effect for medication, and an interaction between therapy and medication. Similarly, MANOVA will contain the same three effects: main effects: (1) Therapy: The univariate ANOVA main effect for therapy tells whether the physical vs. cognitive therapy groups have different means, irrespective of their medications. The MANOVA main effect for therapy tells whether the physical vs. cognitive therapy group have different mean vectors irrespective of their medication. The vectors in this case are the $(3*1)$ column vectors of means (MMSE, CDR and Imaging); (2) Medication: The univariate ANOVA for medication tells whether the placebo group has a different mean from the Vitamin-E group irrespective of the therapy type. The MANOVA main effect for medication tells whether the placebo group has a different mean vector from the VItamin-E group irrespective of therapy; interaction effects: (3) The univariate ANOVA interaction tells whether the four means for a single variable differ from the value predicted from knowledge of the main effects of therapy and medication. The MANOVA interaction term tells whether the four mean vectors differ from the vector predicted from knowledge of the main effects of therapy and medication.<br />
**Variance partitioning: MANOVA has the same properties as an ANOVA. The only difference is that an ANOVA deals with a $(1*1)$ mean vector for any group (as the response is univatiate). While a MANOVA deals with a $(k*1)$ vector for any group, $k$ being the number of dependent variables, 3 in our example. The variance-covariance matrix for 1 variable is a $(1*1)$ matrix that has only one element, the variance of the variable. What is the variance-covariance matrix for $k$ variables is a $(k*k)$ matrix with the variances on the diagonal and the covariances representing the off diagonal elements. The ANOVA partitions the $(1*1)$ covariance matrix into a part due to error and a part due to the researcher-specified hypotheses (the two main effects and the interaction term). That is: $V_{total} = V_{therapy} + V_{medication} + V_{therapy*medication} + V_{error}.$ Likewise, MANOVA partitions its $(k*k)$ covariance matrix into a part due to research-hypotheses and a part due to error. Thus, in out example, MANOVA will have a $(3*3)$ covariance matrix for total variability, a $(3*3)$ covariance matrix due to therapy, a $(3*3)$ covariance matrix due to medication, a $(3*3)$ covariance matrix due to the interaction of therapy with medication, and finally a $(3*3)$ covariance matrix for the error. That is: $V_{total} = V_{therapy} + V_{medication} + V_{therapy*medication} + V_{error}$. Now,$V$ stands for the appropriate $(3*3)$ matrix, as opposed to $(1*1)$ value, as in ANOVA. The second equation is the matrix-form of the first one. Here is how we interpret these matrices. The error matrix looks like:<br />
<center><br />
{| class="wikitable" style="text-align:center; width:45%" border="1"<br />
|-<br />
| ||MMSE ||CDR ||Imaging<br />
|-<br />
|MMSE|| $V_{error1}$|| COV(error1, error2)|| COV(error1, error3)<br />
|-<br />
|CDR|| COV(error2, error1)||$V_{error2}$|| COV(error2, error1)<br />
|-<br />
|Imaging|| COV(error3, error1)|| COV(error3, error2) ||$V_{error3}$<br />
|-<br />
|}<br />
</center><br />
<br />
*Common statistics are summarized based on the root (eigenvalues) \lambda_{p} of the A matrix: (1) Samuel Stanley Wilks’, $\Lambda_{Wilks}=\prod_{1,\cdots,p} (1/(1+\lambda_{p}))=det(I+A)^{-1}=det(\Sigma_{res})/det(\Sigma_{res}+\Sigma_{model})$ distributed as lambda $(\Lambda)$; (2) The Pillai-M.S. Bartlett trace, $\Lambda_{Pillai}=\sum_{1,\cdots,p}(1/(1+\lambda_{p}))=tr((I+A)^{-1})$; (3) The Lawley-Hotelling trace, $\Lambda_{LH}=\sum_{1,\cdots,p}(\lambda_{p})=tr(A)$; (4) Roy’s greatest root, $\Lambda_{Roy}=max_{p}(\lambda_{p})=\left \| A \right \|_{\infty}$. The 4 major types of MANOVA test, the statistical power of these tests follow: $Pillai’s > Wilk’s > Hotelling’s > Roy’s Robustness$.<br />
**Let the $A$ statistic be the ratio of the sum of squares for an hypothesis and the sum of squares for error. Let $H$ denote the hypothesis sums of squares and cross products matrix, and let $E$ denote the error sums of squares and cross products matrix. The multivariate $A$ statistic is the matrix $A = HE_{-1}.$ Notice how mean squares (that is, covariance matrices) disappear from MANOVA just as they did for ANOVA. All hypothesis tests may be performed on the matrix $A$. Note also that because both $H$ and $E$ are symmetric, $HE^{-1}=E^{-1} H$. This is one special case where the order of matrix multiplication does not matter.<br />
**All MANOVA tests are made on $A=E^{-1}H$. There are four different multivariate tests that are made on this matrix. Each of the four test statistics has its own associated $F$ ratio. In some cases the four tests give an exact $F$ ratio for testing the null hypothesis and in other cases the $F$ ratio is only approximate. The reason for four different statistics and for approximations is that the MANOVA calculations may be complicated in some cases (i.e., the sampling distribution of the $F$ statistic in some multivariate cases would be difficult to compute exactly.) Suppose there are $k$ dependent variables in the MANOVA, and let $\lambda I$ denote the ith eigenvalue of $A=E^{-1}H$.<br />
**Wilk’s $\Lambda:\, 1- \Lambda$ is an index of variance explained by the model, $\eta^{2}$ is a measure of effect size analogous to $R^{2}$ in regression. $Wilk’s\: \Lambda$ is the pooled ratio of error variance to effect variance plus error variance: $\Lambda=\frac{|E|}{|H+E|}=\prod_{i=1}^{k}\frac{1}{1+\lambda_{i}}$.<br />
*Pillai’s Trace: Pillai’s criterion is the pooled effect variances. $Pillai’s\: trace=trace[H(H+E)^{-1}]=\prod_{i=1}^{k}\frac{1}{1+\lambda_{i}}$.<br />
*Hotelling’s Trace: the pooled ratio of effect variance to error variance:$trace(A)=trace[HE^{-1}]=\sum_{i=1}^{k}\lambda_{i}$.<br />
*Roy’s largest root: gives an upper bound for the F statistic, $Roy’s\: largest \,root=max(\hat{\lambda_{i}})$. <br />
<br />
===MANCOVA===<br />
A multivariate analysis of covariance MANCOVA is a statistical an extension of ANCOVA and is designed for cases where there is more than one dependent variable and where the control of concomitant continuous independent variables is required. The process of characterizing a covariate in a data source allows the reduction of the magnitude of the error term, represented in the MANCOVA design as $MS_{error}$. The MANCOVA allows the characterization of the difference in group means in regards to a linear combination of multiple dependent variables, while simultaneously controlling for covariates. <br />
**Assumptions: (1) normality, for each group, each dependent variable follows a normal distribution and any linear combination of dependent variables are normally distributed; (2) independence of observations from all other observations; (3) homogeneity of variance: each dependent variable demonstrate similar levels of variance across each independent variable; (4) homogeneity of covariance: the intercorrelation matrix between dependent variables equals to each other across all levels of independent variable.<br />
**Covariate represents the source of variance that has not been controlled in the experiment and is believed to affect the dependent variable. And methods like ANCOVA and MANCOVA aim to remove the effects of such uncontrolled variation in order to increase statistical power and to ensure an accurate measurement of the true relationship between independent and dependent variables. <br />
**In some studies with covariates it happens that the F value actually becomes smaller (less significant) after including covariates in the design. This indicates that the covariates are not only correlated with the dependent variable, but also with the between-groups factors.<br />
<br />
===Applications===<br />
<br />
[http://rer.sagepub.com/content/68/3/350.short This article] examined articles published in several prominent educational journals to investigate the use of data analysis tools by researchers in four research paradigms: between-subjects univariate designs, between-subjects multivariate designs, repeated measures designs, and covariance designs. In addition to examining specific details pertaining to the research design (e.g., sample size, group size equality/inequality) and methods employed for data analysis, the authors also catalogued whether (a) validity assumptions were examined, (b) effect size indices were reported, (c) sample sizes were selected on the basis of power considerations, and (d) appropriate textbooks and/or articles were cited to communicate the nature of the analyses that were performed. The present analyses imply that researchers rarely verify that validity assumptions are satisfied and that, accordingly, they typically use analyses that are nonrobust to assumption violations. In addition, researchers rarely report effect size statistics, nor do they routinely perform power analyses to determine sample size requirements. Recommendations are offered to rectify these shortcomings.<br />
<br />
[http://www.tandfonline.com/doi/abs/10.1080/03610919808813485#.U-aSFhZTWdA This article] investigated the performance of ANOVA, MANOVA, WLS, and GEE for repeated ordinal data with small sample sizes. Repeated ordinal outcomes are common in behavioral and medical sciences. Due to the familiarity, simplicity and robustness of ANOVA methodology, this approach has been frequently used for repeated ordinal data. Weighted least squares (WLS) and generalized estimating equations (GEE) are usually the procedures of choice for repeated ordinal data since, unlike ANOVA, they generally make no or few untenable assumptions. However, these methods are based on asymptotic results and their properties are not well understood for small samples. Moreover, few software packages have procedures for implementing these methods. For a design with two groups and four time points, the simulation results indicated that ANOVA with the Huynh-Feldt adjustment performed matrix, known as sphericity, or the H‐F condition, is a sufficient condition for the usual F tests to be valid. <br />
<br />
===Software=== <br />
RCODE:<br />
fit <- aov(y ~ A, data = mydata) #one way ANOVA (completely randomized design)<br />
fit <- aov(y ~ A+B, data=mydata) # randomized block design where B is the blocking factor<br />
fit <- aov(y ~ A+B+A*B, data=mydata) ## two way factorial design<br />
fit <- aov(y ~ A+x, data=mydata) ## analysis of covariance<br />
## for within subject designs, the data frame has to be rearranged for each measurement on a subject to be a separate observation<br />
fit <- aov(y ~ A+Error(subject/A), data=mydata) ## one within factor<br />
fit <- aov(y ~(w1*w2*B1*B2)+Error(Subject/(W1*W2))+(B1*B2),data=mydata) # two within factors W1 and W2, two between factors B1 and B2.<br />
## 2*2 factorial MANOVA with 3 dependent variables<br />
Y <- cbind(y1,y2,y3)<br />
fit <- manova(Y ~ A*B)<br />
<br />
===Problems===<br />
<br />
Use data of the CPI (consumer price index) for food, housing, transportation and medical care from 1981 to 2007 to do a two-way analysis of the covariance in R. We take ‘Month’ as one factor and the item the CPI measured on as another factor and did a 2* 2 factorial design. The data are linked at [http://wiki.socr.umich.edu/index.php/SOCR_Data_Dinov_021808_ConsumerPriceIndex3Way Consumer Price Index].<br />
<br />
In R:<br />
CPI <- read.csv('/Users/yufangli/Desktop/CPI_Food.csv',header=T)<br />
attach(CPI)<br />
summary(CPI)<br />
Month <- factor(Month)<br />
CPI_Item <- factor(CPI_Item)<br />
fit <- aov(CPI_Value ~ Month + CPI_Item + CPI_Item*Month, data=CPI)<br />
fit<br />
Call:<br />
aov(formula = CPI_Value ~ Month + CPI_Item + CPI_Item * Month, <br />
data = CPI)<br />
<br />
Terms:<br />
Month CPI_Item Month:CPI_Item Residuals<br />
Sum of Squares 3282.6 1078702.9 706.8 2987673.8<br />
Deg. of Freedom 11 3 33 1248<br />
<br />
Residual standard error: 48.92821 <br />
Estimated effects may be unbalanced<br />
<br />
summary(fit)<br />
Df Sum Sq Mean Sq F value Pr(>F) <br />
Month 11 3283 298 0.125 1 <br />
CPI_Item 3 1078703 359568 150.197 <2e-16 ***<br />
Month:CPI_Item 33 707 21 0.009 1 <br />
Residuals 1248 2987674 2394 <br />
---<br />
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1<br />
<br />
fit2 <- aov(CPI_Value ~ CPI_Item, data=CPI)<br />
fit2<br />
Call:<br />
aov(formula = CPI_Value ~ CPI_Item, data = CPI)<br />
<br />
Terms:<br />
CPI_Item Residuals<br />
Sum of Squares 1078703 2991663<br />
Deg. of Freedom 3 1292<br />
<br />
Residual standard error: 48.11994 <br />
Estimated effects may be unbalanced<br />
summary(fit2)<br />
Df Sum Sq Mean Sq F value Pr(>F) <br />
CPI_Item 3 1078703 359568 155.3 <2e-16 ***<br />
Residuals 1292 2991663 2316 <br />
---<br />
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1<br />
<br />
Seems like the CPI_Item is an important factor while month is not a significant factor of CPI based on the dataset we got.<br />
<br />
<br />
===References===<br />
[http://www.statsoft.com/Textbook/ANOVA-MANOVA ANOVA-MANOVA]<br />
<br />
[http://en.wikipedia.org/wiki/Analysis_of_variance ANOVA Wikipedia]<br />
<br />
[http://wiki.stat.ucla.edu/socr/index.php/Probability_and_statistics_EBook#Chapter_XI:_Analysis_of_Variance_.28ANOVA.29 SOCR]<br />
<br />
[http://en.wikipedia.org/wiki/Analysis_of_covariance ANCOVA Wikipedia]<br />
<br />
[http://en.wikipedia.org/wiki/Multivariate_analysis_of_variance MANOVA Wikipedia]<br />
<br />
[http://en.wikipedia.org/wiki/MANCOVA MANCOVA Wikipedia]<br />
<br />
<br />
<br />
<br />
<br />
<br />
<hr><br />
* SOCR Home page: http://www.socr.umich.edu<br />
<br />
{{translate|pageName=http://wiki.socr.umich.edu/index.php?title=SMHS_ANCOVA}}</div>Clgallahttps://wiki.socr.umich.edu/index.php?title=SMHS_ANCOVA&diff=14530SMHS ANCOVA2014-10-28T18:16:33Z<p>Clgalla: /* MANOVA: */</p>
<hr />
<div>==[[SMHS| Scientific Methods for Health Sciences]] - Analysis of Covariance (ANCOVA) ==<br />
<br />
===Overview===<br />
Analysis of Variance (ANOVA) is the common method applied to analyze the differences between group means. Analysis of Covariance (ANCOVA) is the common method applied to blend ANOVA and regression and evaluate whether population means of a dependent variance (DV) are equal across levels of a categorical independent variable (IV) while statistically controlling for the effects of other continuous variables (CV). Multivariate analysis of variance (MANOVA) is a statistical test procedure for comparing multivariate means of several groups, which is a generalized form of ANOVA. Similar to MANOVA, MANOCVA (multivariate analysis of covariance) is an extension of ANCOVA and is designed for cases where there is more than one dependent variable and where the control of concomitant continuous independent variables is required. In this section, we review ANOVA, ANCOVA, MANOVA and MANCOVA and illustrate their application with examples.<br />
<br />
===Motivation===<br />
We have talked about analysis of variance (ANOVA). ANCOVA is similar to ANOVA and deals with covariance instead. What if we have more than one dependent variable and studied on multivariate observations? What if we want to see if the interactions among dependent variables or changes in the independent variables influence the dependent variable? Then we will need to use the extension of ANOVA and ANCOVA, MANOVA and MACOVA respectively. So the question would be how do those methods work and what kind of conclusions we can drawn from those methods?<br />
<br />
===Theory===<br />
<br />
==ANOVA:==<br />
Analysis of Variance (ANOVA) is the common method applied to analyze the differences between group means. In ANOVA, we divide the observed variance into components attributed to different sources of variation.<br />
*One-way ANOVA: we expand our inference methods to study and compare k independent samples. In this case, we will be decomposing the entire variation in the data into independent components.<br />
**Notations first: $y_{ij}$ is the measurement from group $i$, observation index $j$; $k$ is the number of groups; $n_{i}$ is the number of observations in group $i$; $n$ is the total number of observations and $n=n_{1}+n_{2}+⋯+n_{k}$. The group mean for group $i$ is $\bar y_{i}. =\frac{\sum_{j=1}^n_{i}y_{ij}}{n_{i}},$ the grand mean is $\bar y = \bar y_{..}=\frac{\sum_{i=1}^{k}\sum_{j=1}^{n_{i}}y_{ij}} {n}$.<br />
**Difference between the means (compare each group mean to the grand mean): total variance $SST(total)=\sum_{i=1}^{k}\sum_{j=1}^{n_{i}}\left(y_{ij}-\bar y_{..}\right)^{2}$, degrees of freedom $df(total)=n-1$; difference between each group mean and grand mean: $SST(between)=\sum_{i=}^{k}n_{i} \left(\bar y_{i.} \bar y_{..}\right )^{2}$, degrees of freedom $df(between)=k-1$; Sum square due to error (combination of variation within group): $SSE=\sum_{i=1}^{k} \sum_{j=1}^{n_{i}}\left(y_{ij}-\bar y_{i.}\right)^{2}$, degrees of freedom $df(within)=n-k$. <br />
<br />
With ANOVA decomposition, we have $\sum_{i=1}^{k}\sum_{j=1}^{n_{i}}\left(y_{ij}-\bar y_{..}\right)^{2}=\sum_{i=}^{k}n_{i}\left(\bar y_{i.} - \bar y_{..}\right )^{2}+\sum_{i=1}^{k}\sum_{j=1}^{n_{i}}\left(y_{ij}-\bar y_{i.}\right)^{2}$,that is $ST(total)=SST(between)+SSE(within)\,and\, df(total)=df(between)+df(within).$<br />
<br />
**Calculations: <br />
ANOVA table: <br />
<center><br />
{| class="wikitable" style="text-align:center; width:45%" border="1"<br />
|-<br />
|Variance source||Degree of freedom (df)|| Sum of squares (SS) ||Mean sum of squares (MS)|| F-statistics||P-value<br />
|-<br />
|Treatment effect (between group)||$k-1$||$\sum_{i=1}^{k}n_{i} \left(\bar y_{i.} -\bar y_{..}\right)^{2}$ || $\frac{SST(between)}{df(between)}$ || $F_{0}=\frac{MST(between)}{MSE(within)}$ || $p(F_{(df(between),df(within)}>F_{0}$<br />
|-<br />
|Error (within group)|| $n-k$ ||$\sum_{i=1}^{k} \sum_{j=1}^{n_{i}} \left(y_{ij}-\bar y_{i.}\right)^{2}$ || $\frac{SSE(within)}{df(within)}$ || ||<br />
|- <br />
|Total || $n-1$ || $\sum_{i=1}^{k} \sum_{j=1}^{n_{i}} \left(y_{ij}-\bar y_{..}\right)^{2}$ || || ||<br />
|- <br />
|}<br />
</center><br />
<br />
==ANOVA hypotheses(general form):== <br />
$H_{o}:\mu_{1}=\mu_{2}=⋯=\mu_{k}; H_{a}:\mu_{i}≠μ_{j}$ for some $i≠j$. The test statistics: $F_{0}=\frac{MST(between)}{MSE(within)}$, if $F_{0}$ is large, then there is a lot between group variation, relative to the within group variation. Therefore, the discrepancies between the group means are large compared to the variability within the groups (error). That is large $F_{0}$ provides strong evidence against $H_{0}$.<br />
*ANOVA conditions: valid if (1) design conditions: all groups of observations represent random samples from their population respectively. Plus, all the observations within each group are independent of each other; (2) population conditions: the k population distributions must be approximately normal. If sample size is large, the normality condition is less crucial. Plus, the standard deviations of all populations are equal, which can be slightly relaxed when $0.5≤\frac{\sigma_{i}} {\sigma_{j}}≤2,$ for all $i$ and $j$, none of the population variance is twice larger than any of the other ones. <br />
<br />
==Two-way ANOVA:== <br />
We focus on decomposing the variance of a dataset into independent (orthogonal) components when we have two grouping factors.<br />
*Notations first: two-way model: $y_{ijk}=\mu+\tau_{i}+\beta_{j}+\gamma_{ij}+\varepsilon_{ijk}$,for all $1≤i≤a,1≤j≤b$ and $1≤k≤r. y_{ijk}$ is the A-factor level $i$, and B-factor level $j,$ observation-index $k$ measurement; $k$ is the number of replications; $a_{i}$ is the number of A-factor observations at level $i,a=a_{1}+⋯+a_{i}; b_{j}$ is the number of B-factor observations at level $j,b=b_{1}+⋯+b_{J}; N$ is the total number of observations and $N=a*a*b.$ Here $\mu$ is the overall mean response, $\tau_{i}$ is the effect due to the $i^{th}$ level of factor A, $\beta_{j}$ is the effect due to the $j^{th}$ level of factor B, and $\gamma_{ij}$ is the effect due to any interaction between the $i^{th}$ level of factor A and $j^{th}$ level of factor B. The mean for A-factor group mean at level $i$ and B-factor at level $j$ is $\bar y_{ij}.=\frac{\sum_{k=1}^{r}y_{ijk}}{r}$, the grand mean is $\bar y=\bar y_{…}=\frac{\sum_{k=1}^{r}\sum_{i=1}^{a}\sum_{j=1}^{b} y_{ijk}} {n}$, we have $SST(total)=SS(A)+SS(B)+SS(AB)+SSE.$<br />
<br />
==Hypotheses:== <br />
*Null hypotheses: (1) the population means of the first factor are equal, which is like the one-way ANOVA for the row factor; (2) the population means of the second factor are equal, which is like the one-way ANOVA for the column factor; (3) there is no interaction between the two factors, which is similar to performing a test for independence with contingency tables.<br />
**Factors: factor A and factor B are independent variables in two-way ANOVA.<br />
**Treatment groups: formed by making all possible combinations of two factors. For example, if the factor A has 3 levels and factor B has 5 levels, then there will be $3*5=15$ different treatment groups.<br />
**Main effect: involves the dependent variable one at a time. The interaction is ignored for this part. <br />
**Interaction effect: the effect that one factor has on the other factor. The degree of freedom is the product of the two degrees of freedom of each factor.<br />
**Calculations:<br />
<br />
ANOVA table: <br />
<center><br />
{| class="wikitable" style="text-align:center; width:45%" border="1"<br />
|-<br />
|Variance source||Degree of freedom (df)||Sum of squares (SS)|| Mean sum of squares (MS)||F-statistics||P-value<br />
|-<br />
|Main effect A|| $a-1$ ||$SS(A)=rb\Sigma_{i=1}^{a} n_{i} (\bar y_{i..} -\bar y_{...} )^{2}$ ||$SS(A)/df(A)$||$F_{0}=\frac{MS(A)}{MSE}$ ||$p(F_{(df(B),df(E)})>F_{0}$<br />
|-<br />
|Main effect B || $b-1$||$SS(B)=ra\Sigma_{j=1}^{b} n_{i} (\bar y_{.j.}-\bar y_{...})^{2}$ || $SS(B)/df(B)$|| $F_{0}=\frac{MS(B)}{MSE}$||$p(F_{(df(AB),df(E)} )>F_{0}$<br />
|-<br />
|A vs. B interaction||$(a-1)(b-1)$||$SS(AB)=r\Sigma_{i=1}^{a} \Sigma_{j=1}^{b} ((\bar y_{ij.}-\bar y_{i..}) +\bar y_{.j.} -\bar y_{...} )^{2}$ ||$SS(AB)/df(AB)$||$F_{0}=\frac{MS(AB)} {MSE}$||<br />
|- <br />
|Error ||$N-ab$||$SSE=\Sigma_{k=1}^{r} \Sigma_{i=1}^{a} \Sigma_{j=1}^{b} (y_{ijk}-\bar y_{ij.})^{2}$||$SSE/df(error)$|| ||<br />
|- <br />
|Total||$N-1$||$SST=\Sigma_{k=1}^{r}\Sigma_{i=1}^{a} \Sigma_{j=1}^{b}(y_{ijk}-\bar y_{...})^{2}$|| || ||<br />
|- <br />
|}<br />
</center><br />
<br />
*ANOVA conditions: valid if (1) the population from which the samples were obtained must be normally or approximately normally distributed; (2) the samples must be independent; (3) the variances of the populations must be equal; (4) the groups must have the same sample size.<br />
<br />
==ANCOVA:==<br />
Analysis of Covariance is the common method applied to blend ANOVA and regression and evaluate whether population means of a dependent variance (DV) are equal across levels of a categorical independent variable (IV) while statistically controlling for the effects of other continuous variables (CV).<br />
*Assumptions of ANCOVA: (1) normality of residuals; (2) homogeneity of variance for error; (3) Homogeneity of regression slopes, regression lines should be parallel among groups; (4) Linearity of regression; (5) independence of error terms.<br />
*Increase statistical power: ANCOVA reducing the with-in group error variance and increase statistical power. Use the F-test to evaluate difference between groups by dividing the explained variance between groups by the unexplained variance within the groups. $F=\frac{MS between}{MSwithin}$. If this value is greater than the critical value, then there is significant difference between groups. The influence of CVs is grouped into the denominator. When we control for the effect of CVs on the DV, remove it from the denominator making F bigger, thereby increased our power to find a significant effect if one exists.<br />
**Adjusting preexisting difference in nonequivalent groups: correct for initial group difference that exists on DV among several intact groups. In this case, CVs are used to adjust scores and make participants more similar than without the CV since the participants cannot be made equal through random assignment. CV may be so intimately related to the IV that removing the variance on the DV associated with CV would remove considerable variance on DV, which will make the result meaningless.<br />
**Conduct ANCOVA: (1) Test multicollinearity: if a CV is highly related to another CV, then it won’t adjust the DV over and above the other CV. One or the other should be removed since they are statistically redundant. (2) Test the homogeneity of variance assumption: Levene’s test of equality of error variances. (3) Test of the homogeneity of regression slopes assumption: tested by testing if the CV significantly interacts with the IV by running an ANCOVA model including both the IV and the CV*IV interaction term in the model. If the interaction term is significant, then we should not perform ANCOVA. Instead assess group difference on DV at particular level of CV. (4) Run ANCOVA analysis: if the interaction is not significant, then rerun the ANCOVA without the interaction term. In this analysis, use the adjust means and adjusted MSerror. (5) Follow-up analyses: if there was a significant main effect, then there is a significant difference between the levels of one IV, ignoring all other factors. To find out exactly which levels are significant from others, use the same follow-up tests for ANOVA.<br />
<br />
==MANOVA:==<br />
Multivariate analysis of variance or multiple analysis of variance is a statistical test procedure for comparing multivariate means of several groups. MANOVA is a generalized form of ANOVA.<br />
Relationship with ANOVA: <br />
*MANOVA is an extension of ANOVA, though, unlike ANOVA, it uses the variance-covariance between variables in testing the statistical significance of the mean difference. It is similar to ANOVA, but allows adding of interval independents as covariates. Several specific use-cases for MANOVA: (1) to compare groups formed by categorical independent variables on group differences in a set of dependent variables; (2) to use lack of difference for a set of dependent variables as a criterion for reducing a set of independent variables to a smaller, more easily modeled number of variables; (3) to identify the independent variables which differentiate a set of dependent variables the most.<br />
*Analogous to ANOVA, MANOVA is based on the product of model variance matrix, $\Sigma_{model}$ and inverse of the error variance matrix, $\Sigma_{res}^{-1}, or A=\Sigma_{model} * \Sigma_{res}^{-1}$. The hypothesis that $\Sigma_{model} = \Sigma_{residual}$ implies that the product $A \sim I$. Invariance considerations imply the MANOVA statistic should be a measure of magnitude of the singular value decomposition of this matrix product, there is no unique choice owing to the multi-dimensional nature of the alternative hypothesis.<br />
*MANOVA calculations closely resemble the ANOVA calculations, except that they are in vector and matrix forms. Assume that instead of a single dependent variance in the one-way ANOVA, there are three dependent variables as in our neuroimaging example above. Under the null hypothesis, it is assumed that scores on the three variables for each of the four groups are sampled from a tri-variate normal distribution mean vector $\mu =(\mu_{1}, \mu_{2}, \mu_{3})^{T}$ and variance-covariance matrix $\Sigma=\bigl(\begin{smallmatrix} \sigma_{1}^{2} & \rho_{1,2}\sigma_{1}\sigma{2} &\rho_{1,3}\sigma_{1}\sigma{3} \\ \rho_{2,1}\sigma_{2}\sigma_{1} & \sigma_{2}^{2} & \rho_{2,3}\sigma_{2}\sigma{3} \\\rho_{3,1}\sigma_{3}\sigma_{1}&\rho_{3,2}\sigma_{3}\sigma_{2} &\sigma_{3}^{2}\end{smallmatrix}\bigr)$. Where the covariance between variables 1 and 2 is expressed in terms of their correlation $(\rho_{1,2})$ and individual variances $(\sigma_{1}$ and $\sigma_{2})$. Under the null hypothesis, the scores for all subjects in groups 1, 2 and 3 are sampled from the same distribution.<br />
*Example: a $2*2$ factorial design with medication as one factor and type of therapy as the second factor. The matrix of the data looks includes the patient ID, the drug-treatment (vitamin-E or Placebo), Therapy (Cognitive/physical), MMSE, CDR, Imaging. It's better when the study the design is balanced with equal numbers of patients in all four conditions, as this avoid potential problems of sample-size-driven effects (e.g., variance estimates). Recall that a univariate ANOVA (on any single outcome measure) would contain three types of effects -- a main effect for therapy, a mean effect for medication, and an interaction between therapy and medication. Similarly, MANOVA will contain the same three effects: main effects: (1) Therapy: The univariate ANOVA main effect for therapy tells whether the physical vs. cognitive therapy groups have different means, irrespective of their medications. The MANOVA main effect for therapy tells whether the physical vs. cognitive therapy group have different mean vectors irrespective of their medication. The vectors in this case are the $(3*1)$ column vectors of means (MMSE, CDR and Imaging); (2) Medication: The univariate ANOVA for medication tells whether the placebo group has a different mean from the Vitamin-E group irrespective of the therapy type. The MANOVA main effect for medication tells whether the placebo group has a different mean vector from the VItamin-E group irrespective of therapy; interaction effects: (3) The univariate ANOVA interaction tells whether the four means for a single variable differ from the value predicted from knowledge of the main effects of therapy and medication. The MANOVA interaction term tells whether the four mean vectors differ from the vector predicted from knowledge of the main effects of therapy and medication.<br />
**Variance partitioning: MANOVA has the same properties as an ANOVA. The only difference is that an ANOVA deals with a $(1*1)$ mean vector for any group (as the response is univatiate). While a MANOVA deals with a $(k*1)$ vector for any group, $k$ being the number of dependent variables, 3 in our example. The variance-covariance matrix for 1 variable is a $(1*1)$ matrix that has only one element, the variance of the variable. What is the variance-covariance matrix for $k$ variables is a $(k*k)$ matrix with the variances on the diagonal and the covariances representing the off diagonal elements. The ANOVA partitions the $(1*1)$ covariance matrix into a part due to error and a part due to the researcher-specified hypotheses (the two main effects and the interaction term). That is: $V_{total} = V_{therapy} + V_{medication} + V_{therapy*medication} + V_{error}.$ Likewise, MANOVA partitions its $(k*k)$ covariance matrix into a part due to research-hypotheses and a part due to error. Thus, in out example, MANOVA will have a $(3*3)$ covariance matrix for total variability, a $(3*3)$ covariance matrix due to therapy, a $(3*3)$ covariance matrix due to medication, a $(3*3)$ covariance matrix due to the interaction of therapy with medication, and finally a $(3*3)$ covariance matrix for the error. That is: $V_{total} = V_{therapy} + V_{medication} + V_{therapy*medication} + V_{error}$. Now,$V$ stands for the appropriate $(3*3)$ matrix, as opposed to $(1*1)$ value, as in ANOVA. The second equation is the matrix-form of the first one. Here is how we interpret these matrices. The error matrix looks like:<br />
<center><br />
{| class="wikitable" style="text-align:center; width:45%" border="1"<br />
|-<br />
| ||MMSE ||CDR ||Imaging<br />
|-<br />
|MMSE|| $V_{error1}$|| COV(error1, error2)|| COV(error1, error3)<br />
|-<br />
|CDR|| COV(error2, error1)||$V_{error2}$|| COV(error2, error1)<br />
|-<br />
|Imaging|| COV(error3, error1)|| COV(error3, error2) ||$V_{error3}$<br />
|-<br />
|}<br />
</center><br />
<br />
*Common statistics are summarized based on the root (eigenvalues) \lambda_{p} of the A matrix: (1) Samuel Stanley Wilks’, $\Lambda_{Wilks}=\prod_{1,\cdots,p} (1/(1+\lambda_{p}))=det(I+A)^{-1}=det(\Sigma_{res})/det(\Sigma_{res}+\Sigma_{model})$ distributed as lambda $(\Lambda)$; (2) The Pillai-M.S. Bartlett trace, $\Lambda_{Pillai}=\sum_{1,\cdots,p}(1/(1+\lambda_{p}))=tr((I+A)^{-1})$; (3) The Lawley-Hotelling trace, $\Lambda_{LH}=\sum_{1,\cdots,p}(\lambda_{p})=tr(A)$; (4) Roy’s greatest root, $\Lambda_{Roy}=max_{p}(\lambda_{p})=\left \| A \right \|_{\infty}$. The 4 major types of MANOVA test, the statistical power of these tests follow: $Pillai’s > Wilk’s > Hotelling’s > Roy’s Robustness$.<br />
**Let the $A$ statistic be the ratio of the sum of squares for an hypothesis and the sum of squares for error. Let $H$ denote the hypothesis sums of squares and cross products matrix, and let $E$ denote the error sums of squares and cross products matrix. The multivariate $A$ statistic is the matrix $A = HE_{-1}.$ Notice how mean squares (that is, covariance matrices) disappear from MANOVA just as they did for ANOVA. All hypothesis tests may be performed on the matrix $A$. Note also that because both $H$ and $E$ are symmetric, $HE^{-1}=E^{-1} H$. This is one special case where the order of matrix multiplication does not matter.<br />
**All MANOVA tests are made on $A=E^{-1}H$. There are four different multivariate tests that are made on this matrix. Each of the four test statistics has its own associated $F$ ratio. In some cases the four tests give an exact $F$ ratio for testing the null hypothesis and in other cases the $F$ ratio is only approximate. The reason for four different statistics and for approximations is that the MANOVA calculations may be complicated in some cases (i.e., the sampling distribution of the $F$ statistic in some multivariate cases would be difficult to compute exactly.) Suppose there are $k$ dependent variables in the MANOVA, and let $\lambda I$ denote the ith eigenvalue of $A=E^{-1}H$.<br />
**Wilk’s $\Lambda:\, 1- \Lambda$ is an index of variance explained by the model, $\eta^{2}$ is a measure of effect size analogous to $R^{2}$ in regression. $Wilk’s\: \Lambda$ is the pooled ratio of error variance to effect variance plus error variance: $\Lambda=\frac{|E|}{|H+E|}=\prod_{i=1}^{k}\frac{1}{1+\lambda_{i}}$.<br />
*Pillai’s Trace: Pillai’s criterion is the pooled effect variances. $Pillai’s\: trace=trace[H(H+E)^{-1}]=\prod_{i=1}^{k}\frac{1}{1+\lambda_{i}}$.<br />
*Hotelling’s Trace: the pooled ratio of effect variance to error variance:$trace(A)=trace[HE^{-1}]=\sum_{i=1}^{k}\lambda_{i}$.<br />
*Roy’s largest root: gives an upper bound for the F statistic, $Roy’s\: largest \,root=max(\hat{\lambda_{i}})$. <br />
<br />
'''MANCOVA:''' multivariate analysis of covariance is a statistical an extension of ANCOVA and is designed for cases where there is more than one dependent variable and where the control of concomitant continuous independent variables is required. The process of characterizing a covariate in a data source allows the reduction of the magnitude of the error term, represented in the MANCOVA design as $MS_{error}$. The MANCOVA allows the characterization of the difference in group means in regards to a linear combination of multiple dependent variables, while simultaneously controlling for covariates. <br />
**Assumptions: (1) normality, for each group, each dependent variable follows a normal distribution and any linear combination of dependent variables are normally distributed; (2) independence of observations from all other observations; (3) homogeneity of variance: each dependent variable demonstrate similar levels of variance across each independent variable; (4) homogeneity of covariance: the intercorrelation matrix between dependent variables equals to each other across all levels of independent variable.<br />
**Covariate represents the source of variance that has not been controlled in the experiment and is believed to affect the dependent variable. And methods like ANCOVA and MANCOVA aim to remove the effects of such uncontrolled variation in order to increase statistical power and to ensure an accurate measurement of the true relationship between independent and dependent variables. <br />
**In some studies with covariates it happens that the F value actually becomes smaller (less significant) after including covariates in the design. This indicates that the covariates are not only correlated with the dependent variable, but also with the between-groups factors.<br />
<br />
===Applications===<br />
<br />
[http://rer.sagepub.com/content/68/3/350.short This article] examined articles published in several prominent educational journals to investigate the use of data analysis tools by researchers in four research paradigms: between-subjects univariate designs, between-subjects multivariate designs, repeated measures designs, and covariance designs. In addition to examining specific details pertaining to the research design (e.g., sample size, group size equality/inequality) and methods employed for data analysis, the authors also catalogued whether (a) validity assumptions were examined, (b) effect size indices were reported, (c) sample sizes were selected on the basis of power considerations, and (d) appropriate textbooks and/or articles were cited to communicate the nature of the analyses that were performed. The present analyses imply that researchers rarely verify that validity assumptions are satisfied and that, accordingly, they typically use analyses that are nonrobust to assumption violations. In addition, researchers rarely report effect size statistics, nor do they routinely perform power analyses to determine sample size requirements. Recommendations are offered to rectify these shortcomings.<br />
<br />
[http://www.tandfonline.com/doi/abs/10.1080/03610919808813485#.U-aSFhZTWdA This article] investigated the performance of ANOVA, MANOVA, WLS, and GEE for repeated ordinal data with small sample sizes. Repeated ordinal outcomes are common in behavioral and medical sciences. Due to the familiarity, simplicity and robustness of ANOVA methodology, this approach has been frequently used for repeated ordinal data. Weighted least squares (WLS) and generalized estimating equations (GEE) are usually the procedures of choice for repeated ordinal data since, unlike ANOVA, they generally make no or few untenable assumptions. However, these methods are based on asymptotic results and their properties are not well understood for small samples. Moreover, few software packages have procedures for implementing these methods. For a design with two groups and four time points, the simulation results indicated that ANOVA with the Huynh-Feldt adjustment performed matrix, known as sphericity, or the H‐F condition, is a sufficient condition for the usual F tests to be valid. <br />
<br />
===Software=== <br />
RCODE:<br />
fit <- aov(y ~ A, data = mydata) #one way ANOVA (completely randomized design)<br />
fit <- aov(y ~ A+B, data=mydata) # randomized block design where B is the blocking factor<br />
fit <- aov(y ~ A+B+A*B, data=mydata) ## two way factorial design<br />
fit <- aov(y ~ A+x, data=mydata) ## analysis of covariance<br />
## for within subject designs, the data frame has to be rearranged for each measurement on a subject to be a separate observation<br />
fit <- aov(y ~ A+Error(subject/A), data=mydata) ## one within factor<br />
fit <- aov(y ~(w1*w2*B1*B2)+Error(Subject/(W1*W2))+(B1*B2),data=mydata) # two within factors W1 and W2, two between factors B1 and B2.<br />
## 2*2 factorial MANOVA with 3 dependent variables<br />
Y <- cbind(y1,y2,y3)<br />
fit <- manova(Y ~ A*B)<br />
<br />
===Problems===<br />
<br />
Use data of the CPI (consumer price index) for food, housing, transportation and medical care from 1981 to 2007 to do a two-way analysis of the covariance in R. We take ‘Month’ as one factor and the item the CPI measured on as another factor and did a 2* 2 factorial design. The data are linked at [http://wiki.socr.umich.edu/index.php/SOCR_Data_Dinov_021808_ConsumerPriceIndex3Way Consumer Price Index].<br />
<br />
In R:<br />
CPI <- read.csv('/Users/yufangli/Desktop/CPI_Food.csv',header=T)<br />
attach(CPI)<br />
summary(CPI)<br />
Month <- factor(Month)<br />
CPI_Item <- factor(CPI_Item)<br />
fit <- aov(CPI_Value ~ Month + CPI_Item + CPI_Item*Month, data=CPI)<br />
fit<br />
Call:<br />
aov(formula = CPI_Value ~ Month + CPI_Item + CPI_Item * Month, <br />
data = CPI)<br />
<br />
Terms:<br />
Month CPI_Item Month:CPI_Item Residuals<br />
Sum of Squares 3282.6 1078702.9 706.8 2987673.8<br />
Deg. of Freedom 11 3 33 1248<br />
<br />
Residual standard error: 48.92821 <br />
Estimated effects may be unbalanced<br />
<br />
summary(fit)<br />
Df Sum Sq Mean Sq F value Pr(>F) <br />
Month 11 3283 298 0.125 1 <br />
CPI_Item 3 1078703 359568 150.197 <2e-16 ***<br />
Month:CPI_Item 33 707 21 0.009 1 <br />
Residuals 1248 2987674 2394 <br />
---<br />
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1<br />
<br />
fit2 <- aov(CPI_Value ~ CPI_Item, data=CPI)<br />
fit2<br />
Call:<br />
aov(formula = CPI_Value ~ CPI_Item, data = CPI)<br />
<br />
Terms:<br />
CPI_Item Residuals<br />
Sum of Squares 1078703 2991663<br />
Deg. of Freedom 3 1292<br />
<br />
Residual standard error: 48.11994 <br />
Estimated effects may be unbalanced<br />
summary(fit2)<br />
Df Sum Sq Mean Sq F value Pr(>F) <br />
CPI_Item 3 1078703 359568 155.3 <2e-16 ***<br />
Residuals 1292 2991663 2316 <br />
---<br />
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1<br />
<br />
Seems like the CPI_Item is an important factor while month is not a significant factor of CPI based on the dataset we got.<br />
<br />
<br />
===References===<br />
[http://www.statsoft.com/Textbook/ANOVA-MANOVA ANOVA-MANOVA]<br />
<br />
[http://en.wikipedia.org/wiki/Analysis_of_variance ANOVA Wikipedia]<br />
<br />
[http://wiki.stat.ucla.edu/socr/index.php/Probability_and_statistics_EBook#Chapter_XI:_Analysis_of_Variance_.28ANOVA.29 SOCR]<br />
<br />
[http://en.wikipedia.org/wiki/Analysis_of_covariance ANCOVA Wikipedia]<br />
<br />
[http://en.wikipedia.org/wiki/Multivariate_analysis_of_variance MANOVA Wikipedia]<br />
<br />
[http://en.wikipedia.org/wiki/MANCOVA MANCOVA Wikipedia]<br />
<br />
<br />
<br />
<br />
<br />
<br />
<hr><br />
* SOCR Home page: http://www.socr.umich.edu<br />
<br />
{{translate|pageName=http://wiki.socr.umich.edu/index.php?title=SMHS_ANCOVA}}</div>Clgallahttps://wiki.socr.umich.edu/index.php?title=SMHS_ModelFitting&diff=14485SMHS ModelFitting2014-10-21T13:19:20Z<p>Clgalla: /* Scientific Methods for Health Sciences - Model Fitting and Model Quality (KS-test) */</p>
<hr />
<div>==[[SMHS| Scientific Methods for Health Sciences]] - Model Fitting and Model Quality (KS-test) ==<br />
<br />
===Overview===<br />
Kolmogorov-Smirnov Test (K-S test) is a nonparametric test commonly applied in various fields to test on the equality of continuous, one-dimensional probability distribution that can be used to compare one sample with a reference probability distribution, which is commonly referred to one-sample K-S test, or to compare two samples, which is referred to as two-sample K-S test. The K-S test is used to determine if two datasets differ significantly. In this lecture, we are going to present a general introduction to K-S test and illustrate its application with examples. The implementation of K-S test in the statistical package R will also be discussed with specific examples. <br />
<br />
===Motivation===<br />
When we talk about testing the equality of two dataset, the first idea came into our mind may always be the simple t-test. However, there are situations where it is a mistake to trust the result of a t-test. Consider the situation where the control and treatment groups do not differ in mean, but only in some other way. Consider two dataset with same mean and very different variations, in this situation, the t-test cannot see the difference. Situations where the treatment and control groups are smallish datasets (say less than 20) that differ in mean but substantial non-normal distribution masks the difference. Consider two datasets drawn from lognormal distributions that differ substantially in mean, in this situation, the t-test also fails. Among all those situations, K-S test would be the answer. How does the K-S test work?<br />
<br />
===Theory===<br />
'''1) K-S test:''' a nonparametric test commonly applied in various fields to test on the equality of continuous, one-dimensional probability distribution that can be used to compare one sample with a reference probability distribution or to compare two samples. The K-S test quantifies a distance between the empirical distribution function of the sample and the cdf of reference distribution or between the empirical distribution functions of two samples. The null hypothesis is that the samples are drawn from the same distribution or that the sample is drawn from the reference distribution.<br />
*K-S test is sensitive to difference both in location and shape of the empirical cdf of the samples and is widely used as the nonparametric methods for comparing two samples.<br />
*The empirical distribution function $F_{n}$ for n i.i.d. observations $X_{i}: F_{n} (x)=1/n \sum_{i=1}^{n} I_{X_{i} \leq x},$ where $I_{X_{i} \leq x}$ is the indicator function, which equals to 1 when $X_{i} \leq x$ and equals to $0$ otherwise.<br />
*The K-S statistic for a given $cdf F(x)$ is $D_{n}=sup_{x}|F_{n}(x)-F(x)|,$ where sup{x} is the supremum of the set of distances. By the Gilvenko-Cantelli therorem, if the sample comes from distribution $F(x)$, then $D_{n}$ converges to $0$ almost surely in the limit when $n$ goes to infinity. <br />
*Kolmogorov distribution (the distribution of the random variable): $K=sup_{t\in [0,1]} |B(t)|,$ where $B(t)$ is the Brownian bridge. The $c.d.f$ of $K$ is given: $Pr(K \leq x)=1-2 \sum_{k=1}^{\infty}(-1)^{k-1}e^{-2k^{2} x^{2}}=\frac {\sqrt{2 \pi}}{x} \sum_{k=1}^{\infty}e^{-(2k-1)^{2} \pi^{2} \setminus (8x^{2})}.$ Under the null hypothesis, the sample comes from the hypothesized distribution $F(x), \sqrt{n} D_{n} \overset{n\rightarrow \infty}{\rightarrow} sup_{t} |B(F(t))|$ in distribution, where $B(t)$ is the Brownian bridge. When $F$ is continuous, $\sqrt{n} D_{n}$ under the null hypothesis converges to the Kolmogorov distribution, which does not depend on $F.$<br />
*The goodness-of-fit test or the K-S test is constructed by using the critical values of the Kolmogorov distribution. The null hypothesis is rejected at level $\alpha$ if $\sqrt{n} D_{n} > K_{\alpha},$ where $K_{\alpha}$ is found from $Pr(K \leq K_{\alpha})=1- \alpha.$ The asymptotic power of this test is $1$.<br />
*Test with estimated parameters: if either the form or the parameters of $F(x)$ are determined from the data $X_{j}$ the critical values determined in this way are invalid. Modifications required to the test statistics and critical values have been proposed.<br />
<br />
'''2) Two-sample K-S test:''' to test whether two underlying one-dimensional probability distribution differ. The K-S statistic is $D_{n,{n}'}=sup_{x} |F_{1,n}(x)-F_{2,{n}'}(x)|$, where $F_{1,n}$ and $F_{2,{n}'}$ are the empirical distribution functions of the first and second sample respectively. The null hypothesis is rejected at least $\alpha$ if $D_{n,{n}'}>c(\alpha)\sqrt{\frac{n+{n}'}{n{n}'}}$. The value of $c(\alpha)$ is given in the following table for each level of $\alpha.$<br />
<br />
<center><br />
{| class="wikitable" style="text-align:center; width:35%" border="1"<br />
|-<br />
|$\alpha$|| 0.1|| 0.05|| 0.025|| 0.01|| 0.005 ||0.001<br />
|-<br />
|c($\alpha$)|| 1.22|| 1.36|| 1.48|| 1.63|| 1.73|| 1.95<br />
|}<br />
</center><br />
<br />
Note: two-sample test checks whether the two data samples come from the same distribution. This does not specify what that common distribution is. While the K-S test is usually used to test whether a given $F(x)$ is the underlying probability distribution $F_{n}(x)$, the procedure may be inverted to give confidence limits on $F(x)$ itself. If we choose a critical value of the test statistics $D_{\alpha}$ such that $P(D_{n}>D_{\alpha})=\alpha$, then a band of width $\pm D_{\alpha}$ around $F_{n}(x)$ will entirely contain $F(x)$ with probability $1-\alpha.$<br />
<br />
'''3) Illustration on how the K-S test works with example:''' consider the two datasets control$={1.26, 0.34, 0.70, 1.75, 50.57, 1.55, 0.08, 0.42, 0.50, 3.20, 0.15, 0.49, 0.95, 0.24, 1.37, 0.17, 6.98, 0.10, 0.94, 0.38}$, treatment= ${2.37, 2.16, 14.82, 1.73, 41.04, 0.23, 1.32, 2.91, 39.41, 0.11, 27.44, 4.51, 0.51, 4.50, 0.18, 14.68, 4.66, 1.30, 2.06, 1.19}$.<br />
*Descriptive statistics: to reduce the list of all the data items to a few simpler numbers. For the control group, the mean is 3.607, the median is 0.60, the highest is 50.57, the lowest is 0.08 and the standard deviation is 11.165. Obviously, this dataset is not normally distributed.<br />
*Sort the control data: control sorted$={0.08, 0.10, 0.15, 0.17, 0.24, 0.34, 0.38, 0.42, 0.49, 0.50, 0.70, 0.94, 0.95, 1.26, 1.37, 1.55, 1.75, 3.20, 6.98, 50.57},$ evidently no data lies strictly below $0.08, 5% = 1/20$ of the data is smaller than $0.1$, 10% of the data is strictly smaller than $0.15, 15%$ of the data is strictly smaller than $0.17$, $\cdots.$ There are $17$ data points smaller than $\pi$, and we say that the cumulative fraction of the data smaller than $\pi$ is $17/20 (0.85).$ <br />
<br />
RCODE:<br />
Control <- c(1.26, 0.34, 0.70, 1.75, 50.57, 1.55, 0.08, 0.42, 0.50, 3.20, 0.15, <br />
0.49, 0.95 , 0.24, 1.37, 0.17, 6.98, 0.10, 0.94, 0.38)<br />
Treatment <- c(2.37, 2.16, 14.82, 1.73, 41.04, 0.23, 1.32, 2.91, 39.41, 0.11, <br />
27.44, 4.51, 0.51, 4.50, 0.18, 14.68, 4.66, 1.30, 2.06, 1.19)<br />
summary(control)<br />
Min. 1st Qu. Median Mean 3rd Qu. Max. <br />
0.080 0.315 0.600 3.607 1.415 50.570<br />
sd(control)<br />
[1] 11.16464<br />
*Plot the empirical cumulative plot of the control data. From the chart, we can see that the majority of the data fall in the a small fraction of the plot on the top left, so this is a clear sign that the dataset from control group does not follow a normal distribution.<br />
library(stats)<br />
ecdf(control)<br />
plot(ecdf(control),verticals=TRUE,main='Cumulative Fraction Plot')<br />
<br />
[[Image:SMHS_Fig_1_Model_Fitting.png|500px]]<br />
<br />
*Now plot the control group using a log scale, which will give more space to display the small x data points. Now the plot seems to present the data points evenly into two halves (half above the median, half below the median), which is a little bit below 1.<br />
<br />
<br />
[[Image:SMHS_Fig_2_Model_Fitting.png|500px]]<br />
<br />
log.con <- log(control)<br />
plot(ecdf(log.con),verticals=T, main='Cumulative Fraction Plot of Log(Control)')<br />
<br />
*Now, plot the cumulative fraction of both the treatment and the control group on the same graph. <br />
<br />
[[Image:SMHS_Fig_3_Model_Fitting.png|500px]]<br />
<br />
From the chart, we can see that both datasets span much of the same range of the values, but for most of the x value, the fraction of the treatment (red) is less than the fraction of the control group (blue). We denote the difference in the two fractions at each x value and the K-S test uses the maximum vertical deviation between the two curves. In this case, the maximum deviation occurs near x=1 and D=0.45. The fraction of the treatment group that is less than 1 is 0.2 (4 out of 20), and that for the control group is 0.65 (13 out of the 20 values), thus the maximum difference in the cumulative fraction is D=0.45).<br />
Note that: the value of D is not affected by scale changes like using log, which is different from the t-statistics. Hence, the K-S test is a robust test that cares only about the relative distribution of the data. <br />
<br />
log.treat <- log(treatment)<br />
plot(ecdf(log.con),verticals=TRUE, main='Cumulative Fraction Plot on Log <br />
Scale',col.p='bisque',col.h='blue',col.v='blue',xlim=c(-3,5))<br />
par(new=T)<br />
plot(ecdf(log.treat),verticals=TRUE, <br />
col.p='bisque',col.h='red',col.v='red',main='',xlim=c (-3,5))<br />
con.p <- ecdf(log.con)<br />
treat.p <- ecdf(log.treat)<br />
con.p(0)-treat.p(0) ### D=0.45<br />
(ecdf(control))(1)-(ecdf(treatment))(1) ## D=0.45 same as using the log-scale<br />
<br />
'''4) Using the Percentile Plot:''' for our habit of observing and comparing continuous curves. We may seek to use something similar to cumulative fraction plot, but without the odd steps, say, try the percentiles. Consider the dataset of ${-0.45, 1.11, 0.48, -0.82, -1.26}.$ Sort this data from small to large ${-1.26, -0.82, -0.45, 0.48, 1.11}.$ The median is $-0.45$, which is the 50th percentile. To calculate the percentile, denote the point’s location in the sorted dataset as r, and then divided by the number of points plus one: percentile = $r/(N+1)$. Now we have the set of (data, percentile) pairs: ${(-1.26, 0.167), (-0.82, 0.333), (-0.45, 0.5), (0.48, 0.667), (1.11, 0.833)}$. We can connect the adjacent data points with a straight line and the resulting collection of connected straight line segment is called an '''''ogive'''''.<br />
<br />
<br />
RCODE:<br />
data <- c(-0.45, 1.11, 0.48, -0.82, -1.26)<br />
sort.data <- sort(data)<br />
percentile <- c(0.167, 0.333, 0.5, 0.667, 0.8333)<br />
plot(ecdf(data),verticals=T,xlim=c(-1.5,1.5),ylim=c(0,1), xlab='Data',<br />
ylab='' ,main='Cumulative Fraction Plot vs. Percentil Plot')<br />
par(new=T)<br />
plot(sort.data,percentile,type='o',xlim=c(-1.5,1.5),ylim=c(0,1),col=2,xlab='', ylab='', main='')<br />
<br />
[[Image:SMHS_Fig_4_Model_Fitting.png|500px]]<br />
<br />
Reasons to prefer percentile plot to cumulative fraction plots: the percentile plot is a better estimate of the distribution function and the percentiles allow us to use ‘probability graph paper’, plots with specially scaled axis divisions. Probability scales on the y-axis allow us to see how normal the data is. Normally distributed data will show a straight line on the probability paper while log-normal data will show a straight line with probability-log scaled axes. <br />
<br />
'''5) The K-S statistic in more than one dimension:''' modifies the K-S test statistic to accommodate the multivariate data. Given that the maximum difference between two joint $cdf$ is not generally the same as the maximum difference of any of the complementary distribution functions, the modification is not straightforward in this way. Instead, the maximum difference will differ depending on which of $Pr(x<X \Lambda Y>y)$ or any of the other two possible arrangement is used. One approach may be to compare the cdfs of the two samples with all possible orderings, and take the largest as the K-S statistic. In $d$ dimensions, there are $2^{d}-1$ orderings. The critical values for the test statistic can be obtained by simulations, but depend on the dependence structure in the joint distribution.<br />
<br />
===Applications===<br />
<br />
1)[http://wiki.stat.ucla.edu/socr/index.php/SOCR_EduMaterials_AnalysisActivities_KolmogorovSmirnoff This article] presents the SOCR analyses example on Kolmogorov-Smirnoff Test, which compares how distinct two values are. It presents a general introduction to the K-S test and illustrates its application with the Oats example and control-treatment test through the SOCR program.<br />
<br />
2) [http://wiki.stat.ucla.edu/socr/index.php/SOCR_EduMaterials_ModelerActivities_MixtureModel_1 This article] presents the SOCR activity which demonstrate the random sampling and fitting of mixture models to data. It starts with a general introduction to the mixture-model distribution and data description, followed with the exploratory data analysis and model fitting in the SOCR environment.<br />
<br />
3) [http://jhc.sagepub.com/content/25/7/935.short This article] presented on the Kolmogorov-Smirnov statistical test for the analysis of histograms and discussed about the test for both the two-sample case (comparing fn1(X) to fn2 (X)) and the one-sample case (comparing fn1 (X) to f(X)). Presentation of the specific algorithmic steps involved is done through development of an example where the data are from an experiment discussed elsewhere in this issue. It is shown that the two histograms examined come from two different parent populations at the 99.9% confidence level. <br />
<br />
4) [http://ieeexplore.ieee.org/xpl/login.jsp?tp=&arnumber=4310069&url=http%3A%2F%2Fieeexplore.ieee.org%2Fxpls%2Fabs_all.jsp%3Farnumber%3D4310069 This article] developed the digital techniques for detecting changes between two Landsat images. A data matrix containing 16 Ã 16 picture elements was used for this purpose. The Landsat imagery data was corrected for sensor inconsistencies and varying sun illumination. The Kolmogorov-Smirnov test (K-S test) was performed between the two corrected data matrices. This test is based on the absolute value of the maximum difference (Dmax) between the two cumulative frequency distributions. The limiting distribution of Dmax is known; thus a statistical decision concerning changes can be evaluated for the region. The K-S test was applied to different test sites. It successfully isolated regions of change. The test was found to be relatively independent of slight misregistration.<br />
<br />
===Software=== <br />
http://stat.ethz.ch/R-manual/R-patched/library/stats/html/ks.test.html <br />
<br />
http://astrostatistics.psu.edu/su07/R/html/stats/html/ks.test.html <br />
<br />
RCODE are attached as in the examples given in this lecture.<br />
<br />
===Problems===<br />
<br />
1) Consider the following dataset, control={0.22, -0.87, -2.39, -1.79, 0.37, -1.54, 1.28, -0.31, -0.74, 1.72, 0.38, -0.17, -0.62, -1.10, 0.30, 0.15, 2.30, 0.19, -0.50, -0.09}, treatment={-5.13, -2.19, -2.43, -3.83, 0.50, -3.25, 4.32, 1.63, 5.18, -0.43, 7.11, 4.87, -3.10, -5.81, 3.76, 6.31, 2.58, 0.07, 5.76, 3.50}, use the K-S test to test the equality of the two samples. Include all necessary steps and plots.<br />
<br />
2) Revise the example given in the lecture with a different control group: control={1.26, 0.34, 0.70, 1.75, 50.57, 1.55, 0.08, 0.42, 0.50, 3.20, 0.15, 0.49, 0.95, 0.24, 1.37, 0.17, 6.98, 0.10, 0.94, 0.38}, run the K-S test again and state clearly your conclusions with necessary plots.<br />
<br />
3) Two types of plants are in bloom in wood and we want to study if bees prefer one tree to the other? We collect data by using a stop-watch to time how long a bee stays near a particular tree. We begin to time when the bee touches the tree; we stop timing when the bee is more than a meter from the tree. (As a result all our times are at least 1 second long: it takes a touch-and-go bee that long to get one meter from the tree.) We wanted to time exactly the same number of bees for each tree, but it started to rain. (Unequal dataset size is not a problem for the KS-test.) Apply the K-S test with this example. The data are given as below:<br />
T1={23.4, 30.9, 18.8, 23.0, 21.4, 1, 24.6, 23.8, 24.1, 18.7, 16.3, 20.3, 14.9, 35.4, 21.6, 21.2, 21.0, 15.0, 15.6, 24.0, 34.6, 40.9, 30.7, 24.5, 16.6, 1, 21.7, 1, 23.6, 1, 25.7, 19.3, 46.9, 23.3, 21.8, 33.3, 24.9, 24.4, 1, 19.8, 17.2, 21.5, 25.5, 23.3, 18.6, 22.0, 29.8, 33.3, 1, 21.3, 18.6, 26.8, 19.4, 21.1, 21.2, 20.5, 19.8, 26.3, 39.3, 21.4, 22.6, 1, 35.3, 7.0, 19.3, 21.3, 10.1, 20.2, 1, 36.2, 16.7, 21.1, 39.1, 19.9, 32.1, 23.1, 21.8, 30.4, 19.62, 15.5}<br />
T2={16.5, 1, 22.6, 25.3, 23.7, 1, 23.3, 23.9, 16.2, 23.0, 21.6, 10.8, 12.2, 23.6, 10.1, 24.4, 16.4, 11.7, 17.7, 34.3, 24.3, 18.7, 27.5, 25.8, 22.5, 14.2, 21.7, 1, 31.2, 13.8, 29.7, 23.1, 26.1, 25.1, 23.4, 21.7, 24.4, 13.2, 22.1, 26.7, 22.7, 1, 18.2, 28.7, 29.1, 27.4, 22.3, 13.2, 22.5, 25.0, 1, 6.6, 23.7, 23.5, 17.3, 24.6, 27.8, 29.7, 25.3, 19.9, 18.2, 26.2, 20.4, 23.3, 26.7, 26.0, 1, 25.1, 33.1, 35.0, 25.3, 23.6, 23.2, 20.2, 24.7, 22.6, 39.1, 26.5, 22.7}<br />
Note that this example is based on data distributed according to the Cauchy distribution: a particularly abnormal case and the plots do not look particularly abnormal, however the large number of outliers is a tip off of a non-normal distribution. <br />
<br />
===References===<br />
<br />
[http://en.wikipedia.org/wiki/Kolmogorov–Smirnov_test K-S Test Wikipedia]<br />
<br />
[http://www.physics.csbsju.edu/stats/KS-test.html K-S Test]<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<hr><br />
* SOCR Home page: http://www.socr.umich.edu<br />
<br />
{{translate|pageName=http://wiki.socr.umich.edu/index.php?title=SMHS_ModelFitting}}</div>Clgallahttps://wiki.socr.umich.edu/index.php?title=SMHS_ModelFitting&diff=14484SMHS ModelFitting2014-10-21T13:05:26Z<p>Clgalla: /* Theory */</p>
<hr />
<div>==[[SMHS| Scientific Methods for Health Sciences]] - Model Fitting and Model Quality (KS-test) ==<br />
<br />
===Overview===<br />
Kolmogorov-Smirnov Test (K-S test) is a nonparametric test commonly applied in various fields to test on the equality of continuous, one-dimensional probability distribution that can be used to compare one sample with a reference probability distribution, which is commonly referred to one-sample K-S test, or to compare two samples, which is referred to as two-sample K-S test. The K-S test is used to determine if two datasets differ significantly. In this lecture, we are going to present a general introduction to K-S test and illustrate its application with examples. The implementation of K-S test in the statistical package R will also be discussed with specific examples. <br />
<br />
===Motivation===<br />
When we talk about testing the equality of two dataset, the first idea came into our mind may always be the simple t-test. However, there are situations where it is a mistake to trust the result of a t-test. Consider the situation where the control and treatment groups do not differ in mean, but only in some other way. Consider two dataset with same mean and very different variations, in this situation, the t-test cannot see the difference. Situations where the treatment and control groups are smallish datasets (say less than 20) that differ in mean but substantial non-normal distribution masks the difference. Consider two datasets drawn from lognormal distributions that differ substantially in mean, in this situation, the t-test also fails. Among all those situations, K-S test would be the answer. How does the K-S test work?<br />
<br />
===Theory===<br />
'''1) K-S test:''' a nonparametric test commonly applied in various fields to test on the equality of continuous, one-dimensional probability distribution that can be used to compare one sample with a reference probability distribution or to compare two samples. The K-S test quantifies a distance between the empirical distribution function of the sample and the cdf of reference distribution or between the empirical distribution functions of two samples. The null hypothesis is that the samples are drawn from the same distribution or that the sample is drawn from the reference distribution.<br />
*K-S test is sensitive to difference both in location and shape of the empirical cdf of the samples and is widely used as the nonparametric methods for comparing two samples.<br />
*The empirical distribution function $F_{n}$ for n i.i.d. observations $X_{i}: F_{n} (x)=1/n \sum_{i=1}^{n} I_{X_{i} \leq x},$ where $I_{X_{i} \leq x}$ is the indicator function, which equals to 1 when $X_{i} \leq x$ and equals to $0$ otherwise.<br />
*The K-S statistic for a given $cdf F(x)$ is $D_{n}=sup_{x}|F_{n}(x)-F(x)|,$ where sup{x} is the supremum of the set of distances. By the Gilvenko-Cantelli therorem, if the sample comes from distribution $F(x)$, then $D_{n}$ converges to $0$ almost surely in the limit when $n$ goes to infinity. <br />
*Kolmogorov distribution (the distribution of the random variable): $K=sup_{t\in [0,1]} |B(t)|,$ where $B(t)$ is the Brownian bridge. The $c.d.f$ of $K$ is given: $Pr(K \leq x)=1-2 \sum_{k=1}^{\infty}(-1)^{k-1}e^{-2k^{2} x^{2}}=\frac {\sqrt{2 \pi}}{x} \sum_{k=1}^{\infty}e^{-(2k-1)^{2} \pi^{2} \setminus (8x^{2})}.$ Under the null hypothesis, the sample comes from the hypothesized distribution $F(x), \sqrt{n} D_{n} \overset{n\rightarrow \infty}{\rightarrow} sup_{t} |B(F(t))|$ in distribution, where $B(t)$ is the Brownian bridge. When $F$ is continuous, $\sqrt{n} D_{n}$ under the null hypothesis converges to the Kolmogorov distribution, which does not depend on $F.$<br />
*The goodness-of-fit test or the K-S test is constructed by using the critical values of the Kolmogorov distribution. The null hypothesis is rejected at level $\alpha$ if $\sqrt{n} D_{n} > K_{\alpha},$ where $K_{\alpha}$ is found from $Pr(K \leq K_{\alpha})=1- \alpha.$ The asymptotic power of this test is $1$.<br />
*Test with estimated parameters: if either the form or the parameters of $F(x)$ are determined from the data $X_{j}$ the critical values determined in this way are invalid. Modifications required to the test statistics and critical values have been proposed.<br />
<br />
'''2) Two-sample K-S test:''' to test whether two underlying one-dimensional probability distribution differ. The K-S statistic is $D_{n,{n}'}=sup_{x} |F_{1,n}(x)-F_{2,{n}'}(x)|$, where $F_{1,n}$ and $F_{2,{n}'}$ are the empirical distribution functions of the first and second sample respectively. The null hypothesis is rejected at least $\alpha$ if $D_{n,{n}'}>c(\alpha)\sqrt{\frac{n+{n}'}{n{n}'}}$. The value of $c(\alpha)$ is given in the following table for each level of $\alpha.$<br />
<br />
<center><br />
{| class="wikitable" style="text-align:center; width:35%" border="1"<br />
|-<br />
|$\alpha$|| 0.1|| 0.05|| 0.025|| 0.01|| 0.005 ||0.001<br />
|-<br />
|c($\alpha$)|| 1.22|| 1.36|| 1.48|| 1.63|| 1.73|| 1.95<br />
|}<br />
</center><br />
<br />
Note: two-sample test checks whether the two data samples come from the same distribution. This does not specify what that common distribution is. While the K-S test is usually used to test whether a given $F(x)$ is the underlying probability distribution $F_{n}(x)$, the procedure may be inverted to give confidence limits on $F(x)$ itself. If we choose a critical value of the test statistics $D_{\alpha}$ such that $P(D_{n}>D_{\alpha})=\alpha$, then a band of width $\pm D_{\alpha}$ around $F_{n}(x)$ will entirely contain $F(x)$ with probability $1-\alpha.$<br />
<br />
'''3) Illustration on how the K-S test works with example:''' consider the two datasets control$={1.26, 0.34, 0.70, 1.75, 50.57, 1.55, 0.08, 0.42, 0.50, 3.20, 0.15, 0.49, 0.95, 0.24, 1.37, 0.17, 6.98, 0.10, 0.94, 0.38}$, treatment= ${2.37, 2.16, 14.82, 1.73, 41.04, 0.23, 1.32, 2.91, 39.41, 0.11, 27.44, 4.51, 0.51, 4.50, 0.18, 14.68, 4.66, 1.30, 2.06, 1.19}$.<br />
*Descriptive statistics: to reduce the list of all the data items to a few simpler numbers. For the control group, the mean is 3.607, the median is 0.60, the highest is 50.57, the lowest is 0.08 and the standard deviation is 11.165. Obviously, this dataset is not normally distributed.<br />
*Sort the control data: control sorted$={0.08, 0.10, 0.15, 0.17, 0.24, 0.34, 0.38, 0.42, 0.49, 0.50, 0.70, 0.94, 0.95, 1.26, 1.37, 1.55, 1.75, 3.20, 6.98, 50.57},$ evidently no data lies strictly below $0.08, 5% = 1/20$ of the data is smaller than $0.1$, 10% of the data is strictly smaller than $0.15, 15%$ of the data is strictly smaller than $0.17$, $\cdots.$ There are $17$ data points smaller than $\pi$, and we say that the cumulative fraction of the data smaller than $\pi$ is $17/20 (0.85).$ <br />
<br />
RCODE:<br />
Control <- c(1.26, 0.34, 0.70, 1.75, 50.57, 1.55, 0.08, 0.42, 0.50, 3.20, 0.15, <br />
0.49, 0.95 , 0.24, 1.37, 0.17, 6.98, 0.10, 0.94, 0.38)<br />
Treatment <- c(2.37, 2.16, 14.82, 1.73, 41.04, 0.23, 1.32, 2.91, 39.41, 0.11, <br />
27.44, 4.51, 0.51, 4.50, 0.18, 14.68, 4.66, 1.30, 2.06, 1.19)<br />
summary(control)<br />
Min. 1st Qu. Median Mean 3rd Qu. Max. <br />
0.080 0.315 0.600 3.607 1.415 50.570<br />
sd(control)<br />
[1] 11.16464<br />
*Plot the empirical cumulative plot of the control data. From the chart, we can see that the majority of the data fall in the a small fraction of the plot on the top left, so this is a clear sign that the dataset from control group does not follow a normal distribution.<br />
library(stats)<br />
ecdf(control)<br />
plot(ecdf(control),verticals=TRUE,main='Cumulative Fraction Plot')<br />
<br />
[[Image:SMHS_Fig_1_Model_Fitting.png|500px]]<br />
<br />
*Now plot the control group using a log scale, which will give more space to display the small x data points. Now the plot seems to present the data points evenly into two halves (half above the median, half below the median), which is a little bit below 1.<br />
<br />
<br />
[[Image:SMHS_Fig_2_Model_Fitting.png|500px]]<br />
<br />
log.con <- log(control)<br />
plot(ecdf(log.con),verticals=T, main='Cumulative Fraction Plot of Log(Control)')<br />
<br />
*Now, plot the cumulative fraction of both the treatment and the control group on the same graph. <br />
<br />
[[Image:SMHS_Fig_3_Model_Fitting.png|500px]]<br />
<br />
From the chart, we can see that both datasets span much of the same range of the values, but for most of the x value, the fraction of the treatment (red) is less than the fraction of the control group (blue). We denote the difference in the two fractions at each x value and the K-S test uses the maximum vertical deviation between the two curves. In this case, the maximum deviation occurs near x=1 and D=0.45. The fraction of the treatment group that is less than 1 is 0.2 (4 out of 20), and that for the control group is 0.65 (13 out of the 20 values), thus the maximum difference in the cumulative fraction is D=0.45).<br />
Note that: the value of D is not affected by scale changes like using log, which is different from the t-statistics. Hence, the K-S test is a robust test that cares only about the relative distribution of the data. <br />
<br />
log.treat <- log(treatment)<br />
plot(ecdf(log.con),verticals=TRUE, main='Cumulative Fraction Plot on Log <br />
Scale',col.p='bisque',col.h='blue',col.v='blue',xlim=c(-3,5))<br />
par(new=T)<br />
plot(ecdf(log.treat),verticals=TRUE, <br />
col.p='bisque',col.h='red',col.v='red',main='',xlim=c (-3,5))<br />
con.p <- ecdf(log.con)<br />
treat.p <- ecdf(log.treat)<br />
con.p(0)-treat.p(0) ### D=0.45<br />
(ecdf(control))(1)-(ecdf(treatment))(1) ## D=0.45 same as using the log-scale<br />
<br />
'''4) Using the Percentile Plot:''' for our habit of observing and comparing continuous curves. We may seek to use something similar to cumulative fraction plot, but without the odd steps, say, try the percentiles. Consider the dataset of ${-0.45, 1.11, 0.48, -0.82, -1.26}.$ Sort this data from small to large ${-1.26, -0.82, -0.45, 0.48, 1.11}.$ The median is $-0.45$, which is the 50th percentile. To calculate the percentile, denote the point’s location in the sorted dataset as r, and then divided by the number of points plus one: percentile = $r/(N+1)$. Now we have the set of (data, percentile) pairs: ${(-1.26, 0.167), (-0.82, 0.333), (-0.45, 0.5), (0.48, 0.667), (1.11, 0.833)}$. We can connect the adjacent data points with a straight line and the resulting collection of connected straight line segment is called an '''''ogive'''''.<br />
<br />
<br />
RCODE:<br />
data <- c(-0.45, 1.11, 0.48, -0.82, -1.26)<br />
sort.data <- sort(data)<br />
percentile <- c(0.167, 0.333, 0.5, 0.667, 0.8333)<br />
plot(ecdf(data),verticals=T,xlim=c(-1.5,1.5),ylim=c(0,1), xlab='Data', ylab='' ,main='Cumulative Fraction Plot vs. Percentil Plot')<br />
par(new=T)<br />
plot(sort.data,percentile,type='o',xlim=c(-1.5,1.5),ylim=c(0,1),col=2,xlab='', ylab='', main='')<br />
<br />
[[Image:SMHS_Fig_4_Model_Fitting.png|500px]]<br />
<br />
Reasons to prefer percentile plot to cumulative fraction plots: the percentile plot is a better estimate of the distribution function and the percentiles allow us to use ‘probability graph paper’, plots with specially scaled axis divisions. Probability scales on the y-axis allow us to see how normal the data is. Normally distributed data will show a straight line on the probability paper while log-normal data will show a straight line with probability-log scaled axes. <br />
<br />
'''5) The K-S statistic in more than one dimension:''' modifies the K-S test statistic to accommodate the multivariate data. Given that the maximum difference between two joint $cdf$ is not generally the same as the maximum difference of any of the complementary distribution functions, the modification is not straightforward in this way. Instead, the maximum difference will differ depending on which of $Pr(x<X \Lambda Y>y)$ or any of the other two possible arrangement is used. One approach may be to compare the cdfs of the two samples with all possible orderings, and take the largest as the K-S statistic. In $d$ dimensions, there are $2^{d}-1$ orderings. The critical values for the test statistic can be obtained by simulations, but depend on the dependence structure in the joint distribution.<br />
<br />
===Applications===<br />
<br />
1)[http://wiki.stat.ucla.edu/socr/index.php/SOCR_EduMaterials_AnalysisActivities_KolmogorovSmirnoff This article] presents the SOCR analyses example on Kolmogorov-Smirnoff Test, which compares how distinct two values are. It presents a general introduction to the K-S test and illustrates its application with the Oats example and control-treatment test through the SOCR program.<br />
<br />
2) [http://wiki.stat.ucla.edu/socr/index.php/SOCR_EduMaterials_ModelerActivities_MixtureModel_1 This article] presents the SOCR activity which demonstrate the random sampling and fitting of mixture models to data. It starts with a general introduction to the mixture-model distribution and data description, followed with the exploratory data analysis and model fitting in the SOCR environment.<br />
<br />
3) [http://jhc.sagepub.com/content/25/7/935.short This article] presented on the Kolmogorov-Smirnov statistical test for the analysis of histograms and discussed about the test for both the two-sample case (comparing fn1(X) to fn2 (X)) and the one-sample case (comparing fn1 (X) to f(X)). Presentation of the specific algorithmic steps involved is done through development of an example where the data are from an experiment discussed elsewhere in this issue. It is shown that the two histograms examined come from two different parent populations at the 99.9% confidence level. <br />
<br />
4) [http://ieeexplore.ieee.org/xpl/login.jsp?tp=&arnumber=4310069&url=http%3A%2F%2Fieeexplore.ieee.org%2Fxpls%2Fabs_all.jsp%3Farnumber%3D4310069 This article] developed the digital techniques for detecting changes between two Landsat images. A data matrix containing 16 Ã 16 picture elements was used for this purpose. The Landsat imagery data was corrected for sensor inconsistencies and varying sun illumination. The Kolmogorov-Smirnov test (K-S test) was performed between the two corrected data matrices. This test is based on the absolute value of the maximum difference (Dmax) between the two cumulative frequency distributions. The limiting distribution of Dmax is known; thus a statistical decision concerning changes can be evaluated for the region. The K-S test was applied to different test sites. It successfully isolated regions of change. The test was found to be relatively independent of slight misregistration.<br />
<br />
===Software=== <br />
http://stat.ethz.ch/R-manual/R-patched/library/stats/html/ks.test.html <br />
<br />
http://astrostatistics.psu.edu/su07/R/html/stats/html/ks.test.html <br />
<br />
RCODE are attached as in the examples given in this lecture.<br />
<br />
===Problems===<br />
<br />
1) Consider the following dataset, control={0.22, -0.87, -2.39, -1.79, 0.37, -1.54, 1.28, -0.31, -0.74, 1.72, 0.38, -0.17, -0.62, -1.10, 0.30, 0.15, 2.30, 0.19, -0.50, -0.09}, treatment={-5.13, -2.19, -2.43, -3.83, 0.50, -3.25, 4.32, 1.63, 5.18, -0.43, 7.11, 4.87, -3.10, -5.81, 3.76, 6.31, 2.58, 0.07, 5.76, 3.50}, use the K-S test to test the equality of the two samples. Include all necessary steps and plots.<br />
<br />
2) Revise the example given in the lecture with a different control group: control={1.26, 0.34, 0.70, 1.75, 50.57, 1.55, 0.08, 0.42, 0.50, 3.20, 0.15, 0.49, 0.95, 0.24, 1.37, 0.17, 6.98, 0.10, 0.94, 0.38}, run the K-S test again and state clearly your conclusions with necessary plots.<br />
<br />
3) Two types of plants are in bloom in wood and we want to study if bees prefer one tree to the other? We collect data by using a stop-watch to time how long a bee stays near a particular tree. We begin to time when the bee touches the tree; we stop timing when the bee is more than a meter from the tree. (As a result all our times are at least 1 second long: it takes a touch-and-go bee that long to get one meter from the tree.) We wanted to time exactly the same number of bees for each tree, but it started to rain. (Unequal dataset size is not a problem for the KS-test.) Apply the K-S test with this example. The data are given as below:<br />
T1={23.4, 30.9, 18.8, 23.0, 21.4, 1, 24.6, 23.8, 24.1, 18.7, 16.3, 20.3, 14.9, 35.4, 21.6, 21.2, 21.0, 15.0, 15.6, 24.0, 34.6, 40.9, 30.7, 24.5, 16.6, 1, 21.7, 1, 23.6, 1, 25.7, 19.3, 46.9, 23.3, 21.8, 33.3, 24.9, 24.4, 1, 19.8, 17.2, 21.5, 25.5, 23.3, 18.6, 22.0, 29.8, 33.3, 1, 21.3, 18.6, 26.8, 19.4, 21.1, 21.2, 20.5, 19.8, 26.3, 39.3, 21.4, 22.6, 1, 35.3, 7.0, 19.3, 21.3, 10.1, 20.2, 1, 36.2, 16.7, 21.1, 39.1, 19.9, 32.1, 23.1, 21.8, 30.4, 19.62, 15.5}<br />
T2={16.5, 1, 22.6, 25.3, 23.7, 1, 23.3, 23.9, 16.2, 23.0, 21.6, 10.8, 12.2, 23.6, 10.1, 24.4, 16.4, 11.7, 17.7, 34.3, 24.3, 18.7, 27.5, 25.8, 22.5, 14.2, 21.7, 1, 31.2, 13.8, 29.7, 23.1, 26.1, 25.1, 23.4, 21.7, 24.4, 13.2, 22.1, 26.7, 22.7, 1, 18.2, 28.7, 29.1, 27.4, 22.3, 13.2, 22.5, 25.0, 1, 6.6, 23.7, 23.5, 17.3, 24.6, 27.8, 29.7, 25.3, 19.9, 18.2, 26.2, 20.4, 23.3, 26.7, 26.0, 1, 25.1, 33.1, 35.0, 25.3, 23.6, 23.2, 20.2, 24.7, 22.6, 39.1, 26.5, 22.7}<br />
Note that this example is based on data distributed according to the Cauchy distribution: a particularly abnormal case and the plots do not look particularly abnormal, however the large number of outliers is a tip off of a non-normal distribution. <br />
<br />
===References===<br />
<br />
[http://en.wikipedia.org/wiki/Kolmogorov–Smirnov_test K-S Test Wikipedia]<br />
<br />
[http://www.physics.csbsju.edu/stats/KS-test.html K-S Test]<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<hr><br />
* SOCR Home page: http://www.socr.umich.edu<br />
<br />
{{translate|pageName=http://wiki.socr.umich.edu/index.php?title=SMHS_ModelFitting}}</div>Clgallahttps://wiki.socr.umich.edu/index.php?title=SMHS_ModelFitting&diff=14483SMHS ModelFitting2014-10-21T12:55:33Z<p>Clgalla: /* Scientific Methods for Health Sciences - Model Fitting and Model Quality (KS-test) */</p>
<hr />
<div>==[[SMHS| Scientific Methods for Health Sciences]] - Model Fitting and Model Quality (KS-test) ==<br />
<br />
===Overview===<br />
Kolmogorov-Smirnov Test (K-S test) is a nonparametric test commonly applied in various fields to test on the equality of continuous, one-dimensional probability distribution that can be used to compare one sample with a reference probability distribution, which is commonly referred to one-sample K-S test, or to compare two samples, which is referred to as two-sample K-S test. The K-S test is used to determine if two datasets differ significantly. In this lecture, we are going to present a general introduction to K-S test and illustrate its application with examples. The implementation of K-S test in the statistical package R will also be discussed with specific examples. <br />
<br />
===Motivation===<br />
When we talk about testing the equality of two dataset, the first idea came into our mind may always be the simple t-test. However, there are situations where it is a mistake to trust the result of a t-test. Consider the situation where the control and treatment groups do not differ in mean, but only in some other way. Consider two dataset with same mean and very different variations, in this situation, the t-test cannot see the difference. Situations where the treatment and control groups are smallish datasets (say less than 20) that differ in mean but substantial non-normal distribution masks the difference. Consider two datasets drawn from lognormal distributions that differ substantially in mean, in this situation, the t-test also fails. Among all those situations, K-S test would be the answer. How does the K-S test work?<br />
<br />
===Theory===<br />
'''1) K-S test:''' a nonparametric test commonly applied in various fields to test on the equality of continuous, one-dimensional probability distribution that can be used to compare one sample with a reference probability distribution or to compare two samples. The K-S test quantifies a distance between the empirical distribution function of the sample and the cdf of reference distribution or between the empirical distribution functions of two samples. The null hypothesis is that the samples are drawn from the same distribution or that the sample is drawn from the reference distribution.<br />
*K-S test is sensitive to difference both in location and shape of the empirical cdf of the samples and is widely used as the nonparametric methods for comparing two samples.<br />
*The empirical distribution function $F_{n}$ for n i.i.d. observations $X_{i}: F_{n} (x)=1/n \sum_{i=1}^{n} I_{X_{i} \leq x},$ where $I_{X_{i} \leq x}$ is the indicator function, which equals to 1 when $X_{i} \leq x$ and equals to $0$ otherwise.<br />
*The K-S statistic for a given $cdf F(x)$ is $D_{n}=sup_{x}|F_{n}(x)-F(x)|,$ where sup{x} is the supremum of the set of distances. By the Gilvenko-Cantelli therorem, if the sample comes from distribution $F(x)$, then $D_{n}$ converges to $0$ almost surely in the limit when $n$ goes to infinity. <br />
*Kolmogorov distribution (the distribution of the random variable): $K=sup_{t\in [0,1]} |B(t)|,$ where $B(t)$ is the Brownian bridge. The $c.d.f$ of $K$ is given: $Pr(K \leq x)=1-2 \sum_{k=1}^{\infty}(-1)^{k-1}e^{-2k^{2} x^{2}}=\frac {\sqrt{2 \pi}}{x} \sum_{k=1}^{\infty}e^{-(2k-1)^{2} \pi^{2} \setminus (8x^{2})}.$ Under the null hypothesis, the sample comes from the hypothesized distribution $F(x), \sqrt{n} D_{n} \overset{n\rightarrow \infty}{\rightarrow} sup_{t} |B(F(t))|$ in distribution, where $B(t)$ is the Brownian bridge. When $F$ is continuous, $\sqrt{n} D_{n}$ under the null hypothesis converges to the Kolmogorov distribution, which does not depend on $F.$<br />
*The goodness-of-fit test or the K-S test is constructed by using the critical values of the Kolmogorov distribution. The null hypothesis is rejected at level $\alpha$ if $\sqrt{n} D_{n} > K_{\alpha},$ where $K_{\alpha}$ is found from $Pr(K \leq K_{\alpha})=1- \alpha.$ The asymptotic power of this test is $1$.<br />
*Test with estimated parameters: if either the form or the parameters of $F(x)$ are determined from the data $X_{j}$ the critical values determined in this way are invalid. Modifications required to the test statistics and critical values have been proposed.<br />
<br />
'''2) Two-sample K-S test:''' to test whether two underlying one-dimensional probability distribution differ. The K-S statistic is $D_{n,{n}'}=sup_{x} |F_{1,n}(x)-F_{2,{n}'}(x)|$, where $F_{1,n}$ and $F_{2,{n}'}$ are the empirical distribution functions of the first and second sample respectively. The null hypothesis is rejected at least $\alpha$ if $D_{n,{n}'}>c(\alpha)\sqrt{\frac{n+{n}'}{n{n}'}}$. The value of $c(\alpha)$ is given in the following table for each level of $\alpha.$<br />
<br />
<center><br />
{| class="wikitable" style="text-align:center; width:35%" border="1"<br />
|-<br />
|$\alpha$|| 0.1|| 0.05|| 0.025|| 0.01|| 0.005 ||0.001<br />
|-<br />
|c($\alpha$)|| 1.22|| 1.36|| 1.48|| 1.63|| 1.73|| 1.95<br />
|}<br />
</center><br />
<br />
Note: two-sample test checks whether the two data samples come from the same distribution. This does not specify what that common distribution is. While the K-S test is usually used to test whether a given $F(x)$ is the underlying probability distribution $F_{n}(x)$, the procedure may be inverted to give confidence limits on $F(x)$ itself. If we choose a critical value of the test statistics $D_{\alpha}$ such that $P(D_{n}>D_{\alpha})=\alpha$, then a band of width $\pm D_{\alpha}$ around $F_{n}(x)$ will entirely contain $F(x)$ with probability $1-\alpha.$<br />
<br />
3) Illustration on how the K-S test works with example: consider the two datasets control$={1.26, 0.34, 0.70, 1.75, 50.57, 1.55, 0.08, 0.42, 0.50, 3.20, 0.15, 0.49, 0.95, 0.24, 1.37, 0.17, 6.98, 0.10, 0.94, 0.38}$, treatment= ${2.37, 2.16, 14.82, 1.73, 41.04, 0.23, 1.32, 2.91, 39.41, 0.11, 27.44, 4.51, 0.51, 4.50, 0.18, 14.68, 4.66, 1.30, 2.06, 1.19}$.<br />
*Descriptive statistics: to reduce the list of all the data items to a few simpler numbers. For the control group, the mean is 3.607, the median is 0.60, the highest is 50.57, the lowest is 0.08 and the standard deviation is 11.165. Obviously, this dataset is not normally distributed.<br />
*Sort the control data: control sorted$={0.08, 0.10, 0.15, 0.17, 0.24, 0.34, 0.38, 0.42, 0.49, 0.50, 0.70, 0.94, 0.95, 1.26, 1.37, 1.55, 1.75, 3.20, 6.98, 50.57},$ evidently no data lies strictly below $0.08, 5% = 1/20$ of the data is smaller than $0.1$, 10% of the data is strictly smaller than $0.15, 15%$ of the data is strictly smaller than $0.17$, $\cdots.$ There are $17$ data points smaller than $\pi$, and we say that the cumulative fraction of the data smaller than $\pi$ is $17/20 (0.85).$ <br />
<br />
RCODE:<br />
Control <- c(1.26, 0.34, 0.70, 1.75, 50.57, 1.55, 0.08, 0.42, 0.50, 3.20, 0.15, <br />
0.49, 0.95 , 0.24, 1.37, 0.17, 6.98, 0.10, 0.94, 0.38)<br />
Treatment <- c(2.37, 2.16, 14.82, 1.73, 41.04, 0.23, 1.32, 2.91, 39.41, 0.11, <br />
27.44, 4.51, 0.51, 4.50, 0.18, 14.68, 4.66, 1.30, 2.06, 1.19)<br />
summary(control)<br />
Min. 1st Qu. Median Mean 3rd Qu. Max. <br />
0.080 0.315 0.600 3.607 1.415 50.570<br />
sd(control)<br />
[1] 11.16464<br />
*Plot the empirical cumulative plot of the control data. From the chart, we can see that the majority of the data fall in the a small fraction of the plot on the top left, so this is a clear sign that the dataset from control group does not follow a normal distribution.<br />
library(stats)<br />
ecdf(control)<br />
plot(ecdf(control),verticals=TRUE,main='Cumulative Fraction Plot')<br />
<br />
[[Image:SMHS_Fig_1_Model_Fitting.png|500px]]<br />
<br />
*Now plot the control group using a log scale, which will give more space to display the small x data points. Now the plot seems to present the data points evenly into two halves (half above the median, half below the median), which is a little bit below 1.<br />
<br />
<br />
[[Image:SMHS_Fig_2_Model_Fitting.png|500px]]<br />
<br />
log.con <- log(control)<br />
plot(ecdf(log.con),verticals=T, main='Cumulative Fraction Plot of Log(Control)')<br />
<br />
*Now, plot the cumulative fraction of both the treatment and the control group on the same graph. <br />
<br />
[[Image:SMHS_Fig_3_Model_Fitting.png|500px]]<br />
<br />
From the chart, we can see that both datasets span much of the same range of the values, but for most of the x value, the fraction of the treatment (red) is less than the fraction of the control group (blue). We denote the difference in the two fractions at each x value and the K-S test uses the maximum vertical deviation between the two curves. In this case, the maximum deviation occurs near x=1 and D=0.45. The fraction of the treatment group that is less than 1 is 0.2 (4 out of 20), and that for the control group is 0.65 (13 out of the 20 values), thus the maximum difference in the cumulative fraction is D=0.45).<br />
Note that: the value of D is not affected by scale changes like using log, which is different from the t-statistics. Hence, the K-S test is a robust test that cares only about the relative distribution of the data. <br />
<br />
log.treat <- log(treatment)<br />
plot(ecdf(log.con),verticals=TRUE, main='Cumulative Fraction Plot on Log <br />
Scale',col.p='bisque',col.h='blue',col.v='blue',xlim=c(-3,5))<br />
par(new=T)<br />
plot(ecdf(log.treat),verticals=TRUE, <br />
col.p='bisque',col.h='red',col.v='red',main='',xlim=c (-3,5))<br />
con.p <- ecdf(log.con)<br />
treat.p <- ecdf(log.treat)<br />
con.p(0)-treat.p(0) ### D=0.45<br />
(ecdf(control))(1)-(ecdf(treatment))(1) ## D=0.45 same as using the log-scale<br />
<br />
'''4) Using the Percentile Plot:''' for our habit of observing and comparing continuous curves. We may seek to use something similar to cumulative fraction plot, but without the odd steps, say, try the percentiles. Consider the dataset of ${-0.45, 1.11, 0.48, -0.82, -1.26}.$ Sort this data from small to large ${-1.26, -0.82, -0.45, 0.48, 1.11}.$ The median is $-0.45$, which is the 50th percentile. To calculate the percentile, denote the point’s location in the sorted dataset as r, and then divided by the number of points plus one: percentile = $r/(N+1)$. Now we have the set of (data, percentile) pairs: ${(-1.26, 0.167), (-0.82, 0.333), (-0.45, 0.5), (0.48, 0.667), (1.11, 0.833)}$. We can connect the adjacent data points with a straight line and the resulting collection of connected straight line segment is called an ''ogive'''''Bold text''').<br />
<br />
<br />
RCODE:<br />
data <- c(-0.45, 1.11, 0.48, -0.82, -1.26)<br />
sort.data <- sort(data)<br />
percentile <- c(0.167, 0.333, 0.5, 0.667, 0.8333)<br />
plot(ecdf(data),verticals=T,xlim=c(-1.5,1.5),ylim=c(0,1), xlab='Data', ylab='' ,main='Cumulative Fraction Plot vs. Percentil Plot')<br />
par(new=T)<br />
plot(sort.data,percentile,type='o',xlim=c(-1.5,1.5),ylim=c(0,1),col=2,xlab='', ylab='', main='')<br />
<br />
[[Image:SMHS_Fig_4_Model_Fitting.png|500px]]<br />
<br />
Reasons to prefer percentile plot to cumulative fraction plots: the percentile plot is a better estimate of the distribution function and the percentiles allow us to use ‘probability graph paper’, plots with specially scaled axis divisions. Probability scales on the y-axis allow us to see how normal the data is. Normally distributed data will show a straight line on the probability paper while log-normal data will show a straight line with probability-log scaled axes. <br />
<br />
'''5) The K-S statistic in more than one dimension:''' modifies the K-S test statistic to accommodate the multivariate data. Given that the maximum difference between two joint $cdf$ is not generally the same as the maximum difference of any of the complementary distribution functions, the modification is not straightforward in this way. Instead, the maximum difference will differ depending on which of $Pr(x<X \Lambda Y>y)$ or any of the other two possible arrangement is used. One approach may be to compare the cdfs of the two samples with all possible orderings, and take the largest as the K-S statistic. In $d$ dimensions, there are $2^{d}-1$ orderings. The critical values for the test statistic can be obtained by simulations, but depend on the dependence structure in the joint distribution.<br />
<br />
===Applications===<br />
<br />
1)[http://wiki.stat.ucla.edu/socr/index.php/SOCR_EduMaterials_AnalysisActivities_KolmogorovSmirnoff This article] presents the SOCR analyses example on Kolmogorov-Smirnoff Test, which compares how distinct two values are. It presents a general introduction to the K-S test and illustrates its application with the Oats example and control-treatment test through the SOCR program.<br />
<br />
2) [http://wiki.stat.ucla.edu/socr/index.php/SOCR_EduMaterials_ModelerActivities_MixtureModel_1 This article] presents the SOCR activity which demonstrate the random sampling and fitting of mixture models to data. It starts with a general introduction to the mixture-model distribution and data description, followed with the exploratory data analysis and model fitting in the SOCR environment.<br />
<br />
3) [http://jhc.sagepub.com/content/25/7/935.short This article] presented on the Kolmogorov-Smirnov statistical test for the analysis of histograms and discussed about the test for both the two-sample case (comparing fn1(X) to fn2 (X)) and the one-sample case (comparing fn1 (X) to f(X)). Presentation of the specific algorithmic steps involved is done through development of an example where the data are from an experiment discussed elsewhere in this issue. It is shown that the two histograms examined come from two different parent populations at the 99.9% confidence level. <br />
<br />
4) [http://ieeexplore.ieee.org/xpl/login.jsp?tp=&arnumber=4310069&url=http%3A%2F%2Fieeexplore.ieee.org%2Fxpls%2Fabs_all.jsp%3Farnumber%3D4310069 This article] developed the digital techniques for detecting changes between two Landsat images. A data matrix containing 16 Ã 16 picture elements was used for this purpose. The Landsat imagery data was corrected for sensor inconsistencies and varying sun illumination. The Kolmogorov-Smirnov test (K-S test) was performed between the two corrected data matrices. This test is based on the absolute value of the maximum difference (Dmax) between the two cumulative frequency distributions. The limiting distribution of Dmax is known; thus a statistical decision concerning changes can be evaluated for the region. The K-S test was applied to different test sites. It successfully isolated regions of change. The test was found to be relatively independent of slight misregistration.<br />
<br />
===Software=== <br />
http://stat.ethz.ch/R-manual/R-patched/library/stats/html/ks.test.html <br />
<br />
http://astrostatistics.psu.edu/su07/R/html/stats/html/ks.test.html <br />
<br />
RCODE are attached as in the examples given in this lecture.<br />
<br />
===Problems===<br />
<br />
1) Consider the following dataset, control={0.22, -0.87, -2.39, -1.79, 0.37, -1.54, 1.28, -0.31, -0.74, 1.72, 0.38, -0.17, -0.62, -1.10, 0.30, 0.15, 2.30, 0.19, -0.50, -0.09}, treatment={-5.13, -2.19, -2.43, -3.83, 0.50, -3.25, 4.32, 1.63, 5.18, -0.43, 7.11, 4.87, -3.10, -5.81, 3.76, 6.31, 2.58, 0.07, 5.76, 3.50}, use the K-S test to test the equality of the two samples. Include all necessary steps and plots.<br />
<br />
2) Revise the example given in the lecture with a different control group: control={1.26, 0.34, 0.70, 1.75, 50.57, 1.55, 0.08, 0.42, 0.50, 3.20, 0.15, 0.49, 0.95, 0.24, 1.37, 0.17, 6.98, 0.10, 0.94, 0.38}, run the K-S test again and state clearly your conclusions with necessary plots.<br />
<br />
3) Two types of plants are in bloom in wood and we want to study if bees prefer one tree to the other? We collect data by using a stop-watch to time how long a bee stays near a particular tree. We begin to time when the bee touches the tree; we stop timing when the bee is more than a meter from the tree. (As a result all our times are at least 1 second long: it takes a touch-and-go bee that long to get one meter from the tree.) We wanted to time exactly the same number of bees for each tree, but it started to rain. (Unequal dataset size is not a problem for the KS-test.) Apply the K-S test with this example. The data are given as below:<br />
T1={23.4, 30.9, 18.8, 23.0, 21.4, 1, 24.6, 23.8, 24.1, 18.7, 16.3, 20.3, 14.9, 35.4, 21.6, 21.2, 21.0, 15.0, 15.6, 24.0, 34.6, 40.9, 30.7, 24.5, 16.6, 1, 21.7, 1, 23.6, 1, 25.7, 19.3, 46.9, 23.3, 21.8, 33.3, 24.9, 24.4, 1, 19.8, 17.2, 21.5, 25.5, 23.3, 18.6, 22.0, 29.8, 33.3, 1, 21.3, 18.6, 26.8, 19.4, 21.1, 21.2, 20.5, 19.8, 26.3, 39.3, 21.4, 22.6, 1, 35.3, 7.0, 19.3, 21.3, 10.1, 20.2, 1, 36.2, 16.7, 21.1, 39.1, 19.9, 32.1, 23.1, 21.8, 30.4, 19.62, 15.5}<br />
T2={16.5, 1, 22.6, 25.3, 23.7, 1, 23.3, 23.9, 16.2, 23.0, 21.6, 10.8, 12.2, 23.6, 10.1, 24.4, 16.4, 11.7, 17.7, 34.3, 24.3, 18.7, 27.5, 25.8, 22.5, 14.2, 21.7, 1, 31.2, 13.8, 29.7, 23.1, 26.1, 25.1, 23.4, 21.7, 24.4, 13.2, 22.1, 26.7, 22.7, 1, 18.2, 28.7, 29.1, 27.4, 22.3, 13.2, 22.5, 25.0, 1, 6.6, 23.7, 23.5, 17.3, 24.6, 27.8, 29.7, 25.3, 19.9, 18.2, 26.2, 20.4, 23.3, 26.7, 26.0, 1, 25.1, 33.1, 35.0, 25.3, 23.6, 23.2, 20.2, 24.7, 22.6, 39.1, 26.5, 22.7}<br />
Note that this example is based on data distributed according to the Cauchy distribution: a particularly abnormal case and the plots do not look particularly abnormal, however the large number of outliers is a tip off of a non-normal distribution. <br />
<br />
===References===<br />
<br />
[http://en.wikipedia.org/wiki/Kolmogorov–Smirnov_test K-S Test Wikipedia]<br />
<br />
[http://www.physics.csbsju.edu/stats/KS-test.html K-S Test]<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<hr><br />
* SOCR Home page: http://www.socr.umich.edu<br />
<br />
{{translate|pageName=http://wiki.socr.umich.edu/index.php?title=SMHS_ModelFitting}}</div>Clgallahttps://wiki.socr.umich.edu/index.php?title=File:SMHS_Fig_4_Model_Fitting.png&diff=14482File:SMHS Fig 4 Model Fitting.png2014-10-20T20:28:43Z<p>Clgalla: </p>
<hr />
<div></div>Clgallahttps://wiki.socr.umich.edu/index.php?title=File:SMHS_Fig_3_Model_Fitting.png&diff=14481File:SMHS Fig 3 Model Fitting.png2014-10-20T20:28:29Z<p>Clgalla: </p>
<hr />
<div></div>Clgallahttps://wiki.socr.umich.edu/index.php?title=File:SMHS_Fig_2_Model_Fitting.png&diff=14480File:SMHS Fig 2 Model Fitting.png2014-10-20T20:28:16Z<p>Clgalla: </p>
<hr />
<div></div>Clgallahttps://wiki.socr.umich.edu/index.php?title=File:SMHS_Fig_1_Model_Fitting.png&diff=14479File:SMHS Fig 1 Model Fitting.png2014-10-20T20:28:01Z<p>Clgalla: </p>
<hr />
<div></div>Clgallahttps://wiki.socr.umich.edu/index.php?title=SMHS_ModelFitting&diff=14478SMHS ModelFitting2014-10-20T20:27:40Z<p>Clgalla: </p>
<hr />
<div>==[[SMHS| Scientific Methods for Health Sciences]] - Model Fitting and Model Quality (KS-test) ==<br />
<br />
===Overview===<br />
Kolmogorov-Smirnov Test (K-S test) is a nonparametric test commonly applied in various fields to test on the equality of continuous, one-dimensional probability distribution that can be used to compare one sample with a reference probability distribution, which is commonly referred to one-sample K-S test, or to compare two samples, which is referred to as two-sample K-S test. The K-S test is used to determine if two datasets differ significantly. In this lecture, we are going to present a general introduction to K-S test and illustrate its application with examples. The implementation of K-S test in the statistical package R will also be discussed with specific examples. <br />
<br />
===Motivation===<br />
When we talk about testing the equality of two dataset, the first idea came into our mind may always be the simple t-test. However, there are situations where it is a mistake to trust the result of a t-test. Consider the situation where the control and treatment groups do not differ in mean, but only in some other way. Consider two dataset with same mean and very different variations, in this situation, the t-test cannot see the difference. Situations where the treatment and control groups are smallish datasets (say less than 20) that differ in mean but substantial non-normal distribution masks the difference. Consider two datasets drawn from lognormal distributions that differ substantially in mean, in this situation, the t-test also fails. Among all those situations, K-S test would be the answer. How does the K-S test work?<br />
<br />
===Theory===<br />
'''1) K-S test:''' a nonparametric test commonly applied in various fields to test on the equality of continuous, one-dimensional probability distribution that can be used to compare one sample with a reference probability distribution or to compare two samples. The K-S test quantifies a distance between the empirical distribution function of the sample and the cdf of reference distribution or between the empirical distribution functions of two samples. The null hypothesis is that the samples are drawn from the same distribution or that the sample is drawn from the reference distribution.<br />
*K-S test is sensitive to difference both in location and shape of the empirical cdf of the samples and is widely used as the nonparametric methods for comparing two samples.<br />
*The empirical distribution function $F_{n}$ for n i.i.d. observations $X_{i}: F_{n} (x)=1/n \sum_{i=1}^{n} I_{X_{i} \leq x},$ where $I_{X_{i} \leq x}$ is the indicator function, which equals to 1 when $X_{i} \leq x$ and equals to $0$ otherwise.<br />
*The K-S statistic for a given $cdf F(x)$ is $D_{n}=sup_{x}|F_{n}(x)-F(x)|,$ where sup{x} is the supremum of the set of distances. By the Gilvenko-Cantelli therorem, if the sample comes from distribution $F(x)$, then $D_{n}$ converges to $0$ almost surely in the limit when $n$ goes to infinity. <br />
*Kolmogorov distribution (the distribution of the random variable): $K=sup_{t\in [0,1]} |B(t)|,$ where $B(t)$ is the Brownian bridge. The $c.d.f$ of $K$ is given: $Pr(K \leq x)=1-2 \sum_{k=1}^{\infty}(-1)^{k-1}e^{-2k^{2} x^{2}}=\frac {\sqrt{2 \pi}}{x} \sum_{k=1}^{\infty}e^{-(2k-1)^{2} \pi^{2} \setminus (8x^{2})}.$ Under the null hypothesis, the sample comes from the hypothesized distribution $F(x), \sqrt{n} D_{n} \overset{n\rightarrow \infty}{\rightarrow} sup_{t} |B(F(t))|$ in distribution, where $B(t)$ is the Brownian bridge. When $F$ is continuous, $\sqrt{n} D_{n}$ under the null hypothesis converges to the Kolmogorov distribution, which does not depend on $F.$<br />
*The goodness-of-fit test or the K-S test is constructed by using the critical values of the Kolmogorov distribution. The null hypothesis is rejected at level $\alpha$ if $\sqrt{n} D_{n} > K_{\alpha},$ where $K_{\alpha}$ is found from $Pr(K \leq K_{\alpha})=1- \alpha.$ The asymptotic power of this test is $1$.<br />
*Test with estimated parameters: if either the form or the parameters of $F(x)$ are determined from the data $X_{j}$ the critical values determined in this way are invalid. Modifications required to the test statistics and critical values have been proposed.<br />
<br />
'''2) Two-sample K-S test:''' to test whether two underlying one-dimensional probability distribution differ. The K-S statistic is $D_{n,{n}'}=sup_{x} |F_{1,n}(x)-F_{2,{n}'}(x)|$, where $F_{1,n}$ and $F_{2,{n}'}$ are the empirical distribution functions of the first and second sample respectively. The null hypothesis is rejected at least $\alpha$ if $D_{n,{n}'}>c(\alpha)\sqrt{\frac{n+{n}'}{n{n}'}}$. The value of $c(\alpha)$ is given in the following table for each level of $\alpha.$<br />
<br />
<center><br />
{| class="wikitable" style="text-align:center; width:35%" border="1"<br />
|-<br />
|$\alpha$|| 0.1|| 0.05|| 0.025|| 0.01|| 0.005 ||0.001<br />
|-<br />
|c($\alpha$)|| 1.22|| 1.36|| 1.48|| 1.63|| 1.73|| 1.95<br />
|}<br />
</center><br />
<br />
Note: two-sample test checks whether the two data samples come from the same distribution. This does not specify what that common distribution is. While the K-S test is usually used to test whether a given $F(x)$ is the underlying probability distribution $F_{n}(x)$, the procedure may be inverted to give confidence limits on $F(x)$ itself. If we choose a critical value of the test statistics $D_{\alpha}$ such that $P(D_{n}>D_{\alpha})=\alpha$, then a band of width $\pm D_{\alpha}$ around $F_{n}(x)$ will entirely contain $F(x)$ with probability $1-\alpha.$<br />
<br />
3) Illustration on how the K-S test works with example: consider the two datasets control$={1.26, 0.34, 0.70, 1.75, 50.57, 1.55, 0.08, 0.42, 0.50, 3.20, 0.15, 0.49, 0.95, 0.24, 1.37, 0.17, 6.98, 0.10, 0.94, 0.38}$, treatment= ${2.37, 2.16, 14.82, 1.73, 41.04, 0.23, 1.32, 2.91, 39.41, 0.11, 27.44, 4.51, 0.51, 4.50, 0.18, 14.68, 4.66, 1.30, 2.06, 1.19}$.<br />
*Descriptive statistics: to reduce the list of all the data items to a few simpler numbers. For the control group, the mean is 3.607, the median is 0.60, the highest is 50.57, the lowest is 0.08 and the standard deviation is 11.165. Obviously, this dataset is not normally distributed.<br />
*Sort the control data: control sorted$={0.08, 0.10, 0.15, 0.17, 0.24, 0.34, 0.38, 0.42, 0.49, 0.50, 0.70, 0.94, 0.95, 1.26, 1.37, 1.55, 1.75, 3.20, 6.98, 50.57},$ evidently no data lies strictly below $0.08, 5% = 1/20$ of the data is smaller than $0.1$, 10% of the data is strictly smaller than $0.15, 15%$ of the data is strictly smaller than $0.17$, $\cdots.$ There are $17$ data points smaller than $\pi$, and we say that the cumulative fraction of the data smaller than $\pi$ is $17/20 (0.85).$ <br />
<br />
RCODE:<br />
Control <- c(1.26, 0.34, 0.70, 1.75, 50.57, 1.55, 0.08, 0.42, 0.50, 3.20, 0.15, <br />
0.49, 0.95 , 0.24, 1.37, 0.17, 6.98, 0.10, 0.94, 0.38)<br />
Treatment <- c(2.37, 2.16, 14.82, 1.73, 41.04, 0.23, 1.32, 2.91, 39.41, 0.11, <br />
27.44, 4.51, 0.51, 4.50, 0.18, 14.68, 4.66, 1.30, 2.06, 1.19)<br />
summary(control)<br />
Min. 1st Qu. Median Mean 3rd Qu. Max. <br />
0.080 0.315 0.600 3.607 1.415 50.570<br />
sd(control)<br />
[1] 11.16464<br />
*Plot the empirical cumulative plot of the control data. From the chart, we can see that the majority of the data fall in the a small fraction of the plot on the top left, so this is a clear sign that the dataset from control group does not follow a normal distribution.<br />
library(stats)<br />
ecdf(control)<br />
plot(ecdf(control),verticals=TRUE,main='Cumulative Fraction Plot')<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
===Applications===<br />
<br />
1)[http://wiki.stat.ucla.edu/socr/index.php/SOCR_EduMaterials_AnalysisActivities_KolmogorovSmirnoff This article] presents the SOCR analyses example on Kolmogorov-Smirnoff Test, which compares how distinct two values are. It presents a general introduction to the K-S test and illustrates its application with the Oats example and control-treatment test through the SOCR program.<br />
<br />
2) [http://wiki.stat.ucla.edu/socr/index.php/SOCR_EduMaterials_ModelerActivities_MixtureModel_1 This article] presents the SOCR activity which demonstrate the random sampling and fitting of mixture models to data. It starts with a general introduction to the mixture-model distribution and data description, followed with the exploratory data analysis and model fitting in the SOCR environment.<br />
<br />
3) [http://jhc.sagepub.com/content/25/7/935.short This article] presented on the Kolmogorov-Smirnov statistical test for the analysis of histograms and discussed about the test for both the two-sample case (comparing fn1(X) to fn2 (X)) and the one-sample case (comparing fn1 (X) to f(X)). Presentation of the specific algorithmic steps involved is done through development of an example where the data are from an experiment discussed elsewhere in this issue. It is shown that the two histograms examined come from two different parent populations at the 99.9% confidence level. <br />
<br />
4) [http://ieeexplore.ieee.org/xpl/login.jsp?tp=&arnumber=4310069&url=http%3A%2F%2Fieeexplore.ieee.org%2Fxpls%2Fabs_all.jsp%3Farnumber%3D4310069 This article] developed the digital techniques for detecting changes between two Landsat images. A data matrix containing 16 Ã 16 picture elements was used for this purpose. The Landsat imagery data was corrected for sensor inconsistencies and varying sun illumination. The Kolmogorov-Smirnov test (K-S test) was performed between the two corrected data matrices. This test is based on the absolute value of the maximum difference (Dmax) between the two cumulative frequency distributions. The limiting distribution of Dmax is known; thus a statistical decision concerning changes can be evaluated for the region. The K-S test was applied to different test sites. It successfully isolated regions of change. The test was found to be relatively independent of slight misregistration.<br />
<br />
===Software=== <br />
http://stat.ethz.ch/R-manual/R-patched/library/stats/html/ks.test.html <br />
<br />
http://astrostatistics.psu.edu/su07/R/html/stats/html/ks.test.html <br />
<br />
RCODE are attached as in the examples given in this lecture.<br />
<br />
===Problems===<br />
<br />
6.1) Consider the following dataset, control={0.22, -0.87, -2.39, -1.79, 0.37, -1.54, 1.28, -0.31, -0.74, 1.72, 0.38, -0.17, -0.62, -1.10, 0.30, 0.15, 2.30, 0.19, -0.50, -0.09}, treatment={-5.13, -2.19, -2.43, -3.83, 0.50, -3.25, 4.32, 1.63, 5.18, -0.43, 7.11, 4.87, -3.10, -5.81, 3.76, 6.31, 2.58, 0.07, 5.76, 3.50}, use the K-S test to test the equality of the two samples. Include all necessary steps and plots.<br />
<br />
6.2) Revise the example given in the lecture with a different control group: control={1.26, 0.34, 0.70, 1.75, 50.57, 1.55, 0.08, 0.42, 0.50, 3.20, 0.15, 0.49, 0.95, 0.24, 1.37, 0.17, 6.98, 0.10, 0.94, 0.38}, run the K-S test again and state clearly your conclusions with necessary plots.<br />
<br />
6.3) Two types of plants are in bloom in wood and we want to study if bees prefer one tree to the other? We collect data by using a stop-watch to time how long a bee stays near a particular tree. We begin to time when the bee touches the tree; we stop timing when the bee is more than a meter from the tree. (As a result all our times are at least 1 second long: it takes a touch-and-go bee that long to get one meter from the tree.) We wanted to time exactly the same number of bees for each tree, but it started to rain. (Unequal dataset size is not a problem for the KS-test.) Apply the K-S test with this example. The data are given as below:<br />
T1={23.4, 30.9, 18.8, 23.0, 21.4, 1, 24.6, 23.8, 24.1, 18.7, 16.3, 20.3, 14.9, 35.4, 21.6, 21.2, 21.0, 15.0, 15.6, 24.0, 34.6, 40.9, 30.7, 24.5, 16.6, 1, 21.7, 1, 23.6, 1, 25.7, 19.3, 46.9, 23.3, 21.8, 33.3, 24.9, 24.4, 1, 19.8, 17.2, 21.5, 25.5, 23.3, 18.6, 22.0, 29.8, 33.3, 1, 21.3, 18.6, 26.8, 19.4, 21.1, 21.2, 20.5, 19.8, 26.3, 39.3, 21.4, 22.6, 1, 35.3, 7.0, 19.3, 21.3, 10.1, 20.2, 1, 36.2, 16.7, 21.1, 39.1, 19.9, 32.1, 23.1, 21.8, 30.4, 19.62, 15.5}<br />
T2={16.5, 1, 22.6, 25.3, 23.7, 1, 23.3, 23.9, 16.2, 23.0, 21.6, 10.8, 12.2, 23.6, 10.1, 24.4, 16.4, 11.7, 17.7, 34.3, 24.3, 18.7, 27.5, 25.8, 22.5, 14.2, 21.7, 1, 31.2, 13.8, 29.7, 23.1, 26.1, 25.1, 23.4, 21.7, 24.4, 13.2, 22.1, 26.7, 22.7, 1, 18.2, 28.7, 29.1, 27.4, 22.3, 13.2, 22.5, 25.0, 1, 6.6, 23.7, 23.5, 17.3, 24.6, 27.8, 29.7, 25.3, 19.9, 18.2, 26.2, 20.4, 23.3, 26.7, 26.0, 1, 25.1, 33.1, 35.0, 25.3, 23.6, 23.2, 20.2, 24.7, 22.6, 39.1, 26.5, 22.7}<br />
Note that this example is based on data distributed according to the Cauchy distribution: a particularly abnormal case and the plots do not look particularly abnormal, however the large number of outliers is a tip off of a non-normal distribution. <br />
<br />
===References===<br />
<br />
[http://en.wikipedia.org/wiki/Kolmogorov–Smirnov_test K-S Test Wikipedia]<br />
<br />
[http://www.physics.csbsju.edu/stats/KS-test.html K-S Test]<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<hr><br />
* SOCR Home page: http://www.socr.umich.edu<br />
<br />
{{translate|pageName=http://wiki.socr.umich.edu/index.php?title=SMHS_ModelFitting}}</div>Clgallahttps://wiki.socr.umich.edu/index.php?title=SMHS_ModelFitting&diff=14477SMHS ModelFitting2014-10-20T20:24:06Z<p>Clgalla: /* Scientific Methods for Health Sciences - Model Fitting and Model Quality (KS-test) */</p>
<hr />
<div>==[[SMHS| Scientific Methods for Health Sciences]] - Model Fitting and Model Quality (KS-test) ==<br />
<br />
===Overview===<br />
Kolmogorov-Smirnov Test (K-S test) is a nonparametric test commonly applied in various fields to test on the equality of continuous, one-dimensional probability distribution that can be used to compare one sample with a reference probability distribution, which is commonly referred to one-sample K-S test, or to compare two samples, which is referred to as two-sample K-S test. The K-S test is used to determine if two datasets differ significantly. In this lecture, we are going to present a general introduction to K-S test and illustrate its application with examples. The implementation of K-S test in the statistical package R will also be discussed with specific examples. <br />
<br />
===Motivation===<br />
When we talk about testing the equality of two dataset, the first idea came into our mind may always be the simple t-test. However, there are situations where it is a mistake to trust the result of a t-test. Consider the situation where the control and treatment groups do not differ in mean, but only in some other way. Consider two dataset with same mean and very different variations, in this situation, the t-test cannot see the difference. Situations where the treatment and control groups are smallish datasets (say less than 20) that differ in mean but substantial non-normal distribution masks the difference. Consider two datasets drawn from lognormal distributions that differ substantially in mean, in this situation, the t-test also fails. Among all those situations, K-S test would be the answer. How does the K-S test work?<br />
<br />
===Theory===<br />
'''1) K-S test:''' a nonparametric test commonly applied in various fields to test on the equality of continuous, one-dimensional probability distribution that can be used to compare one sample with a reference probability distribution or to compare two samples. The K-S test quantifies a distance between the empirical distribution function of the sample and the cdf of reference distribution or between the empirical distribution functions of two samples. The null hypothesis is that the samples are drawn from the same distribution or that the sample is drawn from the reference distribution.<br />
*K-S test is sensitive to difference both in location and shape of the empirical cdf of the samples and is widely used as the nonparametric methods for comparing two samples.<br />
*The empirical distribution function $F_{n}$ for n i.i.d. observations $X_{i}: F_{n} (x)=1/n \sum_{i=1}^{n} I_{X_{i} \leq x},$ where $I_{X_{i} \leq x}$ is the indicator function, which equals to 1 when $X_{i} \leq x$ and equals to $0$ otherwise.<br />
*The K-S statistic for a given $cdf F(x)$ is $D_{n}=sup_{x}|F_{n}(x)-F(x)|,$ where sup{x} is the supremum of the set of distances. By the Gilvenko-Cantelli therorem, if the sample comes from distribution $F(x)$, then $D_{n}$ converges to $0$ almost surely in the limit when $n$ goes to infinity. <br />
*Kolmogorov distribution (the distribution of the random variable): $K=sup_{t\in [0,1]} |B(t)|,$ where $B(t)$ is the Brownian bridge. The $c.d.f$ of $K$ is given: $Pr(K \leq x)=1-2 \sum_{k=1}^{\infty}(-1)^{k-1}e^{-2k^{2} x^{2}}=\frac {\sqrt{2 \pi}}{x} \sum_{k=1}^{\infty}e^{-(2k-1)^{2} \pi^{2} \setminus (8x^{2})}.$ Under the null hypothesis, the sample comes from the hypothesized distribution $F(x), \sqrt{n} D_{n} \overset{n\rightarrow \infty}{\rightarrow} sup_{t} |B(F(t))|$ in distribution, where $B(t)$ is the Brownian bridge. When $F$ is continuous, $\sqrt{n} D_{n}$ under the null hypothesis converges to the Kolmogorov distribution, which does not depend on $F.$<br />
*The goodness-of-fit test or the K-S test is constructed by using the critical values of the Kolmogorov distribution. The null hypothesis is rejected at level $\alpha$ if $\sqrt{n} D_{n} > K_{\alpha},$ where $K_{\alpha}$ is found from $Pr(K \leq K_{\alpha})=1- \alpha.$ The asymptotic power of this test is $1$.<br />
*Test with estimated parameters: if either the form or the parameters of $F(x)$ are determined from the data $X_{j}$ the critical values determined in this way are invalid. Modifications required to the test statistics and critical values have been proposed.<br />
<br />
'''2) Two-sample K-S test:''' to test whether two underlying one-dimensional probability distribution differ. The K-S statistic is $D_{n,{n}'}=sup_{x} |F_{1,n}(x)-F_{2,{n}'}(x)|$, where $F_{1,n}$ and $F_{2,{n}'}$ are the empirical distribution functions of the first and second sample respectively. The null hypothesis is rejected at least $\alpha$ if $D_{n,{n}'}>c(\alpha)\sqrt{\frac{n+{n}'}{n{n}'}}$. The value of $c(\alpha)$ is given in the following table for each level of $\alpha.$<br />
<br />
<center><br />
{| class="wikitable" style="text-align:center; width:35%" border="1"<br />
|-<br />
|$\alpha$|| 0.1|| 0.05|| 0.025|| 0.01|| 0.005 ||0.001<br />
|-<br />
|c($\alpha$)|| 1.22|| 1.36|| 1.48|| 1.63|| 1.73|| 1.95<br />
|}<br />
</center><br />
<br />
Note: two-sample test checks whether the two data samples come from the same distribution. This does not specify what that common distribution is. While the K-S test is usually used to test whether a given $F(x)$ is the underlying probability distribution $F_{n}(x)$, the procedure may be inverted to give confidence limits on $F(x)$ itself. If we choose a critical value of the test statistics $D_{\alpha}$ such that $P(D_{n}>D_{\alpha})=\alpha$, then a band of width $\pm D_{\alpha}$ around $F_{n}(x)$ will entirely contain $F(x)$ with probability $1-\alpha.$<br />
<br />
3) Illustration on how the K-S test works with example: consider the two datasets control$={1.26, 0.34, 0.70, 1.75, 50.57, 1.55, 0.08, 0.42, 0.50, 3.20, 0.15, 0.49, 0.95, 0.24, 1.37, 0.17, 6.98, 0.10, 0.94, 0.38}$, treatment= ${2.37, 2.16, 14.82, 1.73, 41.04, 0.23, 1.32, 2.91, 39.41, 0.11, 27.44, 4.51, 0.51, 4.50, 0.18, 14.68, 4.66, 1.30, 2.06, 1.19}$.<br />
*Descriptive statistics: to reduce the list of all the data items to a few simpler numbers. For the control group, the mean is 3.607, the median is 0.60, the highest is 50.57, the lowest is 0.08 and the standard deviation is 11.165. Obviously, this dataset is not normally distributed.<br />
*Sort the control data: control sorted$={0.08, 0.10, 0.15, 0.17, 0.24, 0.34, 0.38, 0.42, 0.49, 0.50, 0.70, 0.94, 0.95, 1.26, 1.37, 1.55, 1.75, 3.20, 6.98, 50.57},$ evidently no data lies strictly below $0.08, 5% = 1/20$ of the data is smaller than $0.1$, 10% of the data is strictly smaller than $0.15, 15%$ of the data is strictly smaller than $0.17$, $\cdots.$ There are $17$ data points smaller than $\pi$, and we say that the cumulative fraction of the data smaller than $\pi$ is $17/20 (0.85).$ <br />
<br />
RCODE:<br />
Control <- c(1.26, 0.34, 0.70, 1.75, 50.57, 1.55, 0.08, 0.42, 0.50, 3.20, 0.15, 0.49, 0.95 , 0.24, 1.37, 0.17, 6.98, 0.10, 0.94, 0.38)<br />
Treatment <- c(2.37, 2.16, 14.82, 1.73, 41.04, 0.23, 1.32, 2.91, 39.41, 0.11, 27.44, 4.51, 0.51, 4.50, 0.18, 14.68, 4.66, 1.30, 2.06, 1.19)<br />
summary(control)<br />
Min. 1st Qu. Median Mean 3rd Qu. Max. <br />
0.080 0.315 0.600 3.607 1.415 50.570<br />
sd(control)<br />
[1] 11.16464<br />
*Plot the empirical cumulative plot of the control data. From the chart, we can see that the majority of the data fall in the a small fraction of the plot on the top left, so this is a clear sign that the dataset from control group does not follow a normal distribution.<br />
library(stats)<br />
ecdf(control)<br />
plot(ecdf(control),verticals=TRUE,main='Cumulative Fraction Plot')<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
===Applications===<br />
<br />
1)[http://wiki.stat.ucla.edu/socr/index.php/SOCR_EduMaterials_AnalysisActivities_KolmogorovSmirnoff This article] presents the SOCR analyses example on Kolmogorov-Smirnoff Test, which compares how distinct two values are. It presents a general introduction to the K-S test and illustrates its application with the Oats example and control-treatment test through the SOCR program.<br />
<br />
2) [http://wiki.stat.ucla.edu/socr/index.php/SOCR_EduMaterials_ModelerActivities_MixtureModel_1 This article] presents the SOCR activity which demonstrate the random sampling and fitting of mixture models to data. It starts with a general introduction to the mixture-model distribution and data description, followed with the exploratory data analysis and model fitting in the SOCR environment.<br />
<br />
3) [http://jhc.sagepub.com/content/25/7/935.short This article] presented on the Kolmogorov-Smirnov statistical test for the analysis of histograms and discussed about the test for both the two-sample case (comparing fn1(X) to fn2 (X)) and the one-sample case (comparing fn1 (X) to f(X)). Presentation of the specific algorithmic steps involved is done through development of an example where the data are from an experiment discussed elsewhere in this issue. It is shown that the two histograms examined come from two different parent populations at the 99.9% confidence level. <br />
<br />
4) [http://ieeexplore.ieee.org/xpl/login.jsp?tp=&arnumber=4310069&url=http%3A%2F%2Fieeexplore.ieee.org%2Fxpls%2Fabs_all.jsp%3Farnumber%3D4310069 This article] developed the digital techniques for detecting changes between two Landsat images. A data matrix containing 16 Ã 16 picture elements was used for this purpose. The Landsat imagery data was corrected for sensor inconsistencies and varying sun illumination. The Kolmogorov-Smirnov test (K-S test) was performed between the two corrected data matrices. This test is based on the absolute value of the maximum difference (Dmax) between the two cumulative frequency distributions. The limiting distribution of Dmax is known; thus a statistical decision concerning changes can be evaluated for the region. The K-S test was applied to different test sites. It successfully isolated regions of change. The test was found to be relatively independent of slight misregistration.<br />
<br />
===Software=== <br />
http://stat.ethz.ch/R-manual/R-patched/library/stats/html/ks.test.html <br />
<br />
http://astrostatistics.psu.edu/su07/R/html/stats/html/ks.test.html <br />
<br />
RCODE are attached as in the examples given in this lecture.<br />
<br />
===Problems===<br />
<br />
6.1) Consider the following dataset, control={0.22, -0.87, -2.39, -1.79, 0.37, -1.54, 1.28, -0.31, -0.74, 1.72, 0.38, -0.17, -0.62, -1.10, 0.30, 0.15, 2.30, 0.19, -0.50, -0.09}, treatment={-5.13, -2.19, -2.43, -3.83, 0.50, -3.25, 4.32, 1.63, 5.18, -0.43, 7.11, 4.87, -3.10, -5.81, 3.76, 6.31, 2.58, 0.07, 5.76, 3.50}, use the K-S test to test the equality of the two samples. Include all necessary steps and plots.<br />
<br />
6.2) Revise the example given in the lecture with a different control group: control={1.26, 0.34, 0.70, 1.75, 50.57, 1.55, 0.08, 0.42, 0.50, 3.20, 0.15, 0.49, 0.95, 0.24, 1.37, 0.17, 6.98, 0.10, 0.94, 0.38}, run the K-S test again and state clearly your conclusions with necessary plots.<br />
<br />
6.3) Two types of plants are in bloom in wood and we want to study if bees prefer one tree to the other? We collect data by using a stop-watch to time how long a bee stays near a particular tree. We begin to time when the bee touches the tree; we stop timing when the bee is more than a meter from the tree. (As a result all our times are at least 1 second long: it takes a touch-and-go bee that long to get one meter from the tree.) We wanted to time exactly the same number of bees for each tree, but it started to rain. (Unequal dataset size is not a problem for the KS-test.) Apply the K-S test with this example. The data are given as below:<br />
T1={23.4, 30.9, 18.8, 23.0, 21.4, 1, 24.6, 23.8, 24.1, 18.7, 16.3, 20.3, 14.9, 35.4, 21.6, 21.2, 21.0, 15.0, 15.6, 24.0, 34.6, 40.9, 30.7, 24.5, 16.6, 1, 21.7, 1, 23.6, 1, 25.7, 19.3, 46.9, 23.3, 21.8, 33.3, 24.9, 24.4, 1, 19.8, 17.2, 21.5, 25.5, 23.3, 18.6, 22.0, 29.8, 33.3, 1, 21.3, 18.6, 26.8, 19.4, 21.1, 21.2, 20.5, 19.8, 26.3, 39.3, 21.4, 22.6, 1, 35.3, 7.0, 19.3, 21.3, 10.1, 20.2, 1, 36.2, 16.7, 21.1, 39.1, 19.9, 32.1, 23.1, 21.8, 30.4, 19.62, 15.5}<br />
T2={16.5, 1, 22.6, 25.3, 23.7, 1, 23.3, 23.9, 16.2, 23.0, 21.6, 10.8, 12.2, 23.6, 10.1, 24.4, 16.4, 11.7, 17.7, 34.3, 24.3, 18.7, 27.5, 25.8, 22.5, 14.2, 21.7, 1, 31.2, 13.8, 29.7, 23.1, 26.1, 25.1, 23.4, 21.7, 24.4, 13.2, 22.1, 26.7, 22.7, 1, 18.2, 28.7, 29.1, 27.4, 22.3, 13.2, 22.5, 25.0, 1, 6.6, 23.7, 23.5, 17.3, 24.6, 27.8, 29.7, 25.3, 19.9, 18.2, 26.2, 20.4, 23.3, 26.7, 26.0, 1, 25.1, 33.1, 35.0, 25.3, 23.6, 23.2, 20.2, 24.7, 22.6, 39.1, 26.5, 22.7}<br />
Note that this example is based on data distributed according to the Cauchy distribution: a particularly abnormal case and the plots do not look particularly abnormal, however the large number of outliers is a tip off of a non-normal distribution. <br />
<br />
===References===<br />
<br />
[http://en.wikipedia.org/wiki/Kolmogorov–Smirnov_test K-S Test Wikipedia]<br />
<br />
[http://www.physics.csbsju.edu/stats/KS-test.html K-S Test]<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<hr><br />
* SOCR Home page: http://www.socr.umich.edu<br />
<br />
{{translate|pageName=http://wiki.socr.umich.edu/index.php?title=SMHS_ModelFitting}}</div>Clgallahttps://wiki.socr.umich.edu/index.php?title=SMHS_ModelFitting&diff=14476SMHS ModelFitting2014-10-20T19:43:24Z<p>Clgalla: /* Scientific Methods for Health Sciences - Model Fitting and Model Quality (KS-test) */</p>
<hr />
<div>==[[SMHS| Scientific Methods for Health Sciences]] - Model Fitting and Model Quality (KS-test) ==<br />
<br />
===Overview===<br />
Kolmogorov-Smirnov Test (K-S test) is a nonparametric test commonly applied in various fields to test on the equality of continuous, one-dimensional probability distribution that can be used to compare one sample with a reference probability distribution, which is commonly referred to one-sample K-S test, or to compare two samples, which is referred to as two-sample K-S test. The K-S test is used to determine if two datasets differ significantly. In this lecture, we are going to present a general introduction to K-S test and illustrate its application with examples. The implementation of K-S test in the statistical package R will also be discussed with specific examples. <br />
<br />
===Motivation===<br />
When we talk about testing the equality of two dataset, the first idea came into our mind may always be the simple t-test. However, there are situations where it is a mistake to trust the result of a t-test. Consider the situation where the control and treatment groups do not differ in mean, but only in some other way. Consider two dataset with same mean and very different variations, in this situation, the t-test cannot see the difference. Situations where the treatment and control groups are smallish datasets (say less than 20) that differ in mean but substantial non-normal distribution masks the difference. Consider two datasets drawn from lognormal distributions that differ substantially in mean, in this situation, the t-test also fails. Among all those situations, K-S test would be the answer. How does the K-S test work?<br />
<br />
===Theory===<br />
'''1) K-S test:''' a nonparametric test commonly applied in various fields to test on the equality of continuous, one-dimensional probability distribution that can be used to compare one sample with a reference probability distribution or to compare two samples. The K-S test quantifies a distance between the empirical distribution function of the sample and the cdf of reference distribution or between the empirical distribution functions of two samples. The null hypothesis is that the samples are drawn from the same distribution or that the sample is drawn from the reference distribution.<br />
*K-S test is sensitive to difference both in location and shape of the empirical cdf of the samples and is widely used as the nonparametric methods for comparing two samples.<br />
*The empirical distribution function $F_{n}$ for n i.i.d. observations $X_{i}: F_{n} (x)=1/n \sum_{i=1}^{n} I_{X_{i} \leq x},$ where $I_{X_{i} \leq x}$ is the indicator function, which equals to 1 when $X_{i} \leq x$ and equals to $0$ otherwise.<br />
*The K-S statistic for a given $cdf F(x)$ is $D_{n}=sup_{x}|F_{n}(x)-F(x)|,$ where sup{x} is the supremum of the set of distances. By the Gilvenko-Cantelli therorem, if the sample comes from distribution $F(x)$, then $D_{n}$ converges to $0$ almost surely in the limit when $n$ goes to infinity. <br />
*Kolmogorov distribution (the distribution of the random variable): $K=sup_{t\in [0,1]} |B(t)|,$ where $B(t)$ is the Brownian bridge. The $c.d.f$ of $K$ is given: $Pr(K \leq x)=1-2 \sum_{k=1}^{\infty}(-1)^{k-1}e^{-2k^{2} x^{2}}=\frac {\sqrt{2 \pi}}{x} \sum_{k=1}^{\infty}e^{-(2k-1)^{2} \pi^{2} \setminus (8x^{2})}.$ Under the null hypothesis, the sample comes from the hypothesized distribution $F(x), \sqrt{n} D_{n} \overset{n\rightarrow \infty}{\rightarrow} sup_{t} |B(F(t))|$ in distribution, where $B(t)$ is the Brownian bridge. When $F$ is continuous, $\sqrt{n} D_{n}$ under the null hypothesis converges to the Kolmogorov distribution, which does not depend on $F.$<br />
*The goodness-of-fit test or the K-S test is constructed by using the critical values of the Kolmogorov distribution. The null hypothesis is rejected at level $\alpha$ if $\sqrt{n} D_{n} > K_{\alpha},$ where $K_{\alpha}$ is found from $Pr(K \leq K_{\alpha})=1- \alpha.$ The asymptotic power of this test is $1$.<br />
*Test with estimated parameters: if either the form or the parameters of $F(x)$ are determined from the data $X_{j}$ the critical values determined in this way are invalid. Modifications required to the test statistics and critical values have been proposed.<br />
<br />
'''2) Two-sample K-S test:''' to test whether two underlying one-dimensional probability distribution differ. The K-S statistic is $D_{n,{n}'}=sup_{x} |F_{1,n}(x)-F_{2,{n}'}(x)|$, where $F_{1,n}$ and $F_{2,{n}'}$ are the empirical distribution functions of the first and second sample respectively. The null hypothesis is rejected at least $\alpha$ if $D_{n,{n}'}>c(\alpha)\sqrt{\frac{n+{n}'}{n{n}'}}$. The value of $c(\alpha)$ is given in the following table for each level of $\alpha.$<br />
<br />
<center><br />
{| class="wikitable" style="text-align:center; width:35%" border="1"<br />
|-<br />
|$\alpha$|| 0.1|| 0.05|| 0.025|| 0.01|| 0.005 ||0.001<br />
|-<br />
|c($\alpha$)|| 1.22|| 1.36|| 1.48|| 1.63|| 1.73|| 1.95<br />
|}<br />
</center><br />
<br />
Note: two-sample test checks whether the two data samples come from the same distribution. This does not specify what that common distribution is. While the K-S test is usually used to test whether a given F(x) is the underlying probability distribution F_{n}(x), the procedure may be inverted to give confidence limits on F(x) itself. If we choose a critical value of the test statistics D_{\alpha} such that P(D_{n}>D_{\alpha})=\alpha, then a band of width \pm D_{\alpha} around F_{n}(x) will entirely contain F(x) with probability 1-\alpha.<br />
<br />
3) Illustration on how the K-S test works with example: consider the two datasets control$={1.26, 0.34, 0.70, 1.75, 50.57, 1.55, 0.08, 0.42, 0.50, 3.20, 0.15, 0.49, 0.95, 0.24, 1.37, 0.17, 6.98, 0.10, 0.94, 0.38}$, treatment= ${2.37, 2.16, 14.82, 1.73, 41.04, 0.23, 1.32, 2.91, 39.41, 0.11, 27.44, 4.51, 0.51, 4.50, 0.18, 14.68, 4.66, 1.30, 2.06, 1.19}$.<br />
*Descriptive statistics: to reduce the list of all the data items to a few simpler numbers. For the control group, the mean is 3.607, the median is 0.60, the highest is 50.57, the lowest is 0.08 and the standard deviation is 11.165. Obviously, this dataset is not normally distributed.<br />
*Sort the control data: control sorted$={0.08, 0.10, 0.15, 0.17, 0.24, 0.34, 0.38, 0.42, 0.49, 0.50, 0.70, 0.94, 0.95, 1.26, 1.37, 1.55, 1.75, 3.20, 6.98, 50.57},$ evidently no data lies strictly below $0.08, 5% = 1/20$ of the data is smaller than $0.1$, 10% of the data is strictly smaller than $0.15, 15%$ of the data is strictly smaller than $0.17$, $\cdots.$ There are $17$ data points smaller than $\pi$, and we say that the cumulative fraction of the data smaller than $\pi$ is $17/20 (0.85).$ <br />
<br />
RCODE:<br />
Control <- c(1.26, 0.34, 0.70, 1.75, 50.57, 1.55, 0.08, 0.42, 0.50, 3.20, 0.15, 0.49, 0.95 , 0.24, 1.37, 0.17, 6.98, 0.10, 0.94, 0.38)<br />
Treatment <- c(2.37, 2.16, 14.82, 1.73, 41.04, 0.23, 1.32, 2.91, 39.41, 0.11, 27.44, 4.51, 0.51, 4.50, 0.18, 14.68, 4.66, 1.30, 2.06, 1.19)<br />
summary(control)<br />
Min. 1st Qu. Median Mean 3rd Qu. Max. <br />
0.080 0.315 0.600 3.607 1.415 50.570<br />
sd(control)<br />
[1] 11.16464<br />
*Plot the empirical cumulative plot of the control data. From the chart, we can see that the majority of the data fall in the a small fraction of the plot on the top left, so this is a clear sign that the dataset from control group does not follow a normal distribution.<br />
library(stats)<br />
ecdf(control)<br />
plot(ecdf(control),verticals=TRUE,main='Cumulative Fraction Plot')<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
===Applications===<br />
<br />
1)[http://wiki.stat.ucla.edu/socr/index.php/SOCR_EduMaterials_AnalysisActivities_KolmogorovSmirnoff This article] presents the SOCR analyses example on Kolmogorov-Smirnoff Test, which compares how distinct two values are. It presents a general introduction to the K-S test and illustrates its application with the Oats example and control-treatment test through the SOCR program.<br />
<br />
2) [http://wiki.stat.ucla.edu/socr/index.php/SOCR_EduMaterials_ModelerActivities_MixtureModel_1 This article] presents the SOCR activity which demonstrate the random sampling and fitting of mixture models to data. It starts with a general introduction to the mixture-model distribution and data description, followed with the exploratory data analysis and model fitting in the SOCR environment.<br />
<br />
3) [http://jhc.sagepub.com/content/25/7/935.short This article] presented on the Kolmogorov-Smirnov statistical test for the analysis of histograms and discussed about the test for both the two-sample case (comparing fn1(X) to fn2 (X)) and the one-sample case (comparing fn1 (X) to f(X)). Presentation of the specific algorithmic steps involved is done through development of an example where the data are from an experiment discussed elsewhere in this issue. It is shown that the two histograms examined come from two different parent populations at the 99.9% confidence level. <br />
<br />
4) [http://ieeexplore.ieee.org/xpl/login.jsp?tp=&arnumber=4310069&url=http%3A%2F%2Fieeexplore.ieee.org%2Fxpls%2Fabs_all.jsp%3Farnumber%3D4310069 This article] developed the digital techniques for detecting changes between two Landsat images. A data matrix containing 16 Ã 16 picture elements was used for this purpose. The Landsat imagery data was corrected for sensor inconsistencies and varying sun illumination. The Kolmogorov-Smirnov test (K-S test) was performed between the two corrected data matrices. This test is based on the absolute value of the maximum difference (Dmax) between the two cumulative frequency distributions. The limiting distribution of Dmax is known; thus a statistical decision concerning changes can be evaluated for the region. The K-S test was applied to different test sites. It successfully isolated regions of change. The test was found to be relatively independent of slight misregistration.<br />
<br />
===Software=== <br />
http://stat.ethz.ch/R-manual/R-patched/library/stats/html/ks.test.html <br />
<br />
http://astrostatistics.psu.edu/su07/R/html/stats/html/ks.test.html <br />
<br />
RCODE are attached as in the examples given in this lecture.<br />
<br />
===Problems===<br />
<br />
6.1) Consider the following dataset, control={0.22, -0.87, -2.39, -1.79, 0.37, -1.54, 1.28, -0.31, -0.74, 1.72, 0.38, -0.17, -0.62, -1.10, 0.30, 0.15, 2.30, 0.19, -0.50, -0.09}, treatment={-5.13, -2.19, -2.43, -3.83, 0.50, -3.25, 4.32, 1.63, 5.18, -0.43, 7.11, 4.87, -3.10, -5.81, 3.76, 6.31, 2.58, 0.07, 5.76, 3.50}, use the K-S test to test the equality of the two samples. Include all necessary steps and plots.<br />
<br />
6.2) Revise the example given in the lecture with a different control group: control={1.26, 0.34, 0.70, 1.75, 50.57, 1.55, 0.08, 0.42, 0.50, 3.20, 0.15, 0.49, 0.95, 0.24, 1.37, 0.17, 6.98, 0.10, 0.94, 0.38}, run the K-S test again and state clearly your conclusions with necessary plots.<br />
<br />
6.3) Two types of plants are in bloom in wood and we want to study if bees prefer one tree to the other? We collect data by using a stop-watch to time how long a bee stays near a particular tree. We begin to time when the bee touches the tree; we stop timing when the bee is more than a meter from the tree. (As a result all our times are at least 1 second long: it takes a touch-and-go bee that long to get one meter from the tree.) We wanted to time exactly the same number of bees for each tree, but it started to rain. (Unequal dataset size is not a problem for the KS-test.) Apply the K-S test with this example. The data are given as below:<br />
T1={23.4, 30.9, 18.8, 23.0, 21.4, 1, 24.6, 23.8, 24.1, 18.7, 16.3, 20.3, 14.9, 35.4, 21.6, 21.2, 21.0, 15.0, 15.6, 24.0, 34.6, 40.9, 30.7, 24.5, 16.6, 1, 21.7, 1, 23.6, 1, 25.7, 19.3, 46.9, 23.3, 21.8, 33.3, 24.9, 24.4, 1, 19.8, 17.2, 21.5, 25.5, 23.3, 18.6, 22.0, 29.8, 33.3, 1, 21.3, 18.6, 26.8, 19.4, 21.1, 21.2, 20.5, 19.8, 26.3, 39.3, 21.4, 22.6, 1, 35.3, 7.0, 19.3, 21.3, 10.1, 20.2, 1, 36.2, 16.7, 21.1, 39.1, 19.9, 32.1, 23.1, 21.8, 30.4, 19.62, 15.5}<br />
T2={16.5, 1, 22.6, 25.3, 23.7, 1, 23.3, 23.9, 16.2, 23.0, 21.6, 10.8, 12.2, 23.6, 10.1, 24.4, 16.4, 11.7, 17.7, 34.3, 24.3, 18.7, 27.5, 25.8, 22.5, 14.2, 21.7, 1, 31.2, 13.8, 29.7, 23.1, 26.1, 25.1, 23.4, 21.7, 24.4, 13.2, 22.1, 26.7, 22.7, 1, 18.2, 28.7, 29.1, 27.4, 22.3, 13.2, 22.5, 25.0, 1, 6.6, 23.7, 23.5, 17.3, 24.6, 27.8, 29.7, 25.3, 19.9, 18.2, 26.2, 20.4, 23.3, 26.7, 26.0, 1, 25.1, 33.1, 35.0, 25.3, 23.6, 23.2, 20.2, 24.7, 22.6, 39.1, 26.5, 22.7}<br />
Note that this example is based on data distributed according to the Cauchy distribution: a particularly abnormal case and the plots do not look particularly abnormal, however the large number of outliers is a tip off of a non-normal distribution. <br />
<br />
===References===<br />
<br />
[http://en.wikipedia.org/wiki/Kolmogorov–Smirnov_test K-S Test Wikipedia]<br />
<br />
[http://www.physics.csbsju.edu/stats/KS-test.html K-S Test]<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<hr><br />
* SOCR Home page: http://www.socr.umich.edu<br />
<br />
{{translate|pageName=http://wiki.socr.umich.edu/index.php?title=SMHS_ModelFitting}}</div>Clgallahttps://wiki.socr.umich.edu/index.php?title=SMHS_MixtureModeling&diff=14472SMHS MixtureModeling2014-10-20T18:57:09Z<p>Clgalla: /* Theory */</p>
<hr />
<div>==[[SMHS| Scientific Methods for Health Sciences]] - Mixture Modeling ==<br />
<br />
<br />
===Overview=== <br />
Mixture model is a probabilistic model for representing the presence of subpopulations within overall population, without requiring that an observed data set should identify the sub-population to which an individual observation belongs. In this section, we will present a general introduction to mixture modeling, the structure of mixture model, various types of mixture model, the estimation of parameters in mixture model and application of mixture model in studies. The implementation of mixture modeling in R package will also be discussed in the attached articles.<br />
<br />
===Motivation===<br />
Mixture distribution represents the probability distribution of observations in the overall population. Problems associated with mixture distributions relate to deriving the properties of the overall population from those of the sub-population. We use mixture models to make statistical inference about the properties of the sub-populations given only observations on the pooled population, without sub-population identity information. It is not the same as models for compositional data, whose components are constrained to sum to a constant value. (1, 100%, etc.) What is the structure of mixture model and how can we estimate parameters in the mixture model?<br />
<br />
===Theory===<br />
'''1) Structure of mixture model:''' a distribution $f$ is a mixture of $K$ component distributions $f_{1}, f_{2}, \cdots, f_{k}$ if $f(x) = \sum_{k=1}^{K} \lambda_{k}f_{k}(x)$ with the $\lambda_{k}$ being the mixing weights, $\lambda_{k} > 0, \sum_{k}\lambda_{k} = 1$. $Z \sim Mult(\lambda_{1}, \cdots, \lambda_{k}), X|Z \sim f_{z}$, where the discrete random variable $Z$ indicating where $X$ is drawn from. Different parametric family for $f_{k}$ generates different parametric mixture models, like Gaussian, Binomial, Poisson and etc. They may all be Gaussian with different parameters, or all Poisson distributions with different means. The model can be expressed as $f(x) = \sum_{k=1}^{K} \lambda_{k}f(x;\theta_{k})$, the parameter vector of the mixture model is $\theta = (\lambda_{1}, \cdots, \lambda_{K}, \theta_{1}, \cdots, \theta_{K})$. When $K=1$, we got a simply parametric distribution of the usual sort, and density estimation reduces to estimating the parameters by ML. If K=n, the number of observations, we went back to kernel density estimation. <br />
<br />
'''2) Estimating parametric mixture models:''' assume independent samples where we have the density function to be $\prod_{i=1}^{n}f(x_{i};\theta)$,for observations $x_{1}, x_{2}, \cdots, x_{n}$. <br />
<br />
<br />
We try the logarithm to turn multiplication into addition: $l(\theta) = \sum_{i=1}^{n} logf(x_{i};\theta) = \sum_{i=1}^{n} log \sum_{k=1}^{K} \lambda_{k} f(x_{i}; \theta_{k})$, <br />
<br />
<br />
we take the derivative of this with respect to one parameter, say $\theta_{j}, \frac{\partial l}{\partial \theta_{j}}=\sum_{i=1}^{n}\frac{1}{\sum_{i=1}^{K}\lambda_{k}f(x_{i};\theta_{k})}\lambda_{j}\frac{\partial f(x_{i};\theta_{j})}{\partial \theta_{j}}=\sum_{i=1}^{n}\frac{\lambda_{j}f(x_{i};\theta_{j})}{\sum_{i=1}^{K}\lambda_{k}f(x_{i};\theta_{k})}\frac{\partial logf(x_{i};\theta_{j})}{\partial \theta_{j}}$. <br />
<br />
<br />
If we just had an ordinary parametric model, on the other hand, the derivative of the log-likelihood would be $\sum_{i=1}^{n}\frac{\partial logf(x_{i};\theta_{j})}{\partial \theta_{j}}.$ Maximizing the likelihood for a mixture model is like doing a weighted likelihood maximization, where the weight of $x_{i}$ depends on cluster, being $w_{ij} = \frac{\lambda_{j}}{f x_{i}};\theta_{j}\sum_{k=1}^{K}\lambda_{k}f(x_{i}\theta_{k}).$ <br />
<br />
Remember that $\lambda_{j}$ is the probability that the hidden class variable $Z$ is $j$,so the numerator in the weights is the joint probability of getting $Z=j$ and $X=x_{i}$. The denominator is the marginal probability of getting $X=x_{i}$, so the ratio is conditional probability of $Z=j$ given $X=x_{i}, w_{ij} = \frac{\lambda_{j}f(x_{i};\theta_{j}}{\sum_{k=1}^{K}} \lambda_{k}f(x_{i};\theta_{k}) = p(Z=j | X=x_{i}; \theta).$<br />
*EM algorithm: (1) start with guesses about the mixture components $\theta_{1}, \cdots, \theta_{K}$ and the mixing weights $\lambda_{1}, \cdots, \lambda_{K}$; (2) until nothing changes very much: using the current parameter guesses, calculate the weights $w_{ij}$ (E-step); using the current weights, maximize the weighted likelihood to get new parameter estimates (M-step); (3) return the final parameter estimates (including mixing proportions) and cluster probabilities. <br />
*Non-parametric mixture modeling: replace the $M$ step of $EM$ by some other way of estimating the distribution of each mixture component. This could be fast and crude estimate of parameters, or it could even be a non-parametric density estimator.<br />
<br />
'''3) Computational examples in R:''' Snoqualmie Falls Revisited (analyzed using the mclust package in R)<br />
<br />
RCODE:<br />
snoqualmie <- read.csv<br />
("http://www.stat.cmu.edu/~cshalizi/402/lectures/16-glm-practicals/snoqualmie.csv",header=FALSE)<br />
snoqualmie.vector <- na.omit(unlist(snoqualmie))<br />
snoq <- snoqualmie.vector[snoqualmie.vector > 0]<br />
summary(snoq)<br />
Min. 1st Qu. Median Mean 3rd Qu. Max. <br />
1.00 6.00 19.00 32.28 44.00 463.00<br />
<br />
<br />
# Two-component Gaussian mixture<br />
library(mixtools)<br />
snoq.k2 <- normalmixEM(snoq,k=2,maxit=100,epsilon=0.01)<br />
summary(snoq.k2)<br />
summary of normalmixEM object:<br />
comp 1 comp 2<br />
lambda 0.557524 0.442476<br />
mu 10.266172 60.009468<br />
sigma 8.510244 44.997240<br />
loglik at estimate: -32681.21<br />
<br />
<br />
# Function to add Gaussian mixture components, vertically scaled, to the<br />
# current plot<br />
# Presumes the mixture object has the structure used by mixtools<br />
plot.normal.components <- function(mixture,component.number,...) {<br />
curve(mixture$\$$lambda[component.number]*<br />
dnorm(x,mean=mixture$\$$mu[component.number],<br />
sd=mixture$\$$sigma[component.number]), add=TRUE, ...)<br />
}<br />
plot(hist(snoq,breaks=101),col="grey",border="grey",freq=FALSE,<br />
xlab="Precipitation (1/100 inch)",main="Precipitation in Snoqualmie Falls")<br />
lines(density(snoq),lty=2)<br />
<br />
[[Image:SMHS_MixtureModel_Fig1.png|500px]]<br />
<br />
Histogram (grey) for precipitation on wet days in Snoqualmie Falls. The dashed line is a kernel density estimate, which is not completely satisfactory.<br />
<br />
plot(hist(snoq,breaks=101),col="grey",border="grey",freq=FALSE,<br />
xlab="Precipitation (1/100 inch)",main="Precipitation in Snoqualmie Falls")<br />
lines(density(snoq),lty=2)<br />
<br />
sapply(1:2,plot.normal.components,mixture=snoq.k2)<br />
<br />
[[Image:SMHS_MixtureModel_Fig2.png|500px]]<br />
<br />
As in the previous figure, plus the components of a mixture of two Gaussians, fitted to the data by the EM algorithm (dashed lines). These are scaled by the mixing weights of the components.<br />
<br />
# Function to calculate the cumulative distribution function of a Gaussian<br />
# mixture model<br />
# Presumes the mixture object has the structure used by mixtools<br />
# Doesn't implement some of the usual options for CDF functions in R, like<br />
# returning the log probability, or the upper tail probability<br />
pnormmix <- function(x,mixture) {<br />
lambda <- mixture$\$$lambda<br />
k <- length(lambda)<br />
pnorm.from.mix <- function(x,component){<br />
lambda[component]*pnorm(x,mean=mixture$\$$mu[component],<br />
sd=mixture$\$$sigma[component])<br />
}<br />
pnorms <- sapply(1:k,pnorm.from.mix,x=x)<br />
return(rowSums(pnorms))<br />
}<br />
<br />
<br />
#### Figure 3<br />
# Distinct values in the data<br />
distinct.snoq <- sort(unique(snoq))<br />
# Theoretical CDF evaluated at each distinct value<br />
tcdfs <- pnormmix(distinct.snoq,mixture=snoq.k2)<br />
# Empirical CDF evaluated at each distinct value<br />
# ecdf(snoq) returns an object which is a _function_, suitable for application<br />
# to new vectors<br />
ecdfs <- ecdf(snoq)(distinct.snoq)<br />
# Plot them against each other<br />
plot(tcdfs,ecdfs,xlab="Theoretical CDF",ylab="Empirical CDF",xlim=c(0,1),<br />
ylim=c(0,1))<br />
# Main diagonal for visual reference<br />
abline(0,1)<br />
<br />
<br />
[[Image:SMHS_MixtureModel_Fig3.png|500px]]<br />
<br />
<br />
<br />
Calibration plot for the two-component Gaussian mixture. For each distinct value of precipitation x, plot the fraction of days predicted by the mixture model to have $\leq x$ precipitation on the horizontal axis, versus the actual fraction of days $\leq x$.<br />
<br />
# Probability density function for a Gaussian mixture<br />
# Presumes the mixture object has the structure used by mixtools<br />
dnormalmix <- function(x,mixture,log=FALSE) {<br />
lambda <- mixture$\$$\lambda<br />
k <- length(lambda)<br />
# Calculate share of likelihood for all data for one component<br />
like.component <- function(x,component) {<br />
lambda[component]*dnorm(x,mean=mixture$\$$mu[component], <br />
sd=mixture$\$$sigma[component])<br />
}<br />
# Create array with likelihood shares from all components over all data<br />
likes <- sapply(1:k,like.component,x=x)<br />
# Add up contributions from components<br />
d <- rowSums(likes)<br />
if (log) {<br />
d <- log(d)<br />
}<br />
return(d)<br />
}<br />
<br />
# Log likelihood function for a Gaussian mixture, potentially on new data<br />
loglike.normalmix <- function(x,mixture) {<br />
loglike <- dnormalmix(x,mixture,log=TRUE)<br />
return(sum(loglike))<br />
}<br />
loglike.normalmix(snoq,mixture=snoq.k2)<br />
[1] - 32681.21<br />
# Evaluate various numbers of Gaussian components by data-set splitting<br />
# (i.e., very crude cross-validation)<br />
n <- length(snoq)<br />
data.points <- 1:n<br />
data.points <- sample(data.points) # Permute randomly<br />
train <- data.points[1:floor(n/2)] # First random half is training<br />
test <- data.points[-(1:floor(n/2))] # 2nd random half is testing<br />
candidate.component.numbers <- 2:10<br />
loglikes <- vector(length=1+length(candidate.component.numbers))<br />
# k=1 needs special handling<br />
mu<-mean(snoq[train]) # MLE of mean<br />
sigma<- sd(snoq[train])*sqrt((n-1)/n) # MLE of standard deviation<br />
loglikes[1] <- sum(dnorm(snoq[test],mu,sigma,log=TRUE))<br />
for (k in candidate.component.numbers) {<br />
mixture <- normalmixEM(snoq[train],k=k,maxit=400,epsilon=1e-2)<br />
loglikes[k] <- loglike.normalmix(snoq[test],mixture=mixture)<br />
}<br />
loglikes<br />
[1] -17647.93 -16336.12 -15796.02 -15554.33 -15398.04 -15337.47 -15297.61<br />
[8] -15285.60 -15286.75 -15288.88<br />
plot(x=1:10, y=loglikes,xlab="Number of mixture components", ylab="Log-likelihood on testing data")<br />
<br />
[[Image:SMHS_MixtureModel_Fig4.png|500px]]<br />
<br />
log-likelihoods of different sizes of mixture models, fit to a random half of the data for training, and evaluated on the other half of the data for testing.<br />
<br />
snoq.k9 <- normalmixEM(snoq,k=9,maxit=400,epsilon=1e-2)<br />
plot(hist(snoq,breaks=101),col="grey",border="grey",freq=FALSE,<br />
xlab="Precipitation (1/100 inch)",main="Precipitation in Snoqualmie Falls")<br />
lines(density(snoq),lty=2)<br />
sapply(1:9,plot.normal.components,mixture=snoq.k9)<br />
<br />
[[Image:SMHS_MixtureModel_Fig5.png|500px]]<br />
<br />
With the nine-component Gaussian mixture.<br />
<br />
# Assigments for distinct.snoq and ecdfs are redundant if you've already done<br />
distinct.snoq <- sort(unique(snoq))<br />
tcdfs <- pnormmix(distinct.snoq,mixture=snoq.k9)<br />
ecdfs <- ecdf(snoq)(distinct.snoq)<br />
plot(tcdfs,ecdfs,xlab="Theoretical CDF",ylab="Empirical CDF",xlim=c(0,1),<br />
ylim=c(0,1))<br />
abline(0,1)<br />
<br />
[[Image:SMHS_MixtureModel_Fig6.png|500px]]<br />
<br />
Calibration plot for the nine-component Gaussian mixture.<br />
<br />
plot(0,xlim=range(snoq.k9$\$$mu),ylim=range(snoq.k9$\$$sigma),type="n",<br />
xlab="Component mean", ylab="Component standard deviation")<br />
<br />
points(x=snoq.k9$\$$mu,y=snoq.k9$\$$sigma,pch=as.character(1:9),<br />
cex=sqrt(0.5+5*snoq.k9$\$$lambda))<br />
<br />
[[Image:SMHS_MixtureModel_Fig7.png|500px]]<br />
<br />
Characteristics of the components of the 9-mode Gaussian mixture. The horizontal axis gives the component mean, the vertical axis its standard deviation. The area of the number representing each component is proportional to the component’s mixing weight.<br />
<br />
plot(density(snoq),lty=2,ylim=c(0,0.04),<br />
main=paste("Comparison of density estimates\n",<br />
"Kernel vs. Gaussian mixture"),<br />
xlab="Precipitation (1/100 inch)")<br />
curve(dnormalmix(x,snoq.k9),add=TRUE)<br />
<br />
[[Image:SMHS_MixtureModel_Fig8.png|500px]]<br />
<br />
Dashed line: kernel density estimate. Solid line is the nine-Gaussian mixture. Notice that the mixture, unlike the KDE, gives negligible probability to negative precipitation.<br />
<br />
# Do the classes of the Gaussian mixture make sense as annual weather patterns?<br />
# Most probable class for each day:<br />
day.classes <- apply(snoq.k9$\$$posterior,1,which.max)<br />
# Make a copy of the original, structured data set to edit<br />
snoqualmie.classes <- snoqualmie<br />
# Figure out which days had precipitation<br />
wet.days <- (snoqualmie > 0) & !(is.na(snoqualmie))<br />
# Replace actual precipitation amounts with classes<br />
snoqualmie.classes[wet.days] <- day.classes<br />
# Problem: the number of the classes doesn't correspond to e.g. amount of<br />
# precipitation expected. Solution: label by expected precipitation, not by<br />
# class number.<br />
snoqualmie.classes[wet.days] <- snoq.k9$\$$mu[day.classes]<br />
<br />
plot(0,xlim=c(1,366),ylim=range(snoq.k9$\$$mu),type="n",xaxt="n",<br />
xlab="Day of year",ylab="Expected precipiation (1/100 inch)")<br />
axis(1,at=1+(0:11)*30)<br />
for (year in 1:nrow(snoqualmie.classes)) {<br />
points(1:366,snoqualmie.classes[year,],pch=16,cex=0.2)<br />
}<br />
<br />
[[Image:SMHS_MixtureModel_Fig9.png|500px]]<br />
<br />
Plot of days classified according to the nine-component mixture. Horizontal axis: days of the year, numbered from 1 to 366 to handle leap years. Vertical axis: expected amount of precipitation on that day according to the most probable class for the day.<br />
<br />
# Next line is currently (5 April 2011) used to invoke a bug-patch kindly<br />
# provided by Dr. Derek Young; the patch will be incorporated in the next<br />
# update to mixtools, so should not be needed after April 2011<br />
source("http://www.stat.cmu.edu/~cshalizi/402/lectures/20-mixture-examples/bootcomp.R")<br />
snoq.boot <- boot.comp(snoq,max.comp=10,mix.type="normalmix",maxit=400,$\epsilon=1e-2)$<br />
# Running this takes about 5 minutes<br />
# automatically produced as a side-effect of running boot.comp()<br />
<br />
[[Image:SMHS_MixtureModel_Fig10.png|500px]]<br />
<br />
Histograms produced by boot.comp("http://www.stat.cmu.edu/~cshalizi/402/lectures/20-mixture-examples/bootcomp.R"). The vertical red lines mark the observed difference in log-likelihood.<br />
<br />
library(mvtnorm)<br />
x.points <- seq(-3,3,length.out=100)<br />
y.points <- x.points<br />
z <- matrix(0,nrow=100,ncol=100)<br />
mu <- c(1,1)<br />
sigma <- matrix(c(2,1,1,1),nrow=2)<br />
for (i in 1:100) {<br />
for (j in 1:100) {<br />
z[i,j] <- dmvnorm(c(x.points[i],y.points[j]),mean=mu,sigma=sigma)<br />
}<br />
}<br />
contour(x.points,y.points,z)<br />
<br />
[[Image:SMHS_MixtureModel_Fig11.png|500px]]<br />
<br />
===Applications===<br />
<br />
1) [http://wiki.stat.ucla.edu/socr/index.php/SOCR_EduMaterials_Activities_2D_PointSegmentation_EM_Mixture This article] demonstrated activity of the mixture modeling and expectation maximization (EM) applied to the problem of 2D point cluster segmentation. It illustrated ways to use EM and mixture modeling to obtain cluster classification of points in 2D using SOCR charts activity and SOCR modeler with specific examples.<br />
<br />
2) [http://wiki.stat.ucla.edu/socr/index.php/SOCR_EduMaterials_ModelerActivities_MixtureModel_1 This article] presented the SOCR activity that demonstrate random sampling and fitting of mixture models to data. The SOCR mixture-model applet demonstrates how unimodal-distributions come together as building blocks to form the backbone of may complex processes and allow computing probability and critical values for these mixture distributions, and enable inference on such complicated processes.<br />
<br />
3) [http://www.sciencedirect.com/science/article/pii/S0167947301000469 This article] presented a mixture model approach for the analysis of microarray gene expression data. Microarrays have emerged as powerful tools allowing investigators to assess the expression of thousands of genes in different tissues and organisms. Statistical treatment of the resulting data remains a substantial challenge. Investigators using microarray expression studies may wish to answer questions about the statistical significance of differences in expression of any of the genes under study, avoiding false positive and false negative results. This paper developed a sequence of procedures involving finite mixture modeling and bootstrap inference to address these issues in studies involving many thousands of genes and illustrated the use of these techniques with a dataset involving calorically restricted mice.<br />
<br />
4) [http://proceedings.spiedigitallibrary.org/proceeding.aspx?articleid=979899 This article] is concerned with estimating a probability density function of human skin color, using a finite Gaussian mixture model, whose parameters are estimated through the EM algorithm. Hawkins' statistical test on the normality and homoscedasticity (common covariance matrix) of the estimated Gaussian mixture models is performed and McLachlan's bootstrap method is used to test the number of components in a mixture. Experimental results show that the estimated Gaussian mixture model fits skin images from a large database. Applications of the estimated density function in image and video databases are presented.<br />
<br />
===Software===<br />
[http://cran.r-project.org/web/packages/mixtools/vignettes/mixtools.pdf Mixtool Vignettes]<br />
<br />
[http://www.stat.washington.edu/mclust/ mclust]<br />
<br />
[http://www.stat.cmu.edu/~cshalizi/uADA/12/lectures/mixture-examples.R R code for examples in Chapter 20 (see references)]<br />
<br />
===Problems===<br />
<br />
1) Write a function to simulate from a Gaussian mixture model. Check if it works by comparing a density estimated on its output to the theoretical density.<br />
<br />
2) Work through the E-step and M-step for a mixture of two Poisson distributions.<br />
<br />
3) Code up the EM algorithm for a mixture of K Gaussians. Simulate data from K=3 Gaussians. How well does the code assign data points to components if give the actual Gaussian parameter the initial guess and how does it change if given other initial parameters?<br />
<br />
4) Write a function to fit a mixture of exponential distributions using the EM algorithm. <br />
<br />
===References===<br />
<br />
[http://mirlyn.lib.umich.edu/Record/004199238 Statistical inference / George Casella, Roger L. Berger]<br />
<br />
<br />
[http://www.stat.cmu.edu/~cshalizi/uADA/12/lectures/ch20.pdf Chapter 20. Mixture Models]<br />
<br />
<hr><br />
* SOCR Home page: http://www.socr.umich.edu<br />
<br />
{{translate|pageName=http://wiki.socr.umich.edu/index.php?title=SMHS_MixtureModeling}}</div>Clgallahttps://wiki.socr.umich.edu/index.php?title=SMHS_MixtureModeling&diff=14471SMHS MixtureModeling2014-10-20T18:51:40Z<p>Clgalla: /* Theory */</p>
<hr />
<div>==[[SMHS| Scientific Methods for Health Sciences]] - Mixture Modeling ==<br />
<br />
<br />
===Overview=== <br />
Mixture model is a probabilistic model for representing the presence of subpopulations within overall population, without requiring that an observed data set should identify the sub-population to which an individual observation belongs. In this section, we will present a general introduction to mixture modeling, the structure of mixture model, various types of mixture model, the estimation of parameters in mixture model and application of mixture model in studies. The implementation of mixture modeling in R package will also be discussed in the attached articles.<br />
<br />
===Motivation===<br />
Mixture distribution represents the probability distribution of observations in the overall population. Problems associated with mixture distributions relate to deriving the properties of the overall population from those of the sub-population. We use mixture models to make statistical inference about the properties of the sub-populations given only observations on the pooled population, without sub-population identity information. It is not the same as models for compositional data, whose components are constrained to sum to a constant value. (1, 100%, etc.) What is the structure of mixture model and how can we estimate parameters in the mixture model?<br />
<br />
===Theory===<br />
'''1) Structure of mixture model:''' a distribution $f$ is a mixture of $K$ component distributions $f_{1}, f_{2}, \cdots, f_{k}$ if $f(x) = \sum_{k=1}^{K} \lambda_{k}f_{k}(x)$ with the $\lambda_{k}$ being the mixing weights, $\lambda_{k} > 0, \sum_{k}\lambda_{k} = 1$. $Z \sim Mult(\lambda_{1}, \cdots, \lambda_{k}), X|Z \sim f_{z}$, where the discrete random variable $Z$ indicating where $X$ is drawn from. Different parametric family for $f_{k}$ generates different parametric mixture models, like Gaussian, Binomial, Poisson and etc. They may all be Gaussian with different parameters, or all Poisson distributions with different means. The model can be expressed as $f(x) = \sum_{k=1}^{K} \lambda_{k}f(x;\theta_{k})$, the parameter vector of the mixture model is $\theta = (\lambda_{1}, \cdots, \lambda_{K}, \theta_{1}, \cdots, \theta_{K})$. When $K=1$, we got a simply parametric distribution of the usual sort, and density estimation reduces to estimating the parameters by ML. If K=n, the number of observations, we went back to kernel density estimation. <br />
<br />
'''2) Estimating parametric mixture models:''' assume independent samples where we have the density function to be $\prod_{i=1}^{n}f(x_{i};\theta)$,for observations $x_{1}, x_{2}, \cdots, x_{n}$. <br />
<br />
<br />
We try the logarithm to turn multiplication into addition: $l(\theta) = \sum_{i=1}^{n} logf(x_{i};\theta) = \sum_{i=1}^{n} log \sum_{k=1}^{K} \lambda_{k} f(x_{i}; \theta_{k})$, <br />
<br />
<br />
we take the derivative of this with respect to one parameter, say $\theta_{j}, \frac{\partial l}{\partial \theta_{j}}=\sum_{i=1}^{n}\frac{1}{\sum_{i=1}^{K}\lambda_{k}f(x_{i};\theta_{k})}\lambda_{j}\frac{\partial f(x_{i};\theta_{j})}{\partial \theta_{j}}=\sum_{i=1}^{n}\frac{\lambda_{j}f(x_{i};\theta_{j})}{\sum_{i=1}^{K}\lambda_{k}f(x_{i};\theta_{k})}\frac{\partial logf(x_{i};\theta_{j})}{\partial \theta_{j}}$. <br />
<br />
<br />
If we just had an ordinary parametric model, on the other hand, the derivative of the log-likelihood would be $\sum_{i=1}^{n}\frac{\partial logf(x_{i};\theta_{j})}{\partial \theta_{j}}.$ Maximizing the likelihood for a mixture model is like doing a weighted likelihood maximization, where the weight of $x_{i}$ depends on cluster, being $w_{ij} = \frac{\lambda_{j}}{f x_{i}};\theta_{j}\sum_{k=1}^{K}\lambda_{k}f(x_{i}\theta_{k}).$ <br />
<br />
Remember that $\lambda_{j}$ is the probability that the hidden class variable $Z$ is $j$,so the numerator in the weights is the joint probability of getting $Z=j$ and $X=x_{i}$. The denominator is the marginal probability of getting $X=x_{i}$, so the ratio is conditional probability of $Z=j$ given $X=x_{i}, w_{ij} = \frac{\lambda_{j}f(x_{i};\theta_{j}}{\sum_{k=1}^{K}} \lambda_{k}f(x_{i};\theta_{k}) = p(Z=j | X=x_{i}; \theta).$<br />
*EM algorithm: (1) start with guesses about the mixture components $\theta_{1}, \cdots, \theta_{K}$ and the mixing weights $\lambda_{1}, \cdots, \lambda_{K}$; (2) until nothing changes very much: using the current parameter guesses, calculate the weights $w_{ij}$ (E-step); using the current weights, maximize the weighted likelihood to get new parameter estimates (M-step); (3) return the final parameter estimates (including mixing proportions) and cluster probabilities. <br />
*Non-parametric mixture modeling: replace the $M$ step of $EM$ by some other way of estimating the distribution of each mixture component. This could be fast and crude estimate of parameters, or it could even be a non-parametric density estimator.<br />
<br />
'''3) Computational examples in R:''' Snoqualmie Falls Revisited (analyzed using the mclust package in R)<br />
<br />
RCODE:<br />
snoqualmie <- read.csv<br />
("http://www.stat.cmu.edu/~cshalizi/402/lectures/16-glm-practicals/snoqualmie.csv",header=FALSE)<br />
snoqualmie.vector <- na.omit(unlist(snoqualmie))<br />
snoq <- snoqualmie.vector[snoqualmie.vector > 0]<br />
summary(snoq)<br />
Min. 1st Qu. Median Mean 3rd Qu. Max. <br />
1.00 6.00 19.00 32.28 44.00 463.00<br />
<br />
<br />
# Two-component Gaussian mixture<br />
library(mixtools)<br />
snoq.k2 <- normalmixEM(snoq,k=2,maxit=100,epsilon=0.01)<br />
summary(snoq.k2)<br />
summary of normalmixEM object:<br />
comp 1 comp 2<br />
lambda 0.557524 0.442476<br />
mu 10.266172 60.009468<br />
sigma 8.510244 44.997240<br />
loglik at estimate: -32681.21<br />
<br />
<br />
# Function to add Gaussian mixture components, vertically scaled, to the<br />
# current plot<br />
# Presumes the mixture object has the structure used by mixtools<br />
plot.normal.components <- function(mixture,component.number,...) {<br />
curve(mixture$\$$lambda$[component.number]*<br />
dnorm(x,mean=mixture$\$$mu[component.number],<br />
sd=mixture $\$$sigma[component.number]), add=TRUE, ...)<br />
}<br />
plot(hist(snoq,breaks=101),col="grey",border="grey",freq=FALSE,<br />
xlab="Precipitation (1/100 inch)",main="Precipitation in Snoqualmie Falls")<br />
lines(density(snoq),lty=2)<br />
<br />
[[Image:SMHS_MixtureModel_Fig1.png|500px]]<br />
<br />
Histogram (grey) for precipitation on wet days in Snoqualmie Falls. The dashed line is a kernel density estimate, which is not completely satisfactory.<br />
<br />
plot(hist(snoq,breaks=101),col="grey",border="grey",freq=FALSE,<br />
xlab="Precipitation (1/100 inch)",main="Precipitation in Snoqualmie Falls")<br />
lines(density(snoq),lty=2)<br />
<br />
sapply(1:2,plot.normal.components,mixture=snoq.k2)<br />
<br />
[[Image:SMHS_MixtureModel_Fig2.png|500px]]<br />
<br />
As in the previous figure, plus the components of a mixture of two Gaussians, fitted to the data by the EM algorithm (dashed lines). These are scaled by the mixing weights of the components.<br />
<br />
# Function to calculate the cumulative distribution function of a Gaussian<br />
# mixture model<br />
# Presumes the mixture object has the structure used by mixtools<br />
# Doesn't implement some of the usual options for CDF functions in R, like<br />
# returning the log probability, or the upper tail probability<br />
pnormmix <- function(x,mixture) {<br />
lambda <- mixture$\$$lambda<br />
k <- length(lambda)<br />
pnorm.from.mix <- function(x,component){<br />
lambda[component]*pnorm(x,mean=mixture$\$$mu[component],<br />
sd=mixture$\$$sigma[component])<br />
}<br />
pnorms <- sapply(1:k,pnorm.from.mix,x=x)<br />
return(rowSums(pnorms))<br />
}<br />
<br />
<br />
#### Figure 3<br />
# Distinct values in the data<br />
distinct.snoq <- sort(unique(snoq))<br />
# Theoretical CDF evaluated at each distinct value<br />
tcdfs <- pnormmix(distinct.snoq,mixture=snoq.k2)<br />
# Empirical CDF evaluated at each distinct value<br />
# ecdf(snoq) returns an object which is a _function_, suitable for application<br />
# to new vectors<br />
ecdfs <- ecdf(snoq)(distinct.snoq)<br />
# Plot them against each other<br />
plot(tcdfs,ecdfs,xlab="Theoretical CDF",ylab="Empirical CDF",xlim=c(0,1),<br />
ylim=c(0,1))<br />
# Main diagonal for visual reference<br />
abline(0,1)<br />
<br />
<br />
[[Image:SMHS_MixtureModel_Fig3.png|500px]]<br />
<br />
<br />
<br />
Calibration plot for the two-component Gaussian mixture. For each distinct value of precipitation x, plot the fraction of days predicted by the mixture model to have $\leq x$ precipitation on the horizontal axis, versus the actual fraction of days $\leq x$.<br />
<br />
# Probability density function for a Gaussian mixture<br />
# Presumes the mixture object has the structure used by mixtools<br />
dnormalmix <- function(x,mixture,log=FALSE) {<br />
lambda <- mixture$\$$\lambda<br />
k <- length(lambda)<br />
# Calculate share of likelihood for all data for one component<br />
like.component <- function(x,component) {<br />
lambda[component]*dnorm(x,mean=mixture$\$$mu[component], <br />
sd=mixture$\$$sigma[component])<br />
}<br />
# Create array with likelihood shares from all components over all data<br />
likes <- sapply(1:k,like.component,x=x)<br />
# Add up contributions from components<br />
d <- rowSums(likes)<br />
if (log) {<br />
d <- log(d)<br />
}<br />
return(d)<br />
}<br />
<br />
# Log likelihood function for a Gaussian mixture, potentially on new data<br />
loglike.normalmix <- function(x,mixture) {<br />
loglike <- dnormalmix(x,mixture,log=TRUE)<br />
return(sum(loglike))<br />
}<br />
loglike.normalmix(snoq,mixture=snoq.k2)<br />
[1] - 32681.21<br />
# Evaluate various numbers of Gaussian components by data-set splitting<br />
# (i.e., very crude cross-validation)<br />
n <- length(snoq)<br />
data.points <- 1:n<br />
data.points <- sample(data.points) # Permute randomly<br />
train <- data.points[1:floor(n/2)] # First random half is training<br />
test <- data.points[-(1:floor(n/2))] # 2nd random half is testing<br />
candidate.component.numbers <- 2:10<br />
loglikes <- vector(length=1+length(candidate.component.numbers))<br />
# k=1 needs special handling<br />
mu<-mean(snoq[train]) # MLE of mean<br />
sigma<- sd(snoq[train])*sqrt((n-1)/n) # MLE of standard deviation<br />
loglikes[1] <- sum(dnorm(snoq[test],mu,sigma,log=TRUE))<br />
for (k in candidate.component.numbers) {<br />
mixture <- normalmixEM(snoq[train],k=k,maxit=400,epsilon=1e-2)<br />
loglikes[k] <- loglike.normalmix(snoq[test],mixture=mixture)<br />
}<br />
loglikes<br />
[1] -17647.93 -16336.12 -15796.02 -15554.33 -15398.04 -15337.47 -15297.61<br />
[8] -15285.60 -15286.75 -15288.88<br />
plot(x=1:10, y=loglikes,xlab="Number of mixture components", ylab="Log-likelihood on testing data")<br />
<br />
[[Image:SMHS_MixtureModel_Fig4.png|500px]]<br />
<br />
log-likelihoods of different sizes of mixture models, fit to a random half of the data for training, and evaluated on the other half of the data for testing.<br />
<br />
snoq.k9 <- normalmixEM(snoq,k=9,maxit=400,epsilon=1e-2)<br />
plot(hist(snoq,breaks=101),col="grey",border="grey",freq=FALSE,<br />
xlab="Precipitation (1/100 inch)",main="Precipitation in Snoqualmie Falls")<br />
lines(density(snoq),lty=2)<br />
sapply(1:9,plot.normal.components,mixture=snoq.k9)<br />
<br />
[[Image:SMHS_MixtureModel_Fig5.png|500px]]<br />
<br />
With the nine-component Gaussian mixture.<br />
<br />
# Assigments for distinct.snoq and ecdfs are redundant if you've already done<br />
distinct.snoq <- sort(unique(snoq))<br />
tcdfs <- pnormmix(distinct.snoq,mixture=snoq.k9)<br />
ecdfs <- ecdf(snoq)(distinct.snoq)<br />
plot(tcdfs,ecdfs,xlab="Theoretical CDF",ylab="Empirical CDF",xlim=c(0,1),<br />
ylim=c(0,1))<br />
abline(0,1)<br />
<br />
[[Image:SMHS_MixtureModel_Fig6.png|500px]]<br />
<br />
Calibration plot for the nine-component Gaussian mixture.<br />
<br />
plot(0,xlim=range(snoq.k9$\$$mu),ylim=range(snoq.k9$\$$sigma),type="n",<br />
xlab="Component mean", ylab="Component standard deviation")<br />
<br />
points(x=snoq.k9$\$$mu,y=snoq.k9$\$$sigma,pch=as.character(1:9),<br />
cex=sqrt(0.5+5*snoq.k9$\$$lambda))<br />
<br />
[[Image:SMHS_MixtureModel_Fig7.png|500px]]<br />
<br />
Characteristics of the components of the 9-mode Gaussian mixture. The horizontal axis gives the component mean, the vertical axis its standard deviation. The area of the number representing each component is proportional to the component’s mixing weight.<br />
<br />
plot(density(snoq),lty=2,ylim=c(0,0.04),<br />
main=paste("Comparison of density estimates\n",<br />
"Kernel vs. Gaussian mixture"),<br />
xlab="Precipitation (1/100 inch)")<br />
curve(dnormalmix(x,snoq.k9),add=TRUE)<br />
<br />
[[Image:SMHS_MixtureModel_Fig8.png|500px]]<br />
<br />
Dashed line: kernel density estimate. Solid line is the nine-Gaussian mixture. Notice that the mixture, unlike the KDE, gives negligible probability to negative precipitation.<br />
<br />
# Do the classes of the Gaussian mixture make sense as annual weather patterns?<br />
# Most probable class for each day:<br />
day.classes <- apply(snoq.k9$\$$posterior,1,which.max)<br />
# Make a copy of the original, structured data set to edit<br />
snoqualmie.classes <- snoqualmie<br />
# Figure out which days had precipitation<br />
wet.days <- (snoqualmie > 0) & !(is.na(snoqualmie))<br />
# Replace actual precipitation amounts with classes<br />
snoqualmie.classes[wet.days] <- day.classes<br />
# Problem: the number of the classes doesn't correspond to e.g. amount of<br />
# precipitation expected. Solution: label by expected precipitation, not by<br />
# class number.<br />
snoqualmie.classes[wet.days] <- snoq.k9$\$$mu[day.classes]<br />
<br />
plot(0,xlim=c(1,366),ylim=range(snoq.k9$\$$mu),type="n",xaxt="n",<br />
xlab="Day of year",ylab="Expected precipiation (1/100 inch)")<br />
axis(1,at=1+(0:11)*30)<br />
for (year in 1:nrow(snoqualmie.classes)) {<br />
points(1:366,snoqualmie.classes[year,],pch=16,cex=0.2)<br />
}<br />
<br />
[[Image:SMHS_MixtureModel_Fig9.png|500px]]<br />
<br />
Plot of days classified according to the nine-component mixture. Horizontal axis: days of the year, numbered from 1 to 366 to handle leap years. Vertical axis: expected amount of precipitation on that day according to the most probable class for the day.<br />
<br />
# Next line is currently (5 April 2011) used to invoke a bug-patch kindly<br />
# provided by Dr. Derek Young; the patch will be incorporated in the next<br />
# update to mixtools, so should not be needed after April 2011<br />
source("http://www.stat.cmu.edu/~cshalizi/402/lectures/20-mixture-examples/bootcomp.R")<br />
snoq.boot <- boot.comp(snoq,max.comp=10,mix.type="normalmix",maxit=400,$\epsilon=1e-2)$<br />
# Running this takes about 5 minutes<br />
# automatically produced as a side-effect of running boot.comp()<br />
<br />
[[Image:SMHS_MixtureModel_Fig10.png|500px]]<br />
<br />
Histograms produced by boot.comp("http://www.stat.cmu.edu/~cshalizi/402/lectures/20-mixture-examples/bootcomp.R"). The vertical red lines mark the observed difference in log-likelihood.<br />
<br />
library(mvtnorm)<br />
x.points <- seq(-3,3,length.out=100)<br />
y.points <- x.points<br />
z <- matrix(0,nrow=100,ncol=100)<br />
mu <- c(1,1)<br />
sigma <- matrix(c(2,1,1,1),nrow=2)<br />
for (i in 1:100) {<br />
for (j in 1:100) {<br />
z[i,j] <- dmvnorm(c(x.points[i],y.points[j]),mean=mu,sigma=sigma)<br />
}<br />
}<br />
contour(x.points,y.points,z)<br />
<br />
[[Image:SMHS_MixtureModel_Fig11.png|500px]]<br />
<br />
===Applications===<br />
<br />
1) [http://wiki.stat.ucla.edu/socr/index.php/SOCR_EduMaterials_Activities_2D_PointSegmentation_EM_Mixture This article] demonstrated activity of the mixture modeling and expectation maximization (EM) applied to the problem of 2D point cluster segmentation. It illustrated ways to use EM and mixture modeling to obtain cluster classification of points in 2D using SOCR charts activity and SOCR modeler with specific examples.<br />
<br />
2) [http://wiki.stat.ucla.edu/socr/index.php/SOCR_EduMaterials_ModelerActivities_MixtureModel_1 This article] presented the SOCR activity that demonstrate random sampling and fitting of mixture models to data. The SOCR mixture-model applet demonstrates how unimodal-distributions come together as building blocks to form the backbone of may complex processes and allow computing probability and critical values for these mixture distributions, and enable inference on such complicated processes.<br />
<br />
3) [http://www.sciencedirect.com/science/article/pii/S0167947301000469 This article] presented a mixture model approach for the analysis of microarray gene expression data. Microarrays have emerged as powerful tools allowing investigators to assess the expression of thousands of genes in different tissues and organisms. Statistical treatment of the resulting data remains a substantial challenge. Investigators using microarray expression studies may wish to answer questions about the statistical significance of differences in expression of any of the genes under study, avoiding false positive and false negative results. This paper developed a sequence of procedures involving finite mixture modeling and bootstrap inference to address these issues in studies involving many thousands of genes and illustrated the use of these techniques with a dataset involving calorically restricted mice.<br />
<br />
4) [http://proceedings.spiedigitallibrary.org/proceeding.aspx?articleid=979899 This article] is concerned with estimating a probability density function of human skin color, using a finite Gaussian mixture model, whose parameters are estimated through the EM algorithm. Hawkins' statistical test on the normality and homoscedasticity (common covariance matrix) of the estimated Gaussian mixture models is performed and McLachlan's bootstrap method is used to test the number of components in a mixture. Experimental results show that the estimated Gaussian mixture model fits skin images from a large database. Applications of the estimated density function in image and video databases are presented.<br />
<br />
===Software===<br />
[http://cran.r-project.org/web/packages/mixtools/vignettes/mixtools.pdf Mixtool Vignettes]<br />
<br />
[http://www.stat.washington.edu/mclust/ mclust]<br />
<br />
[http://www.stat.cmu.edu/~cshalizi/uADA/12/lectures/mixture-examples.R R code for examples in Chapter 20 (see references)]<br />
<br />
===Problems===<br />
<br />
1) Write a function to simulate from a Gaussian mixture model. Check if it works by comparing a density estimated on its output to the theoretical density.<br />
<br />
2) Work through the E-step and M-step for a mixture of two Poisson distributions.<br />
<br />
3) Code up the EM algorithm for a mixture of K Gaussians. Simulate data from K=3 Gaussians. How well does the code assign data points to components if give the actual Gaussian parameter the initial guess and how does it change if given other initial parameters?<br />
<br />
4) Write a function to fit a mixture of exponential distributions using the EM algorithm. <br />
<br />
===References===<br />
<br />
[http://mirlyn.lib.umich.edu/Record/004199238 Statistical inference / George Casella, Roger L. Berger]<br />
<br />
<br />
[http://www.stat.cmu.edu/~cshalizi/uADA/12/lectures/ch20.pdf Chapter 20. Mixture Models]<br />
<br />
<hr><br />
* SOCR Home page: http://www.socr.umich.edu<br />
<br />
{{translate|pageName=http://wiki.socr.umich.edu/index.php?title=SMHS_MixtureModeling}}</div>Clgallahttps://wiki.socr.umich.edu/index.php?title=SMHS_MixtureModeling&diff=14470SMHS MixtureModeling2014-10-20T18:49:20Z<p>Clgalla: /* Theory */</p>
<hr />
<div>==[[SMHS| Scientific Methods for Health Sciences]] - Mixture Modeling ==<br />
<br />
<br />
===Overview=== <br />
Mixture model is a probabilistic model for representing the presence of subpopulations within overall population, without requiring that an observed data set should identify the sub-population to which an individual observation belongs. In this section, we will present a general introduction to mixture modeling, the structure of mixture model, various types of mixture model, the estimation of parameters in mixture model and application of mixture model in studies. The implementation of mixture modeling in R package will also be discussed in the attached articles.<br />
<br />
===Motivation===<br />
Mixture distribution represents the probability distribution of observations in the overall population. Problems associated with mixture distributions relate to deriving the properties of the overall population from those of the sub-population. We use mixture models to make statistical inference about the properties of the sub-populations given only observations on the pooled population, without sub-population identity information. It is not the same as models for compositional data, whose components are constrained to sum to a constant value. (1, 100%, etc.) What is the structure of mixture model and how can we estimate parameters in the mixture model?<br />
<br />
===Theory===<br />
'''1) Structure of mixture model:''' a distribution $f$ is a mixture of $K$ component distributions $f_{1}, f_{2}, \cdots, f_{k}$ if $f(x) = \sum_{k=1}^{K} \lambda_{k}f_{k}(x)$ with the $\lambda_{k}$ being the mixing weights, $\lambda_{k} > 0, \sum_{k}\lambda_{k} = 1$. $Z \sim Mult(\lambda_{1}, \cdots, \lambda_{k}), X|Z \sim f_{z}$, where the discrete random variable $Z$ indicating where $X$ is drawn from. Different parametric family for $f_{k}$ generates different parametric mixture models, like Gaussian, Binomial, Poisson and etc. They may all be Gaussian with different parameters, or all Poisson distributions with different means. The model can be expressed as $f(x) = \sum_{k=1}^{K} \lambda_{k}f(x;\theta_{k})$, the parameter vector of the mixture model is $\theta = (\lambda_{1}, \cdots, \lambda_{K}, \theta_{1}, \cdots, \theta_{K})$. When $K=1$, we got a simply parametric distribution of the usual sort, and density estimation reduces to estimating the parameters by ML. If K=n, the number of observations, we went back to kernel density estimation. <br />
<br />
'''2) Estimating parametric mixture models:''' assume independent samples where we have the density function to be $\prod_{i=1}^{n}f(x_{i};\theta)$,for observations $x_{1}, x_{2}, \cdots, x_{n}$. <br />
<br />
<br />
We try the logarithm to turn multiplication into addition: $l(\theta) = \sum_{i=1}^{n} logf(x_{i};\theta) = \sum_{i=1}^{n} log \sum_{k=1}^{K} \lambda_{k} f(x_{i}; \theta_{k})$, <br />
<br />
<br />
we take the derivative of this with respect to one parameter, say $\theta_{j}, \frac{\partial l}{\partial \theta_{j}}=\sum_{i=1}^{n}\frac{1}{\sum_{i=1}^{K}\lambda_{k}f(x_{i};\theta_{k})}\lambda_{j}\frac{\partial f(x_{i};\theta_{j})}{\partial \theta_{j}}=\sum_{i=1}^{n}\frac{\lambda_{j}f(x_{i};\theta_{j})}{\sum_{i=1}^{K}\lambda_{k}f(x_{i};\theta_{k})}\frac{\partial logf(x_{i};\theta_{j})}{\partial \theta_{j}}$. <br />
<br />
<br />
If we just had an ordinary parametric model, on the other hand, the derivative of the log-likelihood would be $\sum_{i=1}^{n}\frac{\partial logf(x_{i};\theta_{j})}{\partial \theta_{j}}.$ Maximizing the likelihood for a mixture model is like doing a weighted likelihood maximization, where the weight of $x_{i}$ depends on cluster, being $w_{ij} = \frac{\lambda_{j}}{f x_{i}};\theta_{j}\sum_{k=1}^{K}\lambda_{k}f(x_{i}\theta_{k}).$ <br />
<br />
Remember that $\lambda_{j}$ is the probability that the hidden class variable $Z$ is $j$,so the numerator in the weights is the joint probability of getting $Z=j$ and $X=x_{i}$. The denominator is the marginal probability of getting $X=x_{i}$, so the ratio is conditional probability of $Z=j$ given $X=x_{i}, w_{ij} = \frac{\lambda_{j}f(x_{i};\theta_{j}}{\sum_{k=1}^{K}} \lambda_{k}f(x_{i};\theta_{k}) = p(Z=j | X=x_{i}; \theta).$<br />
*EM algorithm: (1) start with guesses about the mixture components $\theta_{1}, \cdots, \theta_{K}$ and the mixing weights $\lambda_{1}, \cdots, \lambda_{K}$; (2) until nothing changes very much: using the current parameter guesses, calculate the weights $w_{ij}$ (E-step); using the current weights, maximize the weighted likelihood to get new parameter estimates (M-step); (3) return the final parameter estimates (including mixing proportions) and cluster probabilities. <br />
*Non-parametric mixture modeling: replace the $M$ step of $EM$ by some other way of estimating the distribution of each mixture component. This could be fast and crude estimate of parameters, or it could even be a non-parametric density estimator.<br />
<br />
'''3) Computational examples in R:''' Snoqualmie Falls Revisited (analyzed using the mclust package in R)<br />
<br />
RCODE:<br />
snoqualmie <- read.csv<br />
("http://www.stat.cmu.edu/~cshalizi/402/lectures/16-glm-practicals/snoqualmie.csv",header=FALSE)<br />
snoqualmie.vector <- na.omit(unlist(snoqualmie))<br />
snoq <- snoqualmie.vector[snoqualmie.vector > 0]<br />
summary(snoq)<br />
Min. 1st Qu. Median Mean 3rd Qu. Max. <br />
1.00 6.00 19.00 32.28 44.00 463.00<br />
<br />
<br />
# Two-component Gaussian mixture<br />
library(mixtools)<br />
snoq.k2 <- normalmixEM(snoq,k=2,maxit=100,epsilon=0.01)<br />
summary(snoq.k2)<br />
summary of normalmixEM object:<br />
comp 1 comp 2<br />
lambda 0.557524 0.442476<br />
mu 10.266172 60.009468<br />
sigma 8.510244 44.997240<br />
loglik at estimate: -32681.21<br />
<br />
<br />
# Function to add Gaussian mixture components, vertically scaled, to the<br />
# current plot<br />
# Presumes the mixture object has the structure used by mixtools<br />
plot.normal.components <- function(mixture,component.number,...) {<br />
curve(mixture $\lambda$ [component.number] *<br />
dnorm(x,mean=mixture $\mu$ [component.number],<br />
sd=mixture $\sigma$ [component.number]), add=TRUE, ...)<br />
}<br />
plot(hist(snoq,breaks=101),col="grey",border="grey",freq=FALSE,<br />
xlab="Precipitation (1/100 inch)",main="Precipitation in Snoqualmie Falls")<br />
lines(density(snoq),lty=2)<br />
<br />
[[Image:SMHS_MixtureModel_Fig1.png|500px]]<br />
<br />
Histogram (grey) for precipitation on wet days in Snoqualmie Falls. The dashed line is a kernel density estimate, which is not completely satisfactory.<br />
<br />
plot(hist(snoq,breaks=101),col="grey",border="grey",freq=FALSE,<br />
xlab="Precipitation (1/100 inch)",main="Precipitation in Snoqualmie Falls")<br />
lines(density(snoq),lty=2)<br />
<br />
sapply(1:2,plot.normal.components,mixture=snoq.k2)<br />
<br />
[[Image:SMHS_MixtureModel_Fig2.png|500px]]<br />
<br />
As in the previous figure, plus the components of a mixture of two Gaussians, fitted to the data by the EM algorithm (dashed lines). These are scaled by the mixing weights of the components.<br />
<br />
# Function to calculate the cumulative distribution function of a Gaussian<br />
# mixture model<br />
# Presumes the mixture object has the structure used by mixtools<br />
# Doesn't implement some of the usual options for CDF functions in R, like<br />
# returning the log probability, or the upper tail probability<br />
pnormmix <- function(x,mixture) {<br />
lambda <- mixture$\$$lambda$<br />
k <- length(lambda)<br />
pnorm.from.mix <- function(x,component){<br />
lambda[component]*pnorm(x,mean=mixture$\$$mu[component],<br />
sd=mixture$\$$sigma[component])<br />
}<br />
pnorms <- sapply(1:k,pnorm.from.mix,x=x)<br />
return(rowSums(pnorms))<br />
}<br />
<br />
<br />
#### Figure 3<br />
# Distinct values in the data<br />
distinct.snoq <- sort(unique(snoq))<br />
# Theoretical CDF evaluated at each distinct value<br />
tcdfs <- pnormmix(distinct.snoq,mixture=snoq.k2)<br />
# Empirical CDF evaluated at each distinct value<br />
# ecdf(snoq) returns an object which is a _function_, suitable for application<br />
# to new vectors<br />
ecdfs <- ecdf(snoq)(distinct.snoq)<br />
# Plot them against each other<br />
plot(tcdfs,ecdfs,xlab="Theoretical CDF",ylab="Empirical CDF",xlim=c(0,1),<br />
ylim=c(0,1))<br />
# Main diagonal for visual reference<br />
abline(0,1)<br />
<br />
<br />
[[Image:SMHS_MixtureModel_Fig3.png|500px]]<br />
<br />
<br />
<br />
Calibration plot for the two-component Gaussian mixture. For each distinct value of precipitation x, plot the fraction of days predicted by the mixture model to have $\leq x$ precipitation on the horizontal axis, versus the actual fraction of days $\leq x$.<br />
<br />
# Probability density function for a Gaussian mixture<br />
# Presumes the mixture object has the structure used by mixtools<br />
dnormalmix <- function(x,mixture,log=FALSE) {<br />
lambda <- mixture$\$$\lambda<br />
k <- length(lambda)<br />
# Calculate share of likelihood for all data for one component<br />
like.component <- function(x,component) {<br />
lambda[component]*dnorm(x,mean=mixture$\$$mu[component], <br />
sd=mixture$\$$sigma[component])<br />
}<br />
# Create array with likelihood shares from all components over all data<br />
likes <- sapply(1:k,like.component,x=x)<br />
# Add up contributions from components<br />
d <- rowSums(likes)<br />
if (log) {<br />
d <- log(d)<br />
}<br />
return(d)<br />
}<br />
<br />
# Log likelihood function for a Gaussian mixture, potentially on new data<br />
loglike.normalmix <- function(x,mixture) {<br />
loglike <- dnormalmix(x,mixture,log=TRUE)<br />
return(sum(loglike))<br />
}<br />
loglike.normalmix(snoq,mixture=snoq.k2)<br />
[1] - 32681.21<br />
# Evaluate various numbers of Gaussian components by data-set splitting<br />
# (i.e., very crude cross-validation)<br />
n <- length(snoq)<br />
data.points <- 1:n<br />
data.points <- sample(data.points) # Permute randomly<br />
train <- data.points[1:floor(n/2)] # First random half is training<br />
test <- data.points[-(1:floor(n/2))] # 2nd random half is testing<br />
candidate.component.numbers <- 2:10<br />
loglikes <- vector(length=1+length(candidate.component.numbers))<br />
# k=1 needs special handling<br />
mu<-mean(snoq[train]) # MLE of mean<br />
sigma<- sd(snoq[train])*sqrt((n-1)/n) # MLE of standard deviation<br />
loglikes[1] <- sum(dnorm(snoq[test],mu,sigma,log=TRUE))<br />
for (k in candidate.component.numbers) {<br />
mixture <- normalmixEM(snoq[train],k=k,maxit=400,epsilon=1e-2)<br />
loglikes[k] <- loglike.normalmix(snoq[test],mixture=mixture)<br />
}<br />
loglikes<br />
[1] -17647.93 -16336.12 -15796.02 -15554.33 -15398.04 -15337.47 -15297.61<br />
[8] -15285.60 -15286.75 -15288.88<br />
plot(x=1:10, y=loglikes,xlab="Number of mixture components", ylab="Log-likelihood on testing data")<br />
<br />
[[Image:SMHS_MixtureModel_Fig4.png|500px]]<br />
<br />
log-likelihoods of different sizes of mixture models, fit to a random half of the data for training, and evaluated on the other half of the data for testing.<br />
<br />
snoq.k9 <- normalmixEM(snoq,k=9,maxit=400,epsilon=1e-2)<br />
plot(hist(snoq,breaks=101),col="grey",border="grey",freq=FALSE,<br />
xlab="Precipitation (1/100 inch)",main="Precipitation in Snoqualmie Falls")<br />
lines(density(snoq),lty=2)<br />
sapply(1:9,plot.normal.components,mixture=snoq.k9)<br />
<br />
[[Image:SMHS_MixtureModel_Fig5.png|500px]]<br />
<br />
With the nine-component Gaussian mixture.<br />
<br />
# Assigments for distinct.snoq and ecdfs are redundant if you've already done<br />
distinct.snoq <- sort(unique(snoq))<br />
tcdfs <- pnormmix(distinct.snoq,mixture=snoq.k9)<br />
ecdfs <- ecdf(snoq)(distinct.snoq)<br />
plot(tcdfs,ecdfs,xlab="Theoretical CDF",ylab="Empirical CDF",xlim=c(0,1),<br />
ylim=c(0,1))<br />
abline(0,1)<br />
<br />
[[Image:SMHS_MixtureModel_Fig6.png|500px]]<br />
<br />
Calibration plot for the nine-component Gaussian mixture.<br />
<br />
plot(0,xlim=range(snoq.k9$\$$mu),ylim=range(snoq.k9$\$$sigma),type="n",<br />
xlab="Component mean", ylab="Component standard deviation")<br />
<br />
points(x=snoq.k9$\$$mu,y=snoq.k9$\$$sigma,pch=as.character(1:9),<br />
cex=sqrt(0.5+5*snoq.k9$\$$lambda))<br />
<br />
[[Image:SMHS_MixtureModel_Fig7.png|500px]]<br />
<br />
Characteristics of the components of the 9-mode Gaussian mixture. The horizontal axis gives the component mean, the vertical axis its standard deviation. The area of the number representing each component is proportional to the component’s mixing weight.<br />
<br />
plot(density(snoq),lty=2,ylim=c(0,0.04),<br />
main=paste("Comparison of density estimates\n",<br />
"Kernel vs. Gaussian mixture"),<br />
xlab="Precipitation (1/100 inch)")<br />
curve(dnormalmix(x,snoq.k9),add=TRUE)<br />
<br />
[[Image:SMHS_MixtureModel_Fig8.png|500px]]<br />
<br />
Dashed line: kernel density estimate. Solid line is the nine-Gaussian mixture. Notice that the mixture, unlike the KDE, gives negligible probability to negative precipitation.<br />
<br />
# Do the classes of the Gaussian mixture make sense as annual weather patterns?<br />
# Most probable class for each day:<br />
day.classes <- apply(snoq.k9$\$$posterior,1,which.max)<br />
# Make a copy of the original, structured data set to edit<br />
snoqualmie.classes <- snoqualmie<br />
# Figure out which days had precipitation<br />
wet.days <- (snoqualmie > 0) & !(is.na(snoqualmie))<br />
# Replace actual precipitation amounts with classes<br />
snoqualmie.classes[wet.days] <- day.classes<br />
# Problem: the number of the classes doesn't correspond to e.g. amount of<br />
# precipitation expected. Solution: label by expected precipitation, not by<br />
# class number.<br />
snoqualmie.classes[wet.days] <- snoq.k9$\$$mu[day.classes]<br />
<br />
plot(0,xlim=c(1,366),ylim=range(snoq.k9$\$$mu),type="n",xaxt="n",<br />
xlab="Day of year",ylab="Expected precipiation (1/100 inch)")<br />
axis(1,at=1+(0:11)*30)<br />
for (year in 1:nrow(snoqualmie.classes)) {<br />
points(1:366,snoqualmie.classes[year,],pch=16,cex=0.2)<br />
}<br />
<br />
[[Image:SMHS_MixtureModel_Fig9.png|500px]]<br />
<br />
Plot of days classified according to the nine-component mixture. Horizontal axis: days of the year, numbered from 1 to 366 to handle leap years. Vertical axis: expected amount of precipitation on that day according to the most probable class for the day.<br />
<br />
# Next line is currently (5 April 2011) used to invoke a bug-patch kindly<br />
# provided by Dr. Derek Young; the patch will be incorporated in the next<br />
# update to mixtools, so should not be needed after April 2011<br />
source("http://www.stat.cmu.edu/~cshalizi/402/lectures/20-mixture-examples/bootcomp.R")<br />
snoq.boot <- boot.comp(snoq,max.comp=10,mix.type="normalmix",maxit=400,$\epsilon=1e-2)$<br />
# Running this takes about 5 minutes<br />
# automatically produced as a side-effect of running boot.comp()<br />
<br />
[[Image:SMHS_MixtureModel_Fig10.png|500px]]<br />
<br />
Histograms produced by boot.comp("http://www.stat.cmu.edu/~cshalizi/402/lectures/20-mixture-examples/bootcomp.R"). The vertical red lines mark the observed difference in log-likelihood.<br />
<br />
library(mvtnorm)<br />
x.points <- seq(-3,3,length.out=100)<br />
y.points <- x.points<br />
z <- matrix(0,nrow=100,ncol=100)<br />
mu <- c(1,1)<br />
sigma <- matrix(c(2,1,1,1),nrow=2)<br />
for (i in 1:100) {<br />
for (j in 1:100) {<br />
z[i,j] <- dmvnorm(c(x.points[i],y.points[j]),mean=mu,sigma=sigma)<br />
}<br />
}<br />
contour(x.points,y.points,z)<br />
<br />
[[Image:SMHS_MixtureModel_Fig11.png|500px]]<br />
<br />
===Applications===<br />
<br />
1) [http://wiki.stat.ucla.edu/socr/index.php/SOCR_EduMaterials_Activities_2D_PointSegmentation_EM_Mixture This article] demonstrated activity of the mixture modeling and expectation maximization (EM) applied to the problem of 2D point cluster segmentation. It illustrated ways to use EM and mixture modeling to obtain cluster classification of points in 2D using SOCR charts activity and SOCR modeler with specific examples.<br />
<br />
2) [http://wiki.stat.ucla.edu/socr/index.php/SOCR_EduMaterials_ModelerActivities_MixtureModel_1 This article] presented the SOCR activity that demonstrate random sampling and fitting of mixture models to data. The SOCR mixture-model applet demonstrates how unimodal-distributions come together as building blocks to form the backbone of may complex processes and allow computing probability and critical values for these mixture distributions, and enable inference on such complicated processes.<br />
<br />
3) [http://www.sciencedirect.com/science/article/pii/S0167947301000469 This article] presented a mixture model approach for the analysis of microarray gene expression data. Microarrays have emerged as powerful tools allowing investigators to assess the expression of thousands of genes in different tissues and organisms. Statistical treatment of the resulting data remains a substantial challenge. Investigators using microarray expression studies may wish to answer questions about the statistical significance of differences in expression of any of the genes under study, avoiding false positive and false negative results. This paper developed a sequence of procedures involving finite mixture modeling and bootstrap inference to address these issues in studies involving many thousands of genes and illustrated the use of these techniques with a dataset involving calorically restricted mice.<br />
<br />
4) [http://proceedings.spiedigitallibrary.org/proceeding.aspx?articleid=979899 This article] is concerned with estimating a probability density function of human skin color, using a finite Gaussian mixture model, whose parameters are estimated through the EM algorithm. Hawkins' statistical test on the normality and homoscedasticity (common covariance matrix) of the estimated Gaussian mixture models is performed and McLachlan's bootstrap method is used to test the number of components in a mixture. Experimental results show that the estimated Gaussian mixture model fits skin images from a large database. Applications of the estimated density function in image and video databases are presented.<br />
<br />
===Software===<br />
[http://cran.r-project.org/web/packages/mixtools/vignettes/mixtools.pdf Mixtool Vignettes]<br />
<br />
[http://www.stat.washington.edu/mclust/ mclust]<br />
<br />
[http://www.stat.cmu.edu/~cshalizi/uADA/12/lectures/mixture-examples.R R code for examples in Chapter 20 (see references)]<br />
<br />
===Problems===<br />
<br />
1) Write a function to simulate from a Gaussian mixture model. Check if it works by comparing a density estimated on its output to the theoretical density.<br />
<br />
2) Work through the E-step and M-step for a mixture of two Poisson distributions.<br />
<br />
3) Code up the EM algorithm for a mixture of K Gaussians. Simulate data from K=3 Gaussians. How well does the code assign data points to components if give the actual Gaussian parameter the initial guess and how does it change if given other initial parameters?<br />
<br />
4) Write a function to fit a mixture of exponential distributions using the EM algorithm. <br />
<br />
===References===<br />
<br />
[http://mirlyn.lib.umich.edu/Record/004199238 Statistical inference / George Casella, Roger L. Berger]<br />
<br />
<br />
[http://www.stat.cmu.edu/~cshalizi/uADA/12/lectures/ch20.pdf Chapter 20. Mixture Models]<br />
<br />
<hr><br />
* SOCR Home page: http://www.socr.umich.edu<br />
<br />
{{translate|pageName=http://wiki.socr.umich.edu/index.php?title=SMHS_MixtureModeling}}</div>Clgallahttps://wiki.socr.umich.edu/index.php?title=SMHS_TimeSeries&diff=14469SMHS TimeSeries2014-10-20T17:53:23Z<p>Clgalla: /* Problems */</p>
<hr />
<div>==[[SMHS| Scientific Methods for Health Sciences]] - Time Series Analysis ==<br />
<br />
===Overview===<br />
Time series data is a sequence of data points measured at successive pints in time spaced intervals. Time series analysis is commonly used in varieties of studies like monitoring industrial processes and tracking business metrics in the business world. In this lecture, we will present a general introduction to the time series data and introduced on some of the most commonly used techniques in the rick and rapidly growing field of time series modeling and analysis to extract meaningful statistics and other characteristics of the data. We will also illustrate the application of time series analysis techniques with examples in R. <br />
<br />
===Motivation===<br />
Economic data like daily share price, monthly sales, annual income or physical data like daily temperature, ECG readings are all examples of time series data. Time series is just an ordered sequence of values of a variable measured at equally spaced time intervals. So, what would be the effective ways to measure time series data? How can we extract information from time series data and make inference afterwards? <br />
<br />
===Theory===<br />
Time series: a sequence of data points measured at successive pints in time spaced intervals.<br />
*Components: (1) trend component: long-run increase or decrease over time (overall upward or downward movement), typically refers to data taken over a long period of time; the trend can be upward or downward, linear or non-linear. (2) seasonal component: short-term regular wave-like patterns, usually refers to data observed within one year, which may be measured monthly or quarterly. (3) cyclical component: long-term wave-like patterns; regularly occur but may vary in length; often measured peak to peak or trough to trough. (4) irregular component: unpredictable, random, ‘residual’ fluctuation, which may be due to random variations of nature or accidents or unusual events; ‘noise’ in the time series.<br />
*Probabilistic models: The components still add or multiply together across time.<br />
[[Image:SMHS Fig 1 Times Series Analysis.png|400px]]<br />
*Simple Time Series Models: <br />
**Take $T_{t}$ as the trend component, $S_{t}$ as the seasonal component, $C_{t}$ as the cyclical component and $I_{t}$ as the irregular or random component. Then we have an additive model as: $x_{t}=T_{t}+S_{t}+C_{t}+I_{t};$ the multiplicative model says $x_{t}=T_{t}*S_{t}*C_{t}*I_{t};$ sometimes, we take logs on both sides of the multiplicative model to make it additive, $logx_{t}=log\left ( T_{t}*S_{t}*C_{t}*I_{t} \right )$, which can be further noted as $x_{t}{}'=T_{t}{}'+S_{t}{}'+C_{t}{}'+I_{t}{}'$.<br />
**Most time series models are written in terms of an (unobserved) white noise process, which is often assumed to be Gaussian: $x_{t}=w_{t}$, where $w_{t}\sim WN\left ( 0,1 \right )$, that is, $E\left ( W_{t} \right )=0, Var\left(W_{t} \right )=\sigma ^{2}$. Examples of probabilistic models include: (1) Autoregressive $x_{t}=0.6x_{t-1}+w_{t};$ (2) Moving average model $x_{t}= \frac{1}{3}\left(w_{t}+w_{t-1}+w_{t-2}\right).$<br />
**Fitting time series models: a time series model generates a process whose pattern can then be matched in some way to the observed data; since perfect matches are impossible, it is possible that more than one model will be appropriate for a set of data; to decide which model is appropriate: patterns suggest choices, assess within sample adequacy (diagnostics, tests), outside sample adequacy (forecast evaluation), simulation from suggested model and compare with observed data; next, turn to some theoretical aspects like how to characterize a time series, and then investigate some special processes. <br />
**Characterizing time series (the mean and covariance of time series): suppose the data are $x_{1},x_{2},\cdots ,x_{t},$ note for a regularly spaced time series, $x_{1}$ is observed at time $t_{0}$, $x_{2}$ is observed at $t_{0}+\Delta, x_{t}$ is observed at $t_{0}+\left( t-1\right) \Delta$; the expected value of $x_{t}$ is $\mu _{t}=E\left [ x_{t} \right ]$; the covariance function is $\gamma \left(s,t \right )=cov\left(x_{s},x_{t} \right )$. Note that, we don’t distinguish between random variables and their realizations. Properly, $x_{1},x_{2},\cdots ,x_{t}$ is a realization of some process $X_{1},X_{2},\cdots ,X_{t}$ and so $\mu _{t}=E\left [ x_{t} \right ]$ etc.<br />
**Characterizing time series data: weak stationarity: usually we have only one realization of the time series. There are not enough data to estimate all the means \& covariances. That is if the mean is constant, and the covariance depends only on the separation in time, then a time series is $\left(weakly\right)$ stationary. In this case $\mu_{t}=\mu$ and $\gamma\left(s,t\right)=\gamma\left(t-s\right)$. Stationary processes are much easier to handle statistically. Real time series are usually non-stationary, but can often be transformed to give a stationary process.<br />
**Estimation of $\gamma(h)$ (Estimating the autocovariance function): (1) for a stationary process: $\gamma(h)=\gamma(s,s+h)=cov(x_{s},x_{s+h})$ for any s.$\ \gamma (h)$ is called the autocovariance function because it measures the covariance between lags of the process $x_{t}$; (2) we observe T-h pairs separated at lag h namely $\left(x_{1},x_{h+1}\right),\cdots,\left(x_{T-h},x_{T}\right)$; the sample autocovariance function is $\hat{\gamma}(h)=\frac{1}{T}\sum_{t=1}^{T-h}\left(x_{t}-\bar{X} \right )\left(x_{t+h}-\bar{X} \right )$, note that we divide by ${T}$ although there are ${T-h}$ terms in the sum. The autocorrelation function ${ACF}$ is $\rho (h)=\frac{\gamma(h)}{\gamma(0)}$.<br />
**$\rho(h)$ properties: for $x_{t}$ stationary, $\gamma(h)=E(x_{t}-\mu{t})(x_{t+h}-\mu)$, where $\gamma(h)$ is an even function, that is $\gamma(h)=\gamma(-h); \left | \gamma(h) \right |\leq \left | \gamma(0) \right |$, that is $\left|\rho(h)\right|\leq1, h=\pm 1,2,\cdots$. The autocorrelation matrix, $P(h)$ is positive semi-definite which means that the autocorrelations cannot be chosen arbitrarily but must have some relations among themselves. <br />
**To study a set of time series data: given that we only have one realization, we need to match plot of data from proposed model with data (we can see broad/main patterns); match acf, pacf of proposed model with data (we can see if data are stationary uncorrelated, stationary autocorrelated or non-stationary); determine if the proposed model is appropriate; obtain forecasts from proposed model; compare across models that seem appropriate. <br />
**Large sample distribution of the $ACF$ for a $WN$ series: if $x_{t}$ is $WN$ then for n large, the sample $ACF$, $\hat{\rho}_{x}(h), h=1,2,\cdots$,H where H is fixed but arbitrary is normally distributed with zero mean and standard deviation given by $\sigma_{\hat{\rho}_{x}}(h)=\frac{1}{\sqrt{h}}$. A rule of thumb for assessing whether sample autocorrelations resemble those for a $WN$ series is by checking if approximately 95% of the sample autocorrelations are within the interval $0\pm \frac{2}{\sqrt{n}}$. <br />
**White noise, or stationary uncorrelated process: we say that a time series is white noise if it is weakly stationary, with $ACF$, $\rho(h)=\begin{cases} 1 & \text{ if } h=0 \\ 0 & \text{ if } h> 0\ and\ with \ E\left[w_{t} \right ]=0 \end{cases}$, i.e. white noise consists of a sequence of uncorrelated, zero mean random variables with common variance $\sigma^{2}$.<br />
<br />
*Example of a simple moving average model in R: $v_{t}=\frac{1}{3} (w_{t-1}+w_{t}+w_{t+1})$. <br />
<br />
R CODE: <br />
w <- ts(rnorm(150)) ## generate 150 data from the standard normal distribution<br />
v <- ts(filter(w, sides=2, rep(1/3, 3))) ## moving average model<br />
par(mfrow=c(2,1))<br />
plot(w,main='white noise')<br />
plot(v,main='moving average')<br />
<br />
or can use code:<br />
<br />
w <- rnorm(150) ## generate 150 data from the standard normal distribution<br />
v <- filter(w, sides=2, rep(1/3, 3)) ## moving average model<br />
par(mfrow=c(2,1))<br />
plot.ts(w,main='white noise')<br />
plot.ts(v,main='moving average')<br />
<br />
[[Image:SMHS Fig2 Timeseries Analysis.png|500px]]<br />
<br />
## sums based on WN processes<br />
'''ts.plot(w,v,lty=2:1,col=1:2,lwd=1:2)'''<br />
<br />
[[Image:SMHS Fig3 TimeSeries Analysis.png|500px]]<br />
<br />
The ACF and the sample ACF of some stationary processes<br />
*white noise = $w(t),acf x(t)=w(t)+\frac{1}{3} \left(w(t-1)+w(t-2)+w(t-3) \right),$ then plot the barplot of the acf:<br />
<br />
RCODE:<br />
par(mfrow=c(2,1))<br />
barplot(ARMAacf(0,lag.max=10),main='white noise=w(t)')<br />
barplot(ARMAacf(ma=c(1/3,1/3,1/3),lag.max=10),main='acfx(t)=w(t)+1/3w(t-1)+1/3w(t-2)+1/3w(t-3)')<br />
<br />
[[Image:SMHS_Fig4_TimeSeries_Analysis.png|500px]]<br />
<br />
*Theoretical acf of some processes: Autoregressive process, $x_{t}=0.9x_{t-1}+w_{t}.$<br />
<br />
RCODE:<br />
par(mfrow=c(2,1))<br />
barplot(ARMAacf(c(0.9),lag.max=30),main='acf x(t)=0.9x(t-1)+w(t)')<br />
barplot(ARMAacf(c(1,-0.9),lag.max=30),main='acf x(t)=x(t-1)-0.9x(t-2)+w(t)')<br />
<br />
<br />
[[Image:SMHA_Fig5_TimeSeries_Analysis.png|500px]]<br />
<br />
*Recognizing patterns in probabilistic data <br />
**Example 1: compare White noise & Autoregressive process $x_{t}=w_{t} \& x_{t}=x_{t-1}-0.9x_{t-2}+w_{t})$<br />
<br />
RCODE:<br />
w1 <- rnorm(200) ## generate 200 data from the standard normal distribution<br />
x <- filter(w1, filter=c(1,-0.9),method='recursive')[-(1:50)]<br />
w2 <- w1[-(1:50)]<br />
xt <- ts(x)<br />
par(mfrow=c(2,2))<br />
plot.ts(w,main='white noise')<br />
acf(w2,lag.max=25)<br />
plot.ts(xt,main='autoregression')<br />
acf(xt,lag.max=25)<br />
<br />
[[Image:SMHS_Fig6_TimeSeries_Analysis.png|500px]]<br />
<br />
*Apparently, there is almost no autocorrelation in white noise, while there is large autocorrelation in second series.<br />
<br />
Example 2: AR & MA $x_{t}=x_{t-1}-0.9x_{t-2}+w_{t}\& v_{t}=\frac{1}{3}\left(w_{t-1}+w_{t}+w_{t+1} \right))$<br />
<br />
RCODE:<br />
v <- filter(w1, sides=2, rep(1/3, 3))<br />
vt <- ts(v)[2:199]<br />
par(mfrow=c(2,2))<br />
plot.ts(xt,main='white noise')<br />
acf(xt,lag.max=25)<br />
plot.ts(vt,main='autoregression')<br />
acf(vt,lag.max=25)<br />
<br />
[[Image:SMHS_Fig7_TimeSeries_Analysis.png|500px]]<br />
<br />
<br />
*ACF, PACF and some examples of data in R<br />
<br />
RCODE:<br />
data<-<br />
c(28,22,36,26,28,28,26,24,32,30,27,24,33,21,36,32,31,25,24,25,28,36,27,32,34,30,<br />
25,26,26,25,-<br />
44,23,21,30,33,29,27,29,28,22,26,27,16,31,29,36,32,28,40,19,37,23,32,29,-<br />
2,24,25,27,24,16,29,20,28,27,39,23)<br />
par(mfrow=c(1,2))<br />
plot(data)<br />
qqnorm(data)<br />
qqline(data)<br />
<br />
[[Image:SMHS_Fig8_TimeSeries_Analysis.png|500px]]<br />
<br />
data1<-data[c(-31,-55)]<br />
par(mfrow=c(1,2))<br />
plot.ts(data1)<br />
qqnorm(data1)<br />
qqline(data1)<br />
<br />
[[Image:SMHS_Fig9_TimeSeries_Analysis.png|500px]]<br />
<br />
library(astsa)<br />
acf2(data1)<br />
<br />
*Examining the data: <br />
Diagnostics of outliers<br />
<br />
RCODE:<br />
dat<- <br />
c(28,22,36,26,28,28,26,24,32,30,27,24,33,21,36,32,31,25,24,25,28,36,27,32,34,30,<br />
25,26,26,25,-44,23,21,30, 33,29, 27,29,28,22,26,27,16,31,29,36,32,28,40,19,<br />
37,23,32,29,-2,24,25,27,24,16,29,20,28,27, 39,23)<br />
plot.ts(dat)<br />
<br />
<br />
[[Image:SMHS_Fig10_TimeSeries_Analysis.png|500px]]<br />
<br />
<br />
Then we can easily tell that there is an outlier in this data series.<br />
<br />
# what if we use a boxplot here?<br />
boxplot(dat)<br />
<br />
[[Image:SMHS_Fig11_TimeSeries_Analysis.png|500px]]<br />
<br />
identify(rep(1,length(dat)),dat) ## identify outliers by clicking on screen<br />
## we identified the outliers are 55 and 31:<br />
dat[31]<br />
[1] -44<br />
dat[55]<br />
[1] -2<br />
datr <- dat[c(-31,-55)] ## this is the dat with outlier removed<br />
qqnorm(datr)<br />
qqline(datr)<br />
<br />
<br />
[[Image:SMHS_Fig12_TimeSeries.png|500px]]<br />
<br />
<br />
## this plot shows that the distribution of the new data series (right) <br />
is closer to normal compared to the original data (left).<br />
## The effect of outliers by comparing the ACF and PACF of the data <br />
series:<br />
library(astsa)<br />
acf2(dat)<br />
acf2(datr)<br />
[[Image:SMHS_Fig13_TimeSeries.png|800px]]<br />
<br />
*Diagnostics of the error: consider the model $(x_{t}=\mu+w_{t} )$ check to see if the mean of the probability distribution of error is 0; whether variance of error is constant across time; whether errors are independent and uncorrelated; whether the probability distribution of error is normal. <br />
:::Residual plot for functional form<br />
<br />
[[Image:SMHS_Fig14_TimeSeries.png|500px]]<br />
<br />
:::Residual plot for equal variance<br />
<br />
[[Image:SMHS_Fig15_TimeSeries.png|500px]]<br />
<br />
:::Check if residuals are correlated by plotting $e_{t} vs. e_{t-1}$: there are three possible cases – positive autocorrelation, negative autocorrelation and no autocorrelation. <br />
<br />
*Smoothing: to identify the structure by averaging out the noise since series has some structure plus variation. <br />
::Moving averages:<br />
ma3<-filter(datr,sides=2,rep(1/3,3))<br />
ma9<-filter(datr,sides=2,rep(1/9,9))<br />
plot.ts(datr,ylim=c(15,50),ylab='data')<br />
par(new=T)<br />
plot.ts(ma3,ylim=c(15,50),col=2,ylab='data') ## red line indicates ma3<br />
par(new=T)<br />
plot.ts(ma9,ylim=c(15,50),col=3,ylab='data') ## green line indicates ma9<br />
<br />
[[Image:SMHS_Fig16_TimeSeries.png|500px]]<br />
<br />
*Apparently, the bigger the laggings in moving average, the smoother the time series plot of the data series. <br />
ma3<-filter(datr,sides=2,rep(1/3,3))<br />
ma31<-filter(datr,sides=1,rep(1/3,3))<br />
plot.ts(datr,ylim=c(15,50),ylab='data')<br />
par(new=T)<br />
plot.ts(ma3,ylim=c(15,50),col=2,ylab='data') ## red line indicates ma3<br />
par(new=T)<br />
plot.ts(ma31,ylim=c(15,50),col=3,ylab='data') ## green line indicates ma31<br />
<br />
[[Image:SMHS_Fig17_TimeSeries.png|500px]]<br />
<br />
Two sided and one sided filters with same coefficients seem to be similar in their smoothing effect. <br />
<br />
## the effect of equal vs. unequal weights? <br />
ma31<-filter(datr,sides=1,rep(1/3,3))<br />
mawt<-filter(datr,sides=1,c(1/9,1/9,7/9))<br />
plot.ts(datr,ylim=c(15,50),ylab='data')<br />
par(new=T)<br />
plot.ts(ma31,ylim=c(15,50),col=2,ylab='data') ## red line indicates ma31<br />
par(new=T)<br />
plot.ts(mawt,ylim=c(15,50),col=3,ylab='data') ## green line indicates mawt<br />
<br />
[[Image:SMHS_Fig18_TimeSeries.png|500px]]<br />
<br />
The smoothing effect of filter with equal weights seems to be bigger than the filter with unequal weights.<br />
<br />
*Simple Exponential Smoothing (SES): premise – the most recent observations might have the highest predictive value $(\hat{x}_{t} = \alpha x_{t-1} + (1- \alpha) \hat{x}_{t-1} )$, that is new forecast = $\alpha \ actual value + (1-\alpha) \ previous forecast$. This method is appropriate for series with no trend or seasonality.<br />
<br />
*For series with trend, can use Holt-Winters Exponential smoothing additive model $\hat{x}_{t+h}=a_{t}+h*b_{t}.$ We smooth the level or permanent component, the trend component and the seasonal component.<br />
<br />
RCODE:<br />
HoltWinters(series,beta=FALSE,gamma=FALSE) #simple exponential smoothing<br />
HoltWinters(series,gamma=FALSE) # HoltWinters trend<br />
HoltWinters(series,seasonal= 'additive') # HoltWinters additive trend+seasonal<br />
HoltWinters(series,seasonal= 'multiplicative') # HoltWinters multiplicative trend+seasonal<br />
<br />
<br />
<br />
Example: RCODE: (effect of different values of alpha on datr SES)<br />
par(mfrow=c(3,1))<br />
exp05=HoltWinters(datr,alpha=0.05,beta=FALSE,gamma=FALSE) #simple exponential smoothing <br />
p05=predict(exp05,20) <br />
plot(exp05,p05, main = 'Holt- Winters filtering, alpha=0.05')<br />
exp50=HoltWinters(datr, alpha=0.5,beta=FALSE,gamma=FALSE) #simple exponential smoothing <br />
p50=predict(exp50,20)<br />
plot(exp50,p50, main = 'Holt-Winters filtering, alpha=0.5')<br />
exp95=HoltWinters(datr, alpha=0.95,beta=FALSE,gamma= FALSE) #simple exponential smoothing <br />
p95=predict(exp95,20)<br />
plot(exp95,p95,main = "Holt-Winters filtering, alpha=0.95")<br />
<br />
<br />
[[Image:SMHS_Fig19_TimeSeries.png|500px]]<br />
<br />
*Comparing two models: we can compare ‘fit’ and forecast ability. <br />
<br />
RCODE:<br />
obs <- exp50$\$$x[-1]<br />
fit <- exp50$\$$fitted[,1]<br />
r1 <- obs-fit<br />
plot(r1)<br />
par(mfrow=c(1,2))<br />
plot(r1)<br />
acf(r1)<br />
<br />
[[Image:SMHS_Fig20_TimeSeries.png|500px]]<br />
<br />
*Regression based (nonparametric) smoothing: (1) kernel: extension of a two-sided MA. Use a bandwidth (number of terms) plus a kernel function to smooth a set of values, ''ksmooth''; (2) local linear regression (loess): fit local polynomial regression and join them together, ''loess''; (3) fitting cubic splines: extension of polynomial regression, partition data and fit separate piecewise regression to each section, smooth them together where they join, ''smooth.spine''; (4) many more, ''lowess'', ''supsmu''.<br />
::•Kernel smoother: define bandwidth; choose a kernel; use robust regression to get smooth estimate; bigger the bandwidth, smoother the result<br />
::•Loess: define the window width; choose a weight function; use polynomial regression and weighted least squares to get smoothest estimate; the bigger span, the smoother the result.<br />
::•Splines: divide data into intervals; in each interval, fit a polynomial regression of order p such that the splines are continuous at knots; use a parameter (spar in R) to smooth the splines; the bigger the spar, the smoother the result.<br />
::•Issues with regression based (nonparametric) smoothing: (1) the size of S has an important effect on the curve; (2) a span that is too small (meaning that insufficient data fall within the window) produces a curve characterized by a lot of noise, results in a large variance; (3) if the span is too large, the regression will be oversmoothed, and thus the local polynomial may not fit the data well. This may result in loss of important information, and thus the fit will have large bias; (4) we want the smallest span that provides a smooth structure.<br />
<br />
*A regression model for trend: given $y_{t}, t=1,2,…,\cdots,$ T is a time series, fit a regression model for trend. The plot suggests that we fit a trend line, say $y_{t}=\beta_{1}+\beta_{2} t + \varepsilon_{t}$. <br />
::•Assumptions: variance of error is constant across observations; errors are independent / uncorrelated across observations; the regressor variables and error are independent; we may also assume probability distribution of error is normal.<br />
<br />
RCODE<br />
fit <- lm(x~z)<br />
summary(fit$\$$coeff)<br />
plot.ts(fit$\$$resid) # plot the residuals of the fitted regression model<br />
acf2(fit$\$$resid) # the ACF and PACF of the residuals of the fitted regression model<br />
qqnorm(fit$\$$resid)<br />
qqline(fit$\$$resid)<br />
<br />
<br />
*Analyzing patterns in data and modeling: seasonality<br />
**Exploratory analysis<br />
<br />
RCODE:<br />
data=c( 118,132,129,121,135,148,148,136,119,104,118,115,126,141,135,125,<br />
149,170,170,158,133,114,140,145,150,178,163,172,178,199,199,184,162,146,<br />
166,171,180,193,181,183,218,230,242,209,191,172,194,196,196,236,235,<br />
229,243,264,272,237,211,180,201,204,188,235,227,234,264,302,293,259,229,<br />
203,229,242,233,267,269,270,315,364,347,312,274,237,278,284,277,317,313,<br />
318,374,413,405,255,306,271,306,300)<br />
air <- ts(data,start=c(1949,1),frequency=12)<br />
y <- log(air)<br />
par(mfrow=c(1,2))<br />
plot(air)<br />
plot(y)<br />
<br />
<br />
[[Image:SMHS_Fig21_TimeSeries.png|500px]]<br />
<br />
<br />
## from the charts above, taking log helps a little bit with the increasing variability. <br />
<br />
monthplot(y)<br />
<br />
<br />
[[Image:SMHS_Fig22_TimeSeries.png|500px]]<br />
<br />
<br />
acf2(y)<br />
<br />
[[Image:SMHS_Fig23_TimeSeries.png|500px]]<br />
<br />
*Smoothing: the Holt-Winters method smooths and obtains three parts of a series: level, trend and seasonal. The method requires smoothing constants: your choice or those which minimize the one step ahead forecast error: $\sum_{t=1}^{T} \left(\hat{x}_{t} - x_{t-1} \right )^{2}$, where $t=1,2,\cdots,T$. So at a given time point, smoothed value = (level+h*trend)+seasonal index (additive); data value = (level+h*trend)*seasonal index (multiplicative). Additive: $\hat{x}_{t+h}=a_{t}+h*b_{t}+s_{t+1+(h-1)\mod{p}}$; Multiplicative: $\hat{x}_{t+h}=(a_{t}+h*b_{t})* s_{t+1+(h-1)\mod{p}}.$<br />
<br />
<br />
RCODE:<br />
par(mfrow=c(2,1))<br />
sa <- HoltWinters(air,seasonal=c('additive'))<br />
psa <- predict(sa,60)<br />
plot(sa,psa,main='Holt-Winters filtering, airline passengers, HW additive')<br />
sm <- HoltWinters(air,seasonal=c('multiplicative'))<br />
psam <- predict(sm,60)<br />
plot(sm,psam,main='Holt-Winters filtering, airline passengers, HW multiplicative')<br />
<br />
[[Image:SMHS_Fig24_TimeSeries.png|500px]]<br />
<br />
## seasonal decomposition using loess, log(airpassengers)<br />
<br />
[[Image:SMHS_Fig25_TimeSeries.png|500px]]<br />
<br />
*Regression: estimation, diagnostics, interpretation. Defining dummy variables to capture the seasonal effect in regression. If there is a variable with k categories, there will be k-1 dummies if the intercept is included in the regression (exclude one group); there will be k dummies if the intercept is not included in the regression; if the intercept is included in the regression, the coefficient of a dummy variable is always the difference of means between that group and the excluded group. <br />
<br />
RCODE:<br />
t <- time(y)<br />
Q <- factor(rep(1:12,8))<br />
h <- lm(y~time(y)+Q) # regression with intercept, drops first category = January<br />
model.matrix(h)<br />
h1 <- lm(y~0+time(y)+Q) # regression without intercept<br />
model.matrix(h1)<br />
x1 <- ts(y[1:72],start=c(1949,1),frequency=12)<br />
t1 <- t[1:72]<br />
Q1 <- Q[1:72]<br />
contrasts(Q1)=contr.treatment(12,base=10)<br />
h2 <- lm(x1~t1+Q1) # regression with intercept, drops 10th category = October<br />
model.matrix(h2)<br />
reg <- lm(x1~t1+Q1)<br />
par(mfrow=c(2,2))<br />
plot.ts(reg$\$$resid)<br />
qqnorm(reg$\$$resid)<br />
qqline(reg$\$$resid)<br />
acf(reg$\$$resid)<br />
pacf(reg$\$$resid)<br />
<br />
[[Image:SMHS_Fig26_TimeSeries.png|500px]]<br />
<br />
## forecasts RCODE<br />
newdata=data.frame(t1=t[73:96], Q1=Q[73:96]) <br />
preg=predict(reg,newdata,se.fit=TRUE) <br />
seup=preg$\$$fit+2*preg$\$$se.fit<br />
seup <br />
1 2 3 4 5 6 7 8 9 <br />
5.577144 5.722867 5.679936 5.667979 5.780376 5.881420 5.889829 5.782526 5.655445 <br />
10 11 12 13 14 15 16 17 18 <br />
5.525933 5.661158 5.681042 5.716577 5.862300 5.819369 5.807412 5.919809 6.020853 <br />
19 20 21 22 23 24 <br />
6.029262 5.921959 5.794878 5.665366 5.800591 5.820475 <br />
selow=preg$\$$fit-2*preg$\$$se.fit<br />
selow<br />
1 2 3 4 5 6 7 8 9 <br />
5.479442 5.625165 5.582234 5.570277 5.682674 5.783718 5.792127 5.684824 5.557743 <br />
10 11 12 13 14 15 16 17 18 <br />
5.428232 5.563456 5.583340 5.610927 5.756650 5.713720 5.701763 5.814160 5.915203 <br />
19 20 21 22 23 24 <br />
5.923613 5.816309 5.689228 5.559717 5.694941 5.714825<br />
<br />
x1r=y[73:96]<br />
t1r=t[73:96]<br />
plot(t1r,x1r, col='red',lwd=1:2,main='Actual vs Forecast', type='l')<br />
lines(t1r,preg$\$$fit, col='green', lwd=1:2)<br />
lines(t1r,seup, col='blue', lwd=1:2)<br />
lines(t1r,selow, col='blue', lwd=1:2)<br />
## red refer actual; green refers to forecast; blue refers to standard error<br />
<br />
[[Image:SMHS_Fig27_TimeSeries.png|500px]]<br />
<br />
## Holt Winters smoothing: forecasts<br />
sa=HoltWinters(x1, seasonal = c("additive"))<br />
psa=predict(sa,24)<br />
plot(t1r,x1r, col="red",type="l",main = "Holt- Winters filtering, airline passengers, HW additive")<br />
lines(t1r, psa, ,col="green")<br />
<br />
## red represents the actual, green: forecast<br />
<br />
[[Image:SMHS_Fig28_TimeSeries.png|500px]]<br />
<br />
sqrt(sum(x1r-psa)^2)) ##HW<br />
[1] 0.3988259<br />
sqrt(sum((x1r-preg$\$$fit)^2)) ##reg<br />
[1] 0.3807665<br />
<br />
==Applications==<br />
<br />
1)[http://www.statsoft.com/Textbook/Time-Series-Analysis This article] presents a comprehensive introduction to the filed of time series analysis. It discussed about the common patterns existing in time series data and introduced some commonly used techniques to deal with time series data and ways to analyze time series. <br />
<br />
2)[http://www.sciencedirect.com/science/article/pii/0304393282900125 This article] investigates whether macroeconomic time series are better characterized as stationary fluctuations around a deterministic trend or as non-stationary processes that have no tendency to return to a deterministic path. Using long historical time series for the U.S. we are unable to reject the hypothesis that these series are non-stationary stochastic processes with no tendency to return to a trend line. Based on these findings and an unobserved components model for output that decomposes fluctuations into a secular or growth component and a cyclical component we infer that shocks to the former, which we associate with real disturbances, contribute substantially to the variation in observed output. We conclude that macroeconomic models that focus on monetary disturbances as a source of purely transitory fluctuations may never be successful in explaining a large fraction of output variation and that stochastic variation due to real factors is an essential element of any model of macroeconomic fluctuations.<br />
<br />
<br />
==Software== <br />
See the CODE examples given in this lecture.<br />
<br />
==Problems==<br />
<br />
1) Consider a signal-plus-noise model of the general form $x_{t}=s_{t}+w_{t}$, where $w_{t}$ is Gaussian white noise with $\sigma ^{2} _{w} = 1$. Simulate and plot $n=200$ observations from each of the following two models.<br />
<br />
(a) $x_{t} = s_{t}+w_{t},$ for $t=1,2,\cdots,200,$ where $s_{t}=\begin{cases} 0 & \text{ if } t=1,2,\cdots,100 \\ & 10\,exp \left \{ -(t-100)/20) \right \} cos\left ( 2\pi t/4 \right )\text{ if } t=101,102,\cdots,200 \end{cases}.$<br />
<br />
(b) $x_{t}=s_{t}+w_{t}, for t=1,2,\cdots, 200,$ where $s_{t}=\begin{cases} 0 & \text{ if } t=1,2,\cdots,100 \\ & 10\,exp \left \{ -(t-100)/200) \right \} cos\left ( 2\pi t/4\right )\text{ if } t=101,102,\cdots,200 \end{cases}.$<br />
<br />
(c) Compare the signal modulators $(a) exp \left \{-t /20 \right \}$ and $(b) exp \left \{-t/200 \right \},$ for $t=1,2, cdots,100.$<br />
<br />
<br />
2) (a) Generate $n=100$ observations from the autoregression $x_{t}=-0.9 x_{t-2} +w_{t},$ with $\sigma {w}=1,$ using the method described in Example 1.10, page 13 given in the textbook. Next, apply the moving average filter $v_{t}=\left(x_{t}+x_{t-1}+x_{t-2}+x_{t-3} \right )/4 to x_{t},$ to the data you generated. Now, plot $x_{t}$ as a line and superimpose $v_{t}$ as a dashed line. Comment on the behavior of $x_{t}$ and how to apply the moving average filter changes that behavior. [Hint: use $v=filter(x,rep(1/4,4),sides=1)$ for the filter].<br />
<br />
(b) Repeat (a) but with $x_{t}=cos \left( 2 \pi t/4 \right).$<br />
<br />
(c) Repeat (b) but with added $N(0,1)$ noise, $x_{t}=cos \left(2 \pi t/4 \right) +w_{t}.$<br />
<br />
(d) Compare and contrast $(a) – (c).$<br />
<br />
3) For the two series, $x_{t}$ in 6.1 (a) and (b):<br />
<br />
(a) Compute and plot the mean functions $\mu _{x} (t) for t=1,2,\cdots,200.$<br />
<br />
(b) Calculate the autocovariance functions, $\gamma _{x} (s,t),$ for $s,t=1,2, \cdots, 200.$<br />
<br />
4) Consider the time series $x_{t} = \beta {1} + \beta {2} t + w{t},$ where $\beta _{1}$ and $\beta_{2}$ are known constants and $w_{t}$ is a white noise process with variance $\sigma ^{2} _{w}.$<br />
<br />
(a) Determine whether $x_{t}$ is stationary.<br />
<br />
(b) Show that the process $y_{t}=x_{t} – x_{t-1}$ is stationary.<br />
<br />
(c) Show that the mean of the moving average $v_{t}= \frac {1}{2q+1} \sum_{j=-q}^{q} x_{t-j},$ is $\beta _{1} + \beta _{2} t,$ and give a simplified expression for the autocovariance function. <br />
<br />
<br />
5) A time series with a periodic component can be constructed from x_{t} = U_{1} sin(2 \pi w_{0} t) + U_{2} cos (2 \pi w_{0} t), where U_{1} and U_{2} are independent random variables with zero means and E(U^{2} {1} ) = E( U^{2} {2} ) = \sigma ^{2}. The constant w_{0} determines the period or time it takes the process to make one complete cycle. Show that this series is weakly stationary with autocovariance function \gamma (h) = \sigma ^{2} cos(2 pi w_{0} h ). [you will need to refer to a standard trig identity]<br />
<br />
<br />
6) Suppose we would like to predict a single stationary series $x_{t}$ with zero mean and autocorrelation function $\gamma (h)$ at some time in the future, say $t+l,$ for $l>0.$<br />
<br />
(a) If we predict using only $x_{t}$ and some scale multiple $A$, show that the mean-square prediction error $MSE(A)=E \left \[ (x_{t+l}-A x_{t}) ^{2} \right \]$ is minimized by the value $A = \rho (l).$<br />
<br />
(b) Show that the minimum mean-square prediction error is $MSE(A)= \gamma (0) \left \[ 1- \rho ^{2} (l) \right \].$<br />
<br />
(c) Show that if $x_{t+l} = Ax_{t}$, then $\rho (l) =1 if A >0,$ and $\rho (l) = -1 if A<0.$<br />
<br />
7) Let $w_{t},$ for $t=0, \pm 1, \pm2, \cdots$ be a normal white noise process, and consider the series, $x_{t}=w_{t} w_{t-1}.$ Determine the mean and autocovariance function of $x_{t},$ and state whether it is stationary. <br />
<br />
8) Suppose $x_{1}, x_{2}, \cdots, x_{n}$ is a sample from the process $x_{t} = \mu + w_{t} – 0.8 w_{t-1},$ where $w_{t} \sim wn(0, \sigma ^{2} _{w} ).$<br />
<br />
(a) Show that mean function is $E(x_{t} )= \mu.$<br />
<br />
(b) Calculate the standard error of $\bar{x}$ for estimating $\mu.$<br />
<br />
<br />
(c) Compare (b) to the case where $x_{t}$ is white noise and show that (b) is smaller.<br />
Explain the result.<br />
<br />
<br />
For the following problems, the aim is for you to:<br />
(i) examine and appreciate the patterns in the data<br />
<br />
(ii) examine and appreciate the patterns in the sample ACF and<br />
<br />
(iii) examine and appreciate how the sample ACF may differ from the theoretical ACF.<br />
<br />
9) Although the model in problem 1 (a) is not stationary (why?), the sample ACF can be informative. For the data you generated in that problem, calculate and plot the sample ACF, and then comment.<br />
<br />
10) (a) Simulate a series of $n=500$ Gaussian white noise observations and compute the sample ACF, $\hat{ \rho},$ to lag $20$. Compare the sample ACF you obtain to the actual ACF, $\rho (h).$<br />
<br />
(b) Repeat part (a) using only $50$. How does changing n affect the results?<br />
<br />
===References===<br />
http://link.springer.com/book/10.1007/978-1-4419-7865-3/page/1<br />
<br />
http://mirlyn.lib.umich.edu/Record/004199238 <br />
<br />
http://mirlyn.lib.umich.edu/Record/004232056 <br />
<br />
http://mirlyn.lib.umich.edu/Record/004133572 <br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<hr><br />
* SOCR Home page: http://www.socr.umich.edu<br />
<br />
{{translate|pageName=http://wiki.socr.umich.edu/index.php?title=SMHS_TimeSeries}}</div>Clgallahttps://wiki.socr.umich.edu/index.php?title=SMHS_TimeSeries&diff=14465SMHS TimeSeries2014-10-20T15:41:56Z<p>Clgalla: /* Problems */</p>
<hr />
<div>==[[SMHS| Scientific Methods for Health Sciences]] - Time Series Analysis ==<br />
<br />
===Overview===<br />
Time series data is a sequence of data points measured at successive pints in time spaced intervals. Time series analysis is commonly used in varieties of studies like monitoring industrial processes and tracking business metrics in the business world. In this lecture, we will present a general introduction to the time series data and introduced on some of the most commonly used techniques in the rick and rapidly growing field of time series modeling and analysis to extract meaningful statistics and other characteristics of the data. We will also illustrate the application of time series analysis techniques with examples in R. <br />
<br />
===Motivation===<br />
Economic data like daily share price, monthly sales, annual income or physical data like daily temperature, ECG readings are all examples of time series data. Time series is just an ordered sequence of values of a variable measured at equally spaced time intervals. So, what would be the effective ways to measure time series data? How can we extract information from time series data and make inference afterwards? <br />
<br />
===Theory===<br />
Time series: a sequence of data points measured at successive pints in time spaced intervals.<br />
*Components: (1) trend component: long-run increase or decrease over time (overall upward or downward movement), typically refers to data taken over a long period of time; the trend can be upward or downward, linear or non-linear. (2) seasonal component: short-term regular wave-like patterns, usually refers to data observed within one year, which may be measured monthly or quarterly. (3) cyclical component: long-term wave-like patterns; regularly occur but may vary in length; often measured peak to peak or trough to trough. (4) irregular component: unpredictable, random, ‘residual’ fluctuation, which may be due to random variations of nature or accidents or unusual events; ‘noise’ in the time series.<br />
*Probabilistic models: The components still add or multiply together across time.<br />
[[Image:SMHS Fig 1 Times Series Analysis.png|400px]]<br />
*Simple Time Series Models: <br />
**Take $T_{t}$ as the trend component, $S_{t}$ as the seasonal component, $C_{t}$ as the cyclical component and $I_{t}$ as the irregular or random component. Then we have an additive model as: $x_{t}=T_{t}+S_{t}+C_{t}+I_{t};$ the multiplicative model says $x_{t}=T_{t}*S_{t}*C_{t}*I_{t};$ sometimes, we take logs on both sides of the multiplicative model to make it additive, $logx_{t}=log\left ( T_{t}*S_{t}*C_{t}*I_{t} \right )$, which can be further noted as $x_{t}{}'=T_{t}{}'+S_{t}{}'+C_{t}{}'+I_{t}{}'$.<br />
**Most time series models are written in terms of an (unobserved) white noise process, which is often assumed to be Gaussian: $x_{t}=w_{t}$, where $w_{t}\sim WN\left ( 0,1 \right )$, that is, $E\left ( W_{t} \right )=0, Var\left(W_{t} \right )=\sigma ^{2}$. Examples of probabilistic models include: (1) Autoregressive $x_{t}=0.6x_{t-1}+w_{t};$ (2) Moving average model $x_{t}= \frac{1}{3}\left(w_{t}+w_{t-1}+w_{t-2}\right).$<br />
**Fitting time series models: a time series model generates a process whose pattern can then be matched in some way to the observed data; since perfect matches are impossible, it is possible that more than one model will be appropriate for a set of data; to decide which model is appropriate: patterns suggest choices, assess within sample adequacy (diagnostics, tests), outside sample adequacy (forecast evaluation), simulation from suggested model and compare with observed data; next, turn to some theoretical aspects like how to characterize a time series, and then investigate some special processes. <br />
**Characterizing time series (the mean and covariance of time series): suppose the data are $x_{1},x_{2},\cdots ,x_{t},$ note for a regularly spaced time series, $x_{1}$ is observed at time $t_{0}$, $x_{2}$ is observed at $t_{0}+\Delta, x_{t}$ is observed at $t_{0}+\left( t-1\right) \Delta$; the expected value of $x_{t}$ is $\mu _{t}=E\left [ x_{t} \right ]$; the covariance function is $\gamma \left(s,t \right )=cov\left(x_{s},x_{t} \right )$. Note that, we don’t distinguish between random variables and their realizations. Properly, $x_{1},x_{2},\cdots ,x_{t}$ is a realization of some process $X_{1},X_{2},\cdots ,X_{t}$ and so $\mu _{t}=E\left [ x_{t} \right ]$ etc.<br />
**Characterizing time series data: weak stationarity: usually we have only one realization of the time series. There are not enough data to estimate all the means \& covariances. That is if the mean is constant, and the covariance depends only on the separation in time, then a time series is $\left(weakly\right)$ stationary. In this case $\mu_{t}=\mu$ and $\gamma\left(s,t\right)=\gamma\left(t-s\right)$. Stationary processes are much easier to handle statistically. Real time series are usually non-stationary, but can often be transformed to give a stationary process.<br />
**Estimation of $\gamma(h)$ (Estimating the autocovariance function): (1) for a stationary process: $\gamma(h)=\gamma(s,s+h)=cov(x_{s},x_{s+h})$ for any s.$\ \gamma (h)$ is called the autocovariance function because it measures the covariance between lags of the process $x_{t}$; (2) we observe T-h pairs separated at lag h namely $\left(x_{1},x_{h+1}\right),\cdots,\left(x_{T-h},x_{T}\right)$; the sample autocovariance function is $\hat{\gamma}(h)=\frac{1}{T}\sum_{t=1}^{T-h}\left(x_{t}-\bar{X} \right )\left(x_{t+h}-\bar{X} \right )$, note that we divide by ${T}$ although there are ${T-h}$ terms in the sum. The autocorrelation function ${ACF}$ is $\rho (h)=\frac{\gamma(h)}{\gamma(0)}$.<br />
**$\rho(h)$ properties: for $x_{t}$ stationary, $\gamma(h)=E(x_{t}-\mu{t})(x_{t+h}-\mu)$, where $\gamma(h)$ is an even function, that is $\gamma(h)=\gamma(-h); \left | \gamma(h) \right |\leq \left | \gamma(0) \right |$, that is $\left|\rho(h)\right|\leq1, h=\pm 1,2,\cdots$. The autocorrelation matrix, $P(h)$ is positive semi-definite which means that the autocorrelations cannot be chosen arbitrarily but must have some relations among themselves. <br />
**To study a set of time series data: given that we only have one realization, we need to match plot of data from proposed model with data (we can see broad/main patterns); match acf, pacf of proposed model with data (we can see if data are stationary uncorrelated, stationary autocorrelated or non-stationary); determine if the proposed model is appropriate; obtain forecasts from proposed model; compare across models that seem appropriate. <br />
**Large sample distribution of the $ACF$ for a $WN$ series: if $x_{t}$ is $WN$ then for n large, the sample $ACF$, $\hat{\rho}_{x}(h), h=1,2,\cdots$,H where H is fixed but arbitrary is normally distributed with zero mean and standard deviation given by $\sigma_{\hat{\rho}_{x}}(h)=\frac{1}{\sqrt{h}}$. A rule of thumb for assessing whether sample autocorrelations resemble those for a $WN$ series is by checking if approximately 95% of the sample autocorrelations are within the interval $0\pm \frac{2}{\sqrt{n}}$. <br />
**White noise, or stationary uncorrelated process: we say that a time series is white noise if it is weakly stationary, with $ACF$, $\rho(h)=\begin{cases} 1 & \text{ if } h=0 \\ 0 & \text{ if } h> 0\ and\ with \ E\left[w_{t} \right ]=0 \end{cases}$, i.e. white noise consists of a sequence of uncorrelated, zero mean random variables with common variance $\sigma^{2}$.<br />
<br />
*Example of a simple moving average model in R: $v_{t}=\frac{1}{3} (w_{t-1}+w_{t}+w_{t+1})$. <br />
<br />
R CODE: <br />
w <- ts(rnorm(150)) ## generate 150 data from the standard normal distribution<br />
v <- ts(filter(w, sides=2, rep(1/3, 3))) ## moving average model<br />
par(mfrow=c(2,1))<br />
plot(w,main='white noise')<br />
plot(v,main='moving average')<br />
<br />
or can use code:<br />
<br />
w <- rnorm(150) ## generate 150 data from the standard normal distribution<br />
v <- filter(w, sides=2, rep(1/3, 3)) ## moving average model<br />
par(mfrow=c(2,1))<br />
plot.ts(w,main='white noise')<br />
plot.ts(v,main='moving average')<br />
<br />
[[Image:SMHS Fig2 Timeseries Analysis.png|500px]]<br />
<br />
## sums based on WN processes<br />
'''ts.plot(w,v,lty=2:1,col=1:2,lwd=1:2)'''<br />
<br />
[[Image:SMHS Fig3 TimeSeries Analysis.png|500px]]<br />
<br />
The ACF and the sample ACF of some stationary processes<br />
*white noise = $w(t),acf x(t)=w(t)+\frac{1}{3} \left(w(t-1)+w(t-2)+w(t-3) \right),$ then plot the barplot of the acf:<br />
<br />
RCODE:<br />
par(mfrow=c(2,1))<br />
barplot(ARMAacf(0,lag.max=10),main='white noise=w(t)')<br />
barplot(ARMAacf(ma=c(1/3,1/3,1/3),lag.max=10),main='acfx(t)=w(t)+1/3w(t-1)+1/3w(t-2)+1/3w(t-3)')<br />
<br />
[[Image:SMHS_Fig4_TimeSeries_Analysis.png|500px]]<br />
<br />
*Theoretical acf of some processes: Autoregressive process, $x_{t}=0.9x_{t-1}+w_{t}.$<br />
<br />
RCODE:<br />
par(mfrow=c(2,1))<br />
barplot(ARMAacf(c(0.9),lag.max=30),main='acf x(t)=0.9x(t-1)+w(t)')<br />
barplot(ARMAacf(c(1,-0.9),lag.max=30),main='acf x(t)=x(t-1)-0.9x(t-2)+w(t)')<br />
<br />
<br />
[[Image:SMHA_Fig5_TimeSeries_Analysis.png|500px]]<br />
<br />
*Recognizing patterns in probabilistic data <br />
**Example 1: compare White noise & Autoregressive process $x_{t}=w_{t} \& x_{t}=x_{t-1}-0.9x_{t-2}+w_{t})$<br />
<br />
RCODE:<br />
w1 <- rnorm(200) ## generate 200 data from the standard normal distribution<br />
x <- filter(w1, filter=c(1,-0.9),method='recursive')[-(1:50)]<br />
w2 <- w1[-(1:50)]<br />
xt <- ts(x)<br />
par(mfrow=c(2,2))<br />
plot.ts(w,main='white noise')<br />
acf(w2,lag.max=25)<br />
plot.ts(xt,main='autoregression')<br />
acf(xt,lag.max=25)<br />
<br />
[[Image:SMHS_Fig6_TimeSeries_Analysis.png|500px]]<br />
<br />
*Apparently, there is almost no autocorrelation in white noise, while there is large autocorrelation in second series.<br />
<br />
Example 2: AR & MA $x_{t}=x_{t-1}-0.9x_{t-2}+w_{t}\& v_{t}=\frac{1}{3}\left(w_{t-1}+w_{t}+w_{t+1} \right))$<br />
<br />
RCODE:<br />
v <- filter(w1, sides=2, rep(1/3, 3))<br />
vt <- ts(v)[2:199]<br />
par(mfrow=c(2,2))<br />
plot.ts(xt,main='white noise')<br />
acf(xt,lag.max=25)<br />
plot.ts(vt,main='autoregression')<br />
acf(vt,lag.max=25)<br />
<br />
[[Image:SMHS_Fig7_TimeSeries_Analysis.png|500px]]<br />
<br />
<br />
*ACF, PACF and some examples of data in R<br />
<br />
RCODE:<br />
data<-<br />
c(28,22,36,26,28,28,26,24,32,30,27,24,33,21,36,32,31,25,24,25,28,36,27,32,34,30,<br />
25,26,26,25,-<br />
44,23,21,30,33,29,27,29,28,22,26,27,16,31,29,36,32,28,40,19,37,23,32,29,-<br />
2,24,25,27,24,16,29,20,28,27,39,23)<br />
par(mfrow=c(1,2))<br />
plot(data)<br />
qqnorm(data)<br />
qqline(data)<br />
<br />
[[Image:SMHS_Fig8_TimeSeries_Analysis.png|500px]]<br />
<br />
data1<-data[c(-31,-55)]<br />
par(mfrow=c(1,2))<br />
plot.ts(data1)<br />
qqnorm(data1)<br />
qqline(data1)<br />
<br />
[[Image:SMHS_Fig9_TimeSeries_Analysis.png|500px]]<br />
<br />
library(astsa)<br />
acf2(data1)<br />
<br />
*Examining the data: <br />
Diagnostics of outliers<br />
<br />
RCODE:<br />
dat<- <br />
c(28,22,36,26,28,28,26,24,32,30,27,24,33,21,36,32,31,25,24,25,28,36,27,32,34,30,<br />
25,26,26,25,-44,23,21,30, 33,29, 27,29,28,22,26,27,16,31,29,36,32,28,40,19,<br />
37,23,32,29,-2,24,25,27,24,16,29,20,28,27, 39,23)<br />
plot.ts(dat)<br />
<br />
<br />
[[Image:SMHS_Fig10_TimeSeries_Analysis.png|500px]]<br />
<br />
<br />
Then we can easily tell that there is an outlier in this data series.<br />
<br />
# what if we use a boxplot here?<br />
boxplot(dat)<br />
<br />
[[Image:SMHS_Fig11_TimeSeries_Analysis.png|500px]]<br />
<br />
identify(rep(1,length(dat)),dat) ## identify outliers by clicking on screen<br />
## we identified the outliers are 55 and 31:<br />
dat[31]<br />
[1] -44<br />
dat[55]<br />
[1] -2<br />
datr <- dat[c(-31,-55)] ## this is the dat with outlier removed<br />
qqnorm(datr)<br />
qqline(datr)<br />
<br />
<br />
[[Image:SMHS_Fig12_TimeSeries.png|500px]]<br />
<br />
<br />
## this plot shows that the distribution of the new data series (right) <br />
is closer to normal compared to the original data (left).<br />
## The effect of outliers by comparing the ACF and PACF of the data <br />
series:<br />
library(astsa)<br />
acf2(dat)<br />
acf2(datr)<br />
[[Image:SMHS_Fig13_TimeSeries.png|800px]]<br />
<br />
*Diagnostics of the error: consider the model $(x_{t}=\mu+w_{t} )$ check to see if the mean of the probability distribution of error is 0; whether variance of error is constant across time; whether errors are independent and uncorrelated; whether the probability distribution of error is normal. <br />
:::Residual plot for functional form<br />
<br />
[[Image:SMHS_Fig14_TimeSeries.png|500px]]<br />
<br />
:::Residual plot for equal variance<br />
<br />
[[Image:SMHS_Fig15_TimeSeries.png|500px]]<br />
<br />
:::Check if residuals are correlated by plotting $e_{t} vs. e_{t-1}$: there are three possible cases – positive autocorrelation, negative autocorrelation and no autocorrelation. <br />
<br />
*Smoothing: to identify the structure by averaging out the noise since series has some structure plus variation. <br />
::Moving averages:<br />
ma3<-filter(datr,sides=2,rep(1/3,3))<br />
ma9<-filter(datr,sides=2,rep(1/9,9))<br />
plot.ts(datr,ylim=c(15,50),ylab='data')<br />
par(new=T)<br />
plot.ts(ma3,ylim=c(15,50),col=2,ylab='data') ## red line indicates ma3<br />
par(new=T)<br />
plot.ts(ma9,ylim=c(15,50),col=3,ylab='data') ## green line indicates ma9<br />
<br />
[[Image:SMHS_Fig16_TimeSeries.png|500px]]<br />
<br />
*Apparently, the bigger the laggings in moving average, the smoother the time series plot of the data series. <br />
ma3<-filter(datr,sides=2,rep(1/3,3))<br />
ma31<-filter(datr,sides=1,rep(1/3,3))<br />
plot.ts(datr,ylim=c(15,50),ylab='data')<br />
par(new=T)<br />
plot.ts(ma3,ylim=c(15,50),col=2,ylab='data') ## red line indicates ma3<br />
par(new=T)<br />
plot.ts(ma31,ylim=c(15,50),col=3,ylab='data') ## green line indicates ma31<br />
<br />
[[Image:SMHS_Fig17_TimeSeries.png|500px]]<br />
<br />
Two sided and one sided filters with same coefficients seem to be similar in their smoothing effect. <br />
<br />
## the effect of equal vs. unequal weights? <br />
ma31<-filter(datr,sides=1,rep(1/3,3))<br />
mawt<-filter(datr,sides=1,c(1/9,1/9,7/9))<br />
plot.ts(datr,ylim=c(15,50),ylab='data')<br />
par(new=T)<br />
plot.ts(ma31,ylim=c(15,50),col=2,ylab='data') ## red line indicates ma31<br />
par(new=T)<br />
plot.ts(mawt,ylim=c(15,50),col=3,ylab='data') ## green line indicates mawt<br />
<br />
[[Image:SMHS_Fig18_TimeSeries.png|500px]]<br />
<br />
The smoothing effect of filter with equal weights seems to be bigger than the filter with unequal weights.<br />
<br />
*Simple Exponential Smoothing (SES): premise – the most recent observations might have the highest predictive value $(\hat{x}_{t} = \alpha x_{t-1} + (1- \alpha) \hat{x}_{t-1} )$, that is new forecast = $\alpha \ actual value + (1-\alpha) \ previous forecast$. This method is appropriate for series with no trend or seasonality.<br />
<br />
*For series with trend, can use Holt-Winters Exponential smoothing additive model $\hat{x}_{t+h}=a_{t}+h*b_{t}.$ We smooth the level or permanent component, the trend component and the seasonal component.<br />
<br />
RCODE:<br />
HoltWinters(series,beta=FALSE,gamma=FALSE) #simple exponential smoothing<br />
HoltWinters(series,gamma=FALSE) # HoltWinters trend<br />
HoltWinters(series,seasonal= 'additive') # HoltWinters additive trend+seasonal<br />
HoltWinters(series,seasonal= 'multiplicative') # HoltWinters multiplicative trend+seasonal<br />
<br />
<br />
<br />
Example: RCODE: (effect of different values of alpha on datr SES)<br />
par(mfrow=c(3,1))<br />
exp05=HoltWinters(datr,alpha=0.05,beta=FALSE,gamma=FALSE) #simple exponential smoothing <br />
p05=predict(exp05,20) <br />
plot(exp05,p05, main = 'Holt- Winters filtering, alpha=0.05')<br />
exp50=HoltWinters(datr, alpha=0.5,beta=FALSE,gamma=FALSE) #simple exponential smoothing <br />
p50=predict(exp50,20)<br />
plot(exp50,p50, main = 'Holt-Winters filtering, alpha=0.5')<br />
exp95=HoltWinters(datr, alpha=0.95,beta=FALSE,gamma= FALSE) #simple exponential smoothing <br />
p95=predict(exp95,20)<br />
plot(exp95,p95,main = "Holt-Winters filtering, alpha=0.95")<br />
<br />
<br />
[[Image:SMHS_Fig19_TimeSeries.png|500px]]<br />
<br />
*Comparing two models: we can compare ‘fit’ and forecast ability. <br />
<br />
RCODE:<br />
obs <- exp50$\$$x[-1]<br />
fit <- exp50$\$$fitted[,1]<br />
r1 <- obs-fit<br />
plot(r1)<br />
par(mfrow=c(1,2))<br />
plot(r1)<br />
acf(r1)<br />
<br />
[[Image:SMHS_Fig20_TimeSeries.png|500px]]<br />
<br />
*Regression based (nonparametric) smoothing: (1) kernel: extension of a two-sided MA. Use a bandwidth (number of terms) plus a kernel function to smooth a set of values, ''ksmooth''; (2) local linear regression (loess): fit local polynomial regression and join them together, ''loess''; (3) fitting cubic splines: extension of polynomial regression, partition data and fit separate piecewise regression to each section, smooth them together where they join, ''smooth.spine''; (4) many more, ''lowess'', ''supsmu''.<br />
::•Kernel smoother: define bandwidth; choose a kernel; use robust regression to get smooth estimate; bigger the bandwidth, smoother the result<br />
::•Loess: define the window width; choose a weight function; use polynomial regression and weighted least squares to get smoothest estimate; the bigger span, the smoother the result.<br />
::•Splines: divide data into intervals; in each interval, fit a polynomial regression of order p such that the splines are continuous at knots; use a parameter (spar in R) to smooth the splines; the bigger the spar, the smoother the result.<br />
::•Issues with regression based (nonparametric) smoothing: (1) the size of S has an important effect on the curve; (2) a span that is too small (meaning that insufficient data fall within the window) produces a curve characterized by a lot of noise, results in a large variance; (3) if the span is too large, the regression will be oversmoothed, and thus the local polynomial may not fit the data well. This may result in loss of important information, and thus the fit will have large bias; (4) we want the smallest span that provides a smooth structure.<br />
<br />
*A regression model for trend: given $y_{t}, t=1,2,…,\cdots,$ T is a time series, fit a regression model for trend. The plot suggests that we fit a trend line, say $y_{t}=\beta_{1}+\beta_{2} t + \varepsilon_{t}$. <br />
::•Assumptions: variance of error is constant across observations; errors are independent / uncorrelated across observations; the regressor variables and error are independent; we may also assume probability distribution of error is normal.<br />
<br />
RCODE<br />
fit <- lm(x~z)<br />
summary(fit$\$$coeff)<br />
plot.ts(fit$\$$resid) # plot the residuals of the fitted regression model<br />
acf2(fit$\$$resid) # the ACF and PACF of the residuals of the fitted regression model<br />
qqnorm(fit$\$$resid)<br />
qqline(fit$\$$resid)<br />
<br />
<br />
*Analyzing patterns in data and modeling: seasonality<br />
**Exploratory analysis<br />
<br />
RCODE:<br />
data=c( 118,132,129,121,135,148,148,136,119,104,118,115,126,141,135,125,<br />
149,170,170,158,133,114,140,145,150,178,163,172,178,199,199,184,162,146,<br />
166,171,180,193,181,183,218,230,242,209,191,172,194,196,196,236,235,<br />
229,243,264,272,237,211,180,201,204,188,235,227,234,264,302,293,259,229,<br />
203,229,242,233,267,269,270,315,364,347,312,274,237,278,284,277,317,313,<br />
318,374,413,405,255,306,271,306,300)<br />
air <- ts(data,start=c(1949,1),frequency=12)<br />
y <- log(air)<br />
par(mfrow=c(1,2))<br />
plot(air)<br />
plot(y)<br />
<br />
<br />
[[Image:SMHS_Fig21_TimeSeries.png|500px]]<br />
<br />
<br />
## from the charts above, taking log helps a little bit with the increasing variability. <br />
<br />
monthplot(y)<br />
<br />
<br />
[[Image:SMHS_Fig22_TimeSeries.png|500px]]<br />
<br />
<br />
acf2(y)<br />
<br />
[[Image:SMHS_Fig23_TimeSeries.png|500px]]<br />
<br />
*Smoothing: the Holt-Winters method smooths and obtains three parts of a series: level, trend and seasonal. The method requires smoothing constants: your choice or those which minimize the one step ahead forecast error: $\sum_{t=1}^{T} \left(\hat{x}_{t} - x_{t-1} \right )^{2}$, where $t=1,2,\cdots,T$. So at a given time point, smoothed value = (level+h*trend)+seasonal index (additive); data value = (level+h*trend)*seasonal index (multiplicative). Additive: $\hat{x}_{t+h}=a_{t}+h*b_{t}+s_{t+1+(h-1)\mod{p}}$; Multiplicative: $\hat{x}_{t+h}=(a_{t}+h*b_{t})* s_{t+1+(h-1)\mod{p}}.$<br />
<br />
<br />
RCODE:<br />
par(mfrow=c(2,1))<br />
sa <- HoltWinters(air,seasonal=c('additive'))<br />
psa <- predict(sa,60)<br />
plot(sa,psa,main='Holt-Winters filtering, airline passengers, HW additive')<br />
sm <- HoltWinters(air,seasonal=c('multiplicative'))<br />
psam <- predict(sm,60)<br />
plot(sm,psam,main='Holt-Winters filtering, airline passengers, HW multiplicative')<br />
<br />
[[Image:SMHS_Fig24_TimeSeries.png|500px]]<br />
<br />
## seasonal decomposition using loess, log(airpassengers)<br />
<br />
[[Image:SMHS_Fig25_TimeSeries.png|500px]]<br />
<br />
*Regression: estimation, diagnostics, interpretation. Defining dummy variables to capture the seasonal effect in regression. If there is a variable with k categories, there will be k-1 dummies if the intercept is included in the regression (exclude one group); there will be k dummies if the intercept is not included in the regression; if the intercept is included in the regression, the coefficient of a dummy variable is always the difference of means between that group and the excluded group. <br />
<br />
RCODE:<br />
t <- time(y)<br />
Q <- factor(rep(1:12,8))<br />
h <- lm(y~time(y)+Q) # regression with intercept, drops first category = January<br />
model.matrix(h)<br />
h1 <- lm(y~0+time(y)+Q) # regression without intercept<br />
model.matrix(h1)<br />
x1 <- ts(y[1:72],start=c(1949,1),frequency=12)<br />
t1 <- t[1:72]<br />
Q1 <- Q[1:72]<br />
contrasts(Q1)=contr.treatment(12,base=10)<br />
h2 <- lm(x1~t1+Q1) # regression with intercept, drops 10th category = October<br />
model.matrix(h2)<br />
reg <- lm(x1~t1+Q1)<br />
par(mfrow=c(2,2))<br />
plot.ts(reg$\$$resid)<br />
qqnorm(reg$\$$resid)<br />
qqline(reg$\$$resid)<br />
acf(reg$\$$resid)<br />
pacf(reg$\$$resid)<br />
<br />
[[Image:SMHS_Fig26_TimeSeries.png|500px]]<br />
<br />
## forecasts RCODE<br />
newdata=data.frame(t1=t[73:96], Q1=Q[73:96]) <br />
preg=predict(reg,newdata,se.fit=TRUE) <br />
seup=preg$\$$fit+2*preg$\$$se.fit<br />
seup <br />
1 2 3 4 5 6 7 8 9 <br />
5.577144 5.722867 5.679936 5.667979 5.780376 5.881420 5.889829 5.782526 5.655445 <br />
10 11 12 13 14 15 16 17 18 <br />
5.525933 5.661158 5.681042 5.716577 5.862300 5.819369 5.807412 5.919809 6.020853 <br />
19 20 21 22 23 24 <br />
6.029262 5.921959 5.794878 5.665366 5.800591 5.820475 <br />
selow=preg$\$$fit-2*preg$\$$se.fit<br />
selow<br />
1 2 3 4 5 6 7 8 9 <br />
5.479442 5.625165 5.582234 5.570277 5.682674 5.783718 5.792127 5.684824 5.557743 <br />
10 11 12 13 14 15 16 17 18 <br />
5.428232 5.563456 5.583340 5.610927 5.756650 5.713720 5.701763 5.814160 5.915203 <br />
19 20 21 22 23 24 <br />
5.923613 5.816309 5.689228 5.559717 5.694941 5.714825<br />
<br />
x1r=y[73:96]<br />
t1r=t[73:96]<br />
plot(t1r,x1r, col='red',lwd=1:2,main='Actual vs Forecast', type='l')<br />
lines(t1r,preg$\$$fit, col='green', lwd=1:2)<br />
lines(t1r,seup, col='blue', lwd=1:2)<br />
lines(t1r,selow, col='blue', lwd=1:2)<br />
## red refer actual; green refers to forecast; blue refers to standard error<br />
<br />
[[Image:SMHS_Fig27_TimeSeries.png|500px]]<br />
<br />
## Holt Winters smoothing: forecasts<br />
sa=HoltWinters(x1, seasonal = c("additive"))<br />
psa=predict(sa,24)<br />
plot(t1r,x1r, col="red",type="l",main = "Holt- Winters filtering, airline passengers, HW additive")<br />
lines(t1r, psa, ,col="green")<br />
<br />
## red represents the actual, green: forecast<br />
<br />
[[Image:SMHS_Fig28_TimeSeries.png|500px]]<br />
<br />
sqrt(sum(x1r-psa)^2)) ##HW<br />
[1] 0.3988259<br />
sqrt(sum((x1r-preg$\$$fit)^2)) ##reg<br />
[1] 0.3807665<br />
<br />
==Applications==<br />
<br />
1)[http://www.statsoft.com/Textbook/Time-Series-Analysis This article] presents a comprehensive introduction to the filed of time series analysis. It discussed about the common patterns existing in time series data and introduced some commonly used techniques to deal with time series data and ways to analyze time series. <br />
<br />
2)[http://www.sciencedirect.com/science/article/pii/0304393282900125 This article] investigates whether macroeconomic time series are better characterized as stationary fluctuations around a deterministic trend or as non-stationary processes that have no tendency to return to a deterministic path. Using long historical time series for the U.S. we are unable to reject the hypothesis that these series are non-stationary stochastic processes with no tendency to return to a trend line. Based on these findings and an unobserved components model for output that decomposes fluctuations into a secular or growth component and a cyclical component we infer that shocks to the former, which we associate with real disturbances, contribute substantially to the variation in observed output. We conclude that macroeconomic models that focus on monetary disturbances as a source of purely transitory fluctuations may never be successful in explaining a large fraction of output variation and that stochastic variation due to real factors is an essential element of any model of macroeconomic fluctuations.<br />
<br />
<br />
==Software== <br />
See the CODE examples given in this lecture.<br />
<br />
==Problems==<br />
<br />
1) Consider a signal-plus-noise model of the general form $x_{t}=s_{t}+w_{t}$, where $w_{t}$ is Gaussian white noise with $\sigma ^{2} _{w} = 1$. Simulate and plot n=200 observations from each of the following two models.<br />
<br />
(a) $x_{t} = s_{t}+w_{t},$ for $t=1,2,\cdots,200,$ where $s_{t}=\begin{cases} 0 & \text{ if } t=1,2,\cdots,100 \\ & 10\,exp \left \{ -(t-100)/20) \right \} cos\left ( 2\pi t/4 \right )\text{ if } t=101,102,\cdots,200 \end{cases}.$<br />
<br />
(b) $x_{t}=s_{t}+w_{t}, for t=1,2,\cdots, 200,$ where $s_{t}=\begin{cases} 0 & \text{ if } t=1,2,\cdots,100 \\ & 10\,exp \left \{ -(t-100)/200) \right \} cos\left ( 2\pi t/4 \right )\text{ if } t=101,102,\cdots,200 \end{cases}.$<br />
<br />
(c) Compare the signal modulators $(a) exp \left \{-t/20 \right \} and (b) exp \left \{-t/200 \right \}, for t=1,2, cdots,100.$<br />
<br />
<br />
2) (a) Generate $n=100$ observations from the autoregression $x_{t}=-0.9 x_{t-2} +w_{t},$ with $\sigma {w}=1,$ using the method described in Example 1.10, page 13 given in the textbook. Next, apply the moving average filter $v_{t}=\left(x_{t}+x_{t-1}+x_{t-2}+x_{t-3} \right )/4 to x_{t},$ to the data you generated. Now, plot $x_{t}$ as a line and superimpose $v_{t}$ as a dashed line. Comment on the behavior of $x_{t}$ and how to apply the moving average filter changes that behavior. [Hint: use $v=filter(x,rep(1/4,4),sides=1)$ for the filter].<br />
<br />
(b) Repeat (a) but with $x_{t}=cos \left( 2 \pi t/4 \right).$<br />
<br />
(c) Repeat (b) but with added $N(0,1)$ noise, $x_{t}=cos \left(2 \pi t/4 \right) +w_{t}.$<br />
<br />
(d) Compare and contrast $(a) – (c).$<br />
<br />
3) For the two series, $x_{t}$ in 6.1 (a) and (b):<br />
<br />
(a) Compute and plot the mean functions $\mu _{x} (t) for t=1,2,\cdots,200.$<br />
<br />
(b) Calculate the autocovariance functions, $\gamma _{x} (s,t), for s,t=1,2, \cdots, 200.$<br />
<br />
4) Consider the time series $x_{t} = \beta {1} + \beta {2} t + w{t},$ where $\beta _{1}$ and $\beta_{2}$ are known constants and $w_{t}$ is a white noise process with variance $\sigma ^{2} _{w}.$<br />
<br />
(a) Determine whether $x_{t}$ is stationary.<br />
<br />
(b) Show that the process $y_{t}=x_{t} – x_{t-1}$ is stationary.<br />
<br />
(c) Show that the mean of the moving average $v_{t}= \frac {1}{2q+1} \sum_{j=-q}^{q} x_{t-j},$ is $\beta _{1} + \beta _{2} t,$ and give a simplified expression for the autocovariance function. <br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<hr><br />
* SOCR Home page: http://www.socr.umich.edu<br />
<br />
{{translate|pageName=http://wiki.socr.umich.edu/index.php?title=SMHS_TimeSeries}}</div>Clgallahttps://wiki.socr.umich.edu/index.php?title=SMHS_TimeSeries&diff=14464SMHS TimeSeries2014-10-20T15:40:54Z<p>Clgalla: /* Problems */</p>
<hr />
<div>==[[SMHS| Scientific Methods for Health Sciences]] - Time Series Analysis ==<br />
<br />
===Overview===<br />
Time series data is a sequence of data points measured at successive pints in time spaced intervals. Time series analysis is commonly used in varieties of studies like monitoring industrial processes and tracking business metrics in the business world. In this lecture, we will present a general introduction to the time series data and introduced on some of the most commonly used techniques in the rick and rapidly growing field of time series modeling and analysis to extract meaningful statistics and other characteristics of the data. We will also illustrate the application of time series analysis techniques with examples in R. <br />
<br />
===Motivation===<br />
Economic data like daily share price, monthly sales, annual income or physical data like daily temperature, ECG readings are all examples of time series data. Time series is just an ordered sequence of values of a variable measured at equally spaced time intervals. So, what would be the effective ways to measure time series data? How can we extract information from time series data and make inference afterwards? <br />
<br />
===Theory===<br />
Time series: a sequence of data points measured at successive pints in time spaced intervals.<br />
*Components: (1) trend component: long-run increase or decrease over time (overall upward or downward movement), typically refers to data taken over a long period of time; the trend can be upward or downward, linear or non-linear. (2) seasonal component: short-term regular wave-like patterns, usually refers to data observed within one year, which may be measured monthly or quarterly. (3) cyclical component: long-term wave-like patterns; regularly occur but may vary in length; often measured peak to peak or trough to trough. (4) irregular component: unpredictable, random, ‘residual’ fluctuation, which may be due to random variations of nature or accidents or unusual events; ‘noise’ in the time series.<br />
*Probabilistic models: The components still add or multiply together across time.<br />
[[Image:SMHS Fig 1 Times Series Analysis.png|400px]]<br />
*Simple Time Series Models: <br />
**Take $T_{t}$ as the trend component, $S_{t}$ as the seasonal component, $C_{t}$ as the cyclical component and $I_{t}$ as the irregular or random component. Then we have an additive model as: $x_{t}=T_{t}+S_{t}+C_{t}+I_{t};$ the multiplicative model says $x_{t}=T_{t}*S_{t}*C_{t}*I_{t};$ sometimes, we take logs on both sides of the multiplicative model to make it additive, $logx_{t}=log\left ( T_{t}*S_{t}*C_{t}*I_{t} \right )$, which can be further noted as $x_{t}{}'=T_{t}{}'+S_{t}{}'+C_{t}{}'+I_{t}{}'$.<br />
**Most time series models are written in terms of an (unobserved) white noise process, which is often assumed to be Gaussian: $x_{t}=w_{t}$, where $w_{t}\sim WN\left ( 0,1 \right )$, that is, $E\left ( W_{t} \right )=0, Var\left(W_{t} \right )=\sigma ^{2}$. Examples of probabilistic models include: (1) Autoregressive $x_{t}=0.6x_{t-1}+w_{t};$ (2) Moving average model $x_{t}= \frac{1}{3}\left(w_{t}+w_{t-1}+w_{t-2}\right).$<br />
**Fitting time series models: a time series model generates a process whose pattern can then be matched in some way to the observed data; since perfect matches are impossible, it is possible that more than one model will be appropriate for a set of data; to decide which model is appropriate: patterns suggest choices, assess within sample adequacy (diagnostics, tests), outside sample adequacy (forecast evaluation), simulation from suggested model and compare with observed data; next, turn to some theoretical aspects like how to characterize a time series, and then investigate some special processes. <br />
**Characterizing time series (the mean and covariance of time series): suppose the data are $x_{1},x_{2},\cdots ,x_{t},$ note for a regularly spaced time series, $x_{1}$ is observed at time $t_{0}$, $x_{2}$ is observed at $t_{0}+\Delta, x_{t}$ is observed at $t_{0}+\left( t-1\right) \Delta$; the expected value of $x_{t}$ is $\mu _{t}=E\left [ x_{t} \right ]$; the covariance function is $\gamma \left(s,t \right )=cov\left(x_{s},x_{t} \right )$. Note that, we don’t distinguish between random variables and their realizations. Properly, $x_{1},x_{2},\cdots ,x_{t}$ is a realization of some process $X_{1},X_{2},\cdots ,X_{t}$ and so $\mu _{t}=E\left [ x_{t} \right ]$ etc.<br />
**Characterizing time series data: weak stationarity: usually we have only one realization of the time series. There are not enough data to estimate all the means \& covariances. That is if the mean is constant, and the covariance depends only on the separation in time, then a time series is $\left(weakly\right)$ stationary. In this case $\mu_{t}=\mu$ and $\gamma\left(s,t\right)=\gamma\left(t-s\right)$. Stationary processes are much easier to handle statistically. Real time series are usually non-stationary, but can often be transformed to give a stationary process.<br />
**Estimation of $\gamma(h)$ (Estimating the autocovariance function): (1) for a stationary process: $\gamma(h)=\gamma(s,s+h)=cov(x_{s},x_{s+h})$ for any s.$\ \gamma (h)$ is called the autocovariance function because it measures the covariance between lags of the process $x_{t}$; (2) we observe T-h pairs separated at lag h namely $\left(x_{1},x_{h+1}\right),\cdots,\left(x_{T-h},x_{T}\right)$; the sample autocovariance function is $\hat{\gamma}(h)=\frac{1}{T}\sum_{t=1}^{T-h}\left(x_{t}-\bar{X} \right )\left(x_{t+h}-\bar{X} \right )$, note that we divide by ${T}$ although there are ${T-h}$ terms in the sum. The autocorrelation function ${ACF}$ is $\rho (h)=\frac{\gamma(h)}{\gamma(0)}$.<br />
**$\rho(h)$ properties: for $x_{t}$ stationary, $\gamma(h)=E(x_{t}-\mu{t})(x_{t+h}-\mu)$, where $\gamma(h)$ is an even function, that is $\gamma(h)=\gamma(-h); \left | \gamma(h) \right |\leq \left | \gamma(0) \right |$, that is $\left|\rho(h)\right|\leq1, h=\pm 1,2,\cdots$. The autocorrelation matrix, $P(h)$ is positive semi-definite which means that the autocorrelations cannot be chosen arbitrarily but must have some relations among themselves. <br />
**To study a set of time series data: given that we only have one realization, we need to match plot of data from proposed model with data (we can see broad/main patterns); match acf, pacf of proposed model with data (we can see if data are stationary uncorrelated, stationary autocorrelated or non-stationary); determine if the proposed model is appropriate; obtain forecasts from proposed model; compare across models that seem appropriate. <br />
**Large sample distribution of the $ACF$ for a $WN$ series: if $x_{t}$ is $WN$ then for n large, the sample $ACF$, $\hat{\rho}_{x}(h), h=1,2,\cdots$,H where H is fixed but arbitrary is normally distributed with zero mean and standard deviation given by $\sigma_{\hat{\rho}_{x}}(h)=\frac{1}{\sqrt{h}}$. A rule of thumb for assessing whether sample autocorrelations resemble those for a $WN$ series is by checking if approximately 95% of the sample autocorrelations are within the interval $0\pm \frac{2}{\sqrt{n}}$. <br />
**White noise, or stationary uncorrelated process: we say that a time series is white noise if it is weakly stationary, with $ACF$, $\rho(h)=\begin{cases} 1 & \text{ if } h=0 \\ 0 & \text{ if } h> 0\ and\ with \ E\left[w_{t} \right ]=0 \end{cases}$, i.e. white noise consists of a sequence of uncorrelated, zero mean random variables with common variance $\sigma^{2}$.<br />
<br />
*Example of a simple moving average model in R: $v_{t}=\frac{1}{3} (w_{t-1}+w_{t}+w_{t+1})$. <br />
<br />
R CODE: <br />
w <- ts(rnorm(150)) ## generate 150 data from the standard normal distribution<br />
v <- ts(filter(w, sides=2, rep(1/3, 3))) ## moving average model<br />
par(mfrow=c(2,1))<br />
plot(w,main='white noise')<br />
plot(v,main='moving average')<br />
<br />
or can use code:<br />
<br />
w <- rnorm(150) ## generate 150 data from the standard normal distribution<br />
v <- filter(w, sides=2, rep(1/3, 3)) ## moving average model<br />
par(mfrow=c(2,1))<br />
plot.ts(w,main='white noise')<br />
plot.ts(v,main='moving average')<br />
<br />
[[Image:SMHS Fig2 Timeseries Analysis.png|500px]]<br />
<br />
## sums based on WN processes<br />
'''ts.plot(w,v,lty=2:1,col=1:2,lwd=1:2)'''<br />
<br />
[[Image:SMHS Fig3 TimeSeries Analysis.png|500px]]<br />
<br />
The ACF and the sample ACF of some stationary processes<br />
*white noise = $w(t),acf x(t)=w(t)+\frac{1}{3} \left(w(t-1)+w(t-2)+w(t-3) \right),$ then plot the barplot of the acf:<br />
<br />
RCODE:<br />
par(mfrow=c(2,1))<br />
barplot(ARMAacf(0,lag.max=10),main='white noise=w(t)')<br />
barplot(ARMAacf(ma=c(1/3,1/3,1/3),lag.max=10),main='acfx(t)=w(t)+1/3w(t-1)+1/3w(t-2)+1/3w(t-3)')<br />
<br />
[[Image:SMHS_Fig4_TimeSeries_Analysis.png|500px]]<br />
<br />
*Theoretical acf of some processes: Autoregressive process, $x_{t}=0.9x_{t-1}+w_{t}.$<br />
<br />
RCODE:<br />
par(mfrow=c(2,1))<br />
barplot(ARMAacf(c(0.9),lag.max=30),main='acf x(t)=0.9x(t-1)+w(t)')<br />
barplot(ARMAacf(c(1,-0.9),lag.max=30),main='acf x(t)=x(t-1)-0.9x(t-2)+w(t)')<br />
<br />
<br />
[[Image:SMHA_Fig5_TimeSeries_Analysis.png|500px]]<br />
<br />
*Recognizing patterns in probabilistic data <br />
**Example 1: compare White noise & Autoregressive process $x_{t}=w_{t} \& x_{t}=x_{t-1}-0.9x_{t-2}+w_{t})$<br />
<br />
RCODE:<br />
w1 <- rnorm(200) ## generate 200 data from the standard normal distribution<br />
x <- filter(w1, filter=c(1,-0.9),method='recursive')[-(1:50)]<br />
w2 <- w1[-(1:50)]<br />
xt <- ts(x)<br />
par(mfrow=c(2,2))<br />
plot.ts(w,main='white noise')<br />
acf(w2,lag.max=25)<br />
plot.ts(xt,main='autoregression')<br />
acf(xt,lag.max=25)<br />
<br />
[[Image:SMHS_Fig6_TimeSeries_Analysis.png|500px]]<br />
<br />
*Apparently, there is almost no autocorrelation in white noise, while there is large autocorrelation in second series.<br />
<br />
Example 2: AR & MA $x_{t}=x_{t-1}-0.9x_{t-2}+w_{t}\& v_{t}=\frac{1}{3}\left(w_{t-1}+w_{t}+w_{t+1} \right))$<br />
<br />
RCODE:<br />
v <- filter(w1, sides=2, rep(1/3, 3))<br />
vt <- ts(v)[2:199]<br />
par(mfrow=c(2,2))<br />
plot.ts(xt,main='white noise')<br />
acf(xt,lag.max=25)<br />
plot.ts(vt,main='autoregression')<br />
acf(vt,lag.max=25)<br />
<br />
[[Image:SMHS_Fig7_TimeSeries_Analysis.png|500px]]<br />
<br />
<br />
*ACF, PACF and some examples of data in R<br />
<br />
RCODE:<br />
data<-<br />
c(28,22,36,26,28,28,26,24,32,30,27,24,33,21,36,32,31,25,24,25,28,36,27,32,34,30,<br />
25,26,26,25,-<br />
44,23,21,30,33,29,27,29,28,22,26,27,16,31,29,36,32,28,40,19,37,23,32,29,-<br />
2,24,25,27,24,16,29,20,28,27,39,23)<br />
par(mfrow=c(1,2))<br />
plot(data)<br />
qqnorm(data)<br />
qqline(data)<br />
<br />
[[Image:SMHS_Fig8_TimeSeries_Analysis.png|500px]]<br />
<br />
data1<-data[c(-31,-55)]<br />
par(mfrow=c(1,2))<br />
plot.ts(data1)<br />
qqnorm(data1)<br />
qqline(data1)<br />
<br />
[[Image:SMHS_Fig9_TimeSeries_Analysis.png|500px]]<br />
<br />
library(astsa)<br />
acf2(data1)<br />
<br />
*Examining the data: <br />
Diagnostics of outliers<br />
<br />
RCODE:<br />
dat<- <br />
c(28,22,36,26,28,28,26,24,32,30,27,24,33,21,36,32,31,25,24,25,28,36,27,32,34,30,<br />
25,26,26,25,-44,23,21,30, 33,29, 27,29,28,22,26,27,16,31,29,36,32,28,40,19,<br />
37,23,32,29,-2,24,25,27,24,16,29,20,28,27, 39,23)<br />
plot.ts(dat)<br />
<br />
<br />
[[Image:SMHS_Fig10_TimeSeries_Analysis.png|500px]]<br />
<br />
<br />
Then we can easily tell that there is an outlier in this data series.<br />
<br />
# what if we use a boxplot here?<br />
boxplot(dat)<br />
<br />
[[Image:SMHS_Fig11_TimeSeries_Analysis.png|500px]]<br />
<br />
identify(rep(1,length(dat)),dat) ## identify outliers by clicking on screen<br />
## we identified the outliers are 55 and 31:<br />
dat[31]<br />
[1] -44<br />
dat[55]<br />
[1] -2<br />
datr <- dat[c(-31,-55)] ## this is the dat with outlier removed<br />
qqnorm(datr)<br />
qqline(datr)<br />
<br />
<br />
[[Image:SMHS_Fig12_TimeSeries.png|500px]]<br />
<br />
<br />
## this plot shows that the distribution of the new data series (right) <br />
is closer to normal compared to the original data (left).<br />
## The effect of outliers by comparing the ACF and PACF of the data <br />
series:<br />
library(astsa)<br />
acf2(dat)<br />
acf2(datr)<br />
[[Image:SMHS_Fig13_TimeSeries.png|800px]]<br />
<br />
*Diagnostics of the error: consider the model $(x_{t}=\mu+w_{t} )$ check to see if the mean of the probability distribution of error is 0; whether variance of error is constant across time; whether errors are independent and uncorrelated; whether the probability distribution of error is normal. <br />
:::Residual plot for functional form<br />
<br />
[[Image:SMHS_Fig14_TimeSeries.png|500px]]<br />
<br />
:::Residual plot for equal variance<br />
<br />
[[Image:SMHS_Fig15_TimeSeries.png|500px]]<br />
<br />
:::Check if residuals are correlated by plotting $e_{t} vs. e_{t-1}$: there are three possible cases – positive autocorrelation, negative autocorrelation and no autocorrelation. <br />
<br />
*Smoothing: to identify the structure by averaging out the noise since series has some structure plus variation. <br />
::Moving averages:<br />
ma3<-filter(datr,sides=2,rep(1/3,3))<br />
ma9<-filter(datr,sides=2,rep(1/9,9))<br />
plot.ts(datr,ylim=c(15,50),ylab='data')<br />
par(new=T)<br />
plot.ts(ma3,ylim=c(15,50),col=2,ylab='data') ## red line indicates ma3<br />
par(new=T)<br />
plot.ts(ma9,ylim=c(15,50),col=3,ylab='data') ## green line indicates ma9<br />
<br />
[[Image:SMHS_Fig16_TimeSeries.png|500px]]<br />
<br />
*Apparently, the bigger the laggings in moving average, the smoother the time series plot of the data series. <br />
ma3<-filter(datr,sides=2,rep(1/3,3))<br />
ma31<-filter(datr,sides=1,rep(1/3,3))<br />
plot.ts(datr,ylim=c(15,50),ylab='data')<br />
par(new=T)<br />
plot.ts(ma3,ylim=c(15,50),col=2,ylab='data') ## red line indicates ma3<br />
par(new=T)<br />
plot.ts(ma31,ylim=c(15,50),col=3,ylab='data') ## green line indicates ma31<br />
<br />
[[Image:SMHS_Fig17_TimeSeries.png|500px]]<br />
<br />
Two sided and one sided filters with same coefficients seem to be similar in their smoothing effect. <br />
<br />
## the effect of equal vs. unequal weights? <br />
ma31<-filter(datr,sides=1,rep(1/3,3))<br />
mawt<-filter(datr,sides=1,c(1/9,1/9,7/9))<br />
plot.ts(datr,ylim=c(15,50),ylab='data')<br />
par(new=T)<br />
plot.ts(ma31,ylim=c(15,50),col=2,ylab='data') ## red line indicates ma31<br />
par(new=T)<br />
plot.ts(mawt,ylim=c(15,50),col=3,ylab='data') ## green line indicates mawt<br />
<br />
[[Image:SMHS_Fig18_TimeSeries.png|500px]]<br />
<br />
The smoothing effect of filter with equal weights seems to be bigger than the filter with unequal weights.<br />
<br />
*Simple Exponential Smoothing (SES): premise – the most recent observations might have the highest predictive value $(\hat{x}_{t} = \alpha x_{t-1} + (1- \alpha) \hat{x}_{t-1} )$, that is new forecast = $\alpha \ actual value + (1-\alpha) \ previous forecast$. This method is appropriate for series with no trend or seasonality.<br />
<br />
*For series with trend, can use Holt-Winters Exponential smoothing additive model $\hat{x}_{t+h}=a_{t}+h*b_{t}.$ We smooth the level or permanent component, the trend component and the seasonal component.<br />
<br />
RCODE:<br />
HoltWinters(series,beta=FALSE,gamma=FALSE) #simple exponential smoothing<br />
HoltWinters(series,gamma=FALSE) # HoltWinters trend<br />
HoltWinters(series,seasonal= 'additive') # HoltWinters additive trend+seasonal<br />
HoltWinters(series,seasonal= 'multiplicative') # HoltWinters multiplicative trend+seasonal<br />
<br />
<br />
<br />
Example: RCODE: (effect of different values of alpha on datr SES)<br />
par(mfrow=c(3,1))<br />
exp05=HoltWinters(datr,alpha=0.05,beta=FALSE,gamma=FALSE) #simple exponential smoothing <br />
p05=predict(exp05,20) <br />
plot(exp05,p05, main = 'Holt- Winters filtering, alpha=0.05')<br />
exp50=HoltWinters(datr, alpha=0.5,beta=FALSE,gamma=FALSE) #simple exponential smoothing <br />
p50=predict(exp50,20)<br />
plot(exp50,p50, main = 'Holt-Winters filtering, alpha=0.5')<br />
exp95=HoltWinters(datr, alpha=0.95,beta=FALSE,gamma= FALSE) #simple exponential smoothing <br />
p95=predict(exp95,20)<br />
plot(exp95,p95,main = "Holt-Winters filtering, alpha=0.95")<br />
<br />
<br />
[[Image:SMHS_Fig19_TimeSeries.png|500px]]<br />
<br />
*Comparing two models: we can compare ‘fit’ and forecast ability. <br />
<br />
RCODE:<br />
obs <- exp50$\$$x[-1]<br />
fit <- exp50$\$$fitted[,1]<br />
r1 <- obs-fit<br />
plot(r1)<br />
par(mfrow=c(1,2))<br />
plot(r1)<br />
acf(r1)<br />
<br />
[[Image:SMHS_Fig20_TimeSeries.png|500px]]<br />
<br />
*Regression based (nonparametric) smoothing: (1) kernel: extension of a two-sided MA. Use a bandwidth (number of terms) plus a kernel function to smooth a set of values, ''ksmooth''; (2) local linear regression (loess): fit local polynomial regression and join them together, ''loess''; (3) fitting cubic splines: extension of polynomial regression, partition data and fit separate piecewise regression to each section, smooth them together where they join, ''smooth.spine''; (4) many more, ''lowess'', ''supsmu''.<br />
::•Kernel smoother: define bandwidth; choose a kernel; use robust regression to get smooth estimate; bigger the bandwidth, smoother the result<br />
::•Loess: define the window width; choose a weight function; use polynomial regression and weighted least squares to get smoothest estimate; the bigger span, the smoother the result.<br />
::•Splines: divide data into intervals; in each interval, fit a polynomial regression of order p such that the splines are continuous at knots; use a parameter (spar in R) to smooth the splines; the bigger the spar, the smoother the result.<br />
::•Issues with regression based (nonparametric) smoothing: (1) the size of S has an important effect on the curve; (2) a span that is too small (meaning that insufficient data fall within the window) produces a curve characterized by a lot of noise, results in a large variance; (3) if the span is too large, the regression will be oversmoothed, and thus the local polynomial may not fit the data well. This may result in loss of important information, and thus the fit will have large bias; (4) we want the smallest span that provides a smooth structure.<br />
<br />
*A regression model for trend: given $y_{t}, t=1,2,…,\cdots,$ T is a time series, fit a regression model for trend. The plot suggests that we fit a trend line, say $y_{t}=\beta_{1}+\beta_{2} t + \varepsilon_{t}$. <br />
::•Assumptions: variance of error is constant across observations; errors are independent / uncorrelated across observations; the regressor variables and error are independent; we may also assume probability distribution of error is normal.<br />
<br />
RCODE<br />
fit <- lm(x~z)<br />
summary(fit$\$$coeff)<br />
plot.ts(fit$\$$resid) # plot the residuals of the fitted regression model<br />
acf2(fit$\$$resid) # the ACF and PACF of the residuals of the fitted regression model<br />
qqnorm(fit$\$$resid)<br />
qqline(fit$\$$resid)<br />
<br />
<br />
*Analyzing patterns in data and modeling: seasonality<br />
**Exploratory analysis<br />
<br />
RCODE:<br />
data=c( 118,132,129,121,135,148,148,136,119,104,118,115,126,141,135,125,<br />
149,170,170,158,133,114,140,145,150,178,163,172,178,199,199,184,162,146,<br />
166,171,180,193,181,183,218,230,242,209,191,172,194,196,196,236,235,<br />
229,243,264,272,237,211,180,201,204,188,235,227,234,264,302,293,259,229,<br />
203,229,242,233,267,269,270,315,364,347,312,274,237,278,284,277,317,313,<br />
318,374,413,405,255,306,271,306,300)<br />
air <- ts(data,start=c(1949,1),frequency=12)<br />
y <- log(air)<br />
par(mfrow=c(1,2))<br />
plot(air)<br />
plot(y)<br />
<br />
<br />
[[Image:SMHS_Fig21_TimeSeries.png|500px]]<br />
<br />
<br />
## from the charts above, taking log helps a little bit with the increasing variability. <br />
<br />
monthplot(y)<br />
<br />
<br />
[[Image:SMHS_Fig22_TimeSeries.png|500px]]<br />
<br />
<br />
acf2(y)<br />
<br />
[[Image:SMHS_Fig23_TimeSeries.png|500px]]<br />
<br />
*Smoothing: the Holt-Winters method smooths and obtains three parts of a series: level, trend and seasonal. The method requires smoothing constants: your choice or those which minimize the one step ahead forecast error: $\sum_{t=1}^{T} \left(\hat{x}_{t} - x_{t-1} \right )^{2}$, where $t=1,2,\cdots,T$. So at a given time point, smoothed value = (level+h*trend)+seasonal index (additive); data value = (level+h*trend)*seasonal index (multiplicative). Additive: $\hat{x}_{t+h}=a_{t}+h*b_{t}+s_{t+1+(h-1)\mod{p}}$; Multiplicative: $\hat{x}_{t+h}=(a_{t}+h*b_{t})* s_{t+1+(h-1)\mod{p}}.$<br />
<br />
<br />
RCODE:<br />
par(mfrow=c(2,1))<br />
sa <- HoltWinters(air,seasonal=c('additive'))<br />
psa <- predict(sa,60)<br />
plot(sa,psa,main='Holt-Winters filtering, airline passengers, HW additive')<br />
sm <- HoltWinters(air,seasonal=c('multiplicative'))<br />
psam <- predict(sm,60)<br />
plot(sm,psam,main='Holt-Winters filtering, airline passengers, HW multiplicative')<br />
<br />
[[Image:SMHS_Fig24_TimeSeries.png|500px]]<br />
<br />
## seasonal decomposition using loess, log(airpassengers)<br />
<br />
[[Image:SMHS_Fig25_TimeSeries.png|500px]]<br />
<br />
*Regression: estimation, diagnostics, interpretation. Defining dummy variables to capture the seasonal effect in regression. If there is a variable with k categories, there will be k-1 dummies if the intercept is included in the regression (exclude one group); there will be k dummies if the intercept is not included in the regression; if the intercept is included in the regression, the coefficient of a dummy variable is always the difference of means between that group and the excluded group. <br />
<br />
RCODE:<br />
t <- time(y)<br />
Q <- factor(rep(1:12,8))<br />
h <- lm(y~time(y)+Q) # regression with intercept, drops first category = January<br />
model.matrix(h)<br />
h1 <- lm(y~0+time(y)+Q) # regression without intercept<br />
model.matrix(h1)<br />
x1 <- ts(y[1:72],start=c(1949,1),frequency=12)<br />
t1 <- t[1:72]<br />
Q1 <- Q[1:72]<br />
contrasts(Q1)=contr.treatment(12,base=10)<br />
h2 <- lm(x1~t1+Q1) # regression with intercept, drops 10th category = October<br />
model.matrix(h2)<br />
reg <- lm(x1~t1+Q1)<br />
par(mfrow=c(2,2))<br />
plot.ts(reg$\$$resid)<br />
qqnorm(reg$\$$resid)<br />
qqline(reg$\$$resid)<br />
acf(reg$\$$resid)<br />
pacf(reg$\$$resid)<br />
<br />
[[Image:SMHS_Fig26_TimeSeries.png|500px]]<br />
<br />
## forecasts RCODE<br />
newdata=data.frame(t1=t[73:96], Q1=Q[73:96]) <br />
preg=predict(reg,newdata,se.fit=TRUE) <br />
seup=preg$\$$fit+2*preg$\$$se.fit<br />
seup <br />
1 2 3 4 5 6 7 8 9 <br />
5.577144 5.722867 5.679936 5.667979 5.780376 5.881420 5.889829 5.782526 5.655445 <br />
10 11 12 13 14 15 16 17 18 <br />
5.525933 5.661158 5.681042 5.716577 5.862300 5.819369 5.807412 5.919809 6.020853 <br />
19 20 21 22 23 24 <br />
6.029262 5.921959 5.794878 5.665366 5.800591 5.820475 <br />
selow=preg$\$$fit-2*preg$\$$se.fit<br />
selow<br />
1 2 3 4 5 6 7 8 9 <br />
5.479442 5.625165 5.582234 5.570277 5.682674 5.783718 5.792127 5.684824 5.557743 <br />
10 11 12 13 14 15 16 17 18 <br />
5.428232 5.563456 5.583340 5.610927 5.756650 5.713720 5.701763 5.814160 5.915203 <br />
19 20 21 22 23 24 <br />
5.923613 5.816309 5.689228 5.559717 5.694941 5.714825<br />
<br />
x1r=y[73:96]<br />
t1r=t[73:96]<br />
plot(t1r,x1r, col='red',lwd=1:2,main='Actual vs Forecast', type='l')<br />
lines(t1r,preg$\$$fit, col='green', lwd=1:2)<br />
lines(t1r,seup, col='blue', lwd=1:2)<br />
lines(t1r,selow, col='blue', lwd=1:2)<br />
## red refer actual; green refers to forecast; blue refers to standard error<br />
<br />
[[Image:SMHS_Fig27_TimeSeries.png|500px]]<br />
<br />
## Holt Winters smoothing: forecasts<br />
sa=HoltWinters(x1, seasonal = c("additive"))<br />
psa=predict(sa,24)<br />
plot(t1r,x1r, col="red",type="l",main = "Holt- Winters filtering, airline passengers, HW additive")<br />
lines(t1r, psa, ,col="green")<br />
<br />
## red represents the actual, green: forecast<br />
<br />
[[Image:SMHS_Fig28_TimeSeries.png|500px]]<br />
<br />
sqrt(sum(x1r-psa)^2)) ##HW<br />
[1] 0.3988259<br />
sqrt(sum((x1r-preg$\$$fit)^2)) ##reg<br />
[1] 0.3807665<br />
<br />
==Applications==<br />
<br />
1)[http://www.statsoft.com/Textbook/Time-Series-Analysis This article] presents a comprehensive introduction to the filed of time series analysis. It discussed about the common patterns existing in time series data and introduced some commonly used techniques to deal with time series data and ways to analyze time series. <br />
<br />
2)[http://www.sciencedirect.com/science/article/pii/0304393282900125 This article] investigates whether macroeconomic time series are better characterized as stationary fluctuations around a deterministic trend or as non-stationary processes that have no tendency to return to a deterministic path. Using long historical time series for the U.S. we are unable to reject the hypothesis that these series are non-stationary stochastic processes with no tendency to return to a trend line. Based on these findings and an unobserved components model for output that decomposes fluctuations into a secular or growth component and a cyclical component we infer that shocks to the former, which we associate with real disturbances, contribute substantially to the variation in observed output. We conclude that macroeconomic models that focus on monetary disturbances as a source of purely transitory fluctuations may never be successful in explaining a large fraction of output variation and that stochastic variation due to real factors is an essential element of any model of macroeconomic fluctuations.<br />
<br />
<br />
==Software== <br />
See the CODE examples given in this lecture.<br />
<br />
==Problems==<br />
<br />
1) Consider a signal-plus-noise model of the general form $x_{t}=s_{t}+w_{t}$, where $w_{t}$ is Gaussian white noise with $\sigma ^{2} _{w} = 1$. Simulate and plot n=200 observations from each of the following two models.<br />
<br />
(a) $x_{t} = s_{t}+w_{t},$ for $t=1,2,\cdots,200,$ where $s_{t}=\begin{cases} 0 & \text{ if } t=1,2,\cdots,100 \\ & 10\,exp \left \{ -(t-100)/20) \right \} cos\left ( 2\pi t/4 \right )\text{ if } t=101,102,\cdots,200 \end{cases}.$<br />
<br />
(b) $x_{t}=s_{t}+w_{t}, for t=1,2,\cdots, 200,$ where $s_{t}=\begin{cases} 0 & \text{ if } t=1,2,\cdots,100 \\ & 10\,exp \left \{ -(t-100)/200) \right \} cos\left ( 2\pi t/4 \right )\text{ if } t=101,102,\cdots,200 \end{cases}.$<br />
<br />
(c) Compare the signal modulators $(a) exp \left \{-t/20 \right \} and (b) exp \left \{-t/200 \right \}, for t=1,2, cdots,100.$<br />
<br />
<br />
2) (a) Generate $n=100$ observations from the autoregression $x_{t}=-0.9 x_{t-2} +w_{t}, with \sigma {w}=1,$ using the method described in Example 1.10, page 13 given in the textbook. Next, apply the moving average filter $v_{t}=\left(x_{t}+x_{t-1}+x_{t-2}+x_{t-3} \right )/4 to x_{t},$ to the data you generated. Now, plot $x_{t}$ as a line and superimpose $v_{t}$ as a dashed line. Comment on the behavior of $x_{t}$ and how to apply the moving average filter changes that behavior. [Hint: use $v=filter(x,rep(1/4,4),sides=1)$ for the filter].<br />
<br />
(b) Repeat (a) but with $x_{t}=cos \left( 2 \pi t/4 \right).$<br />
<br />
(c) Repeat (b) but with added $N(0,1)$ noise, $x_{t}=cos \left(2 \pi t/4 \right) +w_{t}.$<br />
<br />
(d) Compare and contrast $(a) – (c).$<br />
<br />
3) For the two series, $x_{t}$ in 6.1 (a) and (b):<br />
<br />
(a) Compute and plot the mean functions $\mu _{x} (t) for t=1,2,\cdots,200.$<br />
<br />
(b) Calculate the autocovariance functions, $\gamma _{x} (s,t), for s,t=1,2, \cdots, 200.$<br />
<br />
4) Consider the time series $x_{t} = \beta {1} + \beta {2} t + w{t},$ where $\beta _{1}$ and $\beta_{2}$ are known constants and $w_{t}$ is a white noise process with variance $\sigma ^{2} _{w}.$<br />
<br />
(a) Determine whether $x_{t}$ is stationary.<br />
<br />
(b) Show that the process $y_{t}=x_{t} – x_{t-1}$ is stationary.<br />
<br />
(c) Show that the mean of the moving average $v_{t}= \frac {1}{2q+1} \sum_{j=-q}^{q} x_{t-j},$ is $\beta _{1} + \beta _{2} t,$ and give a simplified expression for the autocovariance function. <br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<hr><br />
* SOCR Home page: http://www.socr.umich.edu<br />
<br />
{{translate|pageName=http://wiki.socr.umich.edu/index.php?title=SMHS_TimeSeries}}</div>Clgallahttps://wiki.socr.umich.edu/index.php?title=SMHS_TimeSeries&diff=14463SMHS TimeSeries2014-10-20T15:39:45Z<p>Clgalla: /* Problems */</p>
<hr />
<div>==[[SMHS| Scientific Methods for Health Sciences]] - Time Series Analysis ==<br />
<br />
===Overview===<br />
Time series data is a sequence of data points measured at successive pints in time spaced intervals. Time series analysis is commonly used in varieties of studies like monitoring industrial processes and tracking business metrics in the business world. In this lecture, we will present a general introduction to the time series data and introduced on some of the most commonly used techniques in the rick and rapidly growing field of time series modeling and analysis to extract meaningful statistics and other characteristics of the data. We will also illustrate the application of time series analysis techniques with examples in R. <br />
<br />
===Motivation===<br />
Economic data like daily share price, monthly sales, annual income or physical data like daily temperature, ECG readings are all examples of time series data. Time series is just an ordered sequence of values of a variable measured at equally spaced time intervals. So, what would be the effective ways to measure time series data? How can we extract information from time series data and make inference afterwards? <br />
<br />
===Theory===<br />
Time series: a sequence of data points measured at successive pints in time spaced intervals.<br />
*Components: (1) trend component: long-run increase or decrease over time (overall upward or downward movement), typically refers to data taken over a long period of time; the trend can be upward or downward, linear or non-linear. (2) seasonal component: short-term regular wave-like patterns, usually refers to data observed within one year, which may be measured monthly or quarterly. (3) cyclical component: long-term wave-like patterns; regularly occur but may vary in length; often measured peak to peak or trough to trough. (4) irregular component: unpredictable, random, ‘residual’ fluctuation, which may be due to random variations of nature or accidents or unusual events; ‘noise’ in the time series.<br />
*Probabilistic models: The components still add or multiply together across time.<br />
[[Image:SMHS Fig 1 Times Series Analysis.png|400px]]<br />
*Simple Time Series Models: <br />
**Take $T_{t}$ as the trend component, $S_{t}$ as the seasonal component, $C_{t}$ as the cyclical component and $I_{t}$ as the irregular or random component. Then we have an additive model as: $x_{t}=T_{t}+S_{t}+C_{t}+I_{t};$ the multiplicative model says $x_{t}=T_{t}*S_{t}*C_{t}*I_{t};$ sometimes, we take logs on both sides of the multiplicative model to make it additive, $logx_{t}=log\left ( T_{t}*S_{t}*C_{t}*I_{t} \right )$, which can be further noted as $x_{t}{}'=T_{t}{}'+S_{t}{}'+C_{t}{}'+I_{t}{}'$.<br />
**Most time series models are written in terms of an (unobserved) white noise process, which is often assumed to be Gaussian: $x_{t}=w_{t}$, where $w_{t}\sim WN\left ( 0,1 \right )$, that is, $E\left ( W_{t} \right )=0, Var\left(W_{t} \right )=\sigma ^{2}$. Examples of probabilistic models include: (1) Autoregressive $x_{t}=0.6x_{t-1}+w_{t};$ (2) Moving average model $x_{t}= \frac{1}{3}\left(w_{t}+w_{t-1}+w_{t-2}\right).$<br />
**Fitting time series models: a time series model generates a process whose pattern can then be matched in some way to the observed data; since perfect matches are impossible, it is possible that more than one model will be appropriate for a set of data; to decide which model is appropriate: patterns suggest choices, assess within sample adequacy (diagnostics, tests), outside sample adequacy (forecast evaluation), simulation from suggested model and compare with observed data; next, turn to some theoretical aspects like how to characterize a time series, and then investigate some special processes. <br />
**Characterizing time series (the mean and covariance of time series): suppose the data are $x_{1},x_{2},\cdots ,x_{t},$ note for a regularly spaced time series, $x_{1}$ is observed at time $t_{0}$, $x_{2}$ is observed at $t_{0}+\Delta, x_{t}$ is observed at $t_{0}+\left( t-1\right) \Delta$; the expected value of $x_{t}$ is $\mu _{t}=E\left [ x_{t} \right ]$; the covariance function is $\gamma \left(s,t \right )=cov\left(x_{s},x_{t} \right )$. Note that, we don’t distinguish between random variables and their realizations. Properly, $x_{1},x_{2},\cdots ,x_{t}$ is a realization of some process $X_{1},X_{2},\cdots ,X_{t}$ and so $\mu _{t}=E\left [ x_{t} \right ]$ etc.<br />
**Characterizing time series data: weak stationarity: usually we have only one realization of the time series. There are not enough data to estimate all the means \& covariances. That is if the mean is constant, and the covariance depends only on the separation in time, then a time series is $\left(weakly\right)$ stationary. In this case $\mu_{t}=\mu$ and $\gamma\left(s,t\right)=\gamma\left(t-s\right)$. Stationary processes are much easier to handle statistically. Real time series are usually non-stationary, but can often be transformed to give a stationary process.<br />
**Estimation of $\gamma(h)$ (Estimating the autocovariance function): (1) for a stationary process: $\gamma(h)=\gamma(s,s+h)=cov(x_{s},x_{s+h})$ for any s.$\ \gamma (h)$ is called the autocovariance function because it measures the covariance between lags of the process $x_{t}$; (2) we observe T-h pairs separated at lag h namely $\left(x_{1},x_{h+1}\right),\cdots,\left(x_{T-h},x_{T}\right)$; the sample autocovariance function is $\hat{\gamma}(h)=\frac{1}{T}\sum_{t=1}^{T-h}\left(x_{t}-\bar{X} \right )\left(x_{t+h}-\bar{X} \right )$, note that we divide by ${T}$ although there are ${T-h}$ terms in the sum. The autocorrelation function ${ACF}$ is $\rho (h)=\frac{\gamma(h)}{\gamma(0)}$.<br />
**$\rho(h)$ properties: for $x_{t}$ stationary, $\gamma(h)=E(x_{t}-\mu{t})(x_{t+h}-\mu)$, where $\gamma(h)$ is an even function, that is $\gamma(h)=\gamma(-h); \left | \gamma(h) \right |\leq \left | \gamma(0) \right |$, that is $\left|\rho(h)\right|\leq1, h=\pm 1,2,\cdots$. The autocorrelation matrix, $P(h)$ is positive semi-definite which means that the autocorrelations cannot be chosen arbitrarily but must have some relations among themselves. <br />
**To study a set of time series data: given that we only have one realization, we need to match plot of data from proposed model with data (we can see broad/main patterns); match acf, pacf of proposed model with data (we can see if data are stationary uncorrelated, stationary autocorrelated or non-stationary); determine if the proposed model is appropriate; obtain forecasts from proposed model; compare across models that seem appropriate. <br />
**Large sample distribution of the $ACF$ for a $WN$ series: if $x_{t}$ is $WN$ then for n large, the sample $ACF$, $\hat{\rho}_{x}(h), h=1,2,\cdots$,H where H is fixed but arbitrary is normally distributed with zero mean and standard deviation given by $\sigma_{\hat{\rho}_{x}}(h)=\frac{1}{\sqrt{h}}$. A rule of thumb for assessing whether sample autocorrelations resemble those for a $WN$ series is by checking if approximately 95% of the sample autocorrelations are within the interval $0\pm \frac{2}{\sqrt{n}}$. <br />
**White noise, or stationary uncorrelated process: we say that a time series is white noise if it is weakly stationary, with $ACF$, $\rho(h)=\begin{cases} 1 & \text{ if } h=0 \\ 0 & \text{ if } h> 0\ and\ with \ E\left[w_{t} \right ]=0 \end{cases}$, i.e. white noise consists of a sequence of uncorrelated, zero mean random variables with common variance $\sigma^{2}$.<br />
<br />
*Example of a simple moving average model in R: $v_{t}=\frac{1}{3} (w_{t-1}+w_{t}+w_{t+1})$. <br />
<br />
R CODE: <br />
w <- ts(rnorm(150)) ## generate 150 data from the standard normal distribution<br />
v <- ts(filter(w, sides=2, rep(1/3, 3))) ## moving average model<br />
par(mfrow=c(2,1))<br />
plot(w,main='white noise')<br />
plot(v,main='moving average')<br />
<br />
or can use code:<br />
<br />
w <- rnorm(150) ## generate 150 data from the standard normal distribution<br />
v <- filter(w, sides=2, rep(1/3, 3)) ## moving average model<br />
par(mfrow=c(2,1))<br />
plot.ts(w,main='white noise')<br />
plot.ts(v,main='moving average')<br />
<br />
[[Image:SMHS Fig2 Timeseries Analysis.png|500px]]<br />
<br />
## sums based on WN processes<br />
'''ts.plot(w,v,lty=2:1,col=1:2,lwd=1:2)'''<br />
<br />
[[Image:SMHS Fig3 TimeSeries Analysis.png|500px]]<br />
<br />
The ACF and the sample ACF of some stationary processes<br />
*white noise = $w(t),acf x(t)=w(t)+\frac{1}{3} \left(w(t-1)+w(t-2)+w(t-3) \right),$ then plot the barplot of the acf:<br />
<br />
RCODE:<br />
par(mfrow=c(2,1))<br />
barplot(ARMAacf(0,lag.max=10),main='white noise=w(t)')<br />
barplot(ARMAacf(ma=c(1/3,1/3,1/3),lag.max=10),main='acfx(t)=w(t)+1/3w(t-1)+1/3w(t-2)+1/3w(t-3)')<br />
<br />
[[Image:SMHS_Fig4_TimeSeries_Analysis.png|500px]]<br />
<br />
*Theoretical acf of some processes: Autoregressive process, $x_{t}=0.9x_{t-1}+w_{t}.$<br />
<br />
RCODE:<br />
par(mfrow=c(2,1))<br />
barplot(ARMAacf(c(0.9),lag.max=30),main='acf x(t)=0.9x(t-1)+w(t)')<br />
barplot(ARMAacf(c(1,-0.9),lag.max=30),main='acf x(t)=x(t-1)-0.9x(t-2)+w(t)')<br />
<br />
<br />
[[Image:SMHA_Fig5_TimeSeries_Analysis.png|500px]]<br />
<br />
*Recognizing patterns in probabilistic data <br />
**Example 1: compare White noise & Autoregressive process $x_{t}=w_{t} \& x_{t}=x_{t-1}-0.9x_{t-2}+w_{t})$<br />
<br />
RCODE:<br />
w1 <- rnorm(200) ## generate 200 data from the standard normal distribution<br />
x <- filter(w1, filter=c(1,-0.9),method='recursive')[-(1:50)]<br />
w2 <- w1[-(1:50)]<br />
xt <- ts(x)<br />
par(mfrow=c(2,2))<br />
plot.ts(w,main='white noise')<br />
acf(w2,lag.max=25)<br />
plot.ts(xt,main='autoregression')<br />
acf(xt,lag.max=25)<br />
<br />
[[Image:SMHS_Fig6_TimeSeries_Analysis.png|500px]]<br />
<br />
*Apparently, there is almost no autocorrelation in white noise, while there is large autocorrelation in second series.<br />
<br />
Example 2: AR & MA $x_{t}=x_{t-1}-0.9x_{t-2}+w_{t}\& v_{t}=\frac{1}{3}\left(w_{t-1}+w_{t}+w_{t+1} \right))$<br />
<br />
RCODE:<br />
v <- filter(w1, sides=2, rep(1/3, 3))<br />
vt <- ts(v)[2:199]<br />
par(mfrow=c(2,2))<br />
plot.ts(xt,main='white noise')<br />
acf(xt,lag.max=25)<br />
plot.ts(vt,main='autoregression')<br />
acf(vt,lag.max=25)<br />
<br />
[[Image:SMHS_Fig7_TimeSeries_Analysis.png|500px]]<br />
<br />
<br />
*ACF, PACF and some examples of data in R<br />
<br />
RCODE:<br />
data<-<br />
c(28,22,36,26,28,28,26,24,32,30,27,24,33,21,36,32,31,25,24,25,28,36,27,32,34,30,<br />
25,26,26,25,-<br />
44,23,21,30,33,29,27,29,28,22,26,27,16,31,29,36,32,28,40,19,37,23,32,29,-<br />
2,24,25,27,24,16,29,20,28,27,39,23)<br />
par(mfrow=c(1,2))<br />
plot(data)<br />
qqnorm(data)<br />
qqline(data)<br />
<br />
[[Image:SMHS_Fig8_TimeSeries_Analysis.png|500px]]<br />
<br />
data1<-data[c(-31,-55)]<br />
par(mfrow=c(1,2))<br />
plot.ts(data1)<br />
qqnorm(data1)<br />
qqline(data1)<br />
<br />
[[Image:SMHS_Fig9_TimeSeries_Analysis.png|500px]]<br />
<br />
library(astsa)<br />
acf2(data1)<br />
<br />
*Examining the data: <br />
Diagnostics of outliers<br />
<br />
RCODE:<br />
dat<- <br />
c(28,22,36,26,28,28,26,24,32,30,27,24,33,21,36,32,31,25,24,25,28,36,27,32,34,30,<br />
25,26,26,25,-44,23,21,30, 33,29, 27,29,28,22,26,27,16,31,29,36,32,28,40,19,<br />
37,23,32,29,-2,24,25,27,24,16,29,20,28,27, 39,23)<br />
plot.ts(dat)<br />
<br />
<br />
[[Image:SMHS_Fig10_TimeSeries_Analysis.png|500px]]<br />
<br />
<br />
Then we can easily tell that there is an outlier in this data series.<br />
<br />
# what if we use a boxplot here?<br />
boxplot(dat)<br />
<br />
[[Image:SMHS_Fig11_TimeSeries_Analysis.png|500px]]<br />
<br />
identify(rep(1,length(dat)),dat) ## identify outliers by clicking on screen<br />
## we identified the outliers are 55 and 31:<br />
dat[31]<br />
[1] -44<br />
dat[55]<br />
[1] -2<br />
datr <- dat[c(-31,-55)] ## this is the dat with outlier removed<br />
qqnorm(datr)<br />
qqline(datr)<br />
<br />
<br />
[[Image:SMHS_Fig12_TimeSeries.png|500px]]<br />
<br />
<br />
## this plot shows that the distribution of the new data series (right) <br />
is closer to normal compared to the original data (left).<br />
## The effect of outliers by comparing the ACF and PACF of the data <br />
series:<br />
library(astsa)<br />
acf2(dat)<br />
acf2(datr)<br />
[[Image:SMHS_Fig13_TimeSeries.png|800px]]<br />
<br />
*Diagnostics of the error: consider the model $(x_{t}=\mu+w_{t} )$ check to see if the mean of the probability distribution of error is 0; whether variance of error is constant across time; whether errors are independent and uncorrelated; whether the probability distribution of error is normal. <br />
:::Residual plot for functional form<br />
<br />
[[Image:SMHS_Fig14_TimeSeries.png|500px]]<br />
<br />
:::Residual plot for equal variance<br />
<br />
[[Image:SMHS_Fig15_TimeSeries.png|500px]]<br />
<br />
:::Check if residuals are correlated by plotting $e_{t} vs. e_{t-1}$: there are three possible cases – positive autocorrelation, negative autocorrelation and no autocorrelation. <br />
<br />
*Smoothing: to identify the structure by averaging out the noise since series has some structure plus variation. <br />
::Moving averages:<br />
ma3<-filter(datr,sides=2,rep(1/3,3))<br />
ma9<-filter(datr,sides=2,rep(1/9,9))<br />
plot.ts(datr,ylim=c(15,50),ylab='data')<br />
par(new=T)<br />
plot.ts(ma3,ylim=c(15,50),col=2,ylab='data') ## red line indicates ma3<br />
par(new=T)<br />
plot.ts(ma9,ylim=c(15,50),col=3,ylab='data') ## green line indicates ma9<br />
<br />
[[Image:SMHS_Fig16_TimeSeries.png|500px]]<br />
<br />
*Apparently, the bigger the laggings in moving average, the smoother the time series plot of the data series. <br />
ma3<-filter(datr,sides=2,rep(1/3,3))<br />
ma31<-filter(datr,sides=1,rep(1/3,3))<br />
plot.ts(datr,ylim=c(15,50),ylab='data')<br />
par(new=T)<br />
plot.ts(ma3,ylim=c(15,50),col=2,ylab='data') ## red line indicates ma3<br />
par(new=T)<br />
plot.ts(ma31,ylim=c(15,50),col=3,ylab='data') ## green line indicates ma31<br />
<br />
[[Image:SMHS_Fig17_TimeSeries.png|500px]]<br />
<br />
Two sided and one sided filters with same coefficients seem to be similar in their smoothing effect. <br />
<br />
## the effect of equal vs. unequal weights? <br />
ma31<-filter(datr,sides=1,rep(1/3,3))<br />
mawt<-filter(datr,sides=1,c(1/9,1/9,7/9))<br />
plot.ts(datr,ylim=c(15,50),ylab='data')<br />
par(new=T)<br />
plot.ts(ma31,ylim=c(15,50),col=2,ylab='data') ## red line indicates ma31<br />
par(new=T)<br />
plot.ts(mawt,ylim=c(15,50),col=3,ylab='data') ## green line indicates mawt<br />
<br />
[[Image:SMHS_Fig18_TimeSeries.png|500px]]<br />
<br />
The smoothing effect of filter with equal weights seems to be bigger than the filter with unequal weights.<br />
<br />
*Simple Exponential Smoothing (SES): premise – the most recent observations might have the highest predictive value $(\hat{x}_{t} = \alpha x_{t-1} + (1- \alpha) \hat{x}_{t-1} )$, that is new forecast = $\alpha \ actual value + (1-\alpha) \ previous forecast$. This method is appropriate for series with no trend or seasonality.<br />
<br />
*For series with trend, can use Holt-Winters Exponential smoothing additive model $\hat{x}_{t+h}=a_{t}+h*b_{t}.$ We smooth the level or permanent component, the trend component and the seasonal component.<br />
<br />
RCODE:<br />
HoltWinters(series,beta=FALSE,gamma=FALSE) #simple exponential smoothing<br />
HoltWinters(series,gamma=FALSE) # HoltWinters trend<br />
HoltWinters(series,seasonal= 'additive') # HoltWinters additive trend+seasonal<br />
HoltWinters(series,seasonal= 'multiplicative') # HoltWinters multiplicative trend+seasonal<br />
<br />
<br />
<br />
Example: RCODE: (effect of different values of alpha on datr SES)<br />
par(mfrow=c(3,1))<br />
exp05=HoltWinters(datr,alpha=0.05,beta=FALSE,gamma=FALSE) #simple exponential smoothing <br />
p05=predict(exp05,20) <br />
plot(exp05,p05, main = 'Holt- Winters filtering, alpha=0.05')<br />
exp50=HoltWinters(datr, alpha=0.5,beta=FALSE,gamma=FALSE) #simple exponential smoothing <br />
p50=predict(exp50,20)<br />
plot(exp50,p50, main = 'Holt-Winters filtering, alpha=0.5')<br />
exp95=HoltWinters(datr, alpha=0.95,beta=FALSE,gamma= FALSE) #simple exponential smoothing <br />
p95=predict(exp95,20)<br />
plot(exp95,p95,main = "Holt-Winters filtering, alpha=0.95")<br />
<br />
<br />
[[Image:SMHS_Fig19_TimeSeries.png|500px]]<br />
<br />
*Comparing two models: we can compare ‘fit’ and forecast ability. <br />
<br />
RCODE:<br />
obs <- exp50$\$$x[-1]<br />
fit <- exp50$\$$fitted[,1]<br />
r1 <- obs-fit<br />
plot(r1)<br />
par(mfrow=c(1,2))<br />
plot(r1)<br />
acf(r1)<br />
<br />
[[Image:SMHS_Fig20_TimeSeries.png|500px]]<br />
<br />
*Regression based (nonparametric) smoothing: (1) kernel: extension of a two-sided MA. Use a bandwidth (number of terms) plus a kernel function to smooth a set of values, ''ksmooth''; (2) local linear regression (loess): fit local polynomial regression and join them together, ''loess''; (3) fitting cubic splines: extension of polynomial regression, partition data and fit separate piecewise regression to each section, smooth them together where they join, ''smooth.spine''; (4) many more, ''lowess'', ''supsmu''.<br />
::•Kernel smoother: define bandwidth; choose a kernel; use robust regression to get smooth estimate; bigger the bandwidth, smoother the result<br />
::•Loess: define the window width; choose a weight function; use polynomial regression and weighted least squares to get smoothest estimate; the bigger span, the smoother the result.<br />
::•Splines: divide data into intervals; in each interval, fit a polynomial regression of order p such that the splines are continuous at knots; use a parameter (spar in R) to smooth the splines; the bigger the spar, the smoother the result.<br />
::•Issues with regression based (nonparametric) smoothing: (1) the size of S has an important effect on the curve; (2) a span that is too small (meaning that insufficient data fall within the window) produces a curve characterized by a lot of noise, results in a large variance; (3) if the span is too large, the regression will be oversmoothed, and thus the local polynomial may not fit the data well. This may result in loss of important information, and thus the fit will have large bias; (4) we want the smallest span that provides a smooth structure.<br />
<br />
*A regression model for trend: given $y_{t}, t=1,2,…,\cdots,$ T is a time series, fit a regression model for trend. The plot suggests that we fit a trend line, say $y_{t}=\beta_{1}+\beta_{2} t + \varepsilon_{t}$. <br />
::•Assumptions: variance of error is constant across observations; errors are independent / uncorrelated across observations; the regressor variables and error are independent; we may also assume probability distribution of error is normal.<br />
<br />
RCODE<br />
fit <- lm(x~z)<br />
summary(fit$\$$coeff)<br />
plot.ts(fit$\$$resid) # plot the residuals of the fitted regression model<br />
acf2(fit$\$$resid) # the ACF and PACF of the residuals of the fitted regression model<br />
qqnorm(fit$\$$resid)<br />
qqline(fit$\$$resid)<br />
<br />
<br />
*Analyzing patterns in data and modeling: seasonality<br />
**Exploratory analysis<br />
<br />
RCODE:<br />
data=c( 118,132,129,121,135,148,148,136,119,104,118,115,126,141,135,125,<br />
149,170,170,158,133,114,140,145,150,178,163,172,178,199,199,184,162,146,<br />
166,171,180,193,181,183,218,230,242,209,191,172,194,196,196,236,235,<br />
229,243,264,272,237,211,180,201,204,188,235,227,234,264,302,293,259,229,<br />
203,229,242,233,267,269,270,315,364,347,312,274,237,278,284,277,317,313,<br />
318,374,413,405,255,306,271,306,300)<br />
air <- ts(data,start=c(1949,1),frequency=12)<br />
y <- log(air)<br />
par(mfrow=c(1,2))<br />
plot(air)<br />
plot(y)<br />
<br />
<br />
[[Image:SMHS_Fig21_TimeSeries.png|500px]]<br />
<br />
<br />
## from the charts above, taking log helps a little bit with the increasing variability. <br />
<br />
monthplot(y)<br />
<br />
<br />
[[Image:SMHS_Fig22_TimeSeries.png|500px]]<br />
<br />
<br />
acf2(y)<br />
<br />
[[Image:SMHS_Fig23_TimeSeries.png|500px]]<br />
<br />
*Smoothing: the Holt-Winters method smooths and obtains three parts of a series: level, trend and seasonal. The method requires smoothing constants: your choice or those which minimize the one step ahead forecast error: $\sum_{t=1}^{T} \left(\hat{x}_{t} - x_{t-1} \right )^{2}$, where $t=1,2,\cdots,T$. So at a given time point, smoothed value = (level+h*trend)+seasonal index (additive); data value = (level+h*trend)*seasonal index (multiplicative). Additive: $\hat{x}_{t+h}=a_{t}+h*b_{t}+s_{t+1+(h-1)\mod{p}}$; Multiplicative: $\hat{x}_{t+h}=(a_{t}+h*b_{t})* s_{t+1+(h-1)\mod{p}}.$<br />
<br />
<br />
RCODE:<br />
par(mfrow=c(2,1))<br />
sa <- HoltWinters(air,seasonal=c('additive'))<br />
psa <- predict(sa,60)<br />
plot(sa,psa,main='Holt-Winters filtering, airline passengers, HW additive')<br />
sm <- HoltWinters(air,seasonal=c('multiplicative'))<br />
psam <- predict(sm,60)<br />
plot(sm,psam,main='Holt-Winters filtering, airline passengers, HW multiplicative')<br />
<br />
[[Image:SMHS_Fig24_TimeSeries.png|500px]]<br />
<br />
## seasonal decomposition using loess, log(airpassengers)<br />
<br />
[[Image:SMHS_Fig25_TimeSeries.png|500px]]<br />
<br />
*Regression: estimation, diagnostics, interpretation. Defining dummy variables to capture the seasonal effect in regression. If there is a variable with k categories, there will be k-1 dummies if the intercept is included in the regression (exclude one group); there will be k dummies if the intercept is not included in the regression; if the intercept is included in the regression, the coefficient of a dummy variable is always the difference of means between that group and the excluded group. <br />
<br />
RCODE:<br />
t <- time(y)<br />
Q <- factor(rep(1:12,8))<br />
h <- lm(y~time(y)+Q) # regression with intercept, drops first category = January<br />
model.matrix(h)<br />
h1 <- lm(y~0+time(y)+Q) # regression without intercept<br />
model.matrix(h1)<br />
x1 <- ts(y[1:72],start=c(1949,1),frequency=12)<br />
t1 <- t[1:72]<br />
Q1 <- Q[1:72]<br />
contrasts(Q1)=contr.treatment(12,base=10)<br />
h2 <- lm(x1~t1+Q1) # regression with intercept, drops 10th category = October<br />
model.matrix(h2)<br />
reg <- lm(x1~t1+Q1)<br />
par(mfrow=c(2,2))<br />
plot.ts(reg$\$$resid)<br />
qqnorm(reg$\$$resid)<br />
qqline(reg$\$$resid)<br />
acf(reg$\$$resid)<br />
pacf(reg$\$$resid)<br />
<br />
[[Image:SMHS_Fig26_TimeSeries.png|500px]]<br />
<br />
## forecasts RCODE<br />
newdata=data.frame(t1=t[73:96], Q1=Q[73:96]) <br />
preg=predict(reg,newdata,se.fit=TRUE) <br />
seup=preg$\$$fit+2*preg$\$$se.fit<br />
seup <br />
1 2 3 4 5 6 7 8 9 <br />
5.577144 5.722867 5.679936 5.667979 5.780376 5.881420 5.889829 5.782526 5.655445 <br />
10 11 12 13 14 15 16 17 18 <br />
5.525933 5.661158 5.681042 5.716577 5.862300 5.819369 5.807412 5.919809 6.020853 <br />
19 20 21 22 23 24 <br />
6.029262 5.921959 5.794878 5.665366 5.800591 5.820475 <br />
selow=preg$\$$fit-2*preg$\$$se.fit<br />
selow<br />
1 2 3 4 5 6 7 8 9 <br />
5.479442 5.625165 5.582234 5.570277 5.682674 5.783718 5.792127 5.684824 5.557743 <br />
10 11 12 13 14 15 16 17 18 <br />
5.428232 5.563456 5.583340 5.610927 5.756650 5.713720 5.701763 5.814160 5.915203 <br />
19 20 21 22 23 24 <br />
5.923613 5.816309 5.689228 5.559717 5.694941 5.714825<br />
<br />
x1r=y[73:96]<br />
t1r=t[73:96]<br />
plot(t1r,x1r, col='red',lwd=1:2,main='Actual vs Forecast', type='l')<br />
lines(t1r,preg$\$$fit, col='green', lwd=1:2)<br />
lines(t1r,seup, col='blue', lwd=1:2)<br />
lines(t1r,selow, col='blue', lwd=1:2)<br />
## red refer actual; green refers to forecast; blue refers to standard error<br />
<br />
[[Image:SMHS_Fig27_TimeSeries.png|500px]]<br />
<br />
## Holt Winters smoothing: forecasts<br />
sa=HoltWinters(x1, seasonal = c("additive"))<br />
psa=predict(sa,24)<br />
plot(t1r,x1r, col="red",type="l",main = "Holt- Winters filtering, airline passengers, HW additive")<br />
lines(t1r, psa, ,col="green")<br />
<br />
## red represents the actual, green: forecast<br />
<br />
[[Image:SMHS_Fig28_TimeSeries.png|500px]]<br />
<br />
sqrt(sum(x1r-psa)^2)) ##HW<br />
[1] 0.3988259<br />
sqrt(sum((x1r-preg$\$$fit)^2)) ##reg<br />
[1] 0.3807665<br />
<br />
==Applications==<br />
<br />
1)[http://www.statsoft.com/Textbook/Time-Series-Analysis This article] presents a comprehensive introduction to the filed of time series analysis. It discussed about the common patterns existing in time series data and introduced some commonly used techniques to deal with time series data and ways to analyze time series. <br />
<br />
2)[http://www.sciencedirect.com/science/article/pii/0304393282900125 This article] investigates whether macroeconomic time series are better characterized as stationary fluctuations around a deterministic trend or as non-stationary processes that have no tendency to return to a deterministic path. Using long historical time series for the U.S. we are unable to reject the hypothesis that these series are non-stationary stochastic processes with no tendency to return to a trend line. Based on these findings and an unobserved components model for output that decomposes fluctuations into a secular or growth component and a cyclical component we infer that shocks to the former, which we associate with real disturbances, contribute substantially to the variation in observed output. We conclude that macroeconomic models that focus on monetary disturbances as a source of purely transitory fluctuations may never be successful in explaining a large fraction of output variation and that stochastic variation due to real factors is an essential element of any model of macroeconomic fluctuations.<br />
<br />
<br />
==Software== <br />
See the CODE examples given in this lecture.<br />
<br />
==Problems==<br />
<br />
1) Consider a signal-plus-noise model of the general form $x_{t}=s_{t}+w_{t}$, where $w_{t}$ is Gaussian white noise with $\sigma ^{2} _{w} = 1$. Simulate and plot n=200 observations from each of the following two models.<br />
<br />
(a) $x_{t} = s_{t}+w_{t},$ for $t=1,2,\cdots,200,$ where $s_{t}=\begin{cases} 0 & \text{ if } t=1,2,\cdots,100 \\ & 10exp \left \{ -(t-100)/20) \right \} cos\left ( 2\pi t/4 \right )\text{ if } t=101,102,\cdots,200 \end{cases}.$<br />
<br />
(b) $x_{t}=s_{t}+w_{t}, for t=1,2,\cdots, 200,$ where $s_{t}=\begin{cases} 0 & \text{ if } t=1,2,cdots,100 \\ & 10exp \left \{ -(t-100)/200) \right \} cos\left ( 2\pi t/4 \right )\text{ if } t=101,102,\cdots,200 \end{cases}.$<br />
<br />
(c) Compare the signal modulators $(a) exp \left \{-t/20 \right \} and (b) exp \left \{-t/200 \right \}, for t=1,2, cdots,100.$<br />
<br />
<br />
2) (a) Generate $n=100$ observations from the autoregression $x_{t}=-0.9 x_{t-2} +w_{t}, with \sigma {w}=1,$ using the method described in Example 1.10, page 13 given in the textbook. Next, apply the moving average filter $v_{t}=\left(x_{t}+x_{t-1}+x_{t-2}+x_{t-3} \right )/4 to x_{t},$ to the data you generated. Now, plot $x_{t}$ as a line and superimpose $v_{t}$ as a dashed line. Comment on the behavior of $x_{t}$ and how to apply the moving average filter changes that behavior. [Hint: use $v=filter(x,rep(1/4,4),sides=1)$ for the filter].<br />
<br />
(b) Repeat (a) but with $x_{t}=cos \left( 2 \pi t/4 \right).$<br />
<br />
(c) Repeat (b) but with added $N(0,1)$ noise, $x_{t}=cos \left(2 \pi t/4 \right) +w_{t}.$<br />
<br />
(d) Compare and contrast $(a) – (c).$<br />
<br />
3) For the two series, $x_{t}$ in 6.1 (a) and (b):<br />
<br />
(a) Compute and plot the mean functions $\mu _{x} (t) for t=1,2,\cdots,200.$<br />
<br />
(b) Calculate the autocovariance functions, $\gamma _{x} (s,t), for s,t=1,2, \cdots, 200.$<br />
<br />
4) Consider the time series $x_{t} = \beta {1} + \beta {2} t + w{t},$ where $\beta _{1}$ and $\beta_{2}$ are known constants and $w_{t}$ is a white noise process with variance $\sigma ^{2} _{w}.$<br />
<br />
(a) Determine whether $x_{t}$ is stationary.<br />
<br />
(b) Show that the process $y_{t}=x_{t} – x_{t-1}$ is stationary.<br />
<br />
(c) Show that the mean of the moving average $v_{t}= \frac {1}{2q+1} \sum_{j=-q}^{q} x_{t-j},$ is $\beta _{1} + \beta _{2} t,$ and give a simplified expression for the autocovariance function. <br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<hr><br />
* SOCR Home page: http://www.socr.umich.edu<br />
<br />
{{translate|pageName=http://wiki.socr.umich.edu/index.php?title=SMHS_TimeSeries}}</div>Clgallahttps://wiki.socr.umich.edu/index.php?title=SMHS_TimeSeries&diff=14462SMHS TimeSeries2014-10-20T15:30:12Z<p>Clgalla: /* ## not sure about the mod here?! */</p>
<hr />
<div>==[[SMHS| Scientific Methods for Health Sciences]] - Time Series Analysis ==<br />
<br />
===Overview===<br />
Time series data is a sequence of data points measured at successive pints in time spaced intervals. Time series analysis is commonly used in varieties of studies like monitoring industrial processes and tracking business metrics in the business world. In this lecture, we will present a general introduction to the time series data and introduced on some of the most commonly used techniques in the rick and rapidly growing field of time series modeling and analysis to extract meaningful statistics and other characteristics of the data. We will also illustrate the application of time series analysis techniques with examples in R. <br />
<br />
===Motivation===<br />
Economic data like daily share price, monthly sales, annual income or physical data like daily temperature, ECG readings are all examples of time series data. Time series is just an ordered sequence of values of a variable measured at equally spaced time intervals. So, what would be the effective ways to measure time series data? How can we extract information from time series data and make inference afterwards? <br />
<br />
===Theory===<br />
Time series: a sequence of data points measured at successive pints in time spaced intervals.<br />
*Components: (1) trend component: long-run increase or decrease over time (overall upward or downward movement), typically refers to data taken over a long period of time; the trend can be upward or downward, linear or non-linear. (2) seasonal component: short-term regular wave-like patterns, usually refers to data observed within one year, which may be measured monthly or quarterly. (3) cyclical component: long-term wave-like patterns; regularly occur but may vary in length; often measured peak to peak or trough to trough. (4) irregular component: unpredictable, random, ‘residual’ fluctuation, which may be due to random variations of nature or accidents or unusual events; ‘noise’ in the time series.<br />
*Probabilistic models: The components still add or multiply together across time.<br />
[[Image:SMHS Fig 1 Times Series Analysis.png|400px]]<br />
*Simple Time Series Models: <br />
**Take $T_{t}$ as the trend component, $S_{t}$ as the seasonal component, $C_{t}$ as the cyclical component and $I_{t}$ as the irregular or random component. Then we have an additive model as: $x_{t}=T_{t}+S_{t}+C_{t}+I_{t};$ the multiplicative model says $x_{t}=T_{t}*S_{t}*C_{t}*I_{t};$ sometimes, we take logs on both sides of the multiplicative model to make it additive, $logx_{t}=log\left ( T_{t}*S_{t}*C_{t}*I_{t} \right )$, which can be further noted as $x_{t}{}'=T_{t}{}'+S_{t}{}'+C_{t}{}'+I_{t}{}'$.<br />
**Most time series models are written in terms of an (unobserved) white noise process, which is often assumed to be Gaussian: $x_{t}=w_{t}$, where $w_{t}\sim WN\left ( 0,1 \right )$, that is, $E\left ( W_{t} \right )=0, Var\left(W_{t} \right )=\sigma ^{2}$. Examples of probabilistic models include: (1) Autoregressive $x_{t}=0.6x_{t-1}+w_{t};$ (2) Moving average model $x_{t}= \frac{1}{3}\left(w_{t}+w_{t-1}+w_{t-2}\right).$<br />
**Fitting time series models: a time series model generates a process whose pattern can then be matched in some way to the observed data; since perfect matches are impossible, it is possible that more than one model will be appropriate for a set of data; to decide which model is appropriate: patterns suggest choices, assess within sample adequacy (diagnostics, tests), outside sample adequacy (forecast evaluation), simulation from suggested model and compare with observed data; next, turn to some theoretical aspects like how to characterize a time series, and then investigate some special processes. <br />
**Characterizing time series (the mean and covariance of time series): suppose the data are $x_{1},x_{2},\cdots ,x_{t},$ note for a regularly spaced time series, $x_{1}$ is observed at time $t_{0}$, $x_{2}$ is observed at $t_{0}+\Delta, x_{t}$ is observed at $t_{0}+\left( t-1\right) \Delta$; the expected value of $x_{t}$ is $\mu _{t}=E\left [ x_{t} \right ]$; the covariance function is $\gamma \left(s,t \right )=cov\left(x_{s},x_{t} \right )$. Note that, we don’t distinguish between random variables and their realizations. Properly, $x_{1},x_{2},\cdots ,x_{t}$ is a realization of some process $X_{1},X_{2},\cdots ,X_{t}$ and so $\mu _{t}=E\left [ x_{t} \right ]$ etc.<br />
**Characterizing time series data: weak stationarity: usually we have only one realization of the time series. There are not enough data to estimate all the means \& covariances. That is if the mean is constant, and the covariance depends only on the separation in time, then a time series is $\left(weakly\right)$ stationary. In this case $\mu_{t}=\mu$ and $\gamma\left(s,t\right)=\gamma\left(t-s\right)$. Stationary processes are much easier to handle statistically. Real time series are usually non-stationary, but can often be transformed to give a stationary process.<br />
**Estimation of $\gamma(h)$ (Estimating the autocovariance function): (1) for a stationary process: $\gamma(h)=\gamma(s,s+h)=cov(x_{s},x_{s+h})$ for any s.$\ \gamma (h)$ is called the autocovariance function because it measures the covariance between lags of the process $x_{t}$; (2) we observe T-h pairs separated at lag h namely $\left(x_{1},x_{h+1}\right),\cdots,\left(x_{T-h},x_{T}\right)$; the sample autocovariance function is $\hat{\gamma}(h)=\frac{1}{T}\sum_{t=1}^{T-h}\left(x_{t}-\bar{X} \right )\left(x_{t+h}-\bar{X} \right )$, note that we divide by ${T}$ although there are ${T-h}$ terms in the sum. The autocorrelation function ${ACF}$ is $\rho (h)=\frac{\gamma(h)}{\gamma(0)}$.<br />
**$\rho(h)$ properties: for $x_{t}$ stationary, $\gamma(h)=E(x_{t}-\mu{t})(x_{t+h}-\mu)$, where $\gamma(h)$ is an even function, that is $\gamma(h)=\gamma(-h); \left | \gamma(h) \right |\leq \left | \gamma(0) \right |$, that is $\left|\rho(h)\right|\leq1, h=\pm 1,2,\cdots$. The autocorrelation matrix, $P(h)$ is positive semi-definite which means that the autocorrelations cannot be chosen arbitrarily but must have some relations among themselves. <br />
**To study a set of time series data: given that we only have one realization, we need to match plot of data from proposed model with data (we can see broad/main patterns); match acf, pacf of proposed model with data (we can see if data are stationary uncorrelated, stationary autocorrelated or non-stationary); determine if the proposed model is appropriate; obtain forecasts from proposed model; compare across models that seem appropriate. <br />
**Large sample distribution of the $ACF$ for a $WN$ series: if $x_{t}$ is $WN$ then for n large, the sample $ACF$, $\hat{\rho}_{x}(h), h=1,2,\cdots$,H where H is fixed but arbitrary is normally distributed with zero mean and standard deviation given by $\sigma_{\hat{\rho}_{x}}(h)=\frac{1}{\sqrt{h}}$. A rule of thumb for assessing whether sample autocorrelations resemble those for a $WN$ series is by checking if approximately 95% of the sample autocorrelations are within the interval $0\pm \frac{2}{\sqrt{n}}$. <br />
**White noise, or stationary uncorrelated process: we say that a time series is white noise if it is weakly stationary, with $ACF$, $\rho(h)=\begin{cases} 1 & \text{ if } h=0 \\ 0 & \text{ if } h> 0\ and\ with \ E\left[w_{t} \right ]=0 \end{cases}$, i.e. white noise consists of a sequence of uncorrelated, zero mean random variables with common variance $\sigma^{2}$.<br />
<br />
*Example of a simple moving average model in R: $v_{t}=\frac{1}{3} (w_{t-1}+w_{t}+w_{t+1})$. <br />
<br />
R CODE: <br />
w <- ts(rnorm(150)) ## generate 150 data from the standard normal distribution<br />
v <- ts(filter(w, sides=2, rep(1/3, 3))) ## moving average model<br />
par(mfrow=c(2,1))<br />
plot(w,main='white noise')<br />
plot(v,main='moving average')<br />
<br />
or can use code:<br />
<br />
w <- rnorm(150) ## generate 150 data from the standard normal distribution<br />
v <- filter(w, sides=2, rep(1/3, 3)) ## moving average model<br />
par(mfrow=c(2,1))<br />
plot.ts(w,main='white noise')<br />
plot.ts(v,main='moving average')<br />
<br />
[[Image:SMHS Fig2 Timeseries Analysis.png|500px]]<br />
<br />
## sums based on WN processes<br />
'''ts.plot(w,v,lty=2:1,col=1:2,lwd=1:2)'''<br />
<br />
[[Image:SMHS Fig3 TimeSeries Analysis.png|500px]]<br />
<br />
The ACF and the sample ACF of some stationary processes<br />
*white noise = $w(t),acf x(t)=w(t)+\frac{1}{3} \left(w(t-1)+w(t-2)+w(t-3) \right),$ then plot the barplot of the acf:<br />
<br />
RCODE:<br />
par(mfrow=c(2,1))<br />
barplot(ARMAacf(0,lag.max=10),main='white noise=w(t)')<br />
barplot(ARMAacf(ma=c(1/3,1/3,1/3),lag.max=10),main='acfx(t)=w(t)+1/3w(t-1)+1/3w(t-2)+1/3w(t-3)')<br />
<br />
[[Image:SMHS_Fig4_TimeSeries_Analysis.png|500px]]<br />
<br />
*Theoretical acf of some processes: Autoregressive process, $x_{t}=0.9x_{t-1}+w_{t}.$<br />
<br />
RCODE:<br />
par(mfrow=c(2,1))<br />
barplot(ARMAacf(c(0.9),lag.max=30),main='acf x(t)=0.9x(t-1)+w(t)')<br />
barplot(ARMAacf(c(1,-0.9),lag.max=30),main='acf x(t)=x(t-1)-0.9x(t-2)+w(t)')<br />
<br />
<br />
[[Image:SMHA_Fig5_TimeSeries_Analysis.png|500px]]<br />
<br />
*Recognizing patterns in probabilistic data <br />
**Example 1: compare White noise & Autoregressive process $x_{t}=w_{t} \& x_{t}=x_{t-1}-0.9x_{t-2}+w_{t})$<br />
<br />
RCODE:<br />
w1 <- rnorm(200) ## generate 200 data from the standard normal distribution<br />
x <- filter(w1, filter=c(1,-0.9),method='recursive')[-(1:50)]<br />
w2 <- w1[-(1:50)]<br />
xt <- ts(x)<br />
par(mfrow=c(2,2))<br />
plot.ts(w,main='white noise')<br />
acf(w2,lag.max=25)<br />
plot.ts(xt,main='autoregression')<br />
acf(xt,lag.max=25)<br />
<br />
[[Image:SMHS_Fig6_TimeSeries_Analysis.png|500px]]<br />
<br />
*Apparently, there is almost no autocorrelation in white noise, while there is large autocorrelation in second series.<br />
<br />
Example 2: AR & MA $x_{t}=x_{t-1}-0.9x_{t-2}+w_{t}\& v_{t}=\frac{1}{3}\left(w_{t-1}+w_{t}+w_{t+1} \right))$<br />
<br />
RCODE:<br />
v <- filter(w1, sides=2, rep(1/3, 3))<br />
vt <- ts(v)[2:199]<br />
par(mfrow=c(2,2))<br />
plot.ts(xt,main='white noise')<br />
acf(xt,lag.max=25)<br />
plot.ts(vt,main='autoregression')<br />
acf(vt,lag.max=25)<br />
<br />
[[Image:SMHS_Fig7_TimeSeries_Analysis.png|500px]]<br />
<br />
<br />
*ACF, PACF and some examples of data in R<br />
<br />
RCODE:<br />
data<-<br />
c(28,22,36,26,28,28,26,24,32,30,27,24,33,21,36,32,31,25,24,25,28,36,27,32,34,30,<br />
25,26,26,25,-<br />
44,23,21,30,33,29,27,29,28,22,26,27,16,31,29,36,32,28,40,19,37,23,32,29,-<br />
2,24,25,27,24,16,29,20,28,27,39,23)<br />
par(mfrow=c(1,2))<br />
plot(data)<br />
qqnorm(data)<br />
qqline(data)<br />
<br />
[[Image:SMHS_Fig8_TimeSeries_Analysis.png|500px]]<br />
<br />
data1<-data[c(-31,-55)]<br />
par(mfrow=c(1,2))<br />
plot.ts(data1)<br />
qqnorm(data1)<br />
qqline(data1)<br />
<br />
[[Image:SMHS_Fig9_TimeSeries_Analysis.png|500px]]<br />
<br />
library(astsa)<br />
acf2(data1)<br />
<br />
*Examining the data: <br />
Diagnostics of outliers<br />
<br />
RCODE:<br />
dat<- <br />
c(28,22,36,26,28,28,26,24,32,30,27,24,33,21,36,32,31,25,24,25,28,36,27,32,34,30,<br />
25,26,26,25,-44,23,21,30, 33,29, 27,29,28,22,26,27,16,31,29,36,32,28,40,19,<br />
37,23,32,29,-2,24,25,27,24,16,29,20,28,27, 39,23)<br />
plot.ts(dat)<br />
<br />
<br />
[[Image:SMHS_Fig10_TimeSeries_Analysis.png|500px]]<br />
<br />
<br />
Then we can easily tell that there is an outlier in this data series.<br />
<br />
# what if we use a boxplot here?<br />
boxplot(dat)<br />
<br />
[[Image:SMHS_Fig11_TimeSeries_Analysis.png|500px]]<br />
<br />
identify(rep(1,length(dat)),dat) ## identify outliers by clicking on screen<br />
## we identified the outliers are 55 and 31:<br />
dat[31]<br />
[1] -44<br />
dat[55]<br />
[1] -2<br />
datr <- dat[c(-31,-55)] ## this is the dat with outlier removed<br />
qqnorm(datr)<br />
qqline(datr)<br />
<br />
<br />
[[Image:SMHS_Fig12_TimeSeries.png|500px]]<br />
<br />
<br />
## this plot shows that the distribution of the new data series (right) <br />
is closer to normal compared to the original data (left).<br />
## The effect of outliers by comparing the ACF and PACF of the data <br />
series:<br />
library(astsa)<br />
acf2(dat)<br />
acf2(datr)<br />
[[Image:SMHS_Fig13_TimeSeries.png|800px]]<br />
<br />
*Diagnostics of the error: consider the model $(x_{t}=\mu+w_{t} )$ check to see if the mean of the probability distribution of error is 0; whether variance of error is constant across time; whether errors are independent and uncorrelated; whether the probability distribution of error is normal. <br />
:::Residual plot for functional form<br />
<br />
[[Image:SMHS_Fig14_TimeSeries.png|500px]]<br />
<br />
:::Residual plot for equal variance<br />
<br />
[[Image:SMHS_Fig15_TimeSeries.png|500px]]<br />
<br />
:::Check if residuals are correlated by plotting $e_{t} vs. e_{t-1}$: there are three possible cases – positive autocorrelation, negative autocorrelation and no autocorrelation. <br />
<br />
*Smoothing: to identify the structure by averaging out the noise since series has some structure plus variation. <br />
::Moving averages:<br />
ma3<-filter(datr,sides=2,rep(1/3,3))<br />
ma9<-filter(datr,sides=2,rep(1/9,9))<br />
plot.ts(datr,ylim=c(15,50),ylab='data')<br />
par(new=T)<br />
plot.ts(ma3,ylim=c(15,50),col=2,ylab='data') ## red line indicates ma3<br />
par(new=T)<br />
plot.ts(ma9,ylim=c(15,50),col=3,ylab='data') ## green line indicates ma9<br />
<br />
[[Image:SMHS_Fig16_TimeSeries.png|500px]]<br />
<br />
*Apparently, the bigger the laggings in moving average, the smoother the time series plot of the data series. <br />
ma3<-filter(datr,sides=2,rep(1/3,3))<br />
ma31<-filter(datr,sides=1,rep(1/3,3))<br />
plot.ts(datr,ylim=c(15,50),ylab='data')<br />
par(new=T)<br />
plot.ts(ma3,ylim=c(15,50),col=2,ylab='data') ## red line indicates ma3<br />
par(new=T)<br />
plot.ts(ma31,ylim=c(15,50),col=3,ylab='data') ## green line indicates ma31<br />
<br />
[[Image:SMHS_Fig17_TimeSeries.png|500px]]<br />
<br />
Two sided and one sided filters with same coefficients seem to be similar in their smoothing effect. <br />
<br />
## the effect of equal vs. unequal weights? <br />
ma31<-filter(datr,sides=1,rep(1/3,3))<br />
mawt<-filter(datr,sides=1,c(1/9,1/9,7/9))<br />
plot.ts(datr,ylim=c(15,50),ylab='data')<br />
par(new=T)<br />
plot.ts(ma31,ylim=c(15,50),col=2,ylab='data') ## red line indicates ma31<br />
par(new=T)<br />
plot.ts(mawt,ylim=c(15,50),col=3,ylab='data') ## green line indicates mawt<br />
<br />
[[Image:SMHS_Fig18_TimeSeries.png|500px]]<br />
<br />
The smoothing effect of filter with equal weights seems to be bigger than the filter with unequal weights.<br />
<br />
*Simple Exponential Smoothing (SES): premise – the most recent observations might have the highest predictive value $(\hat{x}_{t} = \alpha x_{t-1} + (1- \alpha) \hat{x}_{t-1} )$, that is new forecast = $\alpha \ actual value + (1-\alpha) \ previous forecast$. This method is appropriate for series with no trend or seasonality.<br />
<br />
*For series with trend, can use Holt-Winters Exponential smoothing additive model $\hat{x}_{t+h}=a_{t}+h*b_{t}.$ We smooth the level or permanent component, the trend component and the seasonal component.<br />
<br />
RCODE:<br />
HoltWinters(series,beta=FALSE,gamma=FALSE) #simple exponential smoothing<br />
HoltWinters(series,gamma=FALSE) # HoltWinters trend<br />
HoltWinters(series,seasonal= 'additive') # HoltWinters additive trend+seasonal<br />
HoltWinters(series,seasonal= 'multiplicative') # HoltWinters multiplicative trend+seasonal<br />
<br />
<br />
<br />
Example: RCODE: (effect of different values of alpha on datr SES)<br />
par(mfrow=c(3,1))<br />
exp05=HoltWinters(datr,alpha=0.05,beta=FALSE,gamma=FALSE) #simple exponential smoothing <br />
p05=predict(exp05,20) <br />
plot(exp05,p05, main = 'Holt- Winters filtering, alpha=0.05')<br />
exp50=HoltWinters(datr, alpha=0.5,beta=FALSE,gamma=FALSE) #simple exponential smoothing <br />
p50=predict(exp50,20)<br />
plot(exp50,p50, main = 'Holt-Winters filtering, alpha=0.5')<br />
exp95=HoltWinters(datr, alpha=0.95,beta=FALSE,gamma= FALSE) #simple exponential smoothing <br />
p95=predict(exp95,20)<br />
plot(exp95,p95,main = "Holt-Winters filtering, alpha=0.95")<br />
<br />
<br />
[[Image:SMHS_Fig19_TimeSeries.png|500px]]<br />
<br />
*Comparing two models: we can compare ‘fit’ and forecast ability. <br />
<br />
RCODE:<br />
obs <- exp50$\$$x[-1]<br />
fit <- exp50$\$$fitted[,1]<br />
r1 <- obs-fit<br />
plot(r1)<br />
par(mfrow=c(1,2))<br />
plot(r1)<br />
acf(r1)<br />
<br />
[[Image:SMHS_Fig20_TimeSeries.png|500px]]<br />
<br />
*Regression based (nonparametric) smoothing: (1) kernel: extension of a two-sided MA. Use a bandwidth (number of terms) plus a kernel function to smooth a set of values, ''ksmooth''; (2) local linear regression (loess): fit local polynomial regression and join them together, ''loess''; (3) fitting cubic splines: extension of polynomial regression, partition data and fit separate piecewise regression to each section, smooth them together where they join, ''smooth.spine''; (4) many more, ''lowess'', ''supsmu''.<br />
::•Kernel smoother: define bandwidth; choose a kernel; use robust regression to get smooth estimate; bigger the bandwidth, smoother the result<br />
::•Loess: define the window width; choose a weight function; use polynomial regression and weighted least squares to get smoothest estimate; the bigger span, the smoother the result.<br />
::•Splines: divide data into intervals; in each interval, fit a polynomial regression of order p such that the splines are continuous at knots; use a parameter (spar in R) to smooth the splines; the bigger the spar, the smoother the result.<br />
::•Issues with regression based (nonparametric) smoothing: (1) the size of S has an important effect on the curve; (2) a span that is too small (meaning that insufficient data fall within the window) produces a curve characterized by a lot of noise, results in a large variance; (3) if the span is too large, the regression will be oversmoothed, and thus the local polynomial may not fit the data well. This may result in loss of important information, and thus the fit will have large bias; (4) we want the smallest span that provides a smooth structure.<br />
<br />
*A regression model for trend: given $y_{t}, t=1,2,…,\cdots,$ T is a time series, fit a regression model for trend. The plot suggests that we fit a trend line, say $y_{t}=\beta_{1}+\beta_{2} t + \varepsilon_{t}$. <br />
::•Assumptions: variance of error is constant across observations; errors are independent / uncorrelated across observations; the regressor variables and error are independent; we may also assume probability distribution of error is normal.<br />
<br />
RCODE<br />
fit <- lm(x~z)<br />
summary(fit$\$$coeff)<br />
plot.ts(fit$\$$resid) # plot the residuals of the fitted regression model<br />
acf2(fit$\$$resid) # the ACF and PACF of the residuals of the fitted regression model<br />
qqnorm(fit$\$$resid)<br />
qqline(fit$\$$resid)<br />
<br />
<br />
*Analyzing patterns in data and modeling: seasonality<br />
**Exploratory analysis<br />
<br />
RCODE:<br />
data=c( 118,132,129,121,135,148,148,136,119,104,118,115,126,141,135,125,<br />
149,170,170,158,133,114,140,145,150,178,163,172,178,199,199,184,162,146,<br />
166,171,180,193,181,183,218,230,242,209,191,172,194,196,196,236,235,<br />
229,243,264,272,237,211,180,201,204,188,235,227,234,264,302,293,259,229,<br />
203,229,242,233,267,269,270,315,364,347,312,274,237,278,284,277,317,313,<br />
318,374,413,405,255,306,271,306,300)<br />
air <- ts(data,start=c(1949,1),frequency=12)<br />
y <- log(air)<br />
par(mfrow=c(1,2))<br />
plot(air)<br />
plot(y)<br />
<br />
<br />
[[Image:SMHS_Fig21_TimeSeries.png|500px]]<br />
<br />
<br />
## from the charts above, taking log helps a little bit with the increasing variability. <br />
<br />
monthplot(y)<br />
<br />
<br />
[[Image:SMHS_Fig22_TimeSeries.png|500px]]<br />
<br />
<br />
acf2(y)<br />
<br />
[[Image:SMHS_Fig23_TimeSeries.png|500px]]<br />
<br />
*Smoothing: the Holt-Winters method smooths and obtains three parts of a series: level, trend and seasonal. The method requires smoothing constants: your choice or those which minimize the one step ahead forecast error: $\sum_{t=1}^{T} \left(\hat{x}_{t} - x_{t-1} \right )^{2}$, where $t=1,2,\cdots,T$. So at a given time point, smoothed value = (level+h*trend)+seasonal index (additive); data value = (level+h*trend)*seasonal index (multiplicative). Additive: $\hat{x}_{t+h}=a_{t}+h*b_{t}+s_{t+1+(h-1)\mod{p}}$; Multiplicative: $\hat{x}_{t+h}=(a_{t}+h*b_{t})* s_{t+1+(h-1)\mod{p}}.$<br />
<br />
<br />
RCODE:<br />
par(mfrow=c(2,1))<br />
sa <- HoltWinters(air,seasonal=c('additive'))<br />
psa <- predict(sa,60)<br />
plot(sa,psa,main='Holt-Winters filtering, airline passengers, HW additive')<br />
sm <- HoltWinters(air,seasonal=c('multiplicative'))<br />
psam <- predict(sm,60)<br />
plot(sm,psam,main='Holt-Winters filtering, airline passengers, HW multiplicative')<br />
<br />
[[Image:SMHS_Fig24_TimeSeries.png|500px]]<br />
<br />
## seasonal decomposition using loess, log(airpassengers)<br />
<br />
[[Image:SMHS_Fig25_TimeSeries.png|500px]]<br />
<br />
*Regression: estimation, diagnostics, interpretation. Defining dummy variables to capture the seasonal effect in regression. If there is a variable with k categories, there will be k-1 dummies if the intercept is included in the regression (exclude one group); there will be k dummies if the intercept is not included in the regression; if the intercept is included in the regression, the coefficient of a dummy variable is always the difference of means between that group and the excluded group. <br />
<br />
RCODE:<br />
t <- time(y)<br />
Q <- factor(rep(1:12,8))<br />
h <- lm(y~time(y)+Q) # regression with intercept, drops first category = January<br />
model.matrix(h)<br />
h1 <- lm(y~0+time(y)+Q) # regression without intercept<br />
model.matrix(h1)<br />
x1 <- ts(y[1:72],start=c(1949,1),frequency=12)<br />
t1 <- t[1:72]<br />
Q1 <- Q[1:72]<br />
contrasts(Q1)=contr.treatment(12,base=10)<br />
h2 <- lm(x1~t1+Q1) # regression with intercept, drops 10th category = October<br />
model.matrix(h2)<br />
reg <- lm(x1~t1+Q1)<br />
par(mfrow=c(2,2))<br />
plot.ts(reg$\$$resid)<br />
qqnorm(reg$\$$resid)<br />
qqline(reg$\$$resid)<br />
acf(reg$\$$resid)<br />
pacf(reg$\$$resid)<br />
<br />
[[Image:SMHS_Fig26_TimeSeries.png|500px]]<br />
<br />
## forecasts RCODE<br />
newdata=data.frame(t1=t[73:96], Q1=Q[73:96]) <br />
preg=predict(reg,newdata,se.fit=TRUE) <br />
seup=preg$\$$fit+2*preg$\$$se.fit<br />
seup <br />
1 2 3 4 5 6 7 8 9 <br />
5.577144 5.722867 5.679936 5.667979 5.780376 5.881420 5.889829 5.782526 5.655445 <br />
10 11 12 13 14 15 16 17 18 <br />
5.525933 5.661158 5.681042 5.716577 5.862300 5.819369 5.807412 5.919809 6.020853 <br />
19 20 21 22 23 24 <br />
6.029262 5.921959 5.794878 5.665366 5.800591 5.820475 <br />
selow=preg$\$$fit-2*preg$\$$se.fit<br />
selow<br />
1 2 3 4 5 6 7 8 9 <br />
5.479442 5.625165 5.582234 5.570277 5.682674 5.783718 5.792127 5.684824 5.557743 <br />
10 11 12 13 14 15 16 17 18 <br />
5.428232 5.563456 5.583340 5.610927 5.756650 5.713720 5.701763 5.814160 5.915203 <br />
19 20 21 22 23 24 <br />
5.923613 5.816309 5.689228 5.559717 5.694941 5.714825<br />
<br />
x1r=y[73:96]<br />
t1r=t[73:96]<br />
plot(t1r,x1r, col='red',lwd=1:2,main='Actual vs Forecast', type='l')<br />
lines(t1r,preg$\$$fit, col='green', lwd=1:2)<br />
lines(t1r,seup, col='blue', lwd=1:2)<br />
lines(t1r,selow, col='blue', lwd=1:2)<br />
## red refer actual; green refers to forecast; blue refers to standard error<br />
<br />
[[Image:SMHS_Fig27_TimeSeries.png|500px]]<br />
<br />
## Holt Winters smoothing: forecasts<br />
sa=HoltWinters(x1, seasonal = c("additive"))<br />
psa=predict(sa,24)<br />
plot(t1r,x1r, col="red",type="l",main = "Holt- Winters filtering, airline passengers, HW additive")<br />
lines(t1r, psa, ,col="green")<br />
<br />
## red represents the actual, green: forecast<br />
<br />
[[Image:SMHS_Fig28_TimeSeries.png|500px]]<br />
<br />
sqrt(sum(x1r-psa)^2)) ##HW<br />
[1] 0.3988259<br />
sqrt(sum((x1r-preg$\$$fit)^2)) ##reg<br />
[1] 0.3807665<br />
<br />
==Applications==<br />
<br />
1)[http://www.statsoft.com/Textbook/Time-Series-Analysis This article] presents a comprehensive introduction to the filed of time series analysis. It discussed about the common patterns existing in time series data and introduced some commonly used techniques to deal with time series data and ways to analyze time series. <br />
<br />
2)[http://www.sciencedirect.com/science/article/pii/0304393282900125 This article] investigates whether macroeconomic time series are better characterized as stationary fluctuations around a deterministic trend or as non-stationary processes that have no tendency to return to a deterministic path. Using long historical time series for the U.S. we are unable to reject the hypothesis that these series are non-stationary stochastic processes with no tendency to return to a trend line. Based on these findings and an unobserved components model for output that decomposes fluctuations into a secular or growth component and a cyclical component we infer that shocks to the former, which we associate with real disturbances, contribute substantially to the variation in observed output. We conclude that macroeconomic models that focus on monetary disturbances as a source of purely transitory fluctuations may never be successful in explaining a large fraction of output variation and that stochastic variation due to real factors is an essential element of any model of macroeconomic fluctuations.<br />
<br />
<br />
==Software== <br />
See the CODE examples given in this lecture.<br />
<br />
==Problems==<br />
<br />
1) Consider a signal-plus-noise model of the general form x_{t}=s_{t}+w_{t}, where w_{t} is Gaussian white noise with \sigma ^{2} _{w} = 1. Simulate and plot n=200 observations from each of the following two models.<br />
(a) x_{t} = s_{t}+w_{t}, for t=1,2,\cdots,200, where s_{t}=\begin{cases} 0 & \text{ if } t=1,2,cdots,100 \\ & 10exp \left \{ -(t-100)/20) \right \} cos\left ( 2\pi t/4 \right )\text{ if } t=101,102,\cdots,200 \end{cases} .<br />
(b) x_{t}=s_{t}+w_{t}, for t=1,2,\cdots, 200, where s_{t}=\begin{cases} 0 & \text{ if } t=1,2,cdots,100 \\ & 10exp \left \{ -(t-100)/200) \right \} cos\left ( 2\pi t/4 \right )\text{ if } t=101,102,\cdots,200 \end{cases}.<br />
(c) Compare the signal modulators (a) exp \left \{-t/20 \right \} and (b) exp \left \{-t/200 \right \}, for t=1,2, cdots,100.<br />
<br />
2) (a) Generate n=100 observations from the autoregression x_{t}=-0.9 x_{t-2} +w_{t}, with \signma {w}=1, using the method described in Example 1.10, page 13 given in the textbook. Next, apply the moving average filter v_{t}=\left(x_{t}+x_{t-1}+x_{t-2}+x_{t-3} \right )/4 to x_{t}, the data you generated. Now, plot x_{t} as a line and superimpose v_{t} as a dashed line. Comment on the behavior of x_{t} and how apply the moving average filter changes that behavior. [Hint: use v=filter(x,rep(1/4,4),sides=1) for the filter].<br />
(b) Repeat (a) but with x_{t}=cos \left( 2 \pi t/4 \right).<br />
(c) Repeat (b) but with added N(0,1) noise, x_{t}=cos \left(2 \pi t/4 \right) +w_{t}.<br />
(d) Compare and contrast (a) – (c).<br />
<br />
3) For the two series, x_{t} in 6.1 (a) and (b):<br />
(a) Compute and plot the mean functions \mu _{x} (t) for t=1,2,\cdots,200.<br />
(b) Calculate the autocovariance functions, \gamma _{x} (s,t), for s,t=1,2, \cdots, 200.<br />
<br />
4) Consider the time series x_{t} = \beta {1} + \beta {2} t + w{t}, where \beta _{1} and \beta_{2} are known constants and w_{t} is a white noise process with variance \sigma ^{2} _{w}.<br />
(a) Determine whether x_{t} is stationary. <br />
(b) Show that the process y_{t}=x_{t} – x_{t-1} is stationary.<br />
(c) Show that the mean of the moving average v_{t}= \frac {1}{2q+1} \sum_{j=-q}^{q} x_{t-j}, is \beta _{1} + \beta _{2} t, and give a simplified expression for the autocovariance function. <br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<hr><br />
* SOCR Home page: http://www.socr.umich.edu<br />
<br />
{{translate|pageName=http://wiki.socr.umich.edu/index.php?title=SMHS_TimeSeries}}</div>Clgallahttps://wiki.socr.umich.edu/index.php?title=File:SMHS_Fig15_TimeSeries.png&diff=14461File:SMHS Fig15 TimeSeries.png2014-10-20T15:18:57Z<p>Clgalla: Clgalla uploaded a new version of &quot;File:SMHS Fig15 TimeSeries.png&quot;</p>
<hr />
<div></div>Clgallahttps://wiki.socr.umich.edu/index.php?title=SMHS_LongitudinalData&diff=14439SMHS LongitudinalData2014-10-17T16:37:32Z<p>Clgalla: /* Scientific Methods for Health Sciences - Longitudinal Data */</p>
<hr />
<div>==[[SMHS| Scientific Methods for Health Sciences]] - Longitudinal Data ==<br />
<br />
===Overview===<br />
Longitudinal data are referred to data collected from a large population over a given time period where the same subjects are measured at multiple points in time. It is widely used in statistical and financial studies. In this section, we are going to introduce to the concept of longitudinal data. <br />
<br />
===Motivation===<br />
Data measured at multiple points in time on the same subject are commonly used in various studies. For example, consider a dataset contain students’ standard test score in four successive years (attached in the table). How do we categorize data like this?<br />
<br />
<br />
<center><br />
{| class="wikitable" style="text-align:center; width:35%" border="1"<br />
|-<br />
|Student Name|| Grade 1 (2010) Raw Score||Grade 2 (2011) Raw Score||Grade 3 (2012) Raw Score||Grade 4 (2013) Raw Score<br />
|-<br />
|Tom ||339|| 350|| 361|| 366<br />
|-<br />
|Mike|| 332|| 343|| 350|| 351<br />
|-<br />
|Vivian||360 ||380 ||400|| 420<br />
|}<br />
</center><br />
<br />
===Theory===<br />
<br />
'''1) Longitudinal data'''<br />
Longitudinal data is data collected from a large population over a given time period where the same subjects are measured at multiple points in time. <br />
*The primary advantage of longitudinal dataset is that they can measure change. Hence, we can estimate. For example, for the dataset listed above, we can estimate the effect of various factors on improvement in students’ achievement. Or, we can estimate the overall effectiveness of individual teachers by examining the performance of successive classes of students they teach. <br />
*The longitudinal data extend into both the past and the present, which enable us to evaluate the effect of a specific policy by looking at student performance before or after the policy was introduced. <br />
*Longitudinal data also allows us to use sophisticated analytic strategies to measure the impact of various policies with reasonable precision.<br />
<br />
'''2) Longitudinal study'''<br />
A longitudinal study is an investigation where participant outcomes and possibly treatments or exposures are collected at multiple follow-up times. It generally yields multiple (repeated) measurements on each subject, which are correlated within subjects and thus require special statistical techniques for analysis and inference. In longitudinal study, we may also be interested in measuring the time until a key clinical event such as time to death, where survival analysis is generally applied for analysis.<br />
*It plays an important role in various studies like epidemiology, clinical research and biological studies. Longitudinal studies are used to characterize normal growth and aging to assess the effect of risk factors on human health to evaluate the effectiveness of treatments and involve lots of efforts.<br />
*Benefits of longitudinal studies: (1) incident events are recorded, a prospective longitudinal study measures the new occurrence of disease; (2) prospective ascertainment of exposure, data recorded at multiple follow-up visits may recall bias; (3) measurement of individual change in outcomes; (4) separation of time effects: cohort, period, age; (5) control for cohort effects, in a longitudinal study, the cohort under study is fixed and thus changes in time are not confounded by cohort differences. <br />
*Challenges of longitudinal studies: (1) participant follow-up, there is risk of bias due to incomplete follow-up of study participants; (2) analysis of correlated data, if intra-subject correlation of response measurements is ignored, then inferences such as statistical tests or confidence interval can be grossly invalid; (3) time-varying covariate, the direction of causality can be complicated by ‘feedback’ between the outcome and the exposure. <br />
<br />
'''3) Notations'''<br />
Notations use $Y_{ij}$ to denote the outcome measured on subject $i$ at time $t_{ij},$ where $i=1, \cdots$, $N$ is index for subject, and $j = 1$, $\cdots$, $n$ is index for observations within a subject. The measurement time follows a common set of follow-up times $t_{ij} = t_{j}$. Use $X_{ij}$ to denote covariates associated with observations $Y_{ij}$. Common covariates in a longitudinal study include the time, $t_{ij}$, and person-level characteristics such as age, treatment assignments, and etc. In many cases, the scientific studies focus on mean responses as a function of covariates such as treatment and time, we can also make statistical inference on within-person correlation of observation. Define $\rho_{jk} = corr(Y_{ij}, Y_{ik}),$ the within-subject correlation between observations at time $t_{j} and t_{k}.$ <br />
<br />
'''4) Exploratory data analysis''' <br />
This is used to discover patterns of systematic variation across groups, as well as aspects of random variation among individual patients.<br />
*Group means: if we are looking to measure the average response over time, statistical measures like means and standard deviation, which can reveal whether different groups change in a similar or different fashion. <br />
*Variation among individuals: single variance parameter can be used to summarize uncertainty or variability in a response measurement. In longitudinal data, ‘distance’ between measurements on different subjects is usually expected to be greater than the distance between repeated measurements taken on the same subject, hence though the total variance can be noted as $\sigma^{2} =1/2 E[(Y_{ij}-Y_{i’j})^{2}]$ assuming $E(Y_{ij}) = E(Y_{i’j}) = \mu,$ the expected variation for two measurements taken on the same person but at time $t_{j}$ and $t_{k}$ may not equal the total variation of $\sigma^{2}$ since the measurements are correlated:$ \sigma^{2}(1-\rho_{jk})=1/2 E[(Y_{ij}-Y_{ik})^{2}]$ assuming $E(Y_{ij}) = E(Y_{ik}) = \mu.$ When $\rho_{jk} > 0,$ between-subject variation is greater than within-subject variation.<br />
*Characterizing correlation and covariance: characterizing the correlation is useful for understanding components of variation and for identifying a variance or correlation model for regression methods such as GEE (generalized estimating equations) or mixed-effects models. Covariance can be written in terms of $\sigma_{j}^2$ and the correlation $\rho_{jk}: cov(Y_{i})=\begin{bmatrix} \sigma_{1}^2 & \sigma_{1}\sigma_{2}\rho_{12} &\cdots &\sigma_{1}\sigma_{n}\rho_{1n} \\ \sigma_{2}\sigma_{1}\rho_{21}&\sigma_{2}^2 &\cdots &\sigma_{2}\sigma_{n}\rho_{2n} \\ \vdots &\vdots & \ddots &\vdots \\ \sigma_{n}\sigma_{1}\rho_{n1}&\sigma_{n}\sigma_{2}\rho_{n2} &\cdots &\sigma_{n}^2 \end{bmatrix}; the correlation matrix: cov(Y_{i})=\begin{bmatrix} 1 & \rho_{12} &\cdots &\rho_{1n} \\\rho_{21}&1 & \cdots&\rho_{2n} \\ \vdots &\vdots & \ddots &\vdots \\\rho_{n1}&\rho_{n2} &\cdots &1 \end{bmatrix}.$<br />
<br />
===Applications===<br />
<br />
1) [http://biomet.oxfordjournals.org/content/73/1/13.short This article] proposed an extension of generalized linear models to the analysis of longitudinal data. This paper introduced a class of estimating equations that give consistent estimates of the regression parameters and of their variance under mild assumptions about the time dependence. The estimating equations were derived without specifying the joint distribution of a subject's observations yet they reduce to the score equations for multivariate Gaussian outcomes. Asymptotic theory was presented for the general class of estimators. Specific cases in which are assumed independence, m-dependence and exchangeable correlation structures from each subject were discussed. Efficiency of the proposed estimators in two simple situations was considered. <br />
<br />
2) [http://onlinelibrary.wiley.com/doi/10.1002/sim.4780111406/abstract;jsessionid=0538E29F4FDDD9D0DD3D672621073EA7.f03t03?deniedAccessCustomisedMessage=&userIsAuthenticated=false This article] reviewed statistical methods for the analysis of discrete and continuous longitudinal data. The relative merits of longitudinal and cross-sectional studies were discussed. Three approaches, marginal, transition and random effects models, were presented with emphasis on the distinct interpretations of their coefficients in the discrete data case. This paper reviewed generalized estimating equations for inferences about marginal models. The ideas were illustrated with analyses of a 2 × 2 crossover trial with binary responses and a randomized longitudinal study with a count outcome.<br />
<br />
<br />
===Software=== <br />
In R: package of ‘longitudinal’<br />
<br />
===Problems===<br />
install.packages('longitudinal')<br />
library(longitudinal)<br />
data(tcell)<br />
is.longitudinal(tcell.34)<br />
TRUE<br />
summary(tcell.34)<br />
Longitudinal data:<br />
58 variables measured at 10 different time points<br />
Total number of measurements per variable: 340 <br />
Repeated measurements: yes <br />
<br />
To obtain the measurement design call 'get.time.repeats()'.<br />
plot(tcell.10,1:9) ## plot the first 9 time series of the data<br />
<br />
<br />
<br />
[[Image:SMHS__Longtitud_Fig1_.png|500px]]<br />
<br />
<br />
dim(tcell.34) ## dataset with 34 repeats<br />
[1] 340 58<br />
get.time.repeats(tcell.34)<br />
$\$ $time<br />
[1] 0 2 4 6 8 18 24 32 48 72<br />
<br />
$\$ $repeats<br />
[1] 34 34 34 34 34 34 34 34 34 34<br />
<br />
is.equally.spaced(tcell.34)<br />
[1] FALSE<br />
is.regularly.sampled(tcell.34)<br />
[1] TRUE<br />
has.repeated.measurements(tcell.34)<br />
[1] TRUE<br />
condense.longitudinal(tcell.34,1:2,mean) # compute mean value at each time point for the first two gene<br />
RB1 CCNG1<br />
[1,] 17.41394 16.48101<br />
[2,] 17.62637 16.34122<br />
[3,] 17.89343 16.26661<br />
[4,] 17.37539 15.91950<br />
[5,] 17.57670 16.25621<br />
[6,] 17.85805 16.26411<br />
[7,] 17.76270 16.24127<br />
[8,] 17.22543 16.52049<br />
[9,] 16.86306 16.14295<br />
[10,] 17.09348 16.58913<br />
has.repeated.measurements(tcell.34)<br />
TRUE<br />
<br />
Sorry, I failed to find a SOCR dataset to fit here…<br />
<br />
===References===<br />
[http://en.wikipedia.org/wiki/Longitudinal_study Longitudinal Study Wikipedia]<br />
<br />
[http://faculty.washington.edu/heagerty/Courses/VA-longitudinal/private/LDAchapter.pdf Longitudinal Data Analysis]<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<hr><br />
* SOCR Home page: http://www.socr.umich.edu<br />
<br />
{{translate|pageName=http://wiki.socr.umich.edu/index.php?title=SMHS_LongitudinalData}}</div>Clgallahttps://wiki.socr.umich.edu/index.php?title=SMHS_LongitudinalData&diff=14432SMHS LongitudinalData2014-10-17T15:39:02Z<p>Clgalla: /* Scientific Methods for Health Sciences - Longitudinal Data */</p>
<hr />
<div>==[[SMHS| Scientific Methods for Health Sciences]] - Longitudinal Data ==<br />
<br />
===Overview===<br />
Longitudinal data are referred to data collected from a large population over a given time period where the same subjects are measured at multiple points in time. It is widely used in statistical and financial studies. In this section, we are going to introduce to the concept of longitudinal data. <br />
<br />
===Motivation===<br />
Data measured at multiple points in time on the same subject are commonly used in various studies. For example, consider a dataset contain students’ standard test score in four successive years (attached in the table). How do we categorize data like this?<br />
<br />
<br />
<center><br />
{| class="wikitable" style="text-align:center; width:35%" border="1"<br />
|-<br />
|Student Name|| Grade 1 (2010) Raw Score||Grade 2 (2011) Raw Score||Grade 3 (2012) Raw Score||Grade 4 (2013) Raw Score<br />
|-<br />
|Tom ||339|| 350|| 361|| 366<br />
|-<br />
|Mike|| 332|| 343|| 350|| 351<br />
|-<br />
|Vivian||360 ||380 ||400|| 420<br />
|}<br />
</center><br />
<br />
3) Theory<br />
<br />
3.1) Longitudinal data: data collected from a large population over a given time period where the same subjects are measured at multiple points in time. <br />
The primary advantage of longitudinal dataset is that they can measure change. Hence, we can estimate. For example, for the dataset listed above, we can estimate the effect of various factors on improvement in students’ achievement. Or, we can estimate the overall effectiveness of individual teachers by examining the performance of successive classes of students they teach. <br />
The longitudinal data extend into both the past and the present, which enable us to evaluate the effect of a specific policy by looking at student performance before or after the policy was introduced. <br />
Longitudinal data also allows us to use sophisticated analytic strategies to measure the impact of various policies with reasonable precision.<br />
<br />
3.2) Longitudinal study: an investigation where participant outcomes and possibly treatments or exposures are collected at multiple follow-up times. It generally yields multiple (repeated) measurements on each subject, which are correlated within subjects and thus require special statistical techniques for analysis and inference. In longitudinal study, we may also be interested in measuring the time until a key clinical event such as time to death, where survival analysis is generally applied for analysis.<br />
It plays an important role in various studies like epidemiology, clinical research and biological studies. Longitudinal studies are used to characterize normal growth and aging to assess the effect of risk factors on human health to evaluate the effectiveness of treatments and involve lots of efforts.<br />
Benefits of longitudinal studies: (1) incident events are recorded, a prospective longitudinal study measures the new occurrence of disease; (2) prospective ascertainment of exposure, data recorded at multiple follow-up visits may recall bias; (3) measurement of individual change in outcomes; (4) separation of time effects: cohort, period, age; (5) control for cohort effects, in a longitudinal study, the cohort under study is fixed and thus changes in time are not confounded by cohort differences. <br />
Challenges of longitudinal studies: (1) participant follow-up, there is risk of bias due to incomplete follow-up of study participants; (2) analysis of correlated data, if intra-subject correlation of response measurements is ignored, then inferences such as statistical tests or confidence interval can be grossly invalid; (3) time-varying covariate, the direction of causality can be complicated by ‘feedback’ between the outcome and the exposure. <br />
<br />
3.3) Notations: use Y_{ij} to denote the outcome measured on subject i at time t_{ij}, where i =1, \cdots, N is index for subject, and j = 1, \cdots, n is index for observations within a subject. The measurement time follows a common set of follow-up times t_{ij} = t_{j}. Use X_{ij} to denote covariates associated with observations Y_{ij}. Common covariates in a longitudinal study include the time, t_{ij}, and person-level characteristics such as age, treatment assignments, and etc. In many cases, the scientific studies focus on mean responses as a function of covariates such as treatment and time, we can also make statistical inference on within-person correlation of observation. Define \rho_{jk} = corr(Y_{ij}, Y_{ik}), the within-subject correlation between observations at time t_{j} and t_{k}. <br />
<br />
3.4) Exploratory data analysis: to discover patterns of systematic variation across groups, as well as aspects of random variation among individual patients.<br />
Group means: if we are looking to measure the average response over time, statistical measures like means and standard deviation, which can reveal whether different groups change in a similar or different fashion. <br />
Variation among individuals: single variance parameter can be used to summarize uncertainty or variability in a response measurement. In longitudinal data, ‘distance’ between measurements on different subjects is usually expected to be greater than the distance between repeated measurements taken on the same subject, hence though the total variance can be noted as \sigma^{2} =1/2 E[(Y_{ij}-Y_{i’j})^{2}] assuming E(Y_{ij}) = E(Y_{i’j}) = \mu, the expected variation for two measurements taken on the same person but at time t_{j} and t_{k} may not equal the total variation of \sigma^{2} since the measurements are correlated: \sigma^{2}(1-\rho_{jk})=1/2 E[(Y_{ij}-Y_{ik})^{2}] assuming E(Y_{ij}) = E(Y_{ik}) = \mu. When \rho_{jk} > 0, between-subject variation is greater than within-subject variation.<br />
Characterizing correlation and covariance: characterizing the correlation is useful for understanding components of variation and for identifying a variance or correlation model for regression methods such as GEE (generalized estimating equations) or mixed-effects models. Covariance can be written in terms of \sigma_{j}^2 and the correlation \rho_{jk}: cov(Y_{i})=\begin{bmatrix} \sigma_{1}^2 & \sigma_{1}\sigma_{2}\rho_{12} &\cdots &\sigma_{1}\sigma_{n}\rho_{1n} \\ \sigma_{2}\sigma_{1}\rho_{21}&\sigma_{2}^2 &\cdots &\sigma_{2}\sigma_{n}\rho_{2n} \\ \vdots &\vdots & \ddots &\vdots \\ \sigma_{n}\sigma_{1}\rho_{n1}&\sigma_{n}\sigma_{2}\rho_{n2} &\cdots &\sigma_{n}^2 \end{bmatrix}; the correlation matrix: cov(Y_{i})=\begin{bmatrix} 1 & \rho_{12} &\cdots &\rho_{1n} \\\rho_{21}&1 & \cdots&\rho_{2n} \\ \vdots &\vdots & \ddots &\vdots \\\rho_{n1}&\rho_{n2} &\cdots &1 \end{bmatrix}.<br />
<br />
4) Applications<br />
<br />
4.1) This article (http://biomet.oxfordjournals.org/content/73/1/13.short) proposed an extension of generalized linear models to the analysis of longitudinal data. This paper introduced a class of estimating equations that give consistent estimates of the regression parameters and of their variance under mild assumptions about the time dependence. The estimating equations were derived without specifying the joint distribution of a subject's observations yet they reduce to the score equations for multivariate Gaussian outcomes. Asymptotic theory was presented for the general class of estimators. Specific cases in which are assumed independence, m-dependence and exchangeable correlation structures from each subject were discussed. Efficiency of the proposed estimators in two simple situations was considered. <br />
<br />
4.2) This article (http://onlinelibrary.wiley.com/doi/10.1002/sim.4780111406/abstract;jsessionid=0538E29F4FDDD9D0DD3D672621073EA7.f03t03?deniedAccessCustomisedMessage=&userIsAuthenticated=false) reviewed statistical methods for the analysis of discrete and continuous longitudinal data. The relative merits of longitudinal and cross-sectional studies were discussed. Three approaches, marginal, transition and random effects models, were presented with emphasis on the distinct interpretations of their coefficients in the discrete data case. This paper reviewed generalized estimating equations for inferences about marginal models. The ideas were illustrated with analyses of a 2 × 2 crossover trial with binary responses and a randomized longitudinal study with a count outcome.<br />
<br />
<br />
5) Software <br />
In R: package of ‘longitudinal’<br />
<br />
6) Problems<br />
install.packages('longitudinal')<br />
library(longitudinal)<br />
data(tcell)<br />
is.longitudinal(tcell.34)<br />
TRUE<br />
summary(tcell.34)<br />
Longitudinal data:<br />
58 variables measured at 10 different time points<br />
Total number of measurements per variable: 340 <br />
Repeated measurements: yes <br />
<br />
To obtain the measurement design call 'get.time.repeats()'.<br />
plot(tcell.10,1:9) ## plot the first 9 time series of the data<br />
<br />
<br />
<br />
[[Image:SMHS__Longtitud_Fig1_.png|500px]]<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<hr><br />
* SOCR Home page: http://www.socr.umich.edu<br />
<br />
{{translate|pageName=http://wiki.socr.umich.edu/index.php?title=SMHS_LongitudinalData}}</div>Clgallahttps://wiki.socr.umich.edu/index.php?title=File:SMHS_Longtitud_Fig1_.png&diff=14431File:SMHS Longtitud Fig1 .png2014-10-17T15:38:13Z<p>Clgalla: </p>
<hr />
<div></div>Clgallahttps://wiki.socr.umich.edu/index.php?title=SMHS_LongitudinalData&diff=14430SMHS LongitudinalData2014-10-17T15:37:23Z<p>Clgalla: /* Scientific Methods for Health Sciences - Longitudinal Data */</p>
<hr />
<div>==[[SMHS| Scientific Methods for Health Sciences]] - Longitudinal Data ==<br />
<br />
===Overview===<br />
Longitudinal data are referred to data collected from a large population over a given time period where the same subjects are measured at multiple points in time. It is widely used in statistical and financial studies. In this section, we are going to introduce to the concept of longitudinal data. <br />
<br />
===Motivation===<br />
Data measured at multiple points in time on the same subject are commonly used in various studies. For example, consider a dataset contain students’ standard test score in four successive years (attached in the table). How do we categorize data like this?<br />
<br />
<br />
<center><br />
{| class="wikitable" style="text-align:center; width:35%" border="1"<br />
|-<br />
|Student Name|| Grade 1 (2010) Raw Score||Grade 2 (2011) Raw Score||Grade 3 (2012) Raw Score||Grade 4 (2013) Raw Score<br />
|-<br />
|Tom ||339|| 350|| 361|| 366<br />
|-<br />
|Mike|| 332|| 343|| 350|| 351<br />
|-<br />
|Vivian||360 ||380 ||400|| 420<br />
|}<br />
</center><br />
<br />
3) Theory<br />
<br />
3.1) Longitudinal data: data collected from a large population over a given time period where the same subjects are measured at multiple points in time. <br />
The primary advantage of longitudinal dataset is that they can measure change. Hence, we can estimate. For example, for the dataset listed above, we can estimate the effect of various factors on improvement in students’ achievement. Or, we can estimate the overall effectiveness of individual teachers by examining the performance of successive classes of students they teach. <br />
The longitudinal data extend into both the past and the present, which enable us to evaluate the effect of a specific policy by looking at student performance before or after the policy was introduced. <br />
Longitudinal data also allows us to use sophisticated analytic strategies to measure the impact of various policies with reasonable precision.<br />
<br />
3.2) Longitudinal study: an investigation where participant outcomes and possibly treatments or exposures are collected at multiple follow-up times. It generally yields multiple (repeated) measurements on each subject, which are correlated within subjects and thus require special statistical techniques for analysis and inference. In longitudinal study, we may also be interested in measuring the time until a key clinical event such as time to death, where survival analysis is generally applied for analysis.<br />
It plays an important role in various studies like epidemiology, clinical research and biological studies. Longitudinal studies are used to characterize normal growth and aging to assess the effect of risk factors on human health to evaluate the effectiveness of treatments and involve lots of efforts.<br />
Benefits of longitudinal studies: (1) incident events are recorded, a prospective longitudinal study measures the new occurrence of disease; (2) prospective ascertainment of exposure, data recorded at multiple follow-up visits may recall bias; (3) measurement of individual change in outcomes; (4) separation of time effects: cohort, period, age; (5) control for cohort effects, in a longitudinal study, the cohort under study is fixed and thus changes in time are not confounded by cohort differences. <br />
Challenges of longitudinal studies: (1) participant follow-up, there is risk of bias due to incomplete follow-up of study participants; (2) analysis of correlated data, if intra-subject correlation of response measurements is ignored, then inferences such as statistical tests or confidence interval can be grossly invalid; (3) time-varying covariate, the direction of causality can be complicated by ‘feedback’ between the outcome and the exposure. <br />
<br />
3.3) Notations: use Y_{ij} to denote the outcome measured on subject i at time t_{ij}, where i =1, \cdots, N is index for subject, and j = 1, \cdots, n is index for observations within a subject. The measurement time follows a common set of follow-up times t_{ij} = t_{j}. Use X_{ij} to denote covariates associated with observations Y_{ij}. Common covariates in a longitudinal study include the time, t_{ij}, and person-level characteristics such as age, treatment assignments, and etc. In many cases, the scientific studies focus on mean responses as a function of covariates such as treatment and time, we can also make statistical inference on within-person correlation of observation. Define \rho_{jk} = corr(Y_{ij}, Y_{ik}), the within-subject correlation between observations at time t_{j} and t_{k}. <br />
<br />
3.4) Exploratory data analysis: to discover patterns of systematic variation across groups, as well as aspects of random variation among individual patients.<br />
Group means: if we are looking to measure the average response over time, statistical measures like means and standard deviation, which can reveal whether different groups change in a similar or different fashion. <br />
Variation among individuals: single variance parameter can be used to summarize uncertainty or variability in a response measurement. In longitudinal data, ‘distance’ between measurements on different subjects is usually expected to be greater than the distance between repeated measurements taken on the same subject, hence though the total variance can be noted as \sigma^{2} =1/2 E[(Y_{ij}-Y_{i’j})^{2}] assuming E(Y_{ij}) = E(Y_{i’j}) = \mu, the expected variation for two measurements taken on the same person but at time t_{j} and t_{k} may not equal the total variation of \sigma^{2} since the measurements are correlated: \sigma^{2}(1-\rho_{jk})=1/2 E[(Y_{ij}-Y_{ik})^{2}] assuming E(Y_{ij}) = E(Y_{ik}) = \mu. When \rho_{jk} > 0, between-subject variation is greater than within-subject variation.<br />
Characterizing correlation and covariance: characterizing the correlation is useful for understanding components of variation and for identifying a variance or correlation model for regression methods such as GEE (generalized estimating equations) or mixed-effects models. Covariance can be written in terms of \sigma_{j}^2 and the correlation \rho_{jk}: cov(Y_{i})=\begin{bmatrix} \sigma_{1}^2 & \sigma_{1}\sigma_{2}\rho_{12} &\cdots &\sigma_{1}\sigma_{n}\rho_{1n} \\ \sigma_{2}\sigma_{1}\rho_{21}&\sigma_{2}^2 &\cdots &\sigma_{2}\sigma_{n}\rho_{2n} \\ \vdots &\vdots & \ddots &\vdots \\ \sigma_{n}\sigma_{1}\rho_{n1}&\sigma_{n}\sigma_{2}\rho_{n2} &\cdots &\sigma_{n}^2 \end{bmatrix}; the correlation matrix: cov(Y_{i})=\begin{bmatrix} 1 & \rho_{12} &\cdots &\rho_{1n} \\\rho_{21}&1 & \cdots&\rho_{2n} \\ \vdots &\vdots & \ddots &\vdots \\\rho_{n1}&\rho_{n2} &\cdots &1 \end{bmatrix}.<br />
<br />
4) Applications<br />
<br />
4.1) This article (http://biomet.oxfordjournals.org/content/73/1/13.short) proposed an extension of generalized linear models to the analysis of longitudinal data. This paper introduced a class of estimating equations that give consistent estimates of the regression parameters and of their variance under mild assumptions about the time dependence. The estimating equations were derived without specifying the joint distribution of a subject's observations yet they reduce to the score equations for multivariate Gaussian outcomes. Asymptotic theory was presented for the general class of estimators. Specific cases in which are assumed independence, m-dependence and exchangeable correlation structures from each subject were discussed. Efficiency of the proposed estimators in two simple situations was considered. <br />
<br />
4.2) This article (http://onlinelibrary.wiley.com/doi/10.1002/sim.4780111406/abstract;jsessionid=0538E29F4FDDD9D0DD3D672621073EA7.f03t03?deniedAccessCustomisedMessage=&userIsAuthenticated=false) reviewed statistical methods for the analysis of discrete and continuous longitudinal data. The relative merits of longitudinal and cross-sectional studies were discussed. Three approaches, marginal, transition and random effects models, were presented with emphasis on the distinct interpretations of their coefficients in the discrete data case. This paper reviewed generalized estimating equations for inferences about marginal models. The ideas were illustrated with analyses of a 2 × 2 crossover trial with binary responses and a randomized longitudinal study with a count outcome.<br />
<br />
<br />
5) Software <br />
In R: package of ‘longitudinal’<br />
<br />
6) Problems<br />
install.packages('longitudinal')<br />
library(longitudinal)<br />
data(tcell)<br />
is.longitudinal(tcell.34)<br />
TRUE<br />
summary(tcell.34)<br />
Longitudinal data:<br />
58 variables measured at 10 different time points<br />
Total number of measurements per variable: 340 <br />
Repeated measurements: yes <br />
<br />
To obtain the measurement design call 'get.time.repeats()'.<br />
plot(tcell.10,1:9) ## plot the first 9 time series of the data<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<hr><br />
* SOCR Home page: http://www.socr.umich.edu<br />
<br />
{{translate|pageName=http://wiki.socr.umich.edu/index.php?title=SMHS_LongitudinalData}}</div>Clgallahttps://wiki.socr.umich.edu/index.php?title=SMHS_Surveys&diff=14429SMHS Surveys2014-10-17T13:53:45Z<p>Clgalla: /* Scientific Methods for Health Sciences - Surveys */</p>
<hr />
<div>==[[SMHS| Scientific Methods for Health Sciences]] - Surveys ==<br />
<br />
===Overview===<br />
Survey methodology studies on the sampling of individual units from population then apply survey data collection techniques such as questionnaires to improve the number and accuracy of responses to surveys. The ultimate goal is to make statistical inferences about the population studied, which would of course, depends strongly on the survey questions provided. The commonly used survey methods include polls, public health surveys, market research surveys, censuses and so on. Surveys provide important information for public information and research fields and are widely applied in varieties of fields such as marketing, health professionals, sociology and so on. In this lecture, we are going to present a general introduction to surveys and various methods used in surveys will be illustrated with examples.<br />
<br />
===Motivation===<br />
Surveys may be one of the most commonly used methods for data collection. The questions used in the survey are of significant importance in collecting enough data to make statistical inference of the population. There are various ways of collecting data in surveys and they all have their strengths and weakness and are applied to different kinds of data. So, how do surveys work? And how to perform a good survey?<br />
<br />
===Theory===<br />
<br />
'''1) Surveys:'''<br />
Surveys are made of at least one sample (or the full population in the case of a census), a method of data collection and individual questions that become data, which can be further analyzed statistically. <br />
*A single survey may focus on different types of topics such as preferences, opinions, behavior or factual information depending on the purpose of the study. <br />
*Given that survey is based on one sample of the population, the success of research depends largely on the representativeness of the sample with respect to the target population. <br />
*Surveys aim to identify principles about the sample design, data collection instruments, statistical adjustment of data, data processing, and final data analysis that can be used to create systematic and random survey errors, which can sometimes be analyzed in connection with survey cost. The cost constraints are sometimes framed as improving quality within cost constraints, or alternatively, reducing costs for a fixed level of quality. <br />
<br />
'''2) Survey methodology topics'''<br />
The most important challenges of a survey method include making decisions on how to: (1) identify and select potential sample members; (2) contact sampled individuals and collect data from those who are hard to reach or reluctant to respond to the questions; (3) evaluate and test questions; (4) select the mode for posting questions and collecting responses; (5) train and supervise interviewers if they are involved; (6) check data files for accuracy and internal consistency; (7) adjust survey estimates to correct for identified errors.<br />
*Selecting samples: there are mainly two types of survey samples: probability samples and non-probability samples. Stratified sampling is a method of probability sampling where the sub-population within the population are identified and included in the sample selected in balanced ways.<br />
*Modes of data collection: the choice between various modes of administering a survey can be influenced by several factors including costs, coverage of the target population, flexibility of asking questions, respondents’ willingness to participate and response accuracy. Mode effect created by different methods can change the way the respondents answer. The most commonly used modes of administration can be summarized into some main categories including telephone, mail, online surveys, personal in-home surveys, personal mall or street intercept survey and hybrids of the above. <br />
**Telephone: use of interviewers encourages sample persons to respond, leading to higher response rates; interviewers can increase comprehension of questions by answering respondents’ questions; fairly cost efficient, depending on local call charge structure; good for large national sampling frames; some potential for interviewer bias; cannot be used for non-audio information; three main types of telephone include traditional telephone interviews, computer assisted telephone dialing and computer assisted telephone interviewing (CATI).<br />
**Online surveys: (1) Web surveys are faster, simpler, and cheaper. However, lower costs are not so straightforward in practice, as they are strongly interconnected to errors. Because response rate comparisons to other survey modes are usually not favorable for online surveys, efforts to achieve a higher response rate (e.g., with traditional solicitation methods) may substantially increase costs. (2) The entire data collection period is significantly shortened, as all data can be collected and processed in little more than a month. (3) Interaction between the respondent and the questionnaire is more dynamic compared to e-mail or paper surveys.[6] Online surveys are also less intrusive, and they suffer less from social desirability effects. (4) Complex skip patterns can be implemented in ways that are mostly invisible to the respondent. (5) Pop-up instructions can be provided for individual questions to provide help with questions exactly where assistance is required. (6) Questions with long lists of answer choices can be used to provide immediate coding of answers to certain questions that are usually asked in an open-ended fashion in paper questionnaires. (7) Online surveys can be tailored to the situation (e.g., respondents may be allowed save a partially completed form, the questionnaire may be preloaded with already available information, etc.). Online questionnaires may be improved by applying usability testing, where usability is measured with reference to the speed with which a task can be performed, the frequency of errors and user satisfaction with the interface.<br />
**Mail: the questionnaires may be handed to the respondents or mailed to them but in all cases they are returned to the researcher via mail; the advantage is that the cost of mail survey is very low and there is no interviewer bias, however there might be long delays and are not suitable for issues that may require clarification. The response rates can be improved by using mail panels, monetary incentives and improve the class of mail through which the surveys were sent. <br />
**Face-to-face: suitable for locations where telephone or mail are not developed; potential for interview bias; easy to manipulate by completing multiple times to skew results.<br />
**Mixed-mode surveys: with the introduction of computers to the survey process, survey mode now includes combinations of different approaches (mixed-mode designs). Some commonly used methods include computer-assisted personal interviewing (CAPI), audio computer assisted self-interviewing (audio CASI), computer-assisted telephone interviewing (CATI) and interactive voice response (IVR).<br />
*Cross-sectional and longitudinal surveys: the former involves a single questionnaire administered to each sample member and the latter refers to surveys, which repeatedly collect information from the same people over time. Longitudinal surveys are usually considered analytical advantages but can be challenging to implement successfully. As a result, specialist methods have been developed to select longitudinal samples to collect data repeatedly, to keep track of sample members over time, to keep respondents motivated to participate and to process and analyze longitudinal survey data.<br />
*Response formats: there are two kinds of formats of questions used in surveys: the open-ended questions, which requires the respondents to formulate their own answers and closed-ended questions, which require the respondent to pick and answer from a given list of mutually exclusive and exhaustive options. There are four types of response scales for closed-ended questions including dichotomous (two options), nominal-polytomous (more than two unordered options), ordinal-polytomous (more than two ordered options) and bounded continuous (continuous scaled questions). A respondent’s answer to open-ended question can be coded into a response scale afterwards or analyzed using more qualitative methods.<br />
*Nonresponse reduction in telephone and face-to-face surveys: (1) Advance letter: sent in advance to inform the sampled respondents about the upcoming survey. It should be personalized but not overdone; (2) Training: the interviewers thoroughly trained in how to ask respondents questions, how to work with computers and making schedules for callbacks to respondents not reached; (3) Short introduction: the interviewer should always start with a short instruction about him or herself about their names, the institute she is working for, the length of the interview and goal of the interview; (4) Respondent-friendly survey questionnaire: the questions asked must be clear, non-offensive and easy to respond to for the subjects under study.<br />
*Interviewer effects: The effects of the surveys may be affected by physical characteristics of the interviewer including race, gender, and the relative body weight (IBM). These characteristics of the interview are particularly influential when the questions are related to the interviewer trait. While interviewer effects have been investigated mainly for face-to-face surveys, they are also shown to exist for interview modes with no visual contact such as telephone surveys and in video-enhanced web surveys. <br />
<br />
3) Simple examples of survey can be viewed here:<br />
<br />
[http://www.socr.umich.edu/html/SOCR_Survey.html SOCR survey]<br />
<br />
[http://www.esurveyspro.com/Survey.aspx?id=79ba4c38-b7d0-4530-aa00-12fdd32b6609 E-Survey/Surveys Pro]<br />
<br />
[http://socr.ucla.edu/docs/surveys/SOCR_Survey_VisualIllusions_2010.html SOCR Survey Visual Illusions]<br />
<br />
===Applications===<br />
<br />
1) [http://www.sciencedirect.com/science/article/pii/S0895435697001261 This article] characterize response rates for mail surveys published in medical journals; and determined how the response rate among subjects who are typical targets of mail surveys varies by evaluating the contribution of several techniques used by investigators to enhance response rates. Methods. One hundred seventy-eight manuscripts published in 1991, representing 321 distinct mail surveys, were abstracted to determine response rates and survey techniques. In a follow-up mail survey, 113 authors of these manuscripts provided supplementary information. Results. The mean response rate among mail surveys published in medical journals is approximately 60%. However, response rates vary according to subject studied and techniques used. Published surveys of physicians have a mean response rate of only 54%, and those of non-physicians have a mean response rate of 68%. In addition, multivariable models suggest that written reminders provided with a copy of the instrument and telephone reminders are each associated with response rates about 13% higher than surveys that do not use these techniques. Other techniques, such as anonymity and financial incentives, are not associated with higher response rates. Conclusions. Although several mail survey techniques are associated with higher response rates, response rates to published mail surveys tend to be moderate. However, a survey's response rate is at best an indirect indication of the extent of non-respondent bias. Investigators, journal editors, and readers should devote more attention to assessments of bias, and less to specific response rate thresholds.<br />
<br />
2)[http://www.bmj.com/content/320/7237/745 This article] aims to survey operating theatre and intensive care unit staff about attitudes concerning error, stress, and teamwork and to compare these attitudes with those of airline cockpit crew. This study used the cross sectional surveys involving urban teaching and non-teaching hospitals in the United States, Israel, Germany, Switzerland and Italy and included 1033 doctors, nurses, fellows and residents working in operating theatres and intensive care units and over 30000 cockpit crew members in the study and measured the perceptions of error, stress and teamwork. This study concluded that: medical staff reported that error is important but difficult to discuss and not handled well in their hospital. Barriers to discussing error are more important since medical staff seem to deny the effect of stress and fatigue on performance. Further problems include differing perceptions of teamwork among team members and reluctance of senior theatre staff to accept input from junior members.<br />
<br />
3) [http://link.springer.com/article/10.1023/A:1025054610557 This article] provided a review of epidemiological studies of pervasive developmental disorders (PDD), which updates a previously published article. The design, sample characteristics of 32 surveys published between 1966 and 2001 are described. Recent surveys suggest that the rate for all forms of PDDs are around 30/10,000 but more recent surveys suggest that the estimate might be as high as 60/10,000. The rate for Asperger disorder is not well established, and a conservative figure is 2.5/10,000. Childhood disintegrative disorder is extremely rare with a pooled estimate across studies of 0.2/10,000. A detailed discussion of the possible interpretations of trends over time in prevalence rates is provided. There is evidence that changes in case definition and improved awareness explain much of the upward trend of rates in recent decades. However, available epidemiological surveys do not provide an adequate test of the hypothesis of a changing incidence of PDDs.<br />
<br />
===Software=== <br />
[http://www.keysurvey.com/?gclid=CjwKEAjw9eyeBRCqxc_b-LD8kTESJADsBMxS4usDhK8STe_svcEgzSPjw9dk99zGcaujAq5waWPrrxoCsRzw_wcB World App Key Survey]<br />
<br />
[http://www.qualtrics.com/research-suite/ Qualtrics]<br />
<br />
===Problems===<br />
Suppose, you want to study on the effectiveness of a new released drug on headache, can you come up with a short survey (5 or 6 questions) on a group of patients with headache who have been using this drug for the past three months? What kind of survey mode would you choose here and why? (open question)<br />
<br />
<br />
===References===<br />
[http://mirlyn.lib.umich.edu/Record/004199238 Statistical inference / George Casella, Roger L. Berger]<br />
<br />
[http://mirlyn.lib.umich.edu/Record/004232056 Sampling / Steven K. Thompson]<br />
<br />
[http://mirlyn.lib.umich.edu/Record/004133572 Sampling theory and methods / S. Sampath]<br />
<br />
<br />
<br />
<hr><br />
* SOCR Home page: http://www.socr.umich.edu<br />
<br />
{{translate|pageName=http://wiki.socr.umich.edu/index.php?title=SMHS_Surveys}}</div>Clgallahttps://wiki.socr.umich.edu/index.php?title=SMHS_HLM&diff=14428SMHS HLM2014-10-17T13:45:13Z<p>Clgalla: /* Software */</p>
<hr />
<div>==[[SMHS| Scientific Methods for Health Sciences]] - Hierarchical Linear Models (HLM) ==<br />
===Overview===<br />
Hierarchical linear model (also called multilevel models) refers to statistical models of parameters that vary at more than one level. It can be regarded as generalization of linear model and is widely applied in various studies from different fields especially for research designs where data for participants are organized at more than one level. In this section, we will present a general introduction to hierarchical linear model with comparison from the classical method of ANOVA model and illustrate the application of hierarchical linear model with examples.<br />
<br />
===Motivation===<br />
How can we deal with cases where the data are multiple levels? What would be the advantages of hierarchical linear model compared to the classical approach? How does the hierarchical linear model work?<br />
<br />
===Theory===<br />
'''1) Hierarchical linear models'''<br />
These are statistical models of parameters that vary at more than one level. It combines the advantages of the mixed-model ANOVA, with its flexible modeling of fixed and random effects, and regression, with its advantages in dealing with unbalanced data and predictors that are discrete or continuous. <br />
*Data for such models are hierarchically structured with first-level units nested within second-level units, second-level units nested within third-level units and so on. Parameters for such models may be viewed as having a hierarchical linear structure and be viewed as varying across level-two units as a function of level-two characteristics. Higher levels may be added without limits, though to date no published applications involve more than three levels.<br />
*Compared to the classical experimental design models: the random factors are nested and never crossed, however, fixed factors can be crossed with random factors and random factors may be nested within fixed factors. The data may be unbalanced at any level, and continuous predictors can be defined at any level. Both discrete and continuous predictors can be specified as having random effects, which are allowed to covary.<br />
*The units of analysis are usually individuals who are nested within aggregate units at the higher level. The lowest level of data in the model is usually an individual, repeated measurements of individuals may e examined. The hierarchical linear model provides an alternative type of analysis for univariate or multivariate analysis of repeated measures. Individual differences in growth curves may be examined. Multilevel models can be used as an alternative to ANCOVA, where scores on the dependent variable are adjusted for covariate and the model can analyze these experiments without assumption of homogeneity-of-regression slopes, which is required by ANCOVA.<br />
<br />
'''2) Two-level hierarchical linear model'''<br />
Hierarchical linear models can be used on data with many levels, though two-level models are the most common. The dependent variable must be examined at the lowest level of analysis. Take the example of a design study involves students nested within schools. Level-one model specifies how student-level predictors relate to the student-level outcome while at level two, each of the regression coefficients defined in the level-one model.<br />
*The level-one model (the student level), the outcome $y_{ij}$ for student $i$ in school $j (i = 1, 2, \cdots, n_{j}; j = 1, 2, \cdots, J),$ varies as a function of student characteristics, $X_{qij}, q = 1, \cdots, Q,$ and a random error $r_{ij}$ according to the linear regression model $y_{ij}=\beta_{0j}+\sum\beta_{qj}X_{qij} + r_{ij}, r_{ij} \sim N(0,\sigma^2),$ where $\beta_{0j}$ is the intercept and each $\beta_{qj}, q = 1, \cdots, Q,$ is a regression coefficient indicating the strength of association between $X_{qij},$ and the outcome within school $j$. The error of prediction of $y_{ij}$ by the $X$’s is $r_{ij}$, which is assumed normally distributed and, for simplicity, homoscedastic. <br />
*The level-two model: each regression coefficient, $\beta_{qj}, q = 0, 1, \cdots, Q$ defined by the level-one model, becomes an outcome variable to be predicted by school-level characteristics, $W_{sj}, s = 1, \cdots, S$, according to the regression model: $\beta_{qj}=\Theta_{q0}+\sum\Theta_{qs}W_{sj}+\mu_{qj}$, where $\Theta_{q0}$ is an intercept, each $\Theta_{qs}, s = 1, 2, \cdots, S$, is a regression slope specifying the strength of association between each $W_{sj}$ and the outcome $\beta_{qj}$; and the random effects are assumed sampled from a (Q+1)-variate normal distribution, where each $\mu_{qj}, q=1,2, \cdots, Q,$ has a mean of $0$ and variance $\tau_{qq}$, and the covariance between $\mu_{qj}$ and $\mu_{qj}$ is $\tau_{qq}$. There are a considerable number of options in modeling each $\beta_{qj}, q=0,1,\cdots, Q$. If every $W_{sj}$ is assumed to have no effect, the regression coefficients $\Theta_{qs}, s=1,2,\cdots, S,$ are set to zero. If the random effect $\mu_{qj}$ is also constraint to zero, then $\beta_{qj} = \Theta_{q0}$, i.e., $\beta_{qj},$ is fixed across all schools.<br />
*The one-way analysis of variance: <br />
**Classical approach of one-way ANOVA model: $y_{ij} = \mu + \alpha_{j} + r_{ij}, r_{ij} \sim N(0,\sigma^2)$, where $y_{ij}$ is the observation for subject a assigned to level $j$ of the independent variable; $\mu$ is the grand mean; $\alpha_{j}$ is the effect associated with level $j$; and $r_{ij}$ is assumed normally distributed with mean $0$, homogeneous variance $\sigma^2$. <br />
**Hierarchical linear model: $y_{ij}=\beta_{0j}+\sum\beta_{qj}X_{qij}+r_{ij}, r_{ij}\sim N(0,\sigma^2),$ set the intercept to zero, have the level-one model: $y_{ij}=\beta_{0j}+r_{ij}, r_{ij} \sim N(0,\sigma^2)$, $\sigma^2$ is the within group variance and $\beta_{0j}$ is the group mean and is the only parameter that needs to be predicted at level two and the model for that parameter is similarly simplified so that all regression coefficients except the intercept are set to zero: $\beta_{0j}=\Theta_{00}+\mu_{0j}, \mu_{0j}\sim N(0,\tau^2)$. Here, $\Theta_{00}$ is the grand mean and $\mu_{0j}$ is the effect associated with level $j$. In the random effects model, the effect is typically assumed normally distributed with a mean of zero. In the fixed effects model, each $\mu_{0j}$ is a fixed constant, $y_{ij}=\Theta_{00}+\mu_{0j}+r_{ij}$, where $r_{ij} \sim N(0,\sigma^2)$, and $\mu_{0j} \sim N(0,\tau^2)$. This is clearly the one-way random effects ANOVA. <br />
**Hypothesis testing: a simple test of the null hypothesis of no group effects, i.e., $H_{o}: \tau^2=0$ is given by the statistic $H=\sum \hat{P}_{j}(\hat{y}_{\cdot j}-\bar{y}_{\cdot \cdot})^2$, where $\hat{P}_{j}=\frac{n_{j}} {\hat{\sigma}}^2$. H has a large sample chi-square distribution with J-1 degrees of freedom under the null hypothesis. With the case of balanced data, the sum of precision weighted squared difference reduces to $H=\hat{P}\sum(\bar{y}_{i}-\bar{y})^2 = (J-1)\frac{MS_{b}} {MS_{w}}$, revealing clearly that $\frac{H}{J-1}$ is the usual $F$ statistic for testing group differences in ANOVA. <br />
*The two-factor nested design: hierarchical linear models generalize maximum likelihood estimation to the case of unbalanced data, and covariates measured at each level can be discrete or continuous and can have either fixed or random effects. <br />
**The classical approach: $y_{ijk}=\mu+\alpha_{k}+\pi_{j(k)}+r_{ijk}, \pi_{j(k)} \sim N(0,\tau^2), r_{ijk} \sim N(0,\sigma^2)$, where $y_{ijk}$ is the outcome for subject $i$ nested within level $j$ of the random factor, which is, in turn, nested within level $k$ of the fixed factor $(i=1, \cdots, n_{jk}; j=1, \cdots, J_{k}; k=1,\cdots, K); \mu$ is the grand mean; $\alpha_{k}$ is the effect associated with the $k^{th}$ level of the fixed factor; $\pi_{j(k)}$ is the effect associated with the $j^{th}$ level of the random factor within the $k^{th}$ level of the fixed factor; and $r_{ijk}$ is the random (within cell) error. In the case of balanced data ($n_{jk}=n$ for every level of the random factor), the standard analysis of variance method and the method of restricted maximum likelihood coincide.<br />
**Analysis of means of a Hierarchical linear model: as in the case of the one-way ANOVA, first set every regression coefficient, except the intercept to zero, $y_{ij}=\beta_{0j} + r_{ij}, r_{ij} \sim N(0,\sigma^2)$. The level-two (between class) model is a regression model in which the class mean $\beta_{0j}$ is the outcome and the predictor is a contrast between the two treatments. The level-two model: $\beta_{0j}=\Theta_{00} + \Theta_{01}W_{j} + u_{0j}, \mu_{0j} \sim N(0,\tau^2)$, has $w_{j}=\frac{1}{2}$ for classes experiencing instructional method $1$ and $W_{j} =-\frac{1}{2}$ for classes experiencing instructional method $2$. Hence, the correspondences between the hierarchical model and the ANOVA model are $\Theta_{00} = \mu, \Theta_{01} = \alpha_{1} - \alpha_{1}, \mu_{0j} = \pi_{j(k)}$. The single model in this case would be $y_{ij} = \Theta_{00} + \Theta_{01}W_{j} + \mu_{0j} + r_{ij}$. K-1 contrasts must be included to represent the variation among the K methods in order to duplicate the ANOVA results. <br />
*The two-factor crossed design (with replications within cells): with the mixed cases with one factor fixed and the other random. <br />
**The classical approach for the mixed model for the two-factor crossed design: $y_{ijk}=\mu+\alpha_{k}+\pi_{j}+(\alpha \pi)_{jk}+r_{ijk}, \pi_{j} \sim N(0,\tau^2), (\alpha \pi)_{jk} \sim N(0, \delta^2), r_{ijk} \sim N(0, \sigma^2).$ <br />
**Analysis by means of a hierarchical linear model: in the two-factor mixed crossed model, the fixed factor is specified in the level-one model with (K-1) X’s. With K=2, only one $X$ is specified, so the level-one model becomes $y_{ij} = \beta_{0j} + \beta_{1j}X_{1ij} + r_{ij}, r_{ij} \sim N(0, \sigma^2), y_{ij}$ is the outcome for the subject $i$ having tutor $j$, $\beta_{0j}$ is the mean for the $j^{th}$ tutor, $\beta_{1j}$ is the contrast between the practice and no-practice conditions within tutor $j$. $X_{1ij}=1$ for subjects of tutor $j$ having practice and $-1$ for those having no practice, and $r_{ij}$ is the within cell error. To replicate the results of the ANOVA, the level-two model is formulated to allow these to vary randomly across tutors: $\beta_{0j} = \Theta_{00}+\mu_{0j},$ and $\beta_{1j} = \Theta_{10} + \mu_{1j},$ where $\Theta_{00}$ is the grand mean, $\mu_{0j}$ is the unique effect of tutor $j$ on the mean level of the outcome, $\Theta_{10}$ is the average value of the treatment contrast, and $\mu_{1j}$ is the unique effect of tutor $j$ on that contrast. The correspondences between the hierarchical model and the ANOVA model are $\Theta_{00} = \mu, \Theta_{10} = \frac{\alpha_{2}-\alpha_{1}}{2}, \mu_{0j}=\pi_{j}, \mu_{1j} = \frac{(\alpha \beta)_{j2}-(\alpha \beta)_{j1}}{2}, \tau_{00} = \tau^2, \tau_{11} = \frac{\delta^2}{2}$, in this case we yield the single model: $y_{ij} = \Theta_{00} + \Theta_{10}X_{1ij} + \mu_{0j} + \mu_{1j}X_{1ij} + r_{ij}$. <br />
*Randomized block (and repeated measures) design: involve mixed models having both fixed and random effects. The blocks will typically be viewed as having random levels, and within blocks, there will commonly be a fixed effects design. The fixed effects may represent experimental treatment levels, or in longitudinal studies, involve polynomial trends. <br />
**Classical approach for a randomized block design where ther are no between blocks factors. $y_{ij} = \mu + \alpha_{i} + \pi_{j} + e_{ij}$, where $\pi_{j} \sim N(0,\tau^2)$, and $e_{ij} \sim N(0,\sigma^2)$. $\pi_{j}$ is the effect of block $j$ and $e_{ij}$ is the error, which has two components $(\alpha \pi)_{ij}$ and $r_{ij}.$ <br />
**Analysis by means of a hierarchical linear model for the randomized block design: similar to specification for the two-factor crossed design discussed above. The difference is that in the case of randomized block design, there is no replication within cells, so the model needs to be simplified. According to the level-one (within block) model, the outcome depends on polynomial trend components plus error: $y_{ij} = \beta_{0j} + \beta_{1j}(LIN)_{ij} + \beta_{2}(QUAD)_{ij} + \beta_{3j}(CUBE)_{ij} + r_{ij}, r_{ij} \sim N(0,\sigma^2)$. $y_{ij}$ is the outcome for the subject $i$ in block $j$; $\beta_{0j}$ is the mean for the block $j$; $(LIN)_{ij}$ assigns the linear contrast values $(-1.5, -0.5, 0.5, 1.5)$ to durations (1,2,3,4) respectively; $(QUAD)_{ij}$ assigns the quadratic contrast values (0.5, -0.5, -0.5, 0.5); $(CUBE)_{ij}$ assigns the cubic contrast values (-0.5, 1.5, -1.5, 0.5); $\beta_{1j}, \beta_{2j}$ and $\beta_{3j}$ are the linear, quadratic, and cubic regression parameters, respectively and $r_{ij}$ is the within cell error. Suppose we have four observations per block and rou regression coefficients ($\beta$’s) in the level-one model, no degrees of freedom remain to estimate within cell error. Assume that the contrast values don’t vary across blacks, then we can treat the trend parameters as fixed, yielding the level-two model: $\beta_{0j} = \Theta_{00} + \mu_{0j}, \mu_{0j} \sim N(0,\tau^2)$; $\beta_{1j}=\Theta_{10}$; $\beta_{2j} = \Theta_{20}$; $\beta_{3j} = \Theta_{30}$, where $\Theta_{00}$ is the grand mean, $\mu_{0j}$ is the unique effect of block $j$ assumed normally distributed with mean zero and variance $\tau^2$. The coefficients, $\beta_{1j}, \beta_{2j}$ and $\beta_{3j}$ are constrained to be invariant across blocks. The correspondences between the hierarchical model and the ANOVA model are $\Theta_{00}=\mu, \Theta_{10} = (-1.5\alpha_{1} – 0.5\alpha_{2} + 0.5\alpha_{3} + 1.5\alpha_{4}); \Theta_{20} = (0.5\alpha_{1} – 0.5\alpha_{2} – 0.5\alpha_{3} + 0.5\alpha_{4}); \Theta_{30} = (-0.5\alpha_{1} + 1.5\alpha_{2} – 1.5\alpha_{3} + 0.5\alpha_{4})$, and $\mu_{0j} = \pi_{j}$. In this case, we can combine the equations above and yield the single model: $y_{ij} = \Theta_{00} + \Theta_{10}(LIN)_{ij} + \Theta_{20}(QUAD)_{ij} + \Theta_{30}(CUBE)_{ij} + \mu_{0j} + r_{ij}.$ This model assumes the variance-covariance matrix of repeated measure is compound symmetric: $Var(Y_{ij}) = \tau^2 + \sigma^2; Cov(Y_{ij}, Y_{i’j}) = \tau^2.$<br />
<br />
===Applications===<br />
<br />
[http://jom.sagepub.com/content/23/6/723.short This article] presented an overview of the logic and rationale of hierarchical linear models. Due to the inherently hierarchical nature of organizations, data collected in organizations consist of nested entities. More specifically, individuals are nested in work groups, work groups are nested in departments, departments are nested in organizations, and organizations are nested in environments. Hierarchical linear models provide a conceptual and statistical mechanism for investigating and drawing conclusions regarding the influence of phenomena at different levels of analysis. This introductory paper: (a) discusses the logic and rationale of hierarchical linear models, (b) presents a conceptual description of the estimation strategy, and (c) using a hypothetical set of research questions, provides an overview of a typical series of multi-level models that might be investigated.<br />
<br />
[http://jeb.sagepub.com/content/13/2/85.short This article] provided a review of the educational application of hierarchical linear models. The search for appropriate statistical methods for hierarchical, multilevel data has been a prominent theme in educational statistics over the past 15 years. As a result of this search, an important class of models, termed hierarchical linear models by this review, has emerged. In the paradigmatic application of such models, observations within each group (e.g., classroom or school) vary as a function of group-level or “microparameters.” However, these microparameters vary randomly across the population of groups as a function of “macroparameters.” Research interest has focused on estimation of both micro- and macroparameters. This paper reviews estimation theory and application of such models. Also, the logic of these methods is extended beyond the paradigmatic case to include research domains as diverse as panel studies, meta-analysis, and classical test theory. Microparameters to be estimated may be as diverse as means, proportions, variances, linear regression coefficients, and logit linear regression coefficients. Estimation theory is reviewed from Bayes and empirical Bayes viewpoints and the examples considered involve data sets with two levels of hierarchy.<br />
<br />
===Software===<br />
[http://faculty.smu.edu/kyler/training/AERA_overheads.pdf Training AERA Overheads]<br />
<br />
[http://www.r-tutor.com/gpu-computing/rbayes/rhierlmc GPU Computing]<br />
<br />
[http://www.r-bloggers.com/hierarchical-linear-models-and-lmer/ Hierarchical-Linear-Models]<br />
<br />
[http://cran.r-project.org/doc/contrib/Fox-Companion/appendix-mixed-models.pdf Fox-Companion Mixed Models]<br />
<br />
===Problems===<br />
<br />
Example: using data Dyestuff<br />
library(lme4)<br />
str(Dyestuff)<br />
<br />
'data.frame': 30 obs. of 2 variables:<br />
$\$$ Batch: Factor w/ 6 levels "A","B","C","D",..: 1 1 1 1 1 2 2 2 2 2 ...<br />
$\$$ Yield: num 1545 1440 1440 1520 1580 ...<br />
<br />
summary(Dyestuff)<br />
<br />
Batch Yield <br />
A:5 Min. :1440 <br />
B:5 1st Qu.:1469 <br />
C:5 Median :1530 <br />
D:5 Mean :1528 <br />
E:5 3rd Qu.:1575 <br />
<br />
plot(Dyestuff)<br />
<br />
<br />
[[Image:SMHS_Fig1_Hierarchical_Linear_Models.png|500px]]<br />
<br />
<br />
model <- lmer(Yield ~ 1+(1|Batch),Dyestuff)<br />
model<br />
Linear mixed model fit by REML ['lmerMod']<br />
Formula: Yield ~ 1 + (1 | Batch) <br />
::Data: Dyestuff <br />
REML criterion at convergence: 319.6543 <br />
Random effects:<br />
Groups Name Std.Dev.<br />
Batch (Intercept) 42.00 <br />
Residual 49.51 <br />
Number of obs: 30, groups: Batch, 6<br />
Fixed Effects:<br />
(Intercept) <br />
1527<br />
<br />
summary(model)<br />
Linear mixed model fit by REML ['lmerMod']<br />
Formula: Yield ~ 1 + (1 | Batch) <br />
Data: Dyestuff <br />
<br />
REML criterion at convergence: 319.6543 <br />
<br />
Random effects:<br />
Groups Name Variance Std.Dev.<br />
Batch (Intercept) 1764 42.00 <br />
Residual 2451 49.51 <br />
Number of obs: 30, groups: Batch, 6<br />
<br />
Fixed effects:<br />
Estimate Std. Error t value<br />
(Intercept) 1527.50 19.38 78.8<br />
<br />
fitted(model)<br />
[1] 1509.893 1509.893 1509.893 1509.893 1509.893 1527.891 1527.891 1527.891 1527.891<br />
1527.891 1556.062 1556.062<br />
[13] 1556.062 1556.062 1556.062 1504.415 1504.415 1504.415 1504.415 1504.415 1584.233<br />
1584.233 1584.233 1584.233<br />
[25] 1584.233 1482.505 1482.505 1482.505 1482.505 1482.505<br />
<br />
<br />
model.up <- update(model,REML=F) ## refit the model for Maximum Llikelihood estimates, which is the same as Restricted ML estimates given the dataset is balanced, one-way classification.<br />
model.up<br />
<br />
Linear mixed model fit by maximum likelihood ['lmerMod']<br />
Formula: Yield ~ 1 + (1 | Batch) <br />
Data: Dyestuff <br />
AIC BIC logLik deviance <br />
333.3271 337.5307 -163.6635 327.3271 <br />
Random effects:<br />
Groups Name Std.Dev.<br />
Batch (Intercept) 37.26 <br />
Residual 49.51 <br />
Number of obs: 30, groups: Batch, 6<br />
Fixed Effects:<br />
(Intercept) <br />
1527 <br />
<br />
<br />
===References===<br />
<br />
[http://en.wikipedia.org/wiki/Multilevel_model Multilevel Model Wikipedia]<br />
<br />
[http://www.unt.edu/rss/class/Jon/MiscDocs/Raudenbush_1993.pdf Hierarchical Linear Models and Experimental Design]<br />
<br />
<br />
<br />
<br />
<hr><br />
* SOCR Home page: http://www.socr.umich.edu<br />
<br />
{{translate|pageName=http://wiki.socr.umich.edu/index.php?title=SMHS_HLM}}</div>Clgallahttps://wiki.socr.umich.edu/index.php?title=SMHS_TimeSeries&diff=14427SMHS TimeSeries2014-10-17T13:41:02Z<p>Clgalla: /* ## not sure about the mod here?! */</p>
<hr />
<div>==[[SMHS| Scientific Methods for Health Sciences]] - Time Series Analysis ==<br />
<br />
===Overview===<br />
Time series data is a sequence of data points measured at successive pints in time spaced intervals. Time series analysis is commonly used in varieties of studies like monitoring industrial processes and tracking business metrics in the business world. In this lecture, we will present a general introduction to the time series data and introduced on some of the most commonly used techniques in the rick and rapidly growing field of time series modeling and analysis to extract meaningful statistics and other characteristics of the data. We will also illustrate the application of time series analysis techniques with examples in R. <br />
<br />
===Motivation===<br />
Economic data like daily share price, monthly sales, annual income or physical data like daily temperature, ECG readings are all examples of time series data. Time series is just an ordered sequence of values of a variable measured at equally spaced time intervals. So, what would be the effective ways to measure time series data? How can we extract information from time series data and make inference afterwards? <br />
<br />
===Theory===<br />
Time series: a sequence of data points measured at successive pints in time spaced intervals.<br />
*Components: (1) trend component: long-run increase or decrease over time (overall upward or downward movement), typically refers to data taken over a long period of time; the trend can be upward or downward, linear or non-linear. (2) seasonal component: short-term regular wave-like patterns, usually refers to data observed within one year, which may be measured monthly or quarterly. (3) cyclical component: long-term wave-like patterns; regularly occur but may vary in length; often measured peak to peak or trough to trough. (4) irregular component: unpredictable, random, ‘residual’ fluctuation, which may be due to random variations of nature or accidents or unusual events; ‘noise’ in the time series.<br />
*Probabilistic models: The components still add or multiply together across time.<br />
[[Image:SMHS Fig 1 Times Series Analysis.png|800px]]<br />
*Simple Time Series Models: <br />
**Take $T_{t}$ as the trend component, $S_{t}$ as the seasonal component, $C_{t}$ as the cyclical component and $I_{t}$ as the irregular or random component. Then we have an additive model as: $x_{t}=T_{t}+S_{t}+C_{t}+I_{t};$ the multiplicative model says $x_{t}=T_{t}*S_{t}*C_{t}*I_{t};$ sometimes, we take logs on both sides of the multiplicative model to make it additive, $logx_{t}=log\left ( T_{t}*S_{t}*C_{t}*I_{t} \right )$, which can be further noted as $x_{t}{}'=T_{t}{}'+S_{t}{}'+C_{t}{}'+I_{t}{}'$.<br />
**Most time series models are written in terms of an (unobserved) white noise process, which is often assumed to be Gaussian: $x_{t}=w_{t}$, where $w_{t}\sim WN\left ( 0,1 \right )$, that is, $E\left ( W_{t} \right )=0, Var\left(W_{t} \right )=\sigma ^{2}$. Examples of probabilistic models include: (1) Autoregressive $x_{t}=0.6x_{t-1}+w_{t};$ (2) Moving average model $x_{t}= \frac{1}{3}\left(w_{t}+w_{t-1}+w_{t-2}\right).$<br />
**Fitting time series models: a time series model generates a process whose pattern can then be matched in some way to the observed data; since perfect matches are impossible, it is possible that more than one model will be appropriate for a set of data; to decide which model is appropriate: patterns suggest choices, assess within sample adequacy (diagnostics, tests), outside sample adequacy (forecast evaluation), simulation from suggested model and compare with observed data; next, turn to some theoretical aspects like how to characterize a time series, and then investigate some special processes. <br />
**Characterizing time series (the mean and covariance of time series): suppose the data are $x_{1},x_{2},\cdots ,x_{t},$ note for a regularly spaced time series, $x_{1}$ is observed at time $t_{0}$, $x_{2}$ is observed at $t_{0}+\Delta, x_{t}$ is observed at $t_{0}+\left( t-1\right) \Delta$; the expected value of $x_{t}$ is $\mu _{t}=E\left [ x_{t} \right ]$; the covariance function is $\gamma \left(s,t \right )=cov\left(x_{s},x_{t} \right )$. Note that, we don’t distinguish between random variables and their realizations. Properly, $x_{1},x_{2},\cdots ,x_{t}$ is a realization of some process $X_{1},X_{2},\cdots ,X_{t}$ and so $\mu _{t}=E\left [ x_{t} \right ]$ etc.<br />
**Characterizing time series data: weak stationarity: usually we have only one realization of the time series. There are not enough data to estimate all the means \& covariances. That is if the mean is constant, and the covariance depends only on the separation in time, then a time series is $\left(weakly\right)$ stationary. In this case $\mu_{t}=\mu$ and $\gamma\left(s,t\right)=\gamma\left(t-s\right)$. Stationary processes are much easier to handle statistically. Real time series are usually non-stationary, but can often be transformed to give a stationary process.<br />
**Estimation of $\gamma(h)$ (Estimating the autocovariance function): (1) for a stationary process: $\gamma(h)=\gamma(s,s+h)=cov(x_{s},x_{s+h})$ for any s.$\ \gamma (h)$ is called the autocovariance function because it measures the covariance between lags of the process $x_{t}$; (2) we observe T-h pairs separated at lag h namely $\left(x_{1},x_{h+1}\right),\cdots,\left(x_{T-h},x_{T}\right)$; the sample autocovariance function is $\hat{\gamma}(h)=\frac{1}{T}\sum_{t=1}^{T-h}\left(x_{t}-\bar{X} \right )\left(x_{t+h}-\bar{X} \right )$, note that we divide by ${T}$ although there are ${T-h}$ terms in the sum. The autocorrelation function ${ACF}$ is $\rho (h)=\frac{\gamma(h)}{\gamma(0)}$.<br />
**$\rho(h)$ properties: for $x_{t}$ stationary, $\gamma(h)=E(x_{t}-\mu{t})(x_{t+h}-\mu)$, where $\gamma(h)$ is an even function, that is $\gamma(h)=\gamma(-h); \left | \gamma(h) \right |\leq \left | \gamma(0) \right |$, that is $\left|\rho(h)\right|\leq1, h=\pm 1,2,\cdots$. The autocorrelation matrix, $P(h)$ is positive semi-definite which means that the autocorrelations cannot be chosen arbitrarily but must have some relations among themselves. <br />
**To study a set of time series data: given that we only have one realization, we need to match plot of data from proposed model with data (we can see broad/main patterns); match acf, pacf of proposed model with data (we can see if data are stationary uncorrelated, stationary autocorrelated or non-stationary); determine if the proposed model is appropriate; obtain forecasts from proposed model; compare across models that seem appropriate. <br />
**Large sample distribution of the $ACF$ for a $WN$ series: if $x_{t}$ is $WN$ then for n large, the sample $ACF$, $\hat{\rho}_{x}(h), h=1,2,\cdots$,H where H is fixed but arbitrary is normally distributed with zero mean and standard deviation given by $\sigma_{\hat{\rho}_{x}}(h)=\frac{1}{\sqrt{h}}$. A rule of thumb for assessing whether sample autocorrelations resemble those for a $WN$ series is by checking if approximately 95% of the sample autocorrelations are within the interval $0\pm \frac{2}{\sqrt{n}}$. <br />
**White noise, or stationary uncorrelated process: we say that a time series is white noise if it is weakly stationary, with $ACF$, $\rho(h)=\begin{cases} 1 & \text{ if } h=0 \\ 0 & \text{ if } h> 0\ and\ with \ E\left[w_{t} \right ]=0 \end{cases}$, i.e. white noise consists of a sequence of uncorrelated, zero mean random variables with common variance $\sigma^{2}$.<br />
<br />
*Example of a simple moving average model in R: $v_{t}=\frac{1}{3} (w_{t-1}+w_{t}+w_{t+1})$. <br />
<br />
R CODE: <br />
w <- ts(rnorm(150)) ## generate 150 data from the standard normal distribution<br />
v <- ts(filter(w, sides=2, rep(1/3, 3))) ## moving average model<br />
par(mfrow=c(2,1))<br />
plot(w,main='white noise')<br />
plot(v,main='moving average')<br />
<br />
or can use code:<br />
<br />
w <- rnorm(150) ## generate 150 data from the standard normal distribution<br />
v <- filter(w, sides=2, rep(1/3, 3)) ## moving average model<br />
par(mfrow=c(2,1))<br />
plot.ts(w,main='white noise')<br />
plot.ts(v,main='moving average')<br />
<br />
[[Image:SMHS Fig2 Timeseries Analysis.png|500px]]<br />
<br />
## sums based on WN processes<br />
'''ts.plot(w,v,lty=2:1,col=1:2,lwd=1:2)'''<br />
<br />
[[Image:SMHS Fig3 TimeSeries Analysis.png|500px]]<br />
<br />
The ACF and the sample ACF of some stationary processes<br />
*white noise = $w(t),acf x(t)=w(t)+\frac{1}{3} \left(w(t-1)+w(t-2)+w(t-3) \right),$ then plot the barplot of the acf:<br />
<br />
RCODE:<br />
par(mfrow=c(2,1))<br />
barplot(ARMAacf(0,lag.max=10),main='white noise=w(t)')<br />
barplot(ARMAacf(ma=c(1/3,1/3,1/3),lag.max=10),main='acfx(t)=w(t)+1/3w(t-1)+1/3w(t-2)+1/3w(t-3)')<br />
<br />
[[Image:SMHS_Fig4_TimeSeries_Analysis.png|500px]]<br />
<br />
*Theoretical acf of some processes: Autoregressive process, $x_{t}=0.9x_{t-1}+w_{t}.$<br />
<br />
RCODE:<br />
par(mfrow=c(2,1))<br />
barplot(ARMAacf(c(0.9),lag.max=30),main='acf x(t)=0.9x(t-1)+w(t)')<br />
barplot(ARMAacf(c(1,-0.9),lag.max=30),main='acf x(t)=x(t-1)-0.9x(t-2)+w(t)')<br />
<br />
<br />
[[Image:SMHA_Fig5_TimeSeries_Analysis.png|500px]]<br />
<br />
*Recognizing patterns in probabilistic data <br />
**Example 1: compare White noise & Autoregressive process $x_{t}=w_{t} \& x_{t}=x_{t-1}-0.9x_{t-2}+w_{t})$<br />
<br />
RCODE:<br />
w1 <- rnorm(200) ## generate 200 data from the standard normal distribution<br />
x <- filter(w1, filter=c(1,-0.9),method='recursive')[-(1:50)]<br />
w2 <- w1[-(1:50)]<br />
xt <- ts(x)<br />
par(mfrow=c(2,2))<br />
plot.ts(w,main='white noise')<br />
acf(w2,lag.max=25)<br />
plot.ts(xt,main='autoregression')<br />
acf(xt,lag.max=25)<br />
<br />
[[Image:SMHS_Fig6_TimeSeries_Analysis.png|500px]]<br />
<br />
*Apparently, there is almost no autocorrelation in white noise, while there is large autocorrelation in second series.<br />
<br />
Example 2: AR & MA $x_{t}=x_{t-1}-0.9x_{t-2}+w_{t}\& v_{t}=\frac{1}{3}\left(w_{t-1}+w_{t}+w_{t+1} \right))$<br />
<br />
RCODE:<br />
v <- filter(w1, sides=2, rep(1/3, 3))<br />
vt <- ts(v)[2:199]<br />
par(mfrow=c(2,2))<br />
plot.ts(xt,main='white noise')<br />
acf(xt,lag.max=25)<br />
plot.ts(vt,main='autoregression')<br />
acf(vt,lag.max=25)<br />
<br />
[[Image:SMHS_Fig7_TimeSeries_Analysis.png|500px]]<br />
<br />
<br />
*ACF, PACF and some examples of data in R<br />
<br />
RCODE:<br />
data<-<br />
c(28,22,36,26,28,28,26,24,32,30,27,24,33,21,36,32,31,25,24,25,28,36,27,32,34,30,<br />
25,26,26,25,-<br />
44,23,21,30,33,29,27,29,28,22,26,27,16,31,29,36,32,28,40,19,37,23,32,29,-<br />
2,24,25,27,24,16,29,20,28,27,39,23)<br />
par(mfrow=c(1,2))<br />
plot(data)<br />
qqnorm(data)<br />
qqline(data)<br />
<br />
[[Image:SMHS_Fig8_TimeSeries_Analysis.png|500px]]<br />
<br />
data1<-data[c(-31,-55)]<br />
par(mfrow=c(1,2))<br />
plot.ts(data1)<br />
qqnorm(data1)<br />
qqline(data1)<br />
<br />
[[Image:SMHS_Fig9_TimeSeries_Analysis.png|500px]]<br />
<br />
library(astsa)<br />
acf2(data1)<br />
<br />
*Examining the data: <br />
Diagnostics of outliers<br />
<br />
RCODE:<br />
dat<- <br />
c(28,22,36,26,28,28,26,24,32,30,27,24,33,21,36,32,31,25,24,25,28,36,27,32,34,30,<br />
25,26,26,25,-44,23,21,30, 33,29, 27,29,28,22,26,27,16,31,29,36,32,28,40,19,<br />
37,23,32,29,-2,24,25,27,24,16,29,20,28,27, 39,23)<br />
plot.ts(dat)<br />
<br />
<br />
[[Image:SMHS_Fig10_TimeSeries_Analysis.png|500px]]<br />
<br />
<br />
Then we can easily tell that there is an outlier in this data series.<br />
<br />
# what if we use a boxplot here?<br />
boxplot(dat)<br />
<br />
[[Image:SMHS_Fig11_TimeSeries_Analysis.png|500px]]<br />
<br />
identify(rep(1,length(dat)),dat) ## identify outliers by clicking on screen<br />
## we identified the outliers are 55 and 31:<br />
dat[31]<br />
[1] -44<br />
dat[55]<br />
[1] -2<br />
datr <- dat[c(-31,-55)] ## this is the dat with outlier removed<br />
qqnorm(datr)<br />
qqline(datr)<br />
<br />
<br />
[[Image:SMHS_Fig12_TimeSeries.png|500px]]<br />
<br />
<br />
## this plot shows that the distribution of the new data series (right) <br />
is closer to normal compared to the original data (left).<br />
## The effect of outliers by comparing the ACF and PACF of the data <br />
series:<br />
library(astsa)<br />
acf2(dat)<br />
acf2(datr)<br />
[[Image:SMHS_Fig13_TimeSeries.png|800px]]<br />
<br />
*Diagnostics of the error: consider the model $(x_{t}=\mu+w_{t} )$ check to see if the mean of the probability distribution of error is 0; whether variance of error is constant across time; whether errors are independent and uncorrelated; whether the probability distribution of error is normal. <br />
:::Residual plot for functional form<br />
<br />
[[Image:SMHS_Fig14_TimeSeries.png|500px]]<br />
<br />
:::Residual plot for equal variance<br />
<br />
[[Image:SMHS_Fig15_TimeSeries.png|500px]]<br />
<br />
:::Check if residuals are correlated by plotting $e_{t} vs. e_{t-1}$: there are three possible cases – positive autocorrelation, negative autocorrelation and no autocorrelation. <br />
<br />
*Smoothing: to identify the structure by averaging out the noise since series has some structure plus variation. <br />
::Moving averages:<br />
ma3<-filter(datr,sides=2,rep(1/3,3))<br />
ma9<-filter(datr,sides=2,rep(1/9,9))<br />
plot.ts(datr,ylim=c(15,50),ylab='data')<br />
par(new=T)<br />
plot.ts(ma3,ylim=c(15,50),col=2,ylab='data') ## red line indicates ma3<br />
par(new=T)<br />
plot.ts(ma9,ylim=c(15,50),col=3,ylab='data') ## green line indicates ma9<br />
<br />
[[Image:SMHS_Fig16_TimeSeries.png|500px]]<br />
<br />
*Apparently, the bigger the laggings in moving average, the smoother the time series plot of the data series. <br />
ma3<-filter(datr,sides=2,rep(1/3,3))<br />
ma31<-filter(datr,sides=1,rep(1/3,3))<br />
plot.ts(datr,ylim=c(15,50),ylab='data')<br />
par(new=T)<br />
plot.ts(ma3,ylim=c(15,50),col=2,ylab='data') ## red line indicates ma3<br />
par(new=T)<br />
plot.ts(ma31,ylim=c(15,50),col=3,ylab='data') ## green line indicates ma31<br />
<br />
[[Image:SMHS_Fig17_TimeSeries.png|500px]]<br />
<br />
Two sided and one sided filters with same coefficients seem to be similar in their smoothing effect. <br />
<br />
## the effect of equal vs. unequal weights? <br />
ma31<-filter(datr,sides=1,rep(1/3,3))<br />
mawt<-filter(datr,sides=1,c(1/9,1/9,7/9))<br />
plot.ts(datr,ylim=c(15,50),ylab='data')<br />
par(new=T)<br />
plot.ts(ma31,ylim=c(15,50),col=2,ylab='data') ## red line indicates ma31<br />
par(new=T)<br />
plot.ts(mawt,ylim=c(15,50),col=3,ylab='data') ## green line indicates mawt<br />
<br />
[[Image:SMHS_Fig18_TimeSeries.png|500px]]<br />
<br />
The smoothing effect of filter with equal weights seems to be bigger than the filter with unequal weights.<br />
<br />
*Simple Exponential Smoothing (SES): premise – the most recent observations might have the highest predictive value $(\hat{x}_{t} = \alpha x_{t-1} + (1- \alpha) \hat{x}_{t-1} )$, that is new forecast = $\alpha \ actual value + (1-\alpha) \ previous forecast$. This method is appropriate for series with no trend or seasonality.<br />
<br />
*For series with trend, can use Holt-Winters Exponential smoothing additive model $\hat{x}_{t+h}=a_{t}+h*b_{t}.$ We smooth the level or permanent component, the trend component and the seasonal component.<br />
<br />
RCODE:<br />
HoltWinters(series,beta=FALSE,gamma=FALSE) #simple exponential smoothing<br />
HoltWinters(series,gamma=FALSE) # HoltWinters trend<br />
HoltWinters(series,seasonal= 'additive') # HoltWinters additive trend+seasonal<br />
HoltWinters(series,seasonal= 'multiplicative') # HoltWinters multiplicative trend+seasonal<br />
<br />
<br />
<br />
Example: RCODE: (effect of different values of alpha on datr SES)<br />
par(mfrow=c(3,1))<br />
exp05=HoltWinters(datr,alpha=0.05,beta=FALSE,gamma=FALSE) #simple exponential smoothing <br />
p05=predict(exp05,20) <br />
plot(exp05,p05, main = 'Holt- Winters filtering, alpha=0.05')<br />
exp50=HoltWinters(datr, alpha=0.5,beta=FALSE,gamma=FALSE) #simple exponential smoothing <br />
p50=predict(exp50,20)<br />
plot(exp50,p50, main = 'Holt-Winters filtering, alpha=0.5')<br />
exp95=HoltWinters(datr, alpha=0.95,beta=FALSE,gamma= FALSE) #simple exponential smoothing <br />
p95=predict(exp95,20)<br />
plot(exp95,p95,main = "Holt-Winters filtering, alpha=0.95")<br />
<br />
<br />
[[Image:SMHS_Fig19_TimeSeries.png|500px]]<br />
<br />
*Comparing two models: we can compare ‘fit’ and forecast ability. <br />
<br />
RCODE:<br />
obs <- exp50$\$$x[-1]<br />
fit <- exp50$\$$fitted[,1]<br />
r1 <- obs-fit<br />
plot(r1)<br />
par(mfrow=c(1,2))<br />
plot(r1)<br />
acf(r1)<br />
<br />
[[Image:SMHS_Fig20_TimeSeries.png|500px]]<br />
<br />
*Regression based (nonparametric) smoothing: (1) kernel: extension of a two-sided MA. Use a bandwidth (number of terms) plus a kernel function to smooth a set of values, ''ksmooth''; (2) local linear regression (loess): fit local polynomial regression and join them together, ''loess''; (3) fitting cubic splines: extension of polynomial regression, partition data and fit separate piecewise regression to each section, smooth them together where they join, ''smooth.spine''; (4) many more, ''lowess'', ''supsmu''.<br />
::•Kernel smoother: define bandwidth; choose a kernel; use robust regression to get smooth estimate; bigger the bandwidth, smoother the result<br />
::•Loess: define the window width; choose a weight function; use polynomial regression and weighted least squares to get smoothest estimate; the bigger span, the smoother the result.<br />
::•Splines: divide data into intervals; in each interval, fit a polynomial regression of order p such that the splines are continuous at knots; use a parameter (spar in R) to smooth the splines; the bigger the spar, the smoother the result.<br />
::•Issues with regression based (nonparametric) smoothing: (1) the size of S has an important effect on the curve; (2) a span that is too small (meaning that insufficient data fall within the window) produces a curve characterized by a lot of noise, results in a large variance; (3) if the span is too large, the regression will be oversmoothed, and thus the local polynomial may not fit the data well. This may result in loss of important information, and thus the fit will have large bias; (4) we want the smallest span that provides a smooth structure.<br />
<br />
*A regression model for trend: given $y_{t}, t=1,2,…,\cdots,$ T is a time series, fit a regression model for trend. The plot suggests that we fit a trend line, say $y_{t}=\beta_{1}+\beta_{2} t + \varepsilon_{t}$. <br />
::•Assumptions: variance of error is constant across observations; errors are independent / uncorrelated across observations; the regressor variables and error are independent; we may also assume probability distribution of error is normal.<br />
<br />
RCODE<br />
fit <- lm(x~z)<br />
summary(fit$\$$coeff)<br />
plot.ts(fit$\$$resid) # plot the residuals of the fitted regression model<br />
acf2(fit$\$$resid) # the ACF and PACF of the residuals of the fitted regression model<br />
qqnorm(fit$\$$resid)<br />
qqline(fit$\$$resid)<br />
<br />
<br />
*Analyzing patterns in data and modeling: seasonality<br />
**Exploratory analysis<br />
<br />
RCODE:<br />
data=c( 118,132,129,121,135,148,148,136,119,104,118,115,126,141,135,125,<br />
149,170,170,158,133,114,140,145,150,178,163,172,178,199,199,184,162,146,<br />
166,171,180,193,181,183,218,230,242,209,191,172,194,196,196,236,235,<br />
229,243,264,272,237,211,180,201,204,188,235,227,234,264,302,293,259,229,<br />
203,229,242,233,267,269,270,315,364,347,312,274,237,278,284,277,317,313,<br />
318,374,413,405,255,306,271,306,300)<br />
air <- ts(data,start=c(1949,1),frequency=12)<br />
y <- log(air)<br />
par(mfrow=c(1,2))<br />
plot(air)<br />
plot(y)<br />
<br />
<br />
[[Image:SMHS_Fig21_TimeSeries.png|500px]]<br />
<br />
<br />
## from the charts above, taking log helps a little bit with the increasing variability. <br />
<br />
monthplot(y)<br />
<br />
<br />
[[Image:SMHS_Fig22_TimeSeries.png|500px]]<br />
<br />
<br />
acf2(y)<br />
<br />
[[Image:SMHS_Fig23_TimeSeries.png|500px]]<br />
<br />
*Smoothing: the Holt-Winters method smooths and obtains three parts of a series: level, trend and seasonal. The method requires smoothing constants: your choice or those which minimize the one step ahead forecast error: $\sum_{t=1}^{T} \left(\hat{x}_{t} - x_{t-1} \right )^{2}$, where $t=1,2,\cdots,T$. So at a given time point, smoothed value = (level+h*trend)+seasonal index (additive); data value = (level+h*trend)*seasonal index (multiplicative). Additive: $\hat{x}_{t+h}=a_{t}+h*b_{t}+s_{t+1+(h-1)\mod{p}}$; Multiplicative: $\hat{x}_{t+h}=(a_{t}+h*b_{t})* s_{t+1+(h-1)\mod{p}}.$<br />
<br />
<br />
RCODE:<br />
par(mfrow=c(2,1))<br />
sa <- HoltWinters(air,seasonal=c('additive'))<br />
psa <- predict(sa,60)<br />
plot(sa,psa,main='Holt-Winters filtering, airline passengers, HW additive')<br />
sm <- HoltWinters(air,seasonal=c('multiplicative'))<br />
psam <- predict(sm,60)<br />
plot(sm,psam,main='Holt-Winters filtering, airline passengers, HW multiplicative')<br />
<br />
[[Image:SMHS_Fig24_TimeSeries.png|500px]]<br />
<br />
## seasonal decomposition using loess, log(airpassengers)<br />
<br />
[[Image:SMHS_Fig25_TimeSeries.png|500px]]<br />
<br />
*Regression: estimation, diagnostics, interpretation. Defining dummy variables to capture the seasonal effect in regression. If there is a variable with k categories, there will be k-1 dummies if the intercept is included in the regression (exclude one group); there will be k dummies if the intercept is not included in the regression; if the intercept is included in the regression, the coefficient of a dummy variable is always the difference of means between that group and the excluded group. <br />
<br />
RCODE:<br />
t <- time(y)<br />
Q <- factor(rep(1:12,8))<br />
h <- lm(y~time(y)+Q) # regression with intercept, drops first category = January<br />
model.matrix(h)<br />
h1 <- lm(y~0+time(y)+Q) # regression without intercept<br />
model.matrix(h1)<br />
x1 <- ts(y[1:72],start=c(1949,1),frequency=12)<br />
t1 <- t[1:72]<br />
Q1 <- Q[1:72]<br />
contrasts(Q1)=contr.treatment(12,base=10)<br />
h2 <- lm(x1~t1+Q1) # regression with intercept, drops 10th category = October<br />
model.matrix(h2)<br />
reg <- lm(x1~t1+Q1)<br />
par(mfrow=c(2,2))<br />
plot.ts(reg$\$$resid)<br />
qqnorm(reg$\$$resid)<br />
qqline(reg$\$$resid)<br />
acf(reg$\$$resid)<br />
pacf(reg$\$$resid)<br />
<br />
[[Image:SMHS_Fig26_TimeSeries.png|500px]]<br />
<br />
## forecasts RCODE<br />
newdata=data.frame(t1=t[73:96], Q1=Q[73:96]) <br />
preg=predict(reg,newdata,se.fit=TRUE) <br />
seup=preg$\$$fit+2*preg$\$$se.fit<br />
seup <br />
1 2 3 4 5 6 7 8 9 <br />
5.577144 5.722867 5.679936 5.667979 5.780376 5.881420 5.889829 5.782526 5.655445 <br />
10 11 12 13 14 15 16 17 18 <br />
5.525933 5.661158 5.681042 5.716577 5.862300 5.819369 5.807412 5.919809 6.020853 <br />
19 20 21 22 23 24 <br />
6.029262 5.921959 5.794878 5.665366 5.800591 5.820475 <br />
selow=preg$\$$fit-2*preg$\$$se.fit<br />
selow<br />
1 2 3 4 5 6 7 8 9 <br />
5.479442 5.625165 5.582234 5.570277 5.682674 5.783718 5.792127 5.684824 5.557743 <br />
10 11 12 13 14 15 16 17 18 <br />
5.428232 5.563456 5.583340 5.610927 5.756650 5.713720 5.701763 5.814160 5.915203 <br />
19 20 21 22 23 24 <br />
5.923613 5.816309 5.689228 5.559717 5.694941 5.714825<br />
<br />
x1r=y[73:96]<br />
t1r=t[73:96]<br />
plot(t1r,x1r, col='red',lwd=1:2,main='Actual vs Forecast', type='l')<br />
lines(t1r,preg$\$$fit, col='green', lwd=1:2)<br />
lines(t1r,seup, col='blue', lwd=1:2)<br />
lines(t1r,selow, col='blue', lwd=1:2)<br />
## red refer actual; green refers to forecast; blue refers to standard error<br />
<br />
[[Image:SMHS_Fig27_TimeSeries.png|500px]]<br />
<br />
## Holt Winters smoothing: forecasts<br />
sa=HoltWinters(x1, seasonal = c("additive"))<br />
psa=predict(sa,24)<br />
plot(t1r,x1r, col="red",type="l",main = "Holt- Winters filtering, airline passengers, HW additive")<br />
lines(t1r, psa, ,col="green")<br />
<br />
## red represents the actual, green: forecast<br />
<br />
[[Image:SMHS_Fig28_TimeSeries.png|500px]]<br />
<br />
sqrt(sum(x1r-psa)^2)) ##HW<br />
[1] 0.3988259<br />
sqrt(sum((x1r-preg$\$$fit)^2)) ##reg<br />
[1] 0.3807665<br />
<br />
==='''## not sure about the mod here?!'''===<br />
<br />
RCODE:<br />
par(mfrow=c(2,1))<br />
sa <- HoltWinters(air,seasonal=c('additive'))<br />
psa <- predict(sa,60)<br />
plot(sa,psa,main='Holt-Winters filtering, airline passengers, HW additive')<br />
sm <- HoltWinters(air,seasonal=c('multiplicative'))<br />
psam <- predict(sm,60)<br />
plot(sm,psam,main='Holt-Winters filtering, airline passengers, HW multiplicative')<br />
<br />
<br />
[[Image:SMHS_Fig24_TimeSeries.png|500px]]<br />
<br />
## seasonal decomposition using loess, log(airpassengers)<br />
<br />
[[Image:SMHS_Fig25_TimeSeries.png|500px]]<br />
<br />
*Regression: estimation, diagnostics, interpretation. Defining dummy variables to capture the seasonal effect in regression. If there is a variable with k categories, there will be k-1 dummies if the intercept is included in the regression (exclude one group); there will be k dummies if the intercept is not included in the regression; if the intercept is included in the regression, the coefficient of a dummy variable is always the difference of means between that group and the excluded group. <br />
<br />
RCODE:<br />
$t <- time(y)$<br />
$Q <- factor(rep(1:12,8))$<br />
$h <- lm(y~time(y)+Q)$ # regression with intercept, drops first category = January<br />
model.matrix(h)<br />
$h1 <- lm(y~0+time(y)+Q)$ # regression without intercept<br />
model.matrix(h1)<br />
$x1 <- ts(y[1:72]$,start=c(1949,1),frequency=12)<br />
t1 <- t[1:72]<br />
Q1 <- Q[1:72]<br />
contrasts(Q1)=contr.treatment(12,base=10)<br />
h2 <- lm(x1~t1+Q1) # regression with intercept, drops 10th category = October<br />
model.matrix(h2)<br />
reg <- lm(x1~t1+Q1)<br />
par(mfrow=c(2,2))<br />
plot.ts(reg$resid)<br />
qqnorm(reg$resid)<br />
qqline(reg$resid)<br />
acf(reg$resid)<br />
pacf(reg$resid)<br />
<br />
[[Image:SMHS_Fig26_TimeSeries.png|500px]]<br />
<br />
## forecasts RCODE<br />
newdata=data.frame(t1=t[73:96], Q1=Q[73:96])<br />
preg=predict(reg,newdata,se.fit=TRUE) <br />
seup=preg$fit+2*preg$se.fit<br />
seup<br />
<br />
==Applications==<br />
<br />
1)[http://www.statsoft.com/Textbook/Time-Series-Analysis This article] presents a comprehensive introduction to the filed of time series analysis. It discussed about the common patterns existing in time series data and introduced some commonly used techniques to deal with time series data and ways to analyze time series. <br />
<br />
2)[http://www.sciencedirect.com/science/article/pii/0304393282900125 This article] investigates whether macroeconomic time series are better characterized as stationary fluctuations around a deterministic trend or as non-stationary processes that have no tendency to return to a deterministic path. Using long historical time series for the U.S. we are unable to reject the hypothesis that these series are non-stationary stochastic processes with no tendency to return to a trend line. Based on these findings and an unobserved components model for output that decomposes fluctuations into a secular or growth component and a cyclical component we infer that shocks to the former, which we associate with real disturbances, contribute substantially to the variation in observed output. We conclude that macroeconomic models that focus on monetary disturbances as a source of purely transitory fluctuations may never be successful in explaining a large fraction of output variation and that stochastic variation due to real factors is an essential element of any model of macroeconomic fluctuations.<br />
<br />
<br />
==Software== <br />
See the CODE examples given in this lecture.<br />
<br />
==Problems==<br />
<br />
1) Consider a signal-plus-noise model of the general form x_{t}=s_{t}+w_{t}, where w_{t} is Gaussian white noise with \sigma ^{2} _{w} = 1. Simulate and plot n=200 observations from each of the following two models.<br />
(a) x_{t} = s_{t}+w_{t}, for t=1,2,\cdots,200, where s_{t}=\begin{cases} 0 & \text{ if } t=1,2,cdots,100 \\ & 10exp \left \{ -(t-100)/20) \right \} cos\left ( 2\pi t/4 \right )\text{ if } t=101,102,\cdots,200 \end{cases} .<br />
(b) x_{t}=s_{t}+w_{t}, for t=1,2,\cdots, 200, where s_{t}=\begin{cases} 0 & \text{ if } t=1,2,cdots,100 \\ & 10exp \left \{ -(t-100)/200) \right \} cos\left ( 2\pi t/4 \right )\text{ if } t=101,102,\cdots,200 \end{cases}.<br />
(c) Compare the signal modulators (a) exp \left \{-t/20 \right \} and (b) exp \left \{-t/200 \right \}, for t=1,2, cdots,100.<br />
<br />
2) (a) Generate n=100 observations from the autoregression x_{t}=-0.9 x_{t-2} +w_{t}, with \signma {w}=1, using the method described in Example 1.10, page 13 given in the textbook. Next, apply the moving average filter v_{t}=\left(x_{t}+x_{t-1}+x_{t-2}+x_{t-3} \right )/4 to x_{t}, the data you generated. Now, plot x_{t} as a line and superimpose v_{t} as a dashed line. Comment on the behavior of x_{t} and how apply the moving average filter changes that behavior. [Hint: use v=filter(x,rep(1/4,4),sides=1) for the filter].<br />
(b) Repeat (a) but with x_{t}=cos \left( 2 \pi t/4 \right).<br />
(c) Repeat (b) but with added N(0,1) noise, x_{t}=cos \left(2 \pi t/4 \right) +w_{t}.<br />
(d) Compare and contrast (a) – (c).<br />
<br />
3) For the two series, x_{t} in 6.1 (a) and (b):<br />
(a) Compute and plot the mean functions \mu _{x} (t) for t=1,2,\cdots,200.<br />
(b) Calculate the autocovariance functions, \gamma _{x} (s,t), for s,t=1,2, \cdots, 200.<br />
<br />
4) Consider the time series x_{t} = \beta {1} + \beta {2} t + w{t}, where \beta _{1} and \beta_{2} are known constants and w_{t} is a white noise process with variance \sigma ^{2} _{w}.<br />
(a) Determine whether x_{t} is stationary. <br />
(b) Show that the process y_{t}=x_{t} – x_{t-1} is stationary.<br />
(c) Show that the mean of the moving average v_{t}= \frac {1}{2q+1} \sum_{j=-q}^{q} x_{t-j}, is \beta _{1} + \beta _{2} t, and give a simplified expression for the autocovariance function. <br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<hr><br />
* SOCR Home page: http://www.socr.umich.edu<br />
<br />
{{translate|pageName=http://wiki.socr.umich.edu/index.php?title=SMHS_TimeSeries}}</div>Clgallahttps://wiki.socr.umich.edu/index.php?title=SMHS_GEE&diff=14426SMHS GEE2014-10-17T12:56:52Z<p>Clgalla: /* Software */</p>
<hr />
<div>==[[SMHS| Scientific Methods for Health Sciences]] - Generalized Estimating Equations (GEE) Models ==<br />
===Overview===<br />
Generalized estimating equation (GEE) is a commonly used method to estimate the parameters of a generalized linear model with a possible unknown correlation between outcomes. It provides a general approach for analyzing discrete and continuous responses with marginal models and works as a popular alternative to maximum likelihood estimation (MLE). In this section, we are going to present a general introduction to the GEE method and illustrate its application with examples. The R package geeglm will be discussed in the attached paper reference.<br />
<br />
===Motivation===<br />
There is no nature specification or convenient ways to deal with situations like the joint multivariate distribution of $Y_{i} = {(Y_{i1}, Y_{i2}, \cdots, Y_{in})}'$ where the responses are discrete. That’s where the GEE method is proposed based on the concept of estimating equations. It provides a general approach for analyzing discrete and continuous responses with marginal models. So, how does the GEE model work?<br />
<br />
===Theory===<br />
<br />
'''1) GEE'''<br />
<br />
GEE is a commonly used method to estimate the parameters of a generalized linear model with a possible unknown correlation between outcomes. Parameter estimates from the GEE are consistent even when the covariance is mis-specified under mild regularity conditions and the focus of GEE is on estimating the average response over the population rather than the regression parameters that would enable prediction of the effect of changing one or more covariates on a given individual. <br />
*GEEs belong to the class of semi-parametric regression techniques since they rely on specification of only the first two moments and are a popular alternative to the likelihood-based generalized linear mixed model. <br />
*They are commonly used in large epidemiological studies, especially multi-site cohort studies because they can handle many types of unmeasured dependence between outcomes.<br />
<br />
'''2) Marginal models'''<br />
<br />
The vector of observations from subject $I$ are $Y_{i} = {(Y_{i1}, Y_{i2}, \cdots, Y_{in})}',$ for $(i=1,2, \cdots, N).$ Consider a marginal model that has a mean function, a function of predictors; a known variance functions (discrete data the mean determines the variance); possibly a scale parameter for overdispersion; a working correlation matrix. <br />
*More specifically, the marginal mean of the response is $E[Y_{ij}]=\mu_{ij},$ the mean depends on explanatory variables $X_{ij}$ through a known link function: $g(\mu_{ij})=\eta_{ij} ={X_{ij}}’ \beta.$<br />
*The marginal variance of $Y_{ij}$ depends on the marginal mean according to $Var[Y_{ij}]=v(\mu_{ij}) \phi,$ where the variance function $v(\mu_{ij})$ is known, $\phi =1$ is fixed for Poisson, Bernoulli. Parameter $\phi$ may have to be estimated for normal model or overdispersed data. Correlation between $Y_{ij}$ and $Y_{ik}$ is a function of extra parameter, \alpha. Correlation may depend on means $\mu_{ij}$ and $\mu_{ik}.$<br />
*The mean idea is to generalize the usual univariate likelihood equations by introducing the covariance matrix of the vector of responses,$ Y_{i}$. For linear models, weighted least squares (WLS) or generalized least squares (GLS) can be considered a special case of this ‘estimating equations’ approach; for non-linear models, this approach is called ‘generalized estimating equations’ (GEE).<br />
<br />
'''3) GEE computation:''' <br />
<br />
*Consider normal linear regression $y_{i}={x_{i}}’ \beta + \epsilon_{i}, \epsilon_{i} \sim N(0, \sigma^{2}).$ The normal density for $N$ observations looks like $(2* \pi * \sigma^{2})^{-N/2}$ times the exponential of $\sum_{i=1}^{N} -0.5(y_{i}- {x_{i}}' \beta) \sigma^{-2} (y_{i} - {x_{i}}' \beta),$ which is also referred to as the log likelihood function. <br />
*Likelihood equations, normal regression: find the maximum likelihood estimates by finding the value $\hat{\beta}$ of $\beta$ that maximizes the likelihood $l(\beta)= \sum_{i=1}^{N} -0.5(y_{i}- {x_{i}}' \beta) \sigma^{-2} (y_{i} - {x_{i}}' \beta);$ maximum of a function $l(\beta)$ is found by differentiating the function $l(\beta)$ with respect to $\beta,$ setting this equal to zero and then solving $\frac{dl(\beta)}{d \beta} = 0.$<br />
*Regression likelihood equations: the log likelihood looks like $\sum_{i=1}^{N} -0.5(y_{i}- {x_{i}}' \beta) \sigma^{-2} (y_{i} - {x_{i}}' \beta);$ differentiating with respect to $\beta$ and setting equal to zero gives $\sum_{i=1}^{N}{x_{i}}'\sigma^{-2}(y_{i}-{x_{i}}'\beta) = 0.$ This is the likelihood equation. Solving for $\beta$ gives us $\hat{\beta}=(\sum_{i} x_{i}{x_{i}}')^{-1}(\sum_{i} x_{i}y_{i}).$ <br />
*Normal longitudinal model: generalize to multivariate regression (longitudinal models); suppose continuous data $Y_{i}$, nx1; predictor matrix $X_{i}$,$n$ rows by $p$ columns; coefficient vector $\beta, p by 1$. The model is $Y_{i}=X_{i}+\epsilon_{i},$ the residual distribution is $\epsilon \sim N(0, V_{i}),$ variance matrix is $V_{i}$ to distinguish from summation $\sum.$ Covariance matrix $V_{i}$ depends on unknown parameter $\theta.$<br />
*Multivariate likelihood: the multivariate normal log likelihood is $\sum_{i=1}^{N} -0.5{(Y_{i}-X_{i}\beta)}'V_{i}^{-1}(Y_{i}-X_{i}\beta);$ looks like a normal regression likelihood as before; vector $Y_{i}$ has replaced scalar $y_{i}$; vector $X_{i}\beta$ has replaced scalar ${x_{i}}’\beta;$ and matrix $V_{i}^{-1}$ has replaced $\sigma^{-2}.$<br />
*Multivariate observation likelihood: differentiate the log likelihood and set to zero; gives the likelihood equation $\sum_{i=1}^{N}{X_{i}}'V_{i}^{-1}(Y_{i}-X_{i}\beta)=0;$ similar to linear regression likelihood equation; the solution (suppose we know $V_{i})$ is $\hat{\beta}=(\sum_{i=1}^{N}{X_{i}'V_{i}^{-1}X_{i}})^{-1}(\sum_{i=1}^{N}{X_{i}}'V_{i}^{-1}Y_{i}),$ this is the weighted least squares estimator for $\hat{\beta}.$<br />
*Interpretation of the likelihood equations: $\sum_{i=1}^{N}{X_{i}}'V_{i}^{-1}(Y_{i}-X_{i}\beta)=0;$ the residual $(Y_{i}-X_{i}\beta)$ is $(Y_{i}-\mu_{i}),$ the observation vector $Y_{i}$ minus its mean $\mu_{i}=X_{i}\beta;$ the $V_{i}^{-1}$ is the inverse of the inverse of the covariance matrix; the $X_{i}$ can be thought of as the derivative of $\mu_{i}$ with respect to the regression parameter $\beta,$ written as $d\mu_{i}/d\beta.$<br />
*Multivariate generalized linear models equations: generalize to generalized linear models (GLMs) with correlated data. It is no longer a likelihood equation, instead use a weighted least squares equation, where we replace each term in the previous likelihood equation with a generalized linear model equivalent, and include something to adjust for correlations. $Y_{i}$ vector remains the same; mean vector is $\mu_{i}$ rather than $X_{i}\alpha;$ replace $X_{i}$ with the derivative $D_{i} = d\mu_{i}/d\beta;$ for GLMs $D_{i}$ has components that look like $D_{1}^{-1}X_{i}, matrix D_{1}$ is diagonal and has elements $v(\mu_{ij})$ the variance functions of the means; thus $D_{i}$ is a weighted predictor matrix; the overdispersion parameter $\phi$ is not involved at this stage. The $V_{i}$’s are constructed to be similar to the covariance matrix $var[Y_{i}]$, but not actually equal; the GEEs are $\sum_{i=1}^{N}D_{i}V_{i}^{-1}(Y_{i}-\mu_{i}(\beta))=0,$ solve for $\beta$ giving the GEE estimate we write as $\hat{\beta}_{GEE}.$ <br />
*Constructing the working covariance matrix: $V_{i} \approx Var[Y_{i}];$ separate $V_{i}$ into two parts of variance matrix and correlation matrix; the variance matrix is a diagonal matrix $A_{i}$ of variances; the diagonal elements of $A_{i}$ are $\phi v(\mu_{ij})$ are the variances of the observation $Y_{ij}.$ The correlation matrix $Corr(Y_{i})$ is a function of unknown parameter $\alpha$; the matrix $V_{i}$ is known as a working covariance matrix; it is not the true underlying correlation matrix $Corr[Y_{i}]$;, the $V_{i}$ is close to $Var[Y_{i}],$ but is not assumed to be exactly correct. <br />
*Variances of the GEE estimates: the working correlation is a postulated covariance matrix $Y_{i}-\mu_{i};$ not the final or true or modeled covariance matrix; rather a working correlation matrix used to create estimates $\hat{\beta}_{GEE}.$ If the model for the data is correct, then the covariance matrix of the estimate is $B^{-1}=Var[\hat{\beta}_{GEE}]=(\sum_{i=1}^{N}{D_{i}}'V_{i}^{-1}D_{i})^{-1};$ this is the same as the proc mixed covariance model $(\sum_{i=1}{N}X_{i}'V_{i}^{-1}X_{i})^{-1},$ but with $D_{i}=d\mu_{i}/d\beta replacing X_{i}.$ Assuming working correlation matrix is incorrect, modify covariance matrix $B^{-1}, Var[\hat{\beta}_{GEE}]=B^{-1}MB^{-1},$ increase the variance estimate; inflate $Var[\hat{\beta}_{GEE}]$ to pay for errors in $Corr[Y_{i}],$ the $B^{-1}$ is the bread; $M$ is the meat in the $B^{-1}MB^{-1}$ sandwich, which gives this estimator the nickname ‘sandwich estimator’. <br />
*The middle: $M=\sum_{i=1}^{N}D_{i}'V_{i}^{-1}Cov[Y_{i}]V_{i}^{-1}D_{i},$ estimate $D_{i}=d\mu_{i}/d\beta$ and $V_{i}$ is the working correlation matrix times the variance function; Given that we don’t know $Cov[Y_{i}],$ we replace it with $(Y_{i}-\hat{\mu}_{i})(Y_{i}-\hat{\mu}_{i})’,$ which has the right expectation and $B^{-1}MB^{-1}$ is referred to as aka Huber sandwich estimator or robust standard errors.<br />
*Advantages of GEE: the estimated regression coefficient $\hat{\beta}_{GEE}$ is asymptotically correct if the underlying regression mean model is correct even if assumed correlation model is incorrect. Valid standard errors are attained with the sandwich covariance estimator and are easy to use methodology.<br />
*Disadvantages of GEE: the assumptions can correspond to a mathematically impossible model, that is some combinations of correlations and variances are mathematically impossible, no matter how the data is generated; the sandwich estimator inflates standard errors; inflated SE’s are a serious cost of not modeling the covariance correctly; GEE does not fully specify a statistical model and covers instead with asymptotic statements; full assumptions are hidden.<br />
*GEE does not allow or make individual subject predictions; GEE methodology most suited to be balanced longitudinal design where the number of observations $N$ is large and number $n$ of repeated measure is small; GEE is not good for highly unbalanced data sets.<br />
<br />
<br />
===Applications===<br />
<br />
1) [http://www.jstor.org/discover/10.2307/2531734?uid=3739728&uid=2&uid=4&uid=3739256&sid=21104055415031 This article] discussed extensions of generalized linear models for the analysis of longitudinal data. Two approaches are considered: subject-specific (SS) models in which heterogeneity in regression parameters is explicitly modeled; and population-average (PA) models in which the aggregate response for the population is the focus. We use a generalized estimating equation approach to fit both classes of models for discrete and continuous outcomes. When the subject-specific parameters are assumed to follow a Gaussian distribution, simple relationships between the PA and SS parameters are available. The methods are illustrated with an analysis of data on mother’s smoking and children’s respiratory disease.<br />
<br />
2) [http://wiki.socr.umich.edu/index.php/SOCR_EduMaterials_Activities_NegativeBinomial This article] discussed about the generalized estimating equations for correlated binary data using the odds ratio as a measure of association. Moment methods for analyzing repeated binary responses have been proposed before and estimation of the parameters were associated with the expected value of an individual's vector of binary responses as well as the correlations between pairs of binary responses. Because the odds ratio has many desirable properties, and some investigators may find the odds ratio is easier to interpret, the paper discuss modeling the association between binary responses at pairs of times with the odds ratio and modify the estimating equations of Prentice to estimate the odds ratios. In simulations, the parameter estimates for the logistic regression model for the marginal probabilities appear slightly more efficient when using the odds ratio parameterization.<br />
<br />
3) [http://orm.sagepub.com/content/7/2/127.short?rss=1&ssource=mfr This article] talked about the generalized estimating equation (GEE) for longitudinal data analysis. The GEE approach of Zeger and Liang facilitates analysis of data collected in longitudinal, nested, or repeated measures designs. GEEs use the generalized linear model to estimate more efficient and unbiased regression parameters relative to ordinary least squares regression in part because they permit specification of a working correlation matrix that accounts for the form of within-subject correlation of responses on dependent variables of many different distributions, including normal, binomial, and Poisson. The author briefly explains the theory behind GEEs and their beneficial statistical properties and limitations and compares GEEs to suboptimal approaches for analyzing longitudinal data through use of two examples. The first demonstration applies GEEs to the analysis of data from a longitudinal lab study with a counted response variable; the second demonstration applies GEEs to analysis of data with a normally distributed response variable from subjects nested within branch offices of an organization.<br />
<br />
<br />
===Software=== <br />
[http://www.jstatsoft.org/v15/i02/paper The geeglm function]geeglem(outcome ~ baseline + center + sex + treat + age + I(age^), data = respiratory, id = interaction (center, id), family = binomial, corstr = ‘exchangeable’)<br />
<br />
===Problems===<br />
<br />
Example 1:<br />
library(geepack) # install this package first<br />
data(dietox)<br />
summary(dietox)<br />
Weight Feed Time Pig <br />
Min. : 15.00 Min. : 3.30 Min. : 1.000 Min. :4601 <br />
1st Qu.: 38.30 1st Qu.: 32.80 1st Qu.: 3.000 1st Qu.:4857 <br />
Median : 59.20 Median : 74.50 Median : 6.000 Median :5866 <br />
Mean : 60.73 Mean : 80.73 Mean : 6.481 Mean :6238 <br />
3rd Qu.: 81.20 3rd Qu.:123.00 3rd Qu.: 9.000 3rd Qu.:8050 <br />
Max. :117.00 Max. :224.50 Max. :12.000 Max. :8442 <br />
NA's :72 <br />
Evit Cu Litter <br />
Min. :1.000 Min. :1.000 Min. : 1.00 <br />
1st Qu.:1.000 1st Qu.:1.000 1st Qu.: 5.00 <br />
Median :2.000 Median :2.000 Median :11.00 <br />
Mean :2.027 Mean :2.015 Mean :12.14 <br />
3rd Qu.:3.000 3rd Qu.:3.000 3rd Qu.:20.00 <br />
Max. :3.000 Max. :3.000 Max. :24.00 <br />
attach(dietox)<br />
Cu <- as.factor(dietox$Cu)<br />
mf <- formula(Weight~Cu*(Time+I(Time^2)+I(Time^3)))<br />
gee <- geeglm(mf,data=dietox,id=Pig,family=poisson('identity'),corstr='ar1') ## The geeglm function fits generalized estimating equations using the ‘geese.fit’ function of the ‘geepack’ package for doing the actual computations. <br />
gee<br />
Call:<br />
geeglm(formula = mf, family = poisson("identity"), data = dietox, <br />
id = Pig, corstr = "ar1")<br />
<br />
Coefficients:<br />
(Intercept) Cu Time I(Time^2) I(Time^3) Cu:Time <br />
22.025614857 0.013857443 2.177350061 0.683063921 -0.027808001 0.424728590 <br />
Cu:I(Time^2) Cu:I(Time^3) <br />
-0.047363387 0.001313779 <br />
<br />
Degrees of Freedom: 861 Total (i.e. Null); 853 Residual<br />
<br />
Scale Link: identity<br />
Estimated Scale Parameters: [1] 0.7877976<br />
<br />
Correlation: Structure = ar1 Link = identity <br />
Estimated Correlation Parameters:<br />
alpha <br />
0.9576663 <br />
Number of clusters: 72 Maximum cluster size: 12 <br />
<br />
aov(gee)<br />
<br />
Call:<br />
aov(formula = gee)<br />
Terms:<br />
Cu Time I(Time^2) I(Time^3) Cu:Time Cu:I(Time^2)<br />
Sum of Squares 521.2 492381.6 1006.6 297.1 0.0 34.1<br />
Deg. of Freedom 1 1 1 1 1 1<br />
Cu:I(Time^3) Residuals<br />
Sum of Squares 1.4 42350.3<br />
Deg. of Freedom 1 853<br />
<br />
Residual standard error: 7.046179<br />
Estimated effects may be unbalanced<br />
<br />
Example 2 (use dataset of Normal and Schizophrenia Neuroimaging study of Children at http://wiki.socr.umich.edu/index.php/SOCR_Data_Oct2009_ID_NI)<br />
▪ Subject: Subject identifier<br />
▪ Age: Subject Age<br />
▪ DX: Subject diagnosis (Normals=1; Schizophrenia=2)<br />
▪ Sex: Subject gender (Male=1; female=2)<br />
▪ FS_IQ: Subject Intelligence Quotient (IQ)<br />
▪ TBV: Total Brain Volume (mm 3 )<br />
▪ GMV: Total Gray Matter Volume (mm 3 )<br />
▪ WMV: Total White Matter Volume (mm 3 )<br />
▪ CSF: Total Cerebrospinal Fluid (mm 3 )<br />
▪ Background: Background volume (mm 3 )<br />
▪ L_superior_frontal_gyrus, R_superior_frontal_gyrus, ..., brainstem: 56 regional cortical and subcortical volumes (region anatomical names are encoded in the column heading).<br />
<br />
neuro <- read.csv('M:/neuro.csv',header=T)<br />
summary(neuro)<br />
mf2 <- formula(FS_IQ ~ Sex+Age*Sex+Age^2+Age^3+TBV+GMV+WMV+CSF+Background)<br />
gee2 <- geeglm(mf2,data=neuro,id=Subject,family=poisson('identity'),corstr='ar1')<br />
gee2<br />
Call:<br />
geeglm(formula = mf2, family = poisson("identity"), data = neuro, <br />
id = Subject, corstr = "ar1")<br />
<br />
Coefficients:<br />
(Intercept) Sex Age TBV GMV WMV CSF Background <br />
-3.093394e+02 2.255625e+01 3.532712e+00 3.392570e+00 -3.392482e+00 -3.392484e+00 -3.392693e+00 3.144914e-05 <br />
Sex:Age <br />
-1.526318e+00 <br />
<br />
Degrees of Freedom: 63 Total (i.e. Null); 54 Residual<br />
<br />
Scale Link: identity<br />
Estimated Scale Parameters: [1] 2.538312<br />
<br />
Correlation: Structure = ar1 Link = identity <br />
Estimated Correlation Parameters:<br />
alpha <br />
0 <br />
<br />
Number of clusters: 63 Maximum cluster size: 1 <br />
<br />
aov(gee2)<br />
<br />
Call:<br />
aov(formula = gee2)<br />
<br />
Terms:<br />
Sex Age TBV GMV WMV CSF Background Sex:Age Residuals<br />
Sum of Squares 497.829 740.825 1040.712 25.337 3131.065 205.501 384.453 366.533 15459.175<br />
Deg. of Freedom 1 1 1 1 1 1 1 1 54<br />
<br />
Residual standard error: 16.91984<br />
Estimated effects may be unbalanced<br />
<br />
mf3 <- formula(FS_IQ ~ Sex+Age*Sex+Age^2+Age^3)<br />
gee3 <- geeglm(mf3,data=neuro,id=Subject,family=poisson('identity'),corstr='ar1')<br />
gee3<br />
<br />
Call:<br />
geeglm(formula = mf3, family = poisson("identity"), data = neuro, <br />
id = Subject, corstr = "ar1")<br />
<br />
Coefficients:<br />
(Intercept) Sex Age Sex:Age <br />
81.3791778 1.9781743 1.7571481 -0.5086719 <br />
<br />
Degrees of Freedom: 63 Total (i.e. Null); 59 Residual<br />
<br />
Scale Link: identity<br />
Estimated Scale Parameters: [1] 3.364596<br />
<br />
Correlation: Structure = ar1 Link = identity <br />
Estimated Correlation Parameters:<br />
alpha <br />
0 <br />
<br />
Number of clusters: 63 Maximum cluster size: 1 <br />
<br />
aov(gee3)<br />
Call:<br />
aov(formula = gee3)<br />
<br />
Terms:<br />
Sex Age Sex:Age Residuals<br />
Sum of Squares 497.829 740.825 29.199 20583.576<br />
Deg. of Freedom 1 1 1 59<br />
<br />
Residual standard error: 18.67817<br />
Estimated effects may be unbalanced<br />
<br />
===References===<br />
<br />
[http://en.wikipedia.org/wiki/Generalized_estimating_equation GEE Wikipedia]<br />
<br />
[https://onlinecourses.science.psu.edu/stat504/node/180 Introductions to GEE]<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<hr><br />
* SOCR Home page: http://www.socr.umich.edu<br />
<br />
{{translate|pageName=http://wiki.socr.umich.edu/index.php?title=SMHS_GEE}}</div>Clgallahttps://wiki.socr.umich.edu/index.php?title=SMHS_GEE&diff=14425SMHS GEE2014-10-17T12:53:14Z<p>Clgalla: /* Scientific Methods for Health Sciences - Generalized Estimating Equations (GEE) Models */</p>
<hr />
<div>==[[SMHS| Scientific Methods for Health Sciences]] - Generalized Estimating Equations (GEE) Models ==<br />
===Overview===<br />
Generalized estimating equation (GEE) is a commonly used method to estimate the parameters of a generalized linear model with a possible unknown correlation between outcomes. It provides a general approach for analyzing discrete and continuous responses with marginal models and works as a popular alternative to maximum likelihood estimation (MLE). In this section, we are going to present a general introduction to the GEE method and illustrate its application with examples. The R package geeglm will be discussed in the attached paper reference.<br />
<br />
===Motivation===<br />
There is no nature specification or convenient ways to deal with situations like the joint multivariate distribution of $Y_{i} = {(Y_{i1}, Y_{i2}, \cdots, Y_{in})}'$ where the responses are discrete. That’s where the GEE method is proposed based on the concept of estimating equations. It provides a general approach for analyzing discrete and continuous responses with marginal models. So, how does the GEE model work?<br />
<br />
===Theory===<br />
<br />
'''1) GEE'''<br />
<br />
GEE is a commonly used method to estimate the parameters of a generalized linear model with a possible unknown correlation between outcomes. Parameter estimates from the GEE are consistent even when the covariance is mis-specified under mild regularity conditions and the focus of GEE is on estimating the average response over the population rather than the regression parameters that would enable prediction of the effect of changing one or more covariates on a given individual. <br />
*GEEs belong to the class of semi-parametric regression techniques since they rely on specification of only the first two moments and are a popular alternative to the likelihood-based generalized linear mixed model. <br />
*They are commonly used in large epidemiological studies, especially multi-site cohort studies because they can handle many types of unmeasured dependence between outcomes.<br />
<br />
'''2) Marginal models'''<br />
<br />
The vector of observations from subject $I$ are $Y_{i} = {(Y_{i1}, Y_{i2}, \cdots, Y_{in})}',$ for $(i=1,2, \cdots, N).$ Consider a marginal model that has a mean function, a function of predictors; a known variance functions (discrete data the mean determines the variance); possibly a scale parameter for overdispersion; a working correlation matrix. <br />
*More specifically, the marginal mean of the response is $E[Y_{ij}]=\mu_{ij},$ the mean depends on explanatory variables $X_{ij}$ through a known link function: $g(\mu_{ij})=\eta_{ij} ={X_{ij}}’ \beta.$<br />
*The marginal variance of $Y_{ij}$ depends on the marginal mean according to $Var[Y_{ij}]=v(\mu_{ij}) \phi,$ where the variance function $v(\mu_{ij})$ is known, $\phi =1$ is fixed for Poisson, Bernoulli. Parameter $\phi$ may have to be estimated for normal model or overdispersed data. Correlation between $Y_{ij}$ and $Y_{ik}$ is a function of extra parameter, \alpha. Correlation may depend on means $\mu_{ij}$ and $\mu_{ik}.$<br />
*The mean idea is to generalize the usual univariate likelihood equations by introducing the covariance matrix of the vector of responses,$ Y_{i}$. For linear models, weighted least squares (WLS) or generalized least squares (GLS) can be considered a special case of this ‘estimating equations’ approach; for non-linear models, this approach is called ‘generalized estimating equations’ (GEE).<br />
<br />
'''3) GEE computation:''' <br />
<br />
*Consider normal linear regression $y_{i}={x_{i}}’ \beta + \epsilon_{i}, \epsilon_{i} \sim N(0, \sigma^{2}).$ The normal density for $N$ observations looks like $(2* \pi * \sigma^{2})^{-N/2}$ times the exponential of $\sum_{i=1}^{N} -0.5(y_{i}- {x_{i}}' \beta) \sigma^{-2} (y_{i} - {x_{i}}' \beta),$ which is also referred to as the log likelihood function. <br />
*Likelihood equations, normal regression: find the maximum likelihood estimates by finding the value $\hat{\beta}$ of $\beta$ that maximizes the likelihood $l(\beta)= \sum_{i=1}^{N} -0.5(y_{i}- {x_{i}}' \beta) \sigma^{-2} (y_{i} - {x_{i}}' \beta);$ maximum of a function $l(\beta)$ is found by differentiating the function $l(\beta)$ with respect to $\beta,$ setting this equal to zero and then solving $\frac{dl(\beta)}{d \beta} = 0.$<br />
*Regression likelihood equations: the log likelihood looks like $\sum_{i=1}^{N} -0.5(y_{i}- {x_{i}}' \beta) \sigma^{-2} (y_{i} - {x_{i}}' \beta);$ differentiating with respect to $\beta$ and setting equal to zero gives $\sum_{i=1}^{N}{x_{i}}'\sigma^{-2}(y_{i}-{x_{i}}'\beta) = 0.$ This is the likelihood equation. Solving for $\beta$ gives us $\hat{\beta}=(\sum_{i} x_{i}{x_{i}}')^{-1}(\sum_{i} x_{i}y_{i}).$ <br />
*Normal longitudinal model: generalize to multivariate regression (longitudinal models); suppose continuous data $Y_{i}$, nx1; predictor matrix $X_{i}$,$n$ rows by $p$ columns; coefficient vector $\beta, p by 1$. The model is $Y_{i}=X_{i}+\epsilon_{i},$ the residual distribution is $\epsilon \sim N(0, V_{i}),$ variance matrix is $V_{i}$ to distinguish from summation $\sum.$ Covariance matrix $V_{i}$ depends on unknown parameter $\theta.$<br />
*Multivariate likelihood: the multivariate normal log likelihood is $\sum_{i=1}^{N} -0.5{(Y_{i}-X_{i}\beta)}'V_{i}^{-1}(Y_{i}-X_{i}\beta);$ looks like a normal regression likelihood as before; vector $Y_{i}$ has replaced scalar $y_{i}$; vector $X_{i}\beta$ has replaced scalar ${x_{i}}’\beta;$ and matrix $V_{i}^{-1}$ has replaced $\sigma^{-2}.$<br />
*Multivariate observation likelihood: differentiate the log likelihood and set to zero; gives the likelihood equation $\sum_{i=1}^{N}{X_{i}}'V_{i}^{-1}(Y_{i}-X_{i}\beta)=0;$ similar to linear regression likelihood equation; the solution (suppose we know $V_{i})$ is $\hat{\beta}=(\sum_{i=1}^{N}{X_{i}'V_{i}^{-1}X_{i}})^{-1}(\sum_{i=1}^{N}{X_{i}}'V_{i}^{-1}Y_{i}),$ this is the weighted least squares estimator for $\hat{\beta}.$<br />
*Interpretation of the likelihood equations: $\sum_{i=1}^{N}{X_{i}}'V_{i}^{-1}(Y_{i}-X_{i}\beta)=0;$ the residual $(Y_{i}-X_{i}\beta)$ is $(Y_{i}-\mu_{i}),$ the observation vector $Y_{i}$ minus its mean $\mu_{i}=X_{i}\beta;$ the $V_{i}^{-1}$ is the inverse of the inverse of the covariance matrix; the $X_{i}$ can be thought of as the derivative of $\mu_{i}$ with respect to the regression parameter $\beta,$ written as $d\mu_{i}/d\beta.$<br />
*Multivariate generalized linear models equations: generalize to generalized linear models (GLMs) with correlated data. It is no longer a likelihood equation, instead use a weighted least squares equation, where we replace each term in the previous likelihood equation with a generalized linear model equivalent, and include something to adjust for correlations. $Y_{i}$ vector remains the same; mean vector is $\mu_{i}$ rather than $X_{i}\alpha;$ replace $X_{i}$ with the derivative $D_{i} = d\mu_{i}/d\beta;$ for GLMs $D_{i}$ has components that look like $D_{1}^{-1}X_{i}, matrix D_{1}$ is diagonal and has elements $v(\mu_{ij})$ the variance functions of the means; thus $D_{i}$ is a weighted predictor matrix; the overdispersion parameter $\phi$ is not involved at this stage. The $V_{i}$’s are constructed to be similar to the covariance matrix $var[Y_{i}]$, but not actually equal; the GEEs are $\sum_{i=1}^{N}D_{i}V_{i}^{-1}(Y_{i}-\mu_{i}(\beta))=0,$ solve for $\beta$ giving the GEE estimate we write as $\hat{\beta}_{GEE}.$ <br />
*Constructing the working covariance matrix: $V_{i} \approx Var[Y_{i}];$ separate $V_{i}$ into two parts of variance matrix and correlation matrix; the variance matrix is a diagonal matrix $A_{i}$ of variances; the diagonal elements of $A_{i}$ are $\phi v(\mu_{ij})$ are the variances of the observation $Y_{ij}.$ The correlation matrix $Corr(Y_{i})$ is a function of unknown parameter $\alpha$; the matrix $V_{i}$ is known as a working covariance matrix; it is not the true underlying correlation matrix $Corr[Y_{i}]$;, the $V_{i}$ is close to $Var[Y_{i}],$ but is not assumed to be exactly correct. <br />
*Variances of the GEE estimates: the working correlation is a postulated covariance matrix $Y_{i}-\mu_{i};$ not the final or true or modeled covariance matrix; rather a working correlation matrix used to create estimates $\hat{\beta}_{GEE}.$ If the model for the data is correct, then the covariance matrix of the estimate is $B^{-1}=Var[\hat{\beta}_{GEE}]=(\sum_{i=1}^{N}{D_{i}}'V_{i}^{-1}D_{i})^{-1};$ this is the same as the proc mixed covariance model $(\sum_{i=1}{N}X_{i}'V_{i}^{-1}X_{i})^{-1},$ but with $D_{i}=d\mu_{i}/d\beta replacing X_{i}.$ Assuming working correlation matrix is incorrect, modify covariance matrix $B^{-1}, Var[\hat{\beta}_{GEE}]=B^{-1}MB^{-1},$ increase the variance estimate; inflate $Var[\hat{\beta}_{GEE}]$ to pay for errors in $Corr[Y_{i}],$ the $B^{-1}$ is the bread; $M$ is the meat in the $B^{-1}MB^{-1}$ sandwich, which gives this estimator the nickname ‘sandwich estimator’. <br />
*The middle: $M=\sum_{i=1}^{N}D_{i}'V_{i}^{-1}Cov[Y_{i}]V_{i}^{-1}D_{i},$ estimate $D_{i}=d\mu_{i}/d\beta$ and $V_{i}$ is the working correlation matrix times the variance function; Given that we don’t know $Cov[Y_{i}],$ we replace it with $(Y_{i}-\hat{\mu}_{i})(Y_{i}-\hat{\mu}_{i})’,$ which has the right expectation and $B^{-1}MB^{-1}$ is referred to as aka Huber sandwich estimator or robust standard errors.<br />
*Advantages of GEE: the estimated regression coefficient $\hat{\beta}_{GEE}$ is asymptotically correct if the underlying regression mean model is correct even if assumed correlation model is incorrect. Valid standard errors are attained with the sandwich covariance estimator and are easy to use methodology.<br />
*Disadvantages of GEE: the assumptions can correspond to a mathematically impossible model, that is some combinations of correlations and variances are mathematically impossible, no matter how the data is generated; the sandwich estimator inflates standard errors; inflated SE’s are a serious cost of not modeling the covariance correctly; GEE does not fully specify a statistical model and covers instead with asymptotic statements; full assumptions are hidden.<br />
*GEE does not allow or make individual subject predictions; GEE methodology most suited to be balanced longitudinal design where the number of observations $N$ is large and number $n$ of repeated measure is small; GEE is not good for highly unbalanced data sets.<br />
<br />
<br />
===Applications===<br />
<br />
1) [http://www.jstor.org/discover/10.2307/2531734?uid=3739728&uid=2&uid=4&uid=3739256&sid=21104055415031 This article] discussed extensions of generalized linear models for the analysis of longitudinal data. Two approaches are considered: subject-specific (SS) models in which heterogeneity in regression parameters is explicitly modeled; and population-average (PA) models in which the aggregate response for the population is the focus. We use a generalized estimating equation approach to fit both classes of models for discrete and continuous outcomes. When the subject-specific parameters are assumed to follow a Gaussian distribution, simple relationships between the PA and SS parameters are available. The methods are illustrated with an analysis of data on mother’s smoking and children’s respiratory disease.<br />
<br />
2) [http://wiki.socr.umich.edu/index.php/SOCR_EduMaterials_Activities_NegativeBinomial This article] discussed about the generalized estimating equations for correlated binary data using the odds ratio as a measure of association. Moment methods for analyzing repeated binary responses have been proposed before and estimation of the parameters were associated with the expected value of an individual's vector of binary responses as well as the correlations between pairs of binary responses. Because the odds ratio has many desirable properties, and some investigators may find the odds ratio is easier to interpret, the paper discuss modeling the association between binary responses at pairs of times with the odds ratio and modify the estimating equations of Prentice to estimate the odds ratios. In simulations, the parameter estimates for the logistic regression model for the marginal probabilities appear slightly more efficient when using the odds ratio parameterization.<br />
<br />
3) [http://orm.sagepub.com/content/7/2/127.short?rss=1&ssource=mfr This article] talked about the generalized estimating equation (GEE) for longitudinal data analysis. The GEE approach of Zeger and Liang facilitates analysis of data collected in longitudinal, nested, or repeated measures designs. GEEs use the generalized linear model to estimate more efficient and unbiased regression parameters relative to ordinary least squares regression in part because they permit specification of a working correlation matrix that accounts for the form of within-subject correlation of responses on dependent variables of many different distributions, including normal, binomial, and Poisson. The author briefly explains the theory behind GEEs and their beneficial statistical properties and limitations and compares GEEs to suboptimal approaches for analyzing longitudinal data through use of two examples. The first demonstration applies GEEs to the analysis of data from a longitudinal lab study with a counted response variable; the second demonstration applies GEEs to analysis of data with a normally distributed response variable from subjects nested within branch offices of an organization.<br />
<br />
<br />
===Software=== <br />
[http://www.jstatsoft.org/v15/i02/paper The geeglm function]geeglem(outcome ~ baseline + center + sex + treat + age + I($age^$), data = respiratory, id = interaction (center, id), family = binomial, corstr = ‘exchangeable’)<br />
<br />
===Problems===<br />
<br />
Example 1:<br />
library(geepack) # install this package first<br />
data(dietox)<br />
summary(dietox)<br />
Weight Feed Time Pig <br />
Min. : 15.00 Min. : 3.30 Min. : 1.000 Min. :4601 <br />
1st Qu.: 38.30 1st Qu.: 32.80 1st Qu.: 3.000 1st Qu.:4857 <br />
Median : 59.20 Median : 74.50 Median : 6.000 Median :5866 <br />
Mean : 60.73 Mean : 80.73 Mean : 6.481 Mean :6238 <br />
3rd Qu.: 81.20 3rd Qu.:123.00 3rd Qu.: 9.000 3rd Qu.:8050 <br />
Max. :117.00 Max. :224.50 Max. :12.000 Max. :8442 <br />
NA's :72 <br />
Evit Cu Litter <br />
Min. :1.000 Min. :1.000 Min. : 1.00 <br />
1st Qu.:1.000 1st Qu.:1.000 1st Qu.: 5.00 <br />
Median :2.000 Median :2.000 Median :11.00 <br />
Mean :2.027 Mean :2.015 Mean :12.14 <br />
3rd Qu.:3.000 3rd Qu.:3.000 3rd Qu.:20.00 <br />
Max. :3.000 Max. :3.000 Max. :24.00 <br />
attach(dietox)<br />
Cu <- as.factor(dietox$Cu)<br />
mf <- formula(Weight~Cu*(Time+I(Time^2)+I(Time^3)))<br />
gee <- geeglm(mf,data=dietox,id=Pig,family=poisson('identity'),corstr='ar1') ## The geeglm function fits generalized estimating equations using the ‘geese.fit’ function of the ‘geepack’ package for doing the actual computations. <br />
gee<br />
Call:<br />
geeglm(formula = mf, family = poisson("identity"), data = dietox, <br />
id = Pig, corstr = "ar1")<br />
<br />
Coefficients:<br />
(Intercept) Cu Time I(Time^2) I(Time^3) Cu:Time <br />
22.025614857 0.013857443 2.177350061 0.683063921 -0.027808001 0.424728590 <br />
Cu:I(Time^2) Cu:I(Time^3) <br />
-0.047363387 0.001313779 <br />
<br />
Degrees of Freedom: 861 Total (i.e. Null); 853 Residual<br />
<br />
Scale Link: identity<br />
Estimated Scale Parameters: [1] 0.7877976<br />
<br />
Correlation: Structure = ar1 Link = identity <br />
Estimated Correlation Parameters:<br />
alpha <br />
0.9576663 <br />
Number of clusters: 72 Maximum cluster size: 12 <br />
<br />
aov(gee)<br />
<br />
Call:<br />
aov(formula = gee)<br />
Terms:<br />
Cu Time I(Time^2) I(Time^3) Cu:Time Cu:I(Time^2)<br />
Sum of Squares 521.2 492381.6 1006.6 297.1 0.0 34.1<br />
Deg. of Freedom 1 1 1 1 1 1<br />
Cu:I(Time^3) Residuals<br />
Sum of Squares 1.4 42350.3<br />
Deg. of Freedom 1 853<br />
<br />
Residual standard error: 7.046179<br />
Estimated effects may be unbalanced<br />
<br />
Example 2 (use dataset of Normal and Schizophrenia Neuroimaging study of Children at http://wiki.socr.umich.edu/index.php/SOCR_Data_Oct2009_ID_NI)<br />
▪ Subject: Subject identifier<br />
▪ Age: Subject Age<br />
▪ DX: Subject diagnosis (Normals=1; Schizophrenia=2)<br />
▪ Sex: Subject gender (Male=1; female=2)<br />
▪ FS_IQ: Subject Intelligence Quotient (IQ)<br />
▪ TBV: Total Brain Volume (mm 3 )<br />
▪ GMV: Total Gray Matter Volume (mm 3 )<br />
▪ WMV: Total White Matter Volume (mm 3 )<br />
▪ CSF: Total Cerebrospinal Fluid (mm 3 )<br />
▪ Background: Background volume (mm 3 )<br />
▪ L_superior_frontal_gyrus, R_superior_frontal_gyrus, ..., brainstem: 56 regional cortical and subcortical volumes (region anatomical names are encoded in the column heading).<br />
<br />
neuro <- read.csv('M:/neuro.csv',header=T)<br />
summary(neuro)<br />
mf2 <- formula(FS_IQ ~ Sex+Age*Sex+Age^2+Age^3+TBV+GMV+WMV+CSF+Background)<br />
gee2 <- geeglm(mf2,data=neuro,id=Subject,family=poisson('identity'),corstr='ar1')<br />
gee2<br />
Call:<br />
geeglm(formula = mf2, family = poisson("identity"), data = neuro, <br />
id = Subject, corstr = "ar1")<br />
<br />
Coefficients:<br />
(Intercept) Sex Age TBV GMV WMV CSF Background <br />
-3.093394e+02 2.255625e+01 3.532712e+00 3.392570e+00 -3.392482e+00 -3.392484e+00 -3.392693e+00 3.144914e-05 <br />
Sex:Age <br />
-1.526318e+00 <br />
<br />
Degrees of Freedom: 63 Total (i.e. Null); 54 Residual<br />
<br />
Scale Link: identity<br />
Estimated Scale Parameters: [1] 2.538312<br />
<br />
Correlation: Structure = ar1 Link = identity <br />
Estimated Correlation Parameters:<br />
alpha <br />
0 <br />
<br />
Number of clusters: 63 Maximum cluster size: 1 <br />
<br />
aov(gee2)<br />
<br />
Call:<br />
aov(formula = gee2)<br />
<br />
Terms:<br />
Sex Age TBV GMV WMV CSF Background Sex:Age Residuals<br />
Sum of Squares 497.829 740.825 1040.712 25.337 3131.065 205.501 384.453 366.533 15459.175<br />
Deg. of Freedom 1 1 1 1 1 1 1 1 54<br />
<br />
Residual standard error: 16.91984<br />
Estimated effects may be unbalanced<br />
<br />
mf3 <- formula(FS_IQ ~ Sex+Age*Sex+Age^2+Age^3)<br />
gee3 <- geeglm(mf3,data=neuro,id=Subject,family=poisson('identity'),corstr='ar1')<br />
gee3<br />
<br />
Call:<br />
geeglm(formula = mf3, family = poisson("identity"), data = neuro, <br />
id = Subject, corstr = "ar1")<br />
<br />
Coefficients:<br />
(Intercept) Sex Age Sex:Age <br />
81.3791778 1.9781743 1.7571481 -0.5086719 <br />
<br />
Degrees of Freedom: 63 Total (i.e. Null); 59 Residual<br />
<br />
Scale Link: identity<br />
Estimated Scale Parameters: [1] 3.364596<br />
<br />
Correlation: Structure = ar1 Link = identity <br />
Estimated Correlation Parameters:<br />
alpha <br />
0 <br />
<br />
Number of clusters: 63 Maximum cluster size: 1 <br />
<br />
aov(gee3)<br />
Call:<br />
aov(formula = gee3)<br />
<br />
Terms:<br />
Sex Age Sex:Age Residuals<br />
Sum of Squares 497.829 740.825 29.199 20583.576<br />
Deg. of Freedom 1 1 1 59<br />
<br />
Residual standard error: 18.67817<br />
Estimated effects may be unbalanced<br />
<br />
===References===<br />
<br />
[http://en.wikipedia.org/wiki/Generalized_estimating_equation GEE Wikipedia]<br />
<br />
[https://onlinecourses.science.psu.edu/stat504/node/180 Introductions to GEE]<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<hr><br />
* SOCR Home page: http://www.socr.umich.edu<br />
<br />
{{translate|pageName=http://wiki.socr.umich.edu/index.php?title=SMHS_GEE}}</div>Clgallahttps://wiki.socr.umich.edu/index.php?title=SMHS_GEE&diff=14424SMHS GEE2014-10-17T12:48:25Z<p>Clgalla: /* Scientific Methods for Health Sciences - Generalized Estimating Equations (GEE) Models */</p>
<hr />
<div>==[[SMHS| Scientific Methods for Health Sciences]] - Generalized Estimating Equations (GEE) Models ==<br />
===Overview===<br />
Generalized estimating equation (GEE) is a commonly used method to estimate the parameters of a generalized linear model with a possible unknown correlation between outcomes. It provides a general approach for analyzing discrete and continuous responses with marginal models and works as a popular alternative to maximum likelihood estimation (MLE). In this section, we are going to present a general introduction to the GEE method and illustrate its application with examples. The R package geeglm will be discussed in the attached paper reference.<br />
<br />
===Motivation===<br />
There is no nature specification or convenient ways to deal with situations like the joint multivariate distribution of Y_{i} = {(Y_{i1}, Y_{i2}, \cdots, Y_{in})}' where the responses are discrete. That’s where the GEE method is proposed based on the concept of estimating equations. It provides a general approach for analyzing discrete and continuous responses with marginal models. So, how does the GEE model work?<br />
<br />
===Theory===<br />
<br />
'''1) GEE'''<br />
<br />
GEE is a commonly used method to estimate the parameters of a generalized linear model with a possible unknown correlation between outcomes. Parameter estimates from the GEE are consistent even when the covariance is mis-specified under mild regularity conditions and the focus of GEE is on estimating the average response over the population rather than the regression parameters that would enable prediction of the effect of changing one or more covariates on a given individual. <br />
*GEEs belong to the class of semi-parametric regression techniques since they rely on specification of only the first two moments and are a popular alternative to the likelihood-based generalized linear mixed model. <br />
*They are commonly used in large epidemiological studies, especially multi-site cohort studies because they can handle many types of unmeasured dependence between outcomes.<br />
<br />
'''2) Marginal models'''<br />
<br />
The vector of observations from subject $I$ are $Y_{i} = {(Y_{i1}, Y_{i2}, \cdots, Y_{in})}',$ for $(i=1,2, \cdots, N).$ Consider a marginal model that has a mean function, a function of predictors; a known variance functions (discrete data the mean determines the variance); possibly a scale parameter for overdispersion; a working correlation matrix. <br />
*More specifically, the marginal mean of the response is $E[Y_{ij}]=\mu_{ij},$ the mean depends on explanatory variables $X_{ij}$ through a known link function: $g(\mu_{ij})=\eta_{ij} ={X_{ij}}’ \beta.$<br />
*The marginal variance of $Y_{ij}$ depends on the marginal mean according to $Var[Y_{ij}]=v(\mu_{ij}) \phi,$ where the variance function $v(\mu_{ij})$ is known, $\phi =1$ is fixed for Poisson, Bernoulli. Parameter $\phi$ may have to be estimated for normal model or overdispersed data. Correlation between Y_{ij} and Y_{ik} is a function of extra parameter, \alpha. Correlation may depend on means $\mu_{ij}$ and $\mu_{ik}.$<br />
*The mean idea is to generalize the usual univariate likelihood equations by introducing the covariance matrix of the vector of responses,$ Y_{i}$. For linear models, weighted least squares (WLS) or generalized least squares (GLS) can be considered a special case of this ‘estimating equations’ approach; for non-linear models, this approach is called ‘generalized estimating equations’ (GEE).<br />
<br />
'''3) GEE computation:''' <br />
<br />
*Consider normal linear regression $y_{i}={x_{i}}’ \beta + \epsilon_{i}, \epsilon_{i} \sim N(0, \sigma^{2}).$ The normal density for $N$ observations looks like $(2* \pi * \sigma^{2})^{-N/2}$ times the exponential of $\sum_{i=1}^{N} -0.5(y_{i}- {x_{i}}' \beta) \sigma^{-2} (y_{i} - {x_{i}}' \beta),$ which is also referred to as the log likelihood function. <br />
*Likelihood equations, normal regression: find the maximum likelihood estimates by finding the value $\hat{\beta}$ of $\beta$ that maximizes the likelihood $l(\beta)= \sum_{i=1}^{N} -0.5(y_{i}- {x_{i}}' \beta) \sigma^{-2} (y_{i} - {x_{i}}' \beta);$ maximum of a function $l(\beta)$ is found by differentiating the function $l(\beta)$ with respect to $\beta,$ setting this equal to zero and then solving $\frac{dl(\beta)}{d \beta} = 0.$<br />
*Regression likelihood equations: the log likelihood looks like $\sum_{i=1}^{N} -0.5(y_{i}- {x_{i}}' \beta) \sigma^{-2} (y_{i} - {x_{i}}' \beta);$ differentiating with respect to $\beta$ and setting equal to zero gives $\sum_{i=1}^{N}{x_{i}}'\sigma^{-2}(y_{i}-{x_{i}}'\beta) = 0.$ This is the likelihood equation. Solving for $\beta$ gives us $\hat{\beta}=(\sum_{i} x_{i}{x_{i}}')^{-1}(\sum_{i} x_{i}y_{i}).$ <br />
*Normal longitudinal model: generalize to multivariate regression (longitudinal models); suppose continuous data $Y_{i}$, nx1; predictor matrix $X_{i}$,$n$ rows by $p$ columns; coefficient vector $\beta, p by 1$. The model is $Y_{i}=X_{i}+\epsilon_{i},$ the residual distribution is $\epsilon \sim N(0, V_{i}),$ variance matrix is $V_{i}$ to distinguish from summation $\sum.$ Covariance matrix $V_{i}$ depends on unknown parameter $\theta.$<br />
*Multivariate likelihood: the multivariate normal log likelihood is $\sum_{i=1}^{N} -0.5{(Y_{i}-X_{i}\beta)}'V_{i}^{-1}(Y_{i}-X_{i}\beta);$ looks like a normal regression likelihood as before; vector $Y_{i}$ has replaced scalar $y_{i}$; vector $X_{i}\beta$ has replaced scalar ${x_{i}}’\beta;$ and matrix $V_{i}^{-1}$ has replaced $\sigma^{-2}.$<br />
*Multivariate observation likelihood: differentiate the log likelihood and set to zero; gives the likelihood equation $\sum_{i=1}^{N}{X_{i}}'V_{i}^{-1}(Y_{i}-X_{i}\beta)=0;$ similar to linear regression likelihood equation; the solution (suppose we know $V_{i})$ is $\hat{\beta}=(\sum_{i=1}^{N}{X_{i}'V_{i}^{-1}X_{i}})^{-1}(\sum_{i=1}^{N}{X_{i}}'V_{i}^{-1}Y_{i}),$ this is the weighted least squares estimator for $\hat{\beta}.$<br />
*Interpretation of the likelihood equations: $\sum_{i=1}^{N}{X_{i}}'V_{i}^{-1}(Y_{i}-X_{i}\beta)=0;$ the residual $(Y_{i}-X_{i}\beta)$ is $(Y_{i}-\mu_{i}),$ the observation vector $Y_{i}$ minus its mean $\mu_{i}=X_{i}\beta;$ the $V_{i}^{-1}$ is the inverse of the inverse of the covariance matrix; the $X_{i}$ can be thought of as the derivative of $\mu_{i}$ with respect to the regression parameter $\beta,$ written as $d\mu_{i}/d\beta.$<br />
*Multivariate generalized linear models equations: generalize to generalized linear models (GLMs) with correlated data. It is no longer a likelihood equation, instead use a weighted least squares equation, where we replace each term in the previous likelihood equation with a generalized linear model equivalent, and include something to adjust for correlations. $Y_{i}$ vector remains the same; mean vector is $\mu_{i}$ rather than $X_{i}\alpha;$ replace $X_{i}$ with the derivative $D_{i} = d\mu_{i}/d\beta;$ for GLMs $D_{i}$ has components that look like $D_{1}^{-1}X_{i}, matrix D_{1}$ is diagonal and has elements $v(\mu_{ij})$ the variance functions of the means; thus $D_{i}$ is a weighted predictor matrix; the overdispersion parameter $\phi$ is not involved at this stage. The $V_{i}$’s are constructed to be similar to the covariance matrix $var[Y_{i}]$, but not actually equal; the GEEs are $\sum_{i=1}^{N}D_{i}V_{i}^{-1}(Y_{i}-\mu_{i}(\beta))=0,$ solve for $\beta$ giving the GEE estimate we write as $\hat{\beta}_{GEE}.$ <br />
*Constructing the working covariance matrix: $V_{i} \approx Var[Y_{i}];$ separate $V_{i}$ into two parts of variance matrix and correlation matrix; the variance matrix is a diagonal matrix $A_{i}$ of variances; the diagonal elements of $A_{i}$ are $\phi v(\mu_{ij})$ are the variances of the observation $Y_{ij}.$ The correlation matrix $Corr(Y_{i})$ is a function of unknown parameter $\alpha$; the matrix $V_{i}$ is known as a working covariance matrix; it is not the true underlying correlation matrix $Corr[Y_{i}]$;, the $V_{i}$ is close to $Var[Y_{i}],$ but is not assumed to be exactly correct. <br />
*Variances of the GEE estimates: the working correlation is a postulated covariance matrix $Y_{i}-\mu_{i};$ not the final or true or modeled covariance matrix; rather a working correlation matrix used to create estimates $\hat{\beta}_{GEE}.$ If the model for the data is correct, then the covariance matrix of the estimate is $B^{-1}=Var[\hat{\beta}_{GEE}]=(\sum_{i=1}^{N}{D_{i}}'V_{i}^{-1}D_{i})^{-1};$ this is the same as the proc mixed covariance model $(\sum_{i=1}{N}X_{i}'V_{i}^{-1}X_{i})^{-1},$ but with $D_{i}=d\mu_{i}/d\beta replacing X_{i}.$ Assuming working correlation matrix is incorrect, modify covariance matrix $B^{-1}, Var[\hat{\beta}_{GEE}]=B^{-1}MB^{-1},$ increase the variance estimate; inflate $Var[\hat{\beta}_{GEE}]$ to pay for errors in $Corr[Y_{i}],$ the $B^{-1}$ is the bread; $M$ is the meat in the $B^{-1}MB^{-1}$ sandwich, which gives this estimator the nickname ‘sandwich estimator’. <br />
*The middle: $M=\sum_{i=1}^{N}D_{i}'V_{i}^{-1}Cov[Y_{i}]V_{i}^{-1}D_{i},$ estimate $D_{i}=d\mu_{i}/d\beta$ and $V_{i}$ is the working correlation matrix times the variance function; Given that we don’t know $Cov[Y_{i}],$ we replace it with $(Y_{i}-\hat{\mu}_{i})(Y_{i}-\hat{\mu}_{i})’,$ which has the right expectation and $B^{-1}MB^{-1}$ is referred to as aka Huber sandwich estimator or robust standard errors.<br />
*Advantages of GEE: the estimated regression coefficient $\hat{\beta}_{GEE}$ is asymptotically correct if the underlying regression mean model is correct even if assumed correlation model is incorrect. Valid standard errors are attained with the sandwich covariance estimator and are easy to use methodology.<br />
*Disadvantages of GEE: the assumptions can correspond to a mathematically impossible model, that is some combinations of correlations and variances are mathematically impossible, no matter how the data is generated; the sandwich estimator inflates standard errors; inflated SE’s are a serious cost of not modeling the covariance correctly; GEE does not fully specify a statistical model and covers instead with asymptotic statements; full assumptions are hidden.<br />
*GEE does not allow or make individual subject predictions; GEE methodology most suited to be balanced longitudinal design where the number of observations $N$ is large and number $n$ of repeated measure is small; GEE is not good for highly unbalanced data sets.<br />
<br />
<br />
===Applications===<br />
<br />
1) [http://www.jstor.org/discover/10.2307/2531734?uid=3739728&uid=2&uid=4&uid=3739256&sid=21104055415031 This article] discussed extensions of generalized linear models for the analysis of longitudinal data. Two approaches are considered: subject-specific (SS) models in which heterogeneity in regression parameters is explicitly modeled; and population-average (PA) models in which the aggregate response for the population is the focus. We use a generalized estimating equation approach to fit both classes of models for discrete and continuous outcomes. When the subject-specific parameters are assumed to follow a Gaussian distribution, simple relationships between the PA and SS parameters are available. The methods are illustrated with an analysis of data on mother’s smoking and children’s respiratory disease.<br />
<br />
2) [http://wiki.socr.umich.edu/index.php/SOCR_EduMaterials_Activities_NegativeBinomial This article] discussed about the generalized estimating equations for correlated binary data using the odds ratio as a measure of association. Moment methods for analyzing repeated binary responses have been proposed before and estimation of the parameters were associated with the expected value of an individual's vector of binary responses as well as the correlations between pairs of binary responses. Because the odds ratio has many desirable properties, and some investigators may find the odds ratio is easier to interpret, the paper discuss modeling the association between binary responses at pairs of times with the odds ratio and modify the estimating equations of Prentice to estimate the odds ratios. In simulations, the parameter estimates for the logistic regression model for the marginal probabilities appear slightly more efficient when using the odds ratio parameterization.<br />
<br />
3) [http://orm.sagepub.com/content/7/2/127.short?rss=1&ssource=mfr This article] talked about the generalized estimating equation (GEE) for longitudinal data analysis. The GEE approach of Zeger and Liang facilitates analysis of data collected in longitudinal, nested, or repeated measures designs. GEEs use the generalized linear model to estimate more efficient and unbiased regression parameters relative to ordinary least squares regression in part because they permit specification of a working correlation matrix that accounts for the form of within-subject correlation of responses on dependent variables of many different distributions, including normal, binomial, and Poisson. The author briefly explains the theory behind GEEs and their beneficial statistical properties and limitations and compares GEEs to suboptimal approaches for analyzing longitudinal data through use of two examples. The first demonstration applies GEEs to the analysis of data from a longitudinal lab study with a counted response variable; the second demonstration applies GEEs to analysis of data with a normally distributed response variable from subjects nested within branch offices of an organization.<br />
<br />
<br />
===Software=== <br />
[http://www.jstatsoft.org/v15/i02/paper The geeglm function]geeglem(outcome ~ baseline + center + sex + treat + age + I($age^$), data = respiratory, id = interaction (center, id), family = binomial, corstr = ‘exchangeable’)<br />
<br />
===Problems===<br />
<br />
Example 1:<br />
library(geepack) # install this package first<br />
data(dietox)<br />
summary(dietox)<br />
Weight Feed Time Pig <br />
Min. : 15.00 Min. : 3.30 Min. : 1.000 Min. :4601 <br />
1st Qu.: 38.30 1st Qu.: 32.80 1st Qu.: 3.000 1st Qu.:4857 <br />
Median : 59.20 Median : 74.50 Median : 6.000 Median :5866 <br />
Mean : 60.73 Mean : 80.73 Mean : 6.481 Mean :6238 <br />
3rd Qu.: 81.20 3rd Qu.:123.00 3rd Qu.: 9.000 3rd Qu.:8050 <br />
Max. :117.00 Max. :224.50 Max. :12.000 Max. :8442 <br />
NA's :72 <br />
Evit Cu Litter <br />
Min. :1.000 Min. :1.000 Min. : 1.00 <br />
1st Qu.:1.000 1st Qu.:1.000 1st Qu.: 5.00 <br />
Median :2.000 Median :2.000 Median :11.00 <br />
Mean :2.027 Mean :2.015 Mean :12.14 <br />
3rd Qu.:3.000 3rd Qu.:3.000 3rd Qu.:20.00 <br />
Max. :3.000 Max. :3.000 Max. :24.00 <br />
attach(dietox)<br />
Cu <- as.factor(dietox$Cu)<br />
mf <- formula(Weight~Cu*(Time+I(Time^2)+I(Time^3)))<br />
gee <- geeglm(mf,data=dietox,id=Pig,family=poisson('identity'),corstr='ar1') ## The geeglm function fits generalized estimating equations using the ‘geese.fit’ function of the ‘geepack’ package for doing the actual computations. <br />
gee<br />
Call:<br />
geeglm(formula = mf, family = poisson("identity"), data = dietox, <br />
id = Pig, corstr = "ar1")<br />
<br />
Coefficients:<br />
(Intercept) Cu Time I(Time^2) I(Time^3) Cu:Time <br />
22.025614857 0.013857443 2.177350061 0.683063921 -0.027808001 0.424728590 <br />
Cu:I(Time^2) Cu:I(Time^3) <br />
-0.047363387 0.001313779 <br />
<br />
Degrees of Freedom: 861 Total (i.e. Null); 853 Residual<br />
<br />
Scale Link: identity<br />
Estimated Scale Parameters: [1] 0.7877976<br />
<br />
Correlation: Structure = ar1 Link = identity <br />
Estimated Correlation Parameters:<br />
alpha <br />
0.9576663 <br />
Number of clusters: 72 Maximum cluster size: 12 <br />
<br />
aov(gee)<br />
<br />
Call:<br />
aov(formula = gee)<br />
Terms:<br />
Cu Time I(Time^2) I(Time^3) Cu:Time Cu:I(Time^2)<br />
Sum of Squares 521.2 492381.6 1006.6 297.1 0.0 34.1<br />
Deg. of Freedom 1 1 1 1 1 1<br />
Cu:I(Time^3) Residuals<br />
Sum of Squares 1.4 42350.3<br />
Deg. of Freedom 1 853<br />
<br />
Residual standard error: 7.046179<br />
Estimated effects may be unbalanced<br />
<br />
Example 2 (use dataset of Normal and Schizophrenia Neuroimaging study of Children at http://wiki.socr.umich.edu/index.php/SOCR_Data_Oct2009_ID_NI)<br />
▪ Subject: Subject identifier<br />
▪ Age: Subject Age<br />
▪ DX: Subject diagnosis (Normals=1; Schizophrenia=2)<br />
▪ Sex: Subject gender (Male=1; female=2)<br />
▪ FS_IQ: Subject Intelligence Quotient (IQ)<br />
▪ TBV: Total Brain Volume (mm 3 )<br />
▪ GMV: Total Gray Matter Volume (mm 3 )<br />
▪ WMV: Total White Matter Volume (mm 3 )<br />
▪ CSF: Total Cerebrospinal Fluid (mm 3 )<br />
▪ Background: Background volume (mm 3 )<br />
▪ L_superior_frontal_gyrus, R_superior_frontal_gyrus, ..., brainstem: 56 regional cortical and subcortical volumes (region anatomical names are encoded in the column heading).<br />
<br />
neuro <- read.csv('M:/neuro.csv',header=T)<br />
summary(neuro)<br />
mf2 <- formula(FS_IQ ~ Sex+Age*Sex+Age^2+Age^3+TBV+GMV+WMV+CSF+Background)<br />
gee2 <- geeglm(mf2,data=neuro,id=Subject,family=poisson('identity'),corstr='ar1')<br />
gee2<br />
Call:<br />
geeglm(formula = mf2, family = poisson("identity"), data = neuro, <br />
id = Subject, corstr = "ar1")<br />
<br />
Coefficients:<br />
(Intercept) Sex Age TBV GMV WMV CSF Background <br />
-3.093394e+02 2.255625e+01 3.532712e+00 3.392570e+00 -3.392482e+00 -3.392484e+00 -3.392693e+00 3.144914e-05 <br />
Sex:Age <br />
-1.526318e+00 <br />
<br />
Degrees of Freedom: 63 Total (i.e. Null); 54 Residual<br />
<br />
Scale Link: identity<br />
Estimated Scale Parameters: [1] 2.538312<br />
<br />
Correlation: Structure = ar1 Link = identity <br />
Estimated Correlation Parameters:<br />
alpha <br />
0 <br />
<br />
Number of clusters: 63 Maximum cluster size: 1 <br />
<br />
aov(gee2)<br />
<br />
Call:<br />
aov(formula = gee2)<br />
<br />
Terms:<br />
Sex Age TBV GMV WMV CSF Background Sex:Age Residuals<br />
Sum of Squares 497.829 740.825 1040.712 25.337 3131.065 205.501 384.453 366.533 15459.175<br />
Deg. of Freedom 1 1 1 1 1 1 1 1 54<br />
<br />
Residual standard error: 16.91984<br />
Estimated effects may be unbalanced<br />
<br />
mf3 <- formula(FS_IQ ~ Sex+Age*Sex+Age^2+Age^3)<br />
gee3 <- geeglm(mf3,data=neuro,id=Subject,family=poisson('identity'),corstr='ar1')<br />
gee3<br />
<br />
Call:<br />
geeglm(formula = mf3, family = poisson("identity"), data = neuro, <br />
id = Subject, corstr = "ar1")<br />
<br />
Coefficients:<br />
(Intercept) Sex Age Sex:Age <br />
81.3791778 1.9781743 1.7571481 -0.5086719 <br />
<br />
Degrees of Freedom: 63 Total (i.e. Null); 59 Residual<br />
<br />
Scale Link: identity<br />
Estimated Scale Parameters: [1] 3.364596<br />
<br />
Correlation: Structure = ar1 Link = identity <br />
Estimated Correlation Parameters:<br />
alpha <br />
0 <br />
<br />
Number of clusters: 63 Maximum cluster size: 1 <br />
<br />
aov(gee3)<br />
Call:<br />
aov(formula = gee3)<br />
<br />
Terms:<br />
Sex Age Sex:Age Residuals<br />
Sum of Squares 497.829 740.825 29.199 20583.576<br />
Deg. of Freedom 1 1 1 59<br />
<br />
Residual standard error: 18.67817<br />
Estimated effects may be unbalanced<br />
<br />
===References===<br />
<br />
[http://en.wikipedia.org/wiki/Generalized_estimating_equation GEE Wikipedia]<br />
<br />
[https://onlinecourses.science.psu.edu/stat504/node/180 Introductions to GEE]<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<hr><br />
* SOCR Home page: http://www.socr.umich.edu<br />
<br />
{{translate|pageName=http://wiki.socr.umich.edu/index.php?title=SMHS_GEE}}</div>Clgallahttps://wiki.socr.umich.edu/index.php?title=SMHS_Surveys&diff=14423SMHS Surveys2014-10-16T20:57:13Z<p>Clgalla: /* Software */</p>
<hr />
<div>==[[SMHS| Scientific Methods for Health Sciences]] - Surveys ==<br />
<br />
===Overview===<br />
Survey methodology studies on the sampling of individual units from population then apply survey data collection techniques such as questionnaires to improve the number and accuracy of responses to surveys. The ultimate goal is to make statistical inferences about the population studied, which would of course, depends strongly on the survey questions provided. The commonly used survey methods include polls, public health surveys, market research surveys, censuses and so on. Surveys provide important information for public information and research fields and are widely applied in varieties of fields such as marketing, health professionals, sociology and so on. In this lecture, we are going to present a general introduction to surveys and various methods used in surveys will be illustrated with examples.<br />
<br />
===Motivation===<br />
Surveys may be one of the most commonly used methods for data collection. The questions used in the survey are of significant importance in collecting enough data to make statistical inference of the population. There are various ways of collecting data in surveys and they all have their strengths and weakness and are applied to different kinds of data. So, how do surveys work? And how to perform a good survey?<br />
<br />
===Theory===<br />
<br />
'''1) Surveys:'''<br />
Surveys are made of at least one sample (or the full population in the case of a census), a method of data collection and individual questions that become data, which can be further analyzed statistically. <br />
*A single survey may focus on different types of topics such as preferences, opinions, behavior or factual information depending on the purpose of the study. <br />
*Given that survey is based on one sample of the population, the success of research depends largely on the representativeness of the sample with respect to the target population. <br />
*Surveys aim to identify principles about the sample design, data collection instruments, statistical adjustment of data, data processing, and final data analysis that can be used to create systematic and random survey errors, which can sometimes be analyzed in connection with survey cost. The cost constraints are sometimes framed as improving quality within cost constraints, or alternatively, reducing costs for a fixed level of quality. <br />
<br />
'''2) Survey methodology topics'''<br />
The most important challenges of a survey method include making decisions on how to: (1) identify and select potential sample members; (2) contact sampled individuals and collect data from those who are hard to reach or reluctant to respond to the questions; (3) evaluate and test questions; (4) select the mode for posting questions and collecting responses; (5) train and supervise interviewers if they are involved; (6) check data files for accuracy and internal consistency; (7) adjust survey estimates to correct for identified errors.<br />
*Selecting samples: there are mainly two types of survey samples: probability samples and non-probability samples. Stratified sampling is a method of probability sampling where the sub-population within the population are identified and included in the sample selected in balanced ways.<br />
*Modes of data collection: the choice between various modes of administering a survey can be influenced by several factors including costs, coverage of the target population, flexibility of asking questions, respondents’ willingness to participate and response accuracy. Mode effect created by different methods can change the way the respondents answer. The most commonly used modes of administration can be summarized into some main categories including telephone, mail, online surveys, personal in-home surveys, personal mall or street intercept survey and hybrids of the above. <br />
**Telephone: use of interviewers encourages sample persons to respond, leading to higher response rates; interviewers can increase comprehension of questions by answering respondents’ questions; fairly cost efficient, depending on local call charge structure; good for large national sampling frames; some potential for interviewer bias; cannot be used for non-audio information; three main types of telephone include traditional telephone interviews, computer assisted telephone dialing and computer assisted telephone interviewing (CATI).<br />
**Online surveys: (1) Web surveys are faster, simpler, and cheaper. However, lower costs are not so straightforward in practice, as they are strongly interconnected to errors. Because response rate comparisons to other survey modes are usually not favorable for online surveys, efforts to achieve a higher response rate (e.g., with traditional solicitation methods) may substantially increase costs. (2) The entire data collection period is significantly shortened, as all data can be collected and processed in little more than a month. (3) Interaction between the respondent and the questionnaire is more dynamic compared to e-mail or paper surveys.[6] Online surveys are also less intrusive, and they suffer less from social desirability effects. (4) Complex skip patterns can be implemented in ways that are mostly invisible to the respondent. (5) Pop-up instructions can be provided for individual questions to provide help with questions exactly where assistance is required. (6) Questions with long lists of answer choices can be used to provide immediate coding of answers to certain questions that are usually asked in an open-ended fashion in paper questionnaires. (7) Online surveys can be tailored to the situation (e.g., respondents may be allowed save a partially completed form, the questionnaire may be preloaded with already available information, etc.). Online questionnaires may be improved by applying usability testing, where usability is measured with reference to the speed with which a task can be performed, the frequency of errors and user satisfaction with the interface.<br />
**Mail: the questionnaires may be handed to the respondents or mailed to them but in all cases they are returned to the researcher via mail; the advantage is that the cost of mail survey is very low and there is no interviewer bias, however there might be long delays and are not suitable for issues that may require clarification. The response rates can be improved by using mail panels, monetary incentives and improve the class of mail through which the surveys were sent. <br />
**Face-to-face: suitable for locations where telephone or mail are not developed; potential for interview bias; easy to manipulate by completing multiple times to skew results.<br />
**Mixed-mode surveys: with the introduction of computers to the survey process, survey mode now includes combinations of different approaches (mixed-mode designs). Some commonly used methods include computer-assisted personal interviewing (CAPI), audio computer assisted self-interviewing (audio CASI), computer-assisted telephone interviewing (CATI) and interactive voice response (IVR).<br />
*Cross-sectional and longitudinal surveys: the former involves a single questionnaire administered to each sample member and the latter refers to surveys, which repeatedly collect information from the same people over time. Longitudinal surveys are usually considered analytical advantages but can be challenging to implement successfully. As a result, specialist methods have been developed to select longitudinal samples to collect data repeatedly, to keep track of sample members over time, to keep respondents motivated to participate and to process and analyze longitudinal survey data.<br />
*Response formats: there are two kinds of formats of questions used in surveys: the open-ended questions, which requires the respondents to formulate their own answers and closed-ended questions, which require the respondent to pick and answer from a given list of mutually exclusive and exhaustive options. There are four types of response scales for closed-ended questions including dichotomous (two options), nominal-polytomous (more than two unordered options), ordinal-polytomous (more than two ordered options) and bounded continuous (continuous scaled questions). A respondent’s answer to open-ended question can be coded into a response scale afterwards or analyzed using more qualitative methods.<br />
*Nonresponse reduction in telephone and face-to-face surveys: (1) Advance letter: sent in advance to inform the sampled respondents about the upcoming survey. It should be personalized but not overdone; (2) Training: the interviewers thoroughly trained in how to ask respondents questions, how to work with computers and making schedules for callbacks to respondents not reached; (3) Short introduction: the interviewer should always start with a short instruction about him or herself about their names, the institute she is working for, the length of the interview and goal of the interview; (4) Respondent-friendly survey questionnaire: the questions asked must be clear, non-offensive and easy to respond to for the subjects under study.<br />
*Interviewer effects: The effects of the surveys may be affected by physical characteristics of the interviewer including race, gender, and the relative body weight (IBM). These characteristics of the interview are particularly influential when the questions are related to the interviewer trait. While interviewer effects have been investigated mainly for face-to-face surveys, they are also shown to exist for interview modes with no visual contact such as telephone surveys and in video-enhanced web surveys. <br />
<br />
3) Simple examples of survey can be viewed here:<br />
<br />
http://www.socr.umich.edu/html/SOCR_Survey.html.<br />
<br />
http://www.esurveyspro.com/Survey.aspx?id=79ba4c38-b7d0-4530-aa00-12fdd32b6609 <br />
<br />
http://socr.ucla.edu/docs/surveys/SOCR_Survey_VisualIllusions_2010.html <br />
<br />
===Applications===<br />
<br />
1) [http://www.sciencedirect.com/science/article/pii/S0895435697001261 This article] characterize response rates for mail surveys published in medical journals; and determined how the response rate among subjects who are typical targets of mail surveys varies by evaluating the contribution of several techniques used by investigators to enhance response rates. Methods. One hundred seventy-eight manuscripts published in 1991, representing 321 distinct mail surveys, were abstracted to determine response rates and survey techniques. In a follow-up mail survey, 113 authors of these manuscripts provided supplementary information. Results. The mean response rate among mail surveys published in medical journals is approximately 60%. However, response rates vary according to subject studied and techniques used. Published surveys of physicians have a mean response rate of only 54%, and those of non-physicians have a mean response rate of 68%. In addition, multivariable models suggest that written reminders provided with a copy of the instrument and telephone reminders are each associated with response rates about 13% higher than surveys that do not use these techniques. Other techniques, such as anonymity and financial incentives, are not associated with higher response rates. Conclusions. Although several mail survey techniques are associated with higher response rates, response rates to published mail surveys tend to be moderate. However, a survey's response rate is at best an indirect indication of the extent of non-respondent bias. Investigators, journal editors, and readers should devote more attention to assessments of bias, and less to specific response rate thresholds.<br />
<br />
2)[http://www.bmj.com/content/320/7237/745 This article] aims to survey operating theatre and intensive care unit staff about attitudes concerning error, stress, and teamwork and to compare these attitudes with those of airline cockpit crew. This study used the cross sectional surveys involving urban teaching and non-teaching hospitals in the United States, Israel, Germany, Switzerland and Italy and included 1033 doctors, nurses, fellows and residents working in operating theatres and intensive care units and over 30000 cockpit crew members in the study and measured the perceptions of error, stress and teamwork. This study concluded that: medical staff reported that error is important but difficult to discuss and not handled well in their hospital. Barriers to discussing error are more important since medical staff seem to deny the effect of stress and fatigue on performance. Further problems include differing perceptions of teamwork among team members and reluctance of senior theatre staff to accept input from junior members.<br />
<br />
3) [http://link.springer.com/article/10.1023/A:1025054610557 This article] provided a review of epidemiological studies of pervasive developmental disorders (PDD), which updates a previously published article. The design, sample characteristics of 32 surveys published between 1966 and 2001 are described. Recent surveys suggest that the rate for all forms of PDDs are around 30/10,000 but more recent surveys suggest that the estimate might be as high as 60/10,000. The rate for Asperger disorder is not well established, and a conservative figure is 2.5/10,000. Childhood disintegrative disorder is extremely rare with a pooled estimate across studies of 0.2/10,000. A detailed discussion of the possible interpretations of trends over time in prevalence rates is provided. There is evidence that changes in case definition and improved awareness explain much of the upward trend of rates in recent decades. However, available epidemiological surveys do not provide an adequate test of the hypothesis of a changing incidence of PDDs.<br />
<br />
===Software=== <br />
[http://www.keysurvey.com/?gclid=CjwKEAjw9eyeBRCqxc_b-LD8kTESJADsBMxS4usDhK8STe_svcEgzSPjw9dk99zGcaujAq5waWPrrxoCsRzw_wcB World App Key Survey]<br />
<br />
[http://www.qualtrics.com/research-suite/ Qualtrics]<br />
<br />
===Problems===<br />
Suppose, you want to study on the effectiveness of a new released drug on headache, can you come up with a short survey (5 or 6 questions) on a group of patients with headache who have been using this drug for the past three months? What kind of survey mode would you choose here and why? (open question)<br />
<br />
<br />
===References===<br />
[http://mirlyn.lib.umich.edu/Record/004199238 Statistical inference / George Casella, Roger L. Berger]<br />
<br />
[http://mirlyn.lib.umich.edu/Record/004232056 Sampling / Steven K. Thompson]<br />
<br />
[http://mirlyn.lib.umich.edu/Record/004133572 Sampling theory and methods / S. Sampath]<br />
<br />
<br />
<br />
<hr><br />
* SOCR Home page: http://www.socr.umich.edu<br />
<br />
{{translate|pageName=http://wiki.socr.umich.edu/index.php?title=SMHS_Surveys}}</div>Clgallahttps://wiki.socr.umich.edu/index.php?title=SMHS_Surveys&diff=14422SMHS Surveys2014-10-16T20:53:36Z<p>Clgalla: /* References */</p>
<hr />
<div>==[[SMHS| Scientific Methods for Health Sciences]] - Surveys ==<br />
<br />
===Overview===<br />
Survey methodology studies on the sampling of individual units from population then apply survey data collection techniques such as questionnaires to improve the number and accuracy of responses to surveys. The ultimate goal is to make statistical inferences about the population studied, which would of course, depends strongly on the survey questions provided. The commonly used survey methods include polls, public health surveys, market research surveys, censuses and so on. Surveys provide important information for public information and research fields and are widely applied in varieties of fields such as marketing, health professionals, sociology and so on. In this lecture, we are going to present a general introduction to surveys and various methods used in surveys will be illustrated with examples.<br />
<br />
===Motivation===<br />
Surveys may be one of the most commonly used methods for data collection. The questions used in the survey are of significant importance in collecting enough data to make statistical inference of the population. There are various ways of collecting data in surveys and they all have their strengths and weakness and are applied to different kinds of data. So, how do surveys work? And how to perform a good survey?<br />
<br />
===Theory===<br />
<br />
'''1) Surveys:'''<br />
Surveys are made of at least one sample (or the full population in the case of a census), a method of data collection and individual questions that become data, which can be further analyzed statistically. <br />
*A single survey may focus on different types of topics such as preferences, opinions, behavior or factual information depending on the purpose of the study. <br />
*Given that survey is based on one sample of the population, the success of research depends largely on the representativeness of the sample with respect to the target population. <br />
*Surveys aim to identify principles about the sample design, data collection instruments, statistical adjustment of data, data processing, and final data analysis that can be used to create systematic and random survey errors, which can sometimes be analyzed in connection with survey cost. The cost constraints are sometimes framed as improving quality within cost constraints, or alternatively, reducing costs for a fixed level of quality. <br />
<br />
'''2) Survey methodology topics'''<br />
The most important challenges of a survey method include making decisions on how to: (1) identify and select potential sample members; (2) contact sampled individuals and collect data from those who are hard to reach or reluctant to respond to the questions; (3) evaluate and test questions; (4) select the mode for posting questions and collecting responses; (5) train and supervise interviewers if they are involved; (6) check data files for accuracy and internal consistency; (7) adjust survey estimates to correct for identified errors.<br />
*Selecting samples: there are mainly two types of survey samples: probability samples and non-probability samples. Stratified sampling is a method of probability sampling where the sub-population within the population are identified and included in the sample selected in balanced ways.<br />
*Modes of data collection: the choice between various modes of administering a survey can be influenced by several factors including costs, coverage of the target population, flexibility of asking questions, respondents’ willingness to participate and response accuracy. Mode effect created by different methods can change the way the respondents answer. The most commonly used modes of administration can be summarized into some main categories including telephone, mail, online surveys, personal in-home surveys, personal mall or street intercept survey and hybrids of the above. <br />
**Telephone: use of interviewers encourages sample persons to respond, leading to higher response rates; interviewers can increase comprehension of questions by answering respondents’ questions; fairly cost efficient, depending on local call charge structure; good for large national sampling frames; some potential for interviewer bias; cannot be used for non-audio information; three main types of telephone include traditional telephone interviews, computer assisted telephone dialing and computer assisted telephone interviewing (CATI).<br />
**Online surveys: (1) Web surveys are faster, simpler, and cheaper. However, lower costs are not so straightforward in practice, as they are strongly interconnected to errors. Because response rate comparisons to other survey modes are usually not favorable for online surveys, efforts to achieve a higher response rate (e.g., with traditional solicitation methods) may substantially increase costs. (2) The entire data collection period is significantly shortened, as all data can be collected and processed in little more than a month. (3) Interaction between the respondent and the questionnaire is more dynamic compared to e-mail or paper surveys.[6] Online surveys are also less intrusive, and they suffer less from social desirability effects. (4) Complex skip patterns can be implemented in ways that are mostly invisible to the respondent. (5) Pop-up instructions can be provided for individual questions to provide help with questions exactly where assistance is required. (6) Questions with long lists of answer choices can be used to provide immediate coding of answers to certain questions that are usually asked in an open-ended fashion in paper questionnaires. (7) Online surveys can be tailored to the situation (e.g., respondents may be allowed save a partially completed form, the questionnaire may be preloaded with already available information, etc.). Online questionnaires may be improved by applying usability testing, where usability is measured with reference to the speed with which a task can be performed, the frequency of errors and user satisfaction with the interface.<br />
**Mail: the questionnaires may be handed to the respondents or mailed to them but in all cases they are returned to the researcher via mail; the advantage is that the cost of mail survey is very low and there is no interviewer bias, however there might be long delays and are not suitable for issues that may require clarification. The response rates can be improved by using mail panels, monetary incentives and improve the class of mail through which the surveys were sent. <br />
**Face-to-face: suitable for locations where telephone or mail are not developed; potential for interview bias; easy to manipulate by completing multiple times to skew results.<br />
**Mixed-mode surveys: with the introduction of computers to the survey process, survey mode now includes combinations of different approaches (mixed-mode designs). Some commonly used methods include computer-assisted personal interviewing (CAPI), audio computer assisted self-interviewing (audio CASI), computer-assisted telephone interviewing (CATI) and interactive voice response (IVR).<br />
*Cross-sectional and longitudinal surveys: the former involves a single questionnaire administered to each sample member and the latter refers to surveys, which repeatedly collect information from the same people over time. Longitudinal surveys are usually considered analytical advantages but can be challenging to implement successfully. As a result, specialist methods have been developed to select longitudinal samples to collect data repeatedly, to keep track of sample members over time, to keep respondents motivated to participate and to process and analyze longitudinal survey data.<br />
*Response formats: there are two kinds of formats of questions used in surveys: the open-ended questions, which requires the respondents to formulate their own answers and closed-ended questions, which require the respondent to pick and answer from a given list of mutually exclusive and exhaustive options. There are four types of response scales for closed-ended questions including dichotomous (two options), nominal-polytomous (more than two unordered options), ordinal-polytomous (more than two ordered options) and bounded continuous (continuous scaled questions). A respondent’s answer to open-ended question can be coded into a response scale afterwards or analyzed using more qualitative methods.<br />
*Nonresponse reduction in telephone and face-to-face surveys: (1) Advance letter: sent in advance to inform the sampled respondents about the upcoming survey. It should be personalized but not overdone; (2) Training: the interviewers thoroughly trained in how to ask respondents questions, how to work with computers and making schedules for callbacks to respondents not reached; (3) Short introduction: the interviewer should always start with a short instruction about him or herself about their names, the institute she is working for, the length of the interview and goal of the interview; (4) Respondent-friendly survey questionnaire: the questions asked must be clear, non-offensive and easy to respond to for the subjects under study.<br />
*Interviewer effects: The effects of the surveys may be affected by physical characteristics of the interviewer including race, gender, and the relative body weight (IBM). These characteristics of the interview are particularly influential when the questions are related to the interviewer trait. While interviewer effects have been investigated mainly for face-to-face surveys, they are also shown to exist for interview modes with no visual contact such as telephone surveys and in video-enhanced web surveys. <br />
<br />
3) Simple examples of survey can be viewed here:<br />
<br />
http://www.socr.umich.edu/html/SOCR_Survey.html.<br />
<br />
http://www.esurveyspro.com/Survey.aspx?id=79ba4c38-b7d0-4530-aa00-12fdd32b6609 <br />
<br />
http://socr.ucla.edu/docs/surveys/SOCR_Survey_VisualIllusions_2010.html <br />
<br />
===Applications===<br />
<br />
1) [http://www.sciencedirect.com/science/article/pii/S0895435697001261 This article] characterize response rates for mail surveys published in medical journals; and determined how the response rate among subjects who are typical targets of mail surveys varies by evaluating the contribution of several techniques used by investigators to enhance response rates. Methods. One hundred seventy-eight manuscripts published in 1991, representing 321 distinct mail surveys, were abstracted to determine response rates and survey techniques. In a follow-up mail survey, 113 authors of these manuscripts provided supplementary information. Results. The mean response rate among mail surveys published in medical journals is approximately 60%. However, response rates vary according to subject studied and techniques used. Published surveys of physicians have a mean response rate of only 54%, and those of non-physicians have a mean response rate of 68%. In addition, multivariable models suggest that written reminders provided with a copy of the instrument and telephone reminders are each associated with response rates about 13% higher than surveys that do not use these techniques. Other techniques, such as anonymity and financial incentives, are not associated with higher response rates. Conclusions. Although several mail survey techniques are associated with higher response rates, response rates to published mail surveys tend to be moderate. However, a survey's response rate is at best an indirect indication of the extent of non-respondent bias. Investigators, journal editors, and readers should devote more attention to assessments of bias, and less to specific response rate thresholds.<br />
<br />
2)[http://www.bmj.com/content/320/7237/745 This article] aims to survey operating theatre and intensive care unit staff about attitudes concerning error, stress, and teamwork and to compare these attitudes with those of airline cockpit crew. This study used the cross sectional surveys involving urban teaching and non-teaching hospitals in the United States, Israel, Germany, Switzerland and Italy and included 1033 doctors, nurses, fellows and residents working in operating theatres and intensive care units and over 30000 cockpit crew members in the study and measured the perceptions of error, stress and teamwork. This study concluded that: medical staff reported that error is important but difficult to discuss and not handled well in their hospital. Barriers to discussing error are more important since medical staff seem to deny the effect of stress and fatigue on performance. Further problems include differing perceptions of teamwork among team members and reluctance of senior theatre staff to accept input from junior members.<br />
<br />
3) [http://link.springer.com/article/10.1023/A:1025054610557 This article] provided a review of epidemiological studies of pervasive developmental disorders (PDD), which updates a previously published article. The design, sample characteristics of 32 surveys published between 1966 and 2001 are described. Recent surveys suggest that the rate for all forms of PDDs are around 30/10,000 but more recent surveys suggest that the estimate might be as high as 60/10,000. The rate for Asperger disorder is not well established, and a conservative figure is 2.5/10,000. Childhood disintegrative disorder is extremely rare with a pooled estimate across studies of 0.2/10,000. A detailed discussion of the possible interpretations of trends over time in prevalence rates is provided. There is evidence that changes in case definition and improved awareness explain much of the upward trend of rates in recent decades. However, available epidemiological surveys do not provide an adequate test of the hypothesis of a changing incidence of PDDs.<br />
<br />
===Software=== <br />
http://www.keysurvey.com/?gclid=CjwKEAjw9eyeBRCqxc_b-LD8kTESJADsBMxS4usDhK8STe_svcEgzSPjw9dk99zGcaujAq5waWPrrxoCsRzw_wcB<br />
<br />
http://www.qualtrics.com/research-suite/ <br />
<br />
===Problems===<br />
Suppose, you want to study on the effectiveness of a new released drug on headache, can you come up with a short survey (5 or 6 questions) on a group of patients with headache who have been using this drug for the past three months? What kind of survey mode would you choose here and why? (open question)<br />
<br />
<br />
===References===<br />
[http://mirlyn.lib.umich.edu/Record/004199238 Statistical inference / George Casella, Roger L. Berger]<br />
<br />
[http://mirlyn.lib.umich.edu/Record/004232056 Sampling / Steven K. Thompson]<br />
<br />
[http://mirlyn.lib.umich.edu/Record/004133572 Sampling theory and methods / S. Sampath]<br />
<br />
<br />
<br />
<hr><br />
* SOCR Home page: http://www.socr.umich.edu<br />
<br />
{{translate|pageName=http://wiki.socr.umich.edu/index.php?title=SMHS_Surveys}}</div>Clgallahttps://wiki.socr.umich.edu/index.php?title=SMHS_Surveys&diff=14421SMHS Surveys2014-10-16T20:50:04Z<p>Clgalla: /* Scientific Methods for Health Sciences - Surveys */</p>
<hr />
<div>==[[SMHS| Scientific Methods for Health Sciences]] - Surveys ==<br />
<br />
===Overview===<br />
Survey methodology studies on the sampling of individual units from population then apply survey data collection techniques such as questionnaires to improve the number and accuracy of responses to surveys. The ultimate goal is to make statistical inferences about the population studied, which would of course, depends strongly on the survey questions provided. The commonly used survey methods include polls, public health surveys, market research surveys, censuses and so on. Surveys provide important information for public information and research fields and are widely applied in varieties of fields such as marketing, health professionals, sociology and so on. In this lecture, we are going to present a general introduction to surveys and various methods used in surveys will be illustrated with examples.<br />
<br />
===Motivation===<br />
Surveys may be one of the most commonly used methods for data collection. The questions used in the survey are of significant importance in collecting enough data to make statistical inference of the population. There are various ways of collecting data in surveys and they all have their strengths and weakness and are applied to different kinds of data. So, how do surveys work? And how to perform a good survey?<br />
<br />
===Theory===<br />
<br />
'''1) Surveys:'''<br />
Surveys are made of at least one sample (or the full population in the case of a census), a method of data collection and individual questions that become data, which can be further analyzed statistically. <br />
*A single survey may focus on different types of topics such as preferences, opinions, behavior or factual information depending on the purpose of the study. <br />
*Given that survey is based on one sample of the population, the success of research depends largely on the representativeness of the sample with respect to the target population. <br />
*Surveys aim to identify principles about the sample design, data collection instruments, statistical adjustment of data, data processing, and final data analysis that can be used to create systematic and random survey errors, which can sometimes be analyzed in connection with survey cost. The cost constraints are sometimes framed as improving quality within cost constraints, or alternatively, reducing costs for a fixed level of quality. <br />
<br />
'''2) Survey methodology topics'''<br />
The most important challenges of a survey method include making decisions on how to: (1) identify and select potential sample members; (2) contact sampled individuals and collect data from those who are hard to reach or reluctant to respond to the questions; (3) evaluate and test questions; (4) select the mode for posting questions and collecting responses; (5) train and supervise interviewers if they are involved; (6) check data files for accuracy and internal consistency; (7) adjust survey estimates to correct for identified errors.<br />
*Selecting samples: there are mainly two types of survey samples: probability samples and non-probability samples. Stratified sampling is a method of probability sampling where the sub-population within the population are identified and included in the sample selected in balanced ways.<br />
*Modes of data collection: the choice between various modes of administering a survey can be influenced by several factors including costs, coverage of the target population, flexibility of asking questions, respondents’ willingness to participate and response accuracy. Mode effect created by different methods can change the way the respondents answer. The most commonly used modes of administration can be summarized into some main categories including telephone, mail, online surveys, personal in-home surveys, personal mall or street intercept survey and hybrids of the above. <br />
**Telephone: use of interviewers encourages sample persons to respond, leading to higher response rates; interviewers can increase comprehension of questions by answering respondents’ questions; fairly cost efficient, depending on local call charge structure; good for large national sampling frames; some potential for interviewer bias; cannot be used for non-audio information; three main types of telephone include traditional telephone interviews, computer assisted telephone dialing and computer assisted telephone interviewing (CATI).<br />
**Online surveys: (1) Web surveys are faster, simpler, and cheaper. However, lower costs are not so straightforward in practice, as they are strongly interconnected to errors. Because response rate comparisons to other survey modes are usually not favorable for online surveys, efforts to achieve a higher response rate (e.g., with traditional solicitation methods) may substantially increase costs. (2) The entire data collection period is significantly shortened, as all data can be collected and processed in little more than a month. (3) Interaction between the respondent and the questionnaire is more dynamic compared to e-mail or paper surveys.[6] Online surveys are also less intrusive, and they suffer less from social desirability effects. (4) Complex skip patterns can be implemented in ways that are mostly invisible to the respondent. (5) Pop-up instructions can be provided for individual questions to provide help with questions exactly where assistance is required. (6) Questions with long lists of answer choices can be used to provide immediate coding of answers to certain questions that are usually asked in an open-ended fashion in paper questionnaires. (7) Online surveys can be tailored to the situation (e.g., respondents may be allowed save a partially completed form, the questionnaire may be preloaded with already available information, etc.). Online questionnaires may be improved by applying usability testing, where usability is measured with reference to the speed with which a task can be performed, the frequency of errors and user satisfaction with the interface.<br />
**Mail: the questionnaires may be handed to the respondents or mailed to them but in all cases they are returned to the researcher via mail; the advantage is that the cost of mail survey is very low and there is no interviewer bias, however there might be long delays and are not suitable for issues that may require clarification. The response rates can be improved by using mail panels, monetary incentives and improve the class of mail through which the surveys were sent. <br />
**Face-to-face: suitable for locations where telephone or mail are not developed; potential for interview bias; easy to manipulate by completing multiple times to skew results.<br />
**Mixed-mode surveys: with the introduction of computers to the survey process, survey mode now includes combinations of different approaches (mixed-mode designs). Some commonly used methods include computer-assisted personal interviewing (CAPI), audio computer assisted self-interviewing (audio CASI), computer-assisted telephone interviewing (CATI) and interactive voice response (IVR).<br />
*Cross-sectional and longitudinal surveys: the former involves a single questionnaire administered to each sample member and the latter refers to surveys, which repeatedly collect information from the same people over time. Longitudinal surveys are usually considered analytical advantages but can be challenging to implement successfully. As a result, specialist methods have been developed to select longitudinal samples to collect data repeatedly, to keep track of sample members over time, to keep respondents motivated to participate and to process and analyze longitudinal survey data.<br />
*Response formats: there are two kinds of formats of questions used in surveys: the open-ended questions, which requires the respondents to formulate their own answers and closed-ended questions, which require the respondent to pick and answer from a given list of mutually exclusive and exhaustive options. There are four types of response scales for closed-ended questions including dichotomous (two options), nominal-polytomous (more than two unordered options), ordinal-polytomous (more than two ordered options) and bounded continuous (continuous scaled questions). A respondent’s answer to open-ended question can be coded into a response scale afterwards or analyzed using more qualitative methods.<br />
*Nonresponse reduction in telephone and face-to-face surveys: (1) Advance letter: sent in advance to inform the sampled respondents about the upcoming survey. It should be personalized but not overdone; (2) Training: the interviewers thoroughly trained in how to ask respondents questions, how to work with computers and making schedules for callbacks to respondents not reached; (3) Short introduction: the interviewer should always start with a short instruction about him or herself about their names, the institute she is working for, the length of the interview and goal of the interview; (4) Respondent-friendly survey questionnaire: the questions asked must be clear, non-offensive and easy to respond to for the subjects under study.<br />
*Interviewer effects: The effects of the surveys may be affected by physical characteristics of the interviewer including race, gender, and the relative body weight (IBM). These characteristics of the interview are particularly influential when the questions are related to the interviewer trait. While interviewer effects have been investigated mainly for face-to-face surveys, they are also shown to exist for interview modes with no visual contact such as telephone surveys and in video-enhanced web surveys. <br />
<br />
3) Simple examples of survey can be viewed here:<br />
<br />
http://www.socr.umich.edu/html/SOCR_Survey.html.<br />
<br />
http://www.esurveyspro.com/Survey.aspx?id=79ba4c38-b7d0-4530-aa00-12fdd32b6609 <br />
<br />
http://socr.ucla.edu/docs/surveys/SOCR_Survey_VisualIllusions_2010.html <br />
<br />
===Applications===<br />
<br />
1) [http://www.sciencedirect.com/science/article/pii/S0895435697001261 This article] characterize response rates for mail surveys published in medical journals; and determined how the response rate among subjects who are typical targets of mail surveys varies by evaluating the contribution of several techniques used by investigators to enhance response rates. Methods. One hundred seventy-eight manuscripts published in 1991, representing 321 distinct mail surveys, were abstracted to determine response rates and survey techniques. In a follow-up mail survey, 113 authors of these manuscripts provided supplementary information. Results. The mean response rate among mail surveys published in medical journals is approximately 60%. However, response rates vary according to subject studied and techniques used. Published surveys of physicians have a mean response rate of only 54%, and those of non-physicians have a mean response rate of 68%. In addition, multivariable models suggest that written reminders provided with a copy of the instrument and telephone reminders are each associated with response rates about 13% higher than surveys that do not use these techniques. Other techniques, such as anonymity and financial incentives, are not associated with higher response rates. Conclusions. Although several mail survey techniques are associated with higher response rates, response rates to published mail surveys tend to be moderate. However, a survey's response rate is at best an indirect indication of the extent of non-respondent bias. Investigators, journal editors, and readers should devote more attention to assessments of bias, and less to specific response rate thresholds.<br />
<br />
2)[http://www.bmj.com/content/320/7237/745 This article] aims to survey operating theatre and intensive care unit staff about attitudes concerning error, stress, and teamwork and to compare these attitudes with those of airline cockpit crew. This study used the cross sectional surveys involving urban teaching and non-teaching hospitals in the United States, Israel, Germany, Switzerland and Italy and included 1033 doctors, nurses, fellows and residents working in operating theatres and intensive care units and over 30000 cockpit crew members in the study and measured the perceptions of error, stress and teamwork. This study concluded that: medical staff reported that error is important but difficult to discuss and not handled well in their hospital. Barriers to discussing error are more important since medical staff seem to deny the effect of stress and fatigue on performance. Further problems include differing perceptions of teamwork among team members and reluctance of senior theatre staff to accept input from junior members.<br />
<br />
3) [http://link.springer.com/article/10.1023/A:1025054610557 This article] provided a review of epidemiological studies of pervasive developmental disorders (PDD), which updates a previously published article. The design, sample characteristics of 32 surveys published between 1966 and 2001 are described. Recent surveys suggest that the rate for all forms of PDDs are around 30/10,000 but more recent surveys suggest that the estimate might be as high as 60/10,000. The rate for Asperger disorder is not well established, and a conservative figure is 2.5/10,000. Childhood disintegrative disorder is extremely rare with a pooled estimate across studies of 0.2/10,000. A detailed discussion of the possible interpretations of trends over time in prevalence rates is provided. There is evidence that changes in case definition and improved awareness explain much of the upward trend of rates in recent decades. However, available epidemiological surveys do not provide an adequate test of the hypothesis of a changing incidence of PDDs.<br />
<br />
===Software=== <br />
http://www.keysurvey.com/?gclid=CjwKEAjw9eyeBRCqxc_b-LD8kTESJADsBMxS4usDhK8STe_svcEgzSPjw9dk99zGcaujAq5waWPrrxoCsRzw_wcB<br />
<br />
http://www.qualtrics.com/research-suite/ <br />
<br />
===Problems===<br />
Suppose, you want to study on the effectiveness of a new released drug on headache, can you come up with a short survey (5 or 6 questions) on a group of patients with headache who have been using this drug for the past three months? What kind of survey mode would you choose here and why? (open question)<br />
<br />
<br />
===References===<br />
<br />
http://mirlyn.lib.umich.edu/Record/004199238 <br />
<br />
http://mirlyn.lib.umich.edu/Record/004232056 <br />
<br />
http://mirlyn.lib.umich.edu/Record/004133572<br />
<br />
<br />
<br />
<hr><br />
* SOCR Home page: http://www.socr.umich.edu<br />
<br />
{{translate|pageName=http://wiki.socr.umich.edu/index.php?title=SMHS_Surveys}}</div>Clgallahttps://wiki.socr.umich.edu/index.php?title=SMHS_MixtureModeling&diff=14420SMHS MixtureModeling2014-10-16T20:42:52Z<p>Clgalla: </p>
<hr />
<div>==[[SMHS| Scientific Methods for Health Sciences]] - Mixture Modeling ==<br />
<br />
<br />
===Overview=== <br />
Mixture model is a probabilistic model for representing the presence of subpopulations within overall population, without requiring that an observed data set should identify the sub-population to which an individual observation belongs. In this section, we will present a general introduction to mixture modeling, the structure of mixture model, various types of mixture model, the estimation of parameters in mixture model and application of mixture model in studies. The implementation of mixture modeling in R package will also be discussed in the attached articles.<br />
<br />
===Motivation===<br />
Mixture distribution represents the probability distribution of observations in the overall population. Problems associated with mixture distributions relate to deriving the properties of the overall population from those of the sub-population. We use mixture models to make statistical inference about the properties of the sub-populations given only observations on the pooled population, without sub-population identity information. It is not the same as models for compositional data, whose components are constrained to sum to a constant value. (1, 100%, etc.) What is the structure of mixture model and how can we estimate parameters in the mixture model?<br />
<br />
===Theory===<br />
'''1) Structure of mixture model:''' a distribution $f$ is a mixture of $K$ component distributions $f_{1}, f_{2}, \cdots, f_{k}$ if $f(x) = \sum_{k=1}^{K} \lambda_{k}f_{k}(x)$ with the $\lambda_{k}$ being the mixing weights, $\lambda_{k} > 0, \sum_{k}\lambda_{k} = 1$. $Z \sim Mult(\lambda_{1}, \cdots, \lambda_{k}), X|Z \sim f_{z}$, where the discrete random variable $Z$ indicating where $X$ is drawn from. Different parametric family for $f_{k}$ generates different parametric mixture models, like Gaussian, Binomial, Poisson and etc. They may all be Gaussian with different parameters, or all Poisson distributions with different means. The model can be expressed as $f(x) = \sum_{k=1}^{K} \lambda_{k}f(x;\theta_{k})$, the parameter vector of the mixture model is $\theta = (\lambda_{1}, \cdots, \lambda_{K}, \theta_{1}, \cdots, \theta_{K})$. When $K=1$, we got a simply parametric distribution of the usual sort, and density estimation reduces to estimating the parameters by ML. If K=n, the number of observations, we went back to kernel density estimation. <br />
<br />
'''2) Estimating parametric mixture models:''' assume independent samples where we have the density function to be $\prod_{i=1}^{n}f(x_{i};\theta)$,for observations $x_{1}, x_{2}, \cdots, x_{n}$. <br />
<br />
<br />
We try the logarithm to turn multiplication into addition:$l(\theta) = \sum_{i=1}^{n} logf(x_{i};\theta) = \sum_{i=1}^{n} log \sum_{k=1}^{K} \lambda_{k} f(x_{i}; \theta_{k})$, <br />
<br />
<br />
we take the derivative of this with respect to one parameter, say $\theta_{j}, \frac{\partial l}{\partial \theta_{j}}=\sum_{i=1}^{n}\frac{1}{\sum_{i=1}^{K}\lambda_{k}f(x_{i};\theta_{k})}\lambda_{j}\frac{\partial f(x_{i};\theta_{j})}{\partial \theta_{j}}=\sum_{i=1}^{n}\frac{\lambda_{j}f(x_{i};\theta_{j})}{\sum_{i=1}^{K}\lambda_{k}f(x_{i};\theta_{k})}\frac{\partial logf(x_{i};\theta_{j})}{\partial \theta_{j}}$. <br />
<br />
<br />
If we just had an ordinary parametric model, on the other hand, the derivative of the log-likelihood would be $\sum_{i=1}^{n}\frac{\partial logf(x_{i};\theta_{j})}{\partial \theta_{j}}.$ Maximizing the likelihood for a mixture model is like doing a weighted likelihood maximization, where the weight of $x_{i}$ depends on cluster, being $w_{ij} = \frac{\lambda_{j}}{f x_{i}};\theta_{j}\sum_{k=1}^{K}\lambda_{k}f(x_{i}\theta_{k}).$ <br />
<br />
Remember that $\lambda_{j}$ is the probability that the hidden class variable $Z$ is $j$,so the numerator in the weights is the joint probability of getting $Z=j$ and $X=x_{i}$. The denominator is the marginal probability of getting $X=x_{i}$, so the ratio is conditional probability of $Z=j$ given $X=x_{i}, w_{ij} = \frac{\lambda_{j}f(x_{i};\theta_{j}}{\sum_{k=1}^{K}} \lambda_{k}f(x_{i};\theta_{k}) = p(Z=j | X=x_{i}; \theta).$<br />
*EM algorithm: (1) start with guesses about the mixture components $\theta_{1}, \cdots, \theta_{K}$ and the mixing weights $\lambda_{1}, \cdots, \lambda_{K}$; (2) until nothing changes very much: using the current parameter guesses, calculate the weights $w_{ij}$ (E-step); using the current weights, maximize the weighted likelihood to get new parameter estimates (M-step); (3) return the final parameter estimates (including mixing proportions) and cluster probabilities. <br />
*Non-parametric mixture modeling: replace the $M$ step of $EM$ by some other way of estimating the distribution of each mixture component. This could be fast and crude estimate of parameters, or it could even be a non-parametric density estimator.<br />
<br />
'''3) Computational examples in R:''' Snoqualmie Falls Revisited (analyzed using the mclust package in R)<br />
<br />
RCODE:<br />
snoqualmie <- read.csv<br />
("http://www.stat.cmu.edu/~cshalizi/402/lectures/16-glm-practicals/snoqualmie.csv",header=FALSE)<br />
snoqualmie.vector <- na.omit(unlist(snoqualmie))<br />
snoq <- snoqualmie.vector[snoqualmie.vector > 0]<br />
summary(snoq)<br />
Min. 1st Qu. Median Mean 3rd Qu. Max. <br />
1.00 6.00 19.00 32.28 44.00 463.00<br />
<br />
<br />
# Two-component Gaussian mixture<br />
library(mixtools)<br />
snoq.k2 <- normalmixEM(snoq,k=2,maxit=100,epsilon=0.01)<br />
summary(snoq.k2)<br />
summary of normalmixEM object:<br />
comp 1 comp 2<br />
$\lambda$ 0.557524 0.442476<br />
$\mu$ 10.266172 60.009468<br />
$\sigma$ 8.510244 44.997240<br />
loglik at estimate: -32681.21<br />
<br />
<br />
# Function to add Gaussian mixture components, vertically scaled, to the<br />
# current plot<br />
# Presumes the mixture object has the structure used by mixtools<br />
plot.normal.components <- function(mixture,component.number,...) {<br />
curve(mixture $\lambda$ [component.number] *<br />
dnorm(x,mean=mixture $\mu$ [component.number],<br />
sd=mixture $\sigma$ [component.number]), add=TRUE, ...)<br />
}<br />
plot(hist(snoq,breaks=101),col="grey",border="grey",freq=FALSE,<br />
xlab="Precipitation (1/100 inch)",main="Precipitation in Snoqualmie Falls")<br />
lines(density(snoq),lty=2)<br />
<br />
[[Image:SMHS_MixtureModel_Fig1.png|500px]]<br />
<br />
Histogram (grey) for precipitation on wet days in Snoqualmie Falls. The dashed line is a kernel density estimate, which is not completely satisfactory.<br />
<br />
plot(hist(snoq,breaks=101),col="grey",border="grey",freq=FALSE,<br />
xlab="Precipitation (1/100 inch)",main="Precipitation in Snoqualmie Falls")<br />
lines(density(snoq),lty=2)<br />
<br />
sapply(1:2,plot.normal.components,mixture=snoq.k2)<br />
<br />
[[Image:SMHS_MixtureModel_Fig2.png|500px]]<br />
<br />
As in the previous figure, plus the components of a mixture of two Gaussians, fitted to the data by the EM algorithm (dashed lines). These are scaled by the mixing weights of the components.<br />
<br />
# Function to calculate the cumulative distribution function of a Gaussian<br />
# mixture model<br />
# Presumes the mixture object has the structure used by mixtools<br />
# Doesn't implement some of the usual options for CDF functions in R, like<br />
# returning the log probability, or the upper tail probability<br />
pnormmix <- function(x,mixture) {<br />
$\lambda$ <- mixture $\lambda$<br />
k <- length(s\lambda$)<br />
pnorm.from.mix <- function(x,component){<br />
$\lambda$[component]*pnorm(x,mean=mixture$\mu$[component],<br />
sd=mixture$\sigma$[component])<br />
}<br />
pnorms <- sapply(1:k,pnorm.from.mix,x=x)<br />
return(rowSums(pnorms))<br />
}<br />
<br />
<br />
#### Figure 3<br />
# Distinct values in the data<br />
distinct.snoq <- sort(unique(snoq))<br />
# Theoretical CDF evaluated at each distinct value<br />
tcdfs <- pnormmix(distinct.snoq,mixture=snoq.k2)<br />
# Empirical CDF evaluated at each distinct value<br />
# ecdf(snoq) returns an object which is a _function_, suitable for application<br />
# to new vectors<br />
ecdfs <- ecdf(snoq)(distinct.snoq)<br />
# Plot them against each other<br />
plot(tcdfs,ecdfs,xlab="Theoretical CDF",ylab="Empirical CDF",xlim=c(0,1),<br />
ylim=c(0,1))<br />
# Main diagonal for visual reference<br />
abline(0,1)<br />
<br />
<br />
[[Image:SMHS_MixtureModel_Fig3.png|500px]]<br />
<br />
<br />
<br />
Calibration plot for the two-component Gaussian mixture. For each distinct value of precipitation x, plot the fraction of days predicted by the mixture model to have $\leq x$ precipitation on the horizontal axis, versus the actual fraction of days $\leq x$.<br />
<br />
# Probability density function for a Gaussian mixture<br />
# Presumes the mixture object has the structure used by mixtools<br />
dnormalmix <- function(x,mixture,log=FALSE) {<br />
$\lambda$ <- mixture $\lambda$<br />
k <- length($\lambda$)<br />
# Calculate share of likelihood for all data for one component<br />
like.component <- function(x,component) {<br />
$\lambda$ [component]*dnorm(x,mean=mixture$\mu$[component], <br />
sd=mixture $\sigma$ [component])<br />
}<br />
# Create array with likelihood shares from all components over all data<br />
likes <- sapply(1:k,like.component,x=x)<br />
# Add up contributions from components<br />
d <- rowSums(likes)<br />
if (log) {<br />
d <- log(d)<br />
}<br />
return(d)<br />
}<br />
<br />
# Log likelihood function for a Gaussian mixture, potentially on new data<br />
loglike.normalmix <- function(x,mixture) {<br />
loglike <- dnormalmix(x,mixture,log=TRUE)<br />
return(sum(loglike))<br />
}<br />
loglike.normalmix(snoq,mixture=snoq.k2)<br />
[1] - 32681.21<br />
# Evaluate various numbers of Gaussian components by data-set splitting<br />
# (i.e., very crude cross-validation)<br />
n <- length(snoq)<br />
data.points <- 1:n<br />
data.points <- sample(data.points) # Permute randomly<br />
train <- data.points[1:floor(n/2)] # First random half is training<br />
test <- data.points[-(1:floor(n/2))] # 2nd random half is testing<br />
candidate.component.numbers <- 2:10<br />
loglikes <- vector(length=1+length(candidate.component.numbers))<br />
# k=1 needs special handling<br />
$\mu$<-mean(snoq[train]) # MLE of mean<br />
$\sigma$ <- sd(snoq[train])*sqrt((n-1)/n) # MLE of standard deviation<br />
loglikes[1] <- sum(dnorm(snoq[test],mu,sigma,log=TRUE))<br />
for (k in candidate.component.numbers) {<br />
mixture <- normalmixEM(snoq[train],k=k,maxit=400,epsilon=1e-2)<br />
loglikes[k] <- loglike.normalmix(snoq[test],mixture=mixture)<br />
}<br />
loglikes<br />
[1] -17647.93 -16336.12 -15796.02 -15554.33 -15398.04 -15337.47 -15297.61<br />
[8] -15285.60 -15286.75 -15288.88<br />
plot(x=1:10, y=loglikes,xlab="Number of mixture components", ylab="Log-likelihood on testing data")<br />
<br />
[[Image:SMHS_MixtureModel_Fig4.png|500px]]<br />
<br />
log-likelihoods of different sizes of mixture models, fit to a random half of the data for training, and evaluated on the other half of the data for testing.<br />
<br />
snoq.k9 <- normalmixEM(snoq,k=9,maxit=400,epsilon=1e-2)<br />
plot(hist(snoq,breaks=101),col="grey",border="grey",freq=FALSE,<br />
xlab="Precipitation (1/100 inch)",main="Precipitation in Snoqualmie Falls")<br />
lines(density(snoq),lty=2)<br />
sapply(1:9,plot.normal.components,mixture=snoq.k9)<br />
<br />
[[Image:SMHS_MixtureModel_Fig5.png|500px]]<br />
<br />
With the nine-component Gaussian mixture.<br />
<br />
# Assigments for distinct.snoq and ecdfs are redundant if you've already done<br />
distinct.snoq <- sort(unique(snoq))<br />
tcdfs <- pnormmix(distinct.snoq,mixture=snoq.k9)<br />
ecdfs <- ecdf(snoq)(distinct.snoq)<br />
plot(tcdfs,ecdfs,xlab="Theoretical CDF",ylab="Empirical CDF",xlim=c(0,1),<br />
ylim=c(0,1))<br />
abline(0,1)<br />
<br />
[[Image:SMHS_MixtureModel_Fig6.png|500px]]<br />
<br />
Calibration plot for the nine-component Gaussian mixture.<br />
<br />
plot(0,xlim=range(snoq.k9$mu),ylim=range(snoq.k9$sigma),type="n",<br />
xlab="Component mean", ylab="Component standard deviation")<br />
<br />
points(x=snoq.k9$mu,y=snoq.k9$sigma,pch=as.character(1:9),<br />
cex=sqrt(0.5+5*snoq.k9$lambda))<br />
<br />
[[Image:SMHS_MixtureModel_Fig7.png|500px]]<br />
<br />
Characteristics of the components of the 9-mode Gaussian mixture. The horizontal axis gives the component mean, the vertical axis its standard deviation. The area of the number representing each component is proportional to the component’s mixing weight.<br />
<br />
plot(density(snoq),lty=2,ylim=c(0,0.04),<br />
main=paste("Comparison of density estimates\n",<br />
"Kernel vs. Gaussian mixture"),<br />
xlab="Precipitation (1/100 inch)")<br />
curve(dnormalmix(x,snoq.k9),add=TRUE)<br />
<br />
[[Image:SMHS_MixtureModel_Fig8.png|500px]]<br />
<br />
Dashed line: kernel density estimate. Solid line is the nine-Gaussian mixture. Notice that the mixture, unlike the KDE, gives negligible probability to negative precipitation.<br />
<br />
# Do the classes of the Gaussian mixture make sense as annual weather patterns?<br />
# Most probable class for each day:<br />
day.classes <- apply(snoq.k9$posterior,1,which.max)<br />
# Make a copy of the original, structured data set to edit<br />
snoqualmie.classes <- snoqualmie<br />
# Figure out which days had precipitation<br />
wet.days <- (snoqualmie > 0) & !(is.na(snoqualmie))<br />
# Replace actual precipitation amounts with classes<br />
snoqualmie.classes[wet.days] <- day.classes<br />
# Problem: the number of the classes doesn't correspond to e.g. amount of<br />
# precipitation expected. Solution: label by expected precipitation, not by<br />
# class number.<br />
snoqualmie.classes[wet.days] <- snoq.k9$mu[day.classes]<br />
<br />
plot(0,xlim=c(1,366),ylim=range(snoq.k9$mu),type="n",xaxt="n",<br />
xlab="Day of year",ylab="Expected precipiation (1/100 inch)")<br />
axis(1,at=1+(0:11)*30)<br />
for (year in 1:nrow(snoqualmie.classes)) {<br />
points(1:366,snoqualmie.classes[year,],pch=16,cex=0.2)<br />
}<br />
<br />
[[Image:SMHS_MixtureModel_Fig9.png|500px]]<br />
<br />
Plot of days classified according to the nine-component mixture. Horizontal axis: days of the year, numbered from 1 to 366 to handle leap years. Vertical axis: expected amount of precipitation on that day according to the most probable class for the day.<br />
<br />
# Next line is currently (5 April 2011) used to invoke a bug-patch kindly<br />
# provided by Dr. Derek Young; the patch will be incorporated in the next<br />
# update to mixtools, so should not be needed after April 2011<br />
source("http://www.stat.cmu.edu/~cshalizi/402/lectures/20-mixture-examples/bootcomp.R")<br />
snoq.boot <- boot.comp(snoq,max.comp=10,mix.type="normalmix",maxit=400,$\epsilon=1e-2)$<br />
# Running this takes about 5 minutes<br />
# automatically produced as a side-effect of running boot.comp()<br />
<br />
[[Image:SMHS_MixtureModel_Fig10.png|500px]]<br />
<br />
Histograms produced by boot.comp("http://www.stat.cmu.edu/~cshalizi/402/lectures/20-mixture-examples/bootcomp.R"). The vertical red lines mark the observed difference in log-likelihood.<br />
<br />
library(mvtnorm)<br />
x.points <- seq(-3,3,length.out=100)<br />
y.points <- x.points<br />
z <- matrix(0,nrow=100,ncol=100)<br />
$\mu <- c(1,1)$<br />
$\sigma <- matrix(c(2,1,1,1),nrow=2)$<br />
for (i in 1:100) {<br />
for (j in 1:100) {<br />
z[i,j] <- dmvnorm(c(x.points[i],y.points[j]),mean=mu,sigma=sigma)<br />
}<br />
}<br />
contour(x.points,y.points,z)<br />
<br />
[[Image:SMHS_MixtureModel_Fig11.png|500px]]<br />
<br />
===Applications===<br />
<br />
1) [http://wiki.stat.ucla.edu/socr/index.php/SOCR_EduMaterials_Activities_2D_PointSegmentation_EM_Mixture This article] demonstrated activity of the mixture modeling and expectation maximization (EM) applied to the problem of 2D point cluster segmentation. It illustrated ways to use EM and mixture modeling to obtain cluster classification of points in 2D using SOCR charts activity and SOCR modeler with specific examples.<br />
<br />
2) [http://wiki.stat.ucla.edu/socr/index.php/SOCR_EduMaterials_ModelerActivities_MixtureModel_1 This article] presented the SOCR activity that demonstrate random sampling and fitting of mixture models to data. The SOCR mixture-model applet demonstrates how unimodal-distributions come together as building blocks to form the backbone of may complex processes and allow computing probability and critical values for these mixture distributions, and enable inference on such complicated processes.<br />
<br />
3) [http://www.sciencedirect.com/science/article/pii/S0167947301000469 This article] presented a mixture model approach for the analysis of microarray gene expression data. Microarrays have emerged as powerful tools allowing investigators to assess the expression of thousands of genes in different tissues and organisms. Statistical treatment of the resulting data remains a substantial challenge. Investigators using microarray expression studies may wish to answer questions about the statistical significance of differences in expression of any of the genes under study, avoiding false positive and false negative results. This paper developed a sequence of procedures involving finite mixture modeling and bootstrap inference to address these issues in studies involving many thousands of genes and illustrated the use of these techniques with a dataset involving calorically restricted mice.<br />
<br />
4) [http://proceedings.spiedigitallibrary.org/proceeding.aspx?articleid=979899 This article] is concerned with estimating a probability density function of human skin color, using a finite Gaussian mixture model, whose parameters are estimated through the EM algorithm. Hawkins' statistical test on the normality and homoscedasticity (common covariance matrix) of the estimated Gaussian mixture models is performed and McLachlan's bootstrap method is used to test the number of components in a mixture. Experimental results show that the estimated Gaussian mixture model fits skin images from a large database. Applications of the estimated density function in image and video databases are presented.<br />
<br />
===Software===<br />
[http://cran.r-project.org/web/packages/mixtools/vignettes/mixtools.pdf Mixtool Vignettes]<br />
<br />
[http://www.stat.washington.edu/mclust/ mclust]<br />
<br />
[http://www.stat.cmu.edu/~cshalizi/uADA/12/lectures/mixture-examples.R R code for examples in Chapter 20 (see references)]<br />
<br />
===Problems===<br />
<br />
1) Write a function to simulate from a Gaussian mixture model. Check if it works by comparing a density estimated on its output to the theoretical density.<br />
<br />
2) Work through the E-step and M-step for a mixture of two Poisson distributions.<br />
<br />
3) Code up the EM algorithm for a mixture of K Gaussians. Simulate data from K=3 Gaussians. How well does the code assign data points to components if give the actual Gaussian parameter the initial guess and how does it change if given other initial parameters?<br />
<br />
4) Write a function to fit a mixture of exponential distributions using the EM algorithm. <br />
<br />
===References===<br />
<br />
[http://mirlyn.lib.umich.edu/Record/004199238 Statistical inference / George Casella, Roger L. Berger]<br />
<br />
<br />
[http://www.stat.cmu.edu/~cshalizi/uADA/12/lectures/ch20.pdf Chapter 20. Mixture Models]<br />
<br />
<hr><br />
* SOCR Home page: http://www.socr.umich.edu<br />
<br />
{{translate|pageName=http://wiki.socr.umich.edu/index.php?title=SMHS_MixtureModeling}}</div>Clgallahttps://wiki.socr.umich.edu/index.php?title=SMHS_MixtureModeling&diff=14419SMHS MixtureModeling2014-10-16T20:41:27Z<p>Clgalla: </p>
<hr />
<div>==[[SMHS| Scientific Methods for Health Sciences]] - Mixture Modeling ==<br />
<br />
<br />
===Overview=== <br />
Mixture model is a probabilistic model for representing the presence of subpopulations within overall population, without requiring that an observed data set should identify the sub-population to which an individual observation belongs. In this section, we will present a general introduction to mixture modeling, the structure of mixture model, various types of mixture model, the estimation of parameters in mixture model and application of mixture model in studies. The implementation of mixture modeling in R package will also be discussed in the attached articles.<br />
<br />
===Motivation===<br />
Mixture distribution represents the probability distribution of observations in the overall population. Problems associated with mixture distributions relate to deriving the properties of the overall population from those of the sub-population. We use mixture models to make statistical inference about the properties of the sub-populations given only observations on the pooled population, without sub-population identity information. It is not the same as models for compositional data, whose components are constrained to sum to a constant value. (1, 100%, etc.) What is the structure of mixture model and how can we estimate parameters in the mixture model?<br />
<br />
===Theory===<br />
'''1) Structure of mixture model:''' a distribution $f$ is a mixture of $K$ component distributions $f_{1}, f_{2}, \cdots, f_{k}$ if $f(x) = \sum_{k=1}^{K} \lambda_{k}f_{k}(x)$ with the $\lambda_{k}$ being the mixing weights, $\lambda_{k} > 0, \sum_{k}\lambda_{k} = 1$. $Z \sim Mult(\lambda_{1}, \cdots, \lambda_{k}), X|Z \sim f_{z}$, where the discrete random variable $Z$ indicating where $X$ is drawn from. Different parametric family for $f_{k}$ generates different parametric mixture models, like Gaussian, Binomial, Poisson and etc. They may all be Gaussian with different parameters, or all Poisson distributions with different means. The model can be expressed as $f(x) = \sum_{k=1}^{K} \lambda_{k}f(x;\theta_{k})$, the parameter vector of the mixture model is $\theta = (\lambda_{1}, \cdots, \lambda_{K}, \theta_{1}, \cdots, \theta_{K})$. When $K=1$, we got a simply parametric distribution of the usual sort, and density estimation reduces to estimating the parameters by ML. If K=n, the number of observations, we went back to kernel density estimation. <br />
<br />
'''2) Estimating parametric mixture models:''' assume independent samples where we have the density function to be $\prod_{i=1}^{n}f(x_{i};\theta)$,for observations $x_{1}, x_{2}, \cdots, x_{n}$. <br />
<br />
<br />
We try the logarithm to turn multiplication into addition:$l(\theta) = \sum_{i=1}^{n} logf(x_{i};\theta) = \sum_{i=1}^{n} log \sum_{k=1}^{K} \lambda_{k} f(x_{i}; \theta_{k})$, <br />
<br />
<br />
we take the derivative of this with respect to one parameter, say $\theta_{j}, \frac{\partial l}{\partial \theta_{j}}=\sum_{i=1}^{n}\frac{1}{\sum_{i=1}^{K}\lambda_{k}f(x_{i};\theta_{k})}\lambda_{j}\frac{\partial f(x_{i};\theta_{j})}{\partial \theta_{j}}=\sum_{i=1}^{n}\frac{\lambda_{j}f(x_{i};\theta_{j})}{\sum_{i=1}^{K}\lambda_{k}f(x_{i};\theta_{k})}\frac{\partial logf(x_{i};\theta_{j})}{\partial \theta_{j}}$. <br />
<br />
<br />
If we just had an ordinary parametric model, on the other hand, the derivative of the log-likelihood would be $\sum_{i=1}^{n}\frac{\partial logf(x_{i};\theta_{j})}{\partial \theta_{j}}.$ Maximizing the likelihood for a mixture model is like doing a weighted likelihood maximization, where the weight of $x_{i}$ depends on cluster, being $w_{ij} = \frac{\lambda_{j}}{f x_{i}};\theta_{j}\sum_{k=1}^{K}\lambda_{k}f(x_{i}\theta_{k}).$ <br />
<br />
Remember that $\lambda_{j}$ is the probability that the hidden class variable $Z$ is $j$,so the numerator in the weights is the joint probability of getting $Z=j$ and $X=x_{i}$. The denominator is the marginal probability of getting $X=x_{i}$, so the ratio is conditional probability of $Z=j$ given $X=x_{i}, w_{ij} = \frac{\lambda_{j}f(x_{i};\theta_{j}}{\sum_{k=1}^{K}} \lambda_{k}f(x_{i};\theta_{k}) = p(Z=j | X=x_{i}; \theta).$<br />
*EM algorithm: (1) start with guesses about the mixture components $\theta_{1}, \cdots, \theta_{K}$ and the mixing weights $\lambda_{1}, \cdots, \lambda_{K}$; (2) until nothing changes very much: using the current parameter guesses, calculate the weights $w_{ij}$ (E-step); using the current weights, maximize the weighted likelihood to get new parameter estimates (M-step); (3) return the final parameter estimates (including mixing proportions) and cluster probabilities. <br />
*Non-parametric mixture modeling: replace the $M$ step of $EM$ by some other way of estimating the distribution of each mixture component. This could be fast and crude estimate of parameters, or it could even be a non-parametric density estimator.<br />
<br />
'''3) Computational examples in R:''' Snoqualmie Falls Revisited (analyzed using the mclust package in R)<br />
<br />
RCODE:<br />
snoqualmie <- read.csv<br />
("http://www.stat.cmu.edu/~cshalizi/402/lectures/16-glm-practicals/snoqualmie.csv",header=FALSE)<br />
snoqualmie.vector <- na.omit(unlist(snoqualmie))<br />
snoq <- snoqualmie.vector[snoqualmie.vector > 0]<br />
summary(snoq)<br />
Min. 1st Qu. Median Mean 3rd Qu. Max. <br />
1.00 6.00 19.00 32.28 44.00 463.00<br />
<br />
<br />
# Two-component Gaussian mixture<br />
library(mixtools)<br />
snoq.k2 <- normalmixEM(snoq,k=2,maxit=100,epsilon=0.01)<br />
summary(snoq.k2)<br />
summary of normalmixEM object:<br />
comp 1 comp 2<br />
$\lambda$ 0.557524 0.442476<br />
$\mu$ 10.266172 60.009468<br />
$\sigma$ 8.510244 44.997240<br />
loglik at estimate: -32681.21<br />
<br />
<br />
# Function to add Gaussian mixture components, vertically scaled, to the<br />
# current plot<br />
# Presumes the mixture object has the structure used by mixtools<br />
plot.normal.components <- function(mixture,component.number,...) {<br />
curve(mixture $\lambda$ [component.number] *<br />
dnorm(x,mean=mixture $\mu$ [component.number],<br />
sd=mixture $\sigma$ [component.number]), add=TRUE, ...)<br />
}<br />
plot(hist(snoq,breaks=101),col="grey",border="grey",freq=FALSE,<br />
xlab="Precipitation (1/100 inch)",main="Precipitation in Snoqualmie Falls")<br />
lines(density(snoq),lty=2)<br />
<br />
[[Image:SMHS_MixtureModel_Fig1.png|500px]]<br />
<br />
Histogram (grey) for precipitation on wet days in Snoqualmie Falls. The dashed line is a kernel density estimate, which is not completely satisfactory.<br />
<br />
plot(hist(snoq,breaks=101),col="grey",border="grey",freq=FALSE,<br />
xlab="Precipitation (1/100 inch)",main="Precipitation in Snoqualmie Falls")<br />
lines(density(snoq),lty=2)<br />
<br />
sapply(1:2,plot.normal.components,mixture=snoq.k2)<br />
<br />
[[Image:SMHS_MixtureModel_Fig2.png|500px]]<br />
<br />
As in the previous figure, plus the components of a mixture of two Gaussians, fitted to the data by the EM algorithm (dashed lines). These are scaled by the mixing weights of the components.<br />
<br />
# Function to calculate the cumulative distribution function of a Gaussian<br />
# mixture model<br />
# Presumes the mixture object has the structure used by mixtools<br />
# Doesn't implement some of the usual options for CDF functions in R, like<br />
# returning the log probability, or the upper tail probability<br />
pnormmix <- function(x,mixture) {<br />
$\lambda$ <- mixture $\lambda$<br />
k <- length(s\lambda$)<br />
pnorm.from.mix <- function(x,component){<br />
$\lambda$[component]*pnorm(x,mean=mixture$\mu$[component],<br />
sd=mixture$\sigma$[component])<br />
}<br />
pnorms <- sapply(1:k,pnorm.from.mix,x=x)<br />
return(rowSums(pnorms))<br />
}<br />
<br />
<br />
#### Figure 3<br />
# Distinct values in the data<br />
distinct.snoq <- sort(unique(snoq))<br />
# Theoretical CDF evaluated at each distinct value<br />
tcdfs <- pnormmix(distinct.snoq,mixture=snoq.k2)<br />
# Empirical CDF evaluated at each distinct value<br />
# ecdf(snoq) returns an object which is a _function_, suitable for application<br />
# to new vectors<br />
ecdfs <- ecdf(snoq)(distinct.snoq)<br />
# Plot them against each other<br />
plot(tcdfs,ecdfs,xlab="Theoretical CDF",ylab="Empirical CDF",xlim=c(0,1),<br />
ylim=c(0,1))<br />
# Main diagonal for visual reference<br />
abline(0,1)<br />
<br />
<br />
[[Image:SMHS_MixtureModel_Fig3.png|500px]]<br />
<br />
<br />
<br />
Calibration plot for the two-component Gaussian mixture. For each distinct value of precipitation x, plot the fraction of days predicted by the mixture model to have $\leq x$ precipitation on the horizontal axis, versus the actual fraction of days $\leq x$.<br />
<br />
# Probability density function for a Gaussian mixture<br />
# Presumes the mixture object has the structure used by mixtools<br />
dnormalmix <- function(x,mixture,log=FALSE) {<br />
$\lambda$ <- mixture $\lambda$<br />
k <- length($\lambda$)<br />
# Calculate share of likelihood for all data for one component<br />
like.component <- function(x,component) {<br />
$\lambda$ [component]*dnorm(x,mean=mixture$\mu$[component], <br />
sd=mixture $\sigma$ [component])<br />
}<br />
# Create array with likelihood shares from all components over all data<br />
likes <- sapply(1:k,like.component,x=x)<br />
# Add up contributions from components<br />
d <- rowSums(likes)<br />
if (log) {<br />
d <- log(d)<br />
}<br />
return(d)<br />
}<br />
<br />
# Log likelihood function for a Gaussian mixture, potentially on new data<br />
loglike.normalmix <- function(x,mixture) {<br />
loglike <- dnormalmix(x,mixture,log=TRUE)<br />
return(sum(loglike))<br />
}<br />
loglike.normalmix(snoq,mixture=snoq.k2)<br />
[1] - 32681.21<br />
# Evaluate various numbers of Gaussian components by data-set splitting<br />
# (i.e., very crude cross-validation)<br />
n <- length(snoq)<br />
data.points <- 1:n<br />
data.points <- sample(data.points) # Permute randomly<br />
train <- data.points[1:floor(n/2)] # First random half is training<br />
test <- data.points[-(1:floor(n/2))] # 2nd random half is testing<br />
candidate.component.numbers <- 2:10<br />
loglikes <- vector(length=1+length(candidate.component.numbers))<br />
# k=1 needs special handling<br />
$\mu$<-mean(snoq[train]) # MLE of mean<br />
$\sigma$ <- sd(snoq[train])*sqrt((n-1)/n) # MLE of standard deviation<br />
loglikes[1] <- sum(dnorm(snoq[test],mu,sigma,log=TRUE))<br />
for (k in candidate.component.numbers) {<br />
mixture <- normalmixEM(snoq[train],k=k,maxit=400,epsilon=1e-2)<br />
loglikes[k] <- loglike.normalmix(snoq[test],mixture=mixture)<br />
}<br />
loglikes<br />
[1] -17647.93 -16336.12 -15796.02 -15554.33 -15398.04 -15337.47 -15297.61<br />
[8] -15285.60 -15286.75 -15288.88<br />
plot(x=1:10, y=loglikes,xlab="Number of mixture components", ylab="Log-likelihood on testing data")<br />
<br />
[[Image:SMHS_MixtureModel_Fig4.png|500px]]<br />
<br />
log-likelihoods of different sizes of mixture models, fit to a random half of the data for training, and evaluated on the other half of the data for testing.<br />
<br />
snoq.k9 <- normalmixEM(snoq,k=9,maxit=400,epsilon=1e-2)<br />
plot(hist(snoq,breaks=101),col="grey",border="grey",freq=FALSE,<br />
xlab="Precipitation (1/100 inch)",main="Precipitation in Snoqualmie Falls")<br />
lines(density(snoq),lty=2)<br />
sapply(1:9,plot.normal.components,mixture=snoq.k9)<br />
<br />
[[Image:SMHS_MixtureModel_Fig5.png|500px]]<br />
<br />
With the nine-component Gaussian mixture.<br />
<br />
# Assigments for distinct.snoq and ecdfs are redundant if you've already done<br />
distinct.snoq <- sort(unique(snoq))<br />
tcdfs <- pnormmix(distinct.snoq,mixture=snoq.k9)<br />
ecdfs <- ecdf(snoq)(distinct.snoq)<br />
plot(tcdfs,ecdfs,xlab="Theoretical CDF",ylab="Empirical CDF",xlim=c(0,1),<br />
ylim=c(0,1))<br />
abline(0,1)<br />
<br />
[[Image:SMHS_MixtureModel_Fig6.png|500px]]<br />
<br />
Calibration plot for the nine-component Gaussian mixture.<br />
<br />
plot(0,xlim=range(snoq.k9$mu),ylim=range(snoq.k9$sigma),type="n",<br />
xlab="Component mean", ylab="Component standard deviation")<br />
<br />
points(x=snoq.k9$mu,y=snoq.k9$sigma,pch=as.character(1:9),<br />
cex=sqrt(0.5+5*snoq.k9$lambda))<br />
<br />
[[Image:SMHS_MixtureModel_Fig7.png|500px]]<br />
<br />
Characteristics of the components of the 9-mode Gaussian mixture. The horizontal axis gives the component mean, the vertical axis its standard deviation. The area of the number representing each component is proportional to the component’s mixing weight.<br />
<br />
plot(density(snoq),lty=2,ylim=c(0,0.04),<br />
main=paste("Comparison of density estimates\n",<br />
"Kernel vs. Gaussian mixture"),<br />
xlab="Precipitation (1/100 inch)")<br />
curve(dnormalmix(x,snoq.k9),add=TRUE)<br />
<br />
[[Image:SMHS_MixtureModel_Fig8.png|500px]]<br />
<br />
Dashed line: kernel density estimate. Solid line is the nine-Gaussian mixture. Notice that the mixture, unlike the KDE, gives negligible probability to negative precipitation.<br />
<br />
# Do the classes of the Gaussian mixture make sense as annual weather patterns?<br />
# Most probable class for each day:<br />
day.classes <- apply(snoq.k9$posterior,1,which.max)<br />
# Make a copy of the original, structured data set to edit<br />
snoqualmie.classes <- snoqualmie<br />
# Figure out which days had precipitation<br />
wet.days <- (snoqualmie > 0) & !(is.na(snoqualmie))<br />
# Replace actual precipitation amounts with classes<br />
snoqualmie.classes[wet.days] <- day.classes<br />
# Problem: the number of the classes doesn't correspond to e.g. amount of<br />
# precipitation expected. Solution: label by expected precipitation, not by<br />
# class number.<br />
snoqualmie.classes[wet.days] <- snoq.k9$mu[day.classes]<br />
<br />
plot(0,xlim=c(1,366),ylim=range(snoq.k9$mu),type="n",xaxt="n",<br />
xlab="Day of year",ylab="Expected precipiation (1/100 inch)")<br />
axis(1,at=1+(0:11)*30)<br />
for (year in 1:nrow(snoqualmie.classes)) {<br />
points(1:366,snoqualmie.classes[year,],pch=16,cex=0.2)<br />
}<br />
<br />
[[Image:SMHS_MixtureModel_Fig9.png|500px]]<br />
<br />
Plot of days classified according to the nine-component mixture. Horizontal axis: days of the year, numbered from 1 to 366 to handle leap years. Vertical axis: expected amount of precipitation on that day according to the most probable class for the day.<br />
<br />
# Next line is currently (5 April 2011) used to invoke a bug-patch kindly<br />
# provided by Dr. Derek Young; the patch will be incorporated in the next<br />
# update to mixtools, so should not be needed after April 2011<br />
source("http://www.stat.cmu.edu/~cshalizi/402/lectures/20-mixture-examples/bootcomp.R")<br />
snoq.boot <- boot.comp(snoq,max.comp=10,mix.type="normalmix",maxit=400,$\epsilon=1e-2)$<br />
# Running this takes about 5 minutes<br />
# automatically produced as a side-effect of running boot.comp()<br />
<br />
[[Image:SMHS_MixtureModel_Fig10.png|500px]]<br />
<br />
Histograms produced by boot.comp("http://www.stat.cmu.edu/~cshalizi/402/lectures/20-mixture-examples/bootcomp.R"). The vertical red lines mark the observed difference in log-likelihood.<br />
<br />
library(mvtnorm)<br />
x.points <- seq(-3,3,length.out=100)<br />
y.points <- x.points<br />
z <- matrix(0,nrow=100,ncol=100)<br />
$\mu <- c(1,1)$<br />
$\sigma <- matrix(c(2,1,1,1),nrow=2)$<br />
for (i in 1:100) {<br />
for (j in 1:100) {<br />
z[i,j] <- dmvnorm(c(x.points[i],y.points[j]),mean=mu,sigma=sigma)<br />
}<br />
}<br />
contour(x.points,y.points,z)<br />
<br />
[[Image:SMHS_MixtureModel_Fig11.png|500px]]<br />
<br />
===Applications===<br />
<br />
1) [http://wiki.stat.ucla.edu/socr/index.php/SOCR_EduMaterials_Activities_2D_PointSegmentation_EM_Mixture This article] demonstrated activity of the mixture modeling and expectation maximization (EM) applied to the problem of 2D point cluster segmentation. It illustrated ways to use EM and mixture modeling to obtain cluster classification of points in 2D using SOCR charts activity and SOCR modeler with specific examples.<br />
<br />
2) [http://wiki.stat.ucla.edu/socr/index.php/SOCR_EduMaterials_ModelerActivities_MixtureModel_1 This article] presented the SOCR activity that demonstrate random sampling and fitting of mixture models to data. The SOCR mixture-model applet demonstrates how unimodal-distributions come together as building blocks to form the backbone of may complex processes and allow computing probability and critical values for these mixture distributions, and enable inference on such complicated processes.<br />
<br />
3) [http://www.sciencedirect.com/science/article/pii/S0167947301000469 This article] presented a mixture model approach for the analysis of microarray gene expression data. Microarrays have emerged as powerful tools allowing investigators to assess the expression of thousands of genes in different tissues and organisms. Statistical treatment of the resulting data remains a substantial challenge. Investigators using microarray expression studies may wish to answer questions about the statistical significance of differences in expression of any of the genes under study, avoiding false positive and false negative results. This paper developed a sequence of procedures involving finite mixture modeling and bootstrap inference to address these issues in studies involving many thousands of genes and illustrated the use of these techniques with a dataset involving calorically restricted mice.<br />
<br />
4) [http://proceedings.spiedigitallibrary.org/proceeding.aspx?articleid=979899 This article] is concerned with estimating a probability density function of human skin color, using a finite Gaussian mixture model, whose parameters are estimated through the EM algorithm. Hawkins' statistical test on the normality and homoscedasticity (common covariance matrix) of the estimated Gaussian mixture models is performed and McLachlan's bootstrap method is used to test the number of components in a mixture. Experimental results show that the estimated Gaussian mixture model fits skin images from a large database. Applications of the estimated density function in image and video databases are presented.<br />
<br />
===Software===<br />
[http://cran.r-project.org/web/packages/mixtools/vignettes/mixtools.pdf Mixtool Vignettes]<br />
<br />
[http://www.stat.washington.edu/mclust/ mclust]<br />
<br />
[http://www.stat.cmu.edu/~cshalizi/uADA/12/lectures/mixture-examples.R R code for examples in Chapter 20 (see references)]<br />
<br />
===Problems===<br />
<br />
1) Write a function to simulate from a Gaussian mixture model. Check if it works by comparing a density estimated on its output to the theoretical density.<br />
<br />
2) Work through the E-step and M-step for a mixture of two Poisson distributions.<br />
<br />
3) Code up the EM algorithm for a mixture of K Gaussians. Simulate data from K=3 Gaussians. How well does the code assign data points to components if give the actual Gaussian parameter the initial guess and how does it change if given other initial parameters?<br />
<br />
4) Write a function to fit a mixture of exponential distributions using the EM algorithm. <br />
<br />
===References===<br />
<br />
[http://mirlyn.lib.umich.edu/Record/004199238 Statistical inference / George Casella, Roger L. Berger]<br />
<br />
<br />
[http://www.stat.cmu.edu/~cshalizi/uADA/12/lectures/ch20.pdf Chapter 20. Mixture Models]<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<hr><br />
* SOCR Home page: http://www.socr.umich.edu<br />
<br />
{{translate|pageName=http://wiki.socr.umich.edu/index.php?title=SMHS_MixtureModeling}}</div>Clgallahttps://wiki.socr.umich.edu/index.php?title=SMHS_MixtureModeling&diff=14418SMHS MixtureModeling2014-10-16T20:39:17Z<p>Clgalla: </p>
<hr />
<div>==[[SMHS| Scientific Methods for Health Sciences]] - Mixture Modeling ==<br />
<br />
<br />
===Overview=== <br />
Mixture model is a probabilistic model for representing the presence of subpopulations within overall population, without requiring that an observed data set should identify the sub-population to which an individual observation belongs. In this section, we will present a general introduction to mixture modeling, the structure of mixture model, various types of mixture model, the estimation of parameters in mixture model and application of mixture model in studies. The implementation of mixture modeling in R package will also be discussed in the attached articles.<br />
<br />
===Motivation===<br />
Mixture distribution represents the probability distribution of observations in the overall population. Problems associated with mixture distributions relate to deriving the properties of the overall population from those of the sub-population. We use mixture models to make statistical inference about the properties of the sub-populations given only observations on the pooled population, without sub-population identity information. It is not the same as models for compositional data, whose components are constrained to sum to a constant value. (1, 100%, etc.) What is the structure of mixture model and how can we estimate parameters in the mixture model?<br />
<br />
===Theory===<br />
'''1) Structure of mixture model:''' a distribution $f$ is a mixture of $K$ component distributions $f_{1}, f_{2}, \cdots, f_{k}$ if $f(x) = \sum_{k=1}^{K} \lambda_{k}f_{k}(x)$ with the $\lambda_{k}$ being the mixing weights, $\lambda_{k} > 0, \sum_{k}\lambda_{k} = 1$. $Z \sim Mult(\lambda_{1}, \cdots, \lambda_{k}), X|Z \sim f_{z}$, where the discrete random variable $Z$ indicating where $X$ is drawn from. Different parametric family for $f_{k}$ generates different parametric mixture models, like Gaussian, Binomial, Poisson and etc. They may all be Gaussian with different parameters, or all Poisson distributions with different means. The model can be expressed as $f(x) = \sum_{k=1}^{K} \lambda_{k}f(x;\theta_{k})$, the parameter vector of the mixture model is $\theta = (\lambda_{1}, \cdots, \lambda_{K}, \theta_{1}, \cdots, \theta_{K})$. When $K=1$, we got a simply parametric distribution of the usual sort, and density estimation reduces to estimating the parameters by ML. If K=n, the number of observations, we went back to kernel density estimation. <br />
<br />
'''2) Estimating parametric mixture models:''' assume independent samples where we have the density function to be $\prod_{i=1}^{n}f(x_{i};\theta)$,for observations $x_{1}, x_{2}, \cdots, x_{n}$. <br />
<br />
<br />
We try the logarithm to turn multiplication into addition:$l(\theta) = \sum_{i=1}^{n} logf(x_{i};\theta) = \sum_{i=1}^{n} log \sum_{k=1}^{K} \lambda_{k} f(x_{i}; \theta_{k})$, <br />
<br />
<br />
we take the derivative of this with respect to one parameter, say $\theta_{j}, \frac{\partial l}{\partial \theta_{j}}=\sum_{i=1}^{n}\frac{1}{\sum_{i=1}^{K}\lambda_{k}f(x_{i};\theta_{k})}\lambda_{j}\frac{\partial f(x_{i};\theta_{j})}{\partial \theta_{j}}=\sum_{i=1}^{n}\frac{\lambda_{j}f(x_{i};\theta_{j})}{\sum_{i=1}^{K}\lambda_{k}f(x_{i};\theta_{k})}\frac{\partial logf(x_{i};\theta_{j})}{\partial \theta_{j}}$. <br />
<br />
<br />
If we just had an ordinary parametric model, on the other hand, the derivative of the log-likelihood would be $\sum_{i=1}^{n}\frac{\partial logf(x_{i};\theta_{j})}{\partial \theta_{j}}.$ Maximizing the likelihood for a mixture model is like doing a weighted likelihood maximization, where the weight of $x_{i}$ depends on cluster, being $w_{ij} = \frac{\lambda_{j}}{f x_{i}};\theta_{j}\sum_{k=1}^{K}\lambda_{k}f(x_{i}\theta_{k}).$ <br />
<br />
Remember that $\lambda_{j}$ is the probability that the hidden class variable $Z$ is $j$,so the numerator in the weights is the joint probability of getting $Z=j$ and $X=x_{i}$. The denominator is the marginal probability of getting $X=x_{i}$, so the ratio is conditional probability of $Z=j$ given $X=x_{i}, w_{ij} = \frac{\lambda_{j}f(x_{i};\theta_{j}}{\sum_{k=1}^{K}} \lambda_{k}f(x_{i};\theta_{k}) = p(Z=j | X=x_{i}; \theta).$<br />
*EM algorithm: (1) start with guesses about the mixture components $\theta_{1}, \cdots, \theta_{K}$ and the mixing weights $\lambda_{1}, \cdots, \lambda_{K}$; (2) until nothing changes very much: using the current parameter guesses, calculate the weights $w_{ij}$ (E-step); using the current weights, maximize the weighted likelihood to get new parameter estimates (M-step); (3) return the final parameter estimates (including mixing proportions) and cluster probabilities. <br />
*Non-parametric mixture modeling: replace the $M$ step of $EM$ by some other way of estimating the distribution of each mixture component. This could be fast and crude estimate of parameters, or it could even be a non-parametric density estimator.<br />
<br />
'''3) Computational examples in R:''' Snoqualmie Falls Revisited (analyzed using the mclust package in R)<br />
<br />
RCODE:<br />
snoqualmie <- read.csv<br />
("http://www.stat.cmu.edu/~cshalizi/402/lectures/16-glm-practicals/snoqualmie.csv",header=FALSE)<br />
snoqualmie.vector <- na.omit(unlist(snoqualmie))<br />
snoq <- snoqualmie.vector[snoqualmie.vector > 0]<br />
summary(snoq)<br />
Min. 1st Qu. Median Mean 3rd Qu. Max. <br />
1.00 6.00 19.00 32.28 44.00 463.00<br />
<br />
<br />
# Two-component Gaussian mixture<br />
library(mixtools)<br />
snoq.k2 <- normalmixEM(snoq,k=2,maxit=100,epsilon=0.01)<br />
summary(snoq.k2)<br />
summary of normalmixEM object:<br />
comp 1 comp 2<br />
$\lambda$ 0.557524 0.442476<br />
$\mu$ 10.266172 60.009468<br />
$\sigma$ 8.510244 44.997240<br />
loglik at estimate: -32681.21<br />
<br />
<br />
# Function to add Gaussian mixture components, vertically scaled, to the<br />
# current plot<br />
# Presumes the mixture object has the structure used by mixtools<br />
plot.normal.components <- function(mixture,component.number,...) {<br />
curve(mixture $\lambda$ [component.number] *<br />
dnorm(x,mean=mixture $\mu$ [component.number],<br />
sd=mixture $\sigma$ [component.number]), add=TRUE, ...)<br />
}<br />
plot(hist(snoq,breaks=101),col="grey",border="grey",freq=FALSE,<br />
xlab="Precipitation (1/100 inch)",main="Precipitation in Snoqualmie Falls")<br />
lines(density(snoq),lty=2)<br />
<br />
[[Image:SMHS_MixtureModel_Fig1.png|500px]]<br />
<br />
Histogram (grey) for precipitation on wet days in Snoqualmie Falls. The dashed line is a kernel density estimate, which is not completely satisfactory.<br />
<br />
plot(hist(snoq,breaks=101),col="grey",border="grey",freq=FALSE,<br />
xlab="Precipitation (1/100 inch)",main="Precipitation in Snoqualmie Falls")<br />
lines(density(snoq),lty=2)<br />
<br />
sapply(1:2,plot.normal.components,mixture=snoq.k2)<br />
<br />
[[Image:SMHS_MixtureModel_Fig2.png|500px]]<br />
<br />
As in the previous figure, plus the components of a mixture of two Gaussians, fitted to the data by the EM algorithm (dashed lines). These are scaled by the mixing weights of the components.<br />
<br />
# Function to calculate the cumulative distribution function of a Gaussian<br />
# mixture model<br />
# Presumes the mixture object has the structure used by mixtools<br />
# Doesn't implement some of the usual options for CDF functions in R, like<br />
# returning the log probability, or the upper tail probability<br />
pnormmix <- function(x,mixture) {<br />
$\lambda$ <- mixture $\lambda$<br />
k <- length(s\lambda$)<br />
pnorm.from.mix <- function(x,component){<br />
$\lambda$[component]*pnorm(x,mean=mixture$\mu$[component],<br />
sd=mixture$\sigma$[component])<br />
}<br />
pnorms <- sapply(1:k,pnorm.from.mix,x=x)<br />
return(rowSums(pnorms))<br />
}<br />
<br />
<br />
#### Figure 3<br />
# Distinct values in the data<br />
distinct.snoq <- sort(unique(snoq))<br />
# Theoretical CDF evaluated at each distinct value<br />
tcdfs <- pnormmix(distinct.snoq,mixture=snoq.k2)<br />
# Empirical CDF evaluated at each distinct value<br />
# ecdf(snoq) returns an object which is a _function_, suitable for application<br />
# to new vectors<br />
ecdfs <- ecdf(snoq)(distinct.snoq)<br />
# Plot them against each other<br />
plot(tcdfs,ecdfs,xlab="Theoretical CDF",ylab="Empirical CDF",xlim=c(0,1),<br />
ylim=c(0,1))<br />
# Main diagonal for visual reference<br />
abline(0,1)<br />
<br />
<br />
[[Image:SMHS_MixtureModel_Fig3.png|500px]]<br />
<br />
<br />
<br />
Calibration plot for the two-component Gaussian mixture. For each distinct value of precipitation x, plot the fraction of days predicted by the mixture model to have $\leq x$ precipitation on the horizontal axis, versus the actual fraction of days $\leq x$.<br />
<br />
# Probability density function for a Gaussian mixture<br />
# Presumes the mixture object has the structure used by mixtools<br />
dnormalmix <- function(x,mixture,log=FALSE) {<br />
$\lambda$ <- mixture $\lambda$<br />
k <- length($\lambda$)<br />
# Calculate share of likelihood for all data for one component<br />
like.component <- function(x,component) {<br />
$\lambda$ [component]*dnorm(x,mean=mixture$\mu$[component], <br />
sd=mixture $\sigma$ [component])<br />
}<br />
# Create array with likelihood shares from all components over all data<br />
likes <- sapply(1:k,like.component,x=x)<br />
# Add up contributions from components<br />
d <- rowSums(likes)<br />
if (log) {<br />
d <- log(d)<br />
}<br />
return(d)<br />
}<br />
<br />
# Log likelihood function for a Gaussian mixture, potentially on new data<br />
loglike.normalmix <- function(x,mixture) {<br />
loglike <- dnormalmix(x,mixture,log=TRUE)<br />
return(sum(loglike))<br />
}<br />
loglike.normalmix(snoq,mixture=snoq.k2)<br />
[1] - 32681.21<br />
# Evaluate various numbers of Gaussian components by data-set splitting<br />
# (i.e., very crude cross-validation)<br />
n <- length(snoq)<br />
data.points <- 1:n<br />
data.points <- sample(data.points) # Permute randomly<br />
train <- data.points[1:floor(n/2)] # First random half is training<br />
test <- data.points[-(1:floor(n/2))] # 2nd random half is testing<br />
candidate.component.numbers <- 2:10<br />
loglikes <- vector(length=1+length(candidate.component.numbers))<br />
# k=1 needs special handling<br />
$\mu$<-mean(snoq[train]) # MLE of mean<br />
$\sigma$ <- sd(snoq[train])*sqrt((n-1)/n) # MLE of standard deviation<br />
loglikes[1] <- sum(dnorm(snoq[test],mu,sigma,log=TRUE))<br />
for (k in candidate.component.numbers) {<br />
mixture <- normalmixEM(snoq[train],k=k,maxit=400,epsilon=1e-2)<br />
loglikes[k] <- loglike.normalmix(snoq[test],mixture=mixture)<br />
}<br />
loglikes<br />
[1] -17647.93 -16336.12 -15796.02 -15554.33 -15398.04 -15337.47 -15297.61<br />
[8] -15285.60 -15286.75 -15288.88<br />
plot(x=1:10, y=loglikes,xlab="Number of mixture components", ylab="Log-likelihood on testing data")<br />
<br />
[[Image:SMHS_MixtureModel_Fig4.png|500px]]<br />
<br />
log-likelihoods of different sizes of mixture models, fit to a random half of the data for training, and evaluated on the other half of the data for testing.<br />
<br />
snoq.k9 <- normalmixEM(snoq,k=9,maxit=400,epsilon=1e-2)<br />
plot(hist(snoq,breaks=101),col="grey",border="grey",freq=FALSE,<br />
xlab="Precipitation (1/100 inch)",main="Precipitation in Snoqualmie Falls")<br />
lines(density(snoq),lty=2)<br />
sapply(1:9,plot.normal.components,mixture=snoq.k9)<br />
<br />
[[Image:SMHS_MixtureModel_Fig5.png|500px]]<br />
<br />
With the nine-component Gaussian mixture.<br />
<br />
# Assigments for distinct.snoq and ecdfs are redundant if you've already done<br />
distinct.snoq <- sort(unique(snoq))<br />
tcdfs <- pnormmix(distinct.snoq,mixture=snoq.k9)<br />
ecdfs <- ecdf(snoq)(distinct.snoq)<br />
plot(tcdfs,ecdfs,xlab="Theoretical CDF",ylab="Empirical CDF",xlim=c(0,1),<br />
ylim=c(0,1))<br />
abline(0,1)<br />
<br />
[[Image:SMHS_MixtureModel_Fig6.png|500px]]<br />
<br />
Calibration plot for the nine-component Gaussian mixture.<br />
<br />
plot(0,xlim=range(snoq.k9$mu),ylim=range(snoq.k9$sigma),type="n",<br />
xlab="Component mean", ylab="Component standard deviation")<br />
<br />
points(x=snoq.k9$mu,y=snoq.k9$sigma,pch=as.character(1:9),<br />
cex=sqrt(0.5+5*snoq.k9$lambda))<br />
<br />
[[Image:SMHS_MixtureModel_Fig7.png|500px]]<br />
<br />
Characteristics of the components of the 9-mode Gaussian mixture. The horizontal axis gives the component mean, the vertical axis its standard deviation. The area of the number representing each component is proportional to the component’s mixing weight.<br />
<br />
plot(density(snoq),lty=2,ylim=c(0,0.04),<br />
main=paste("Comparison of density estimates\n",<br />
"Kernel vs. Gaussian mixture"),<br />
xlab="Precipitation (1/100 inch)")<br />
curve(dnormalmix(x,snoq.k9),add=TRUE)<br />
<br />
[[Image:SMHS_MixtureModel_Fig8.png|500px]]<br />
<br />
Dashed line: kernel density estimate. Solid line is the nine-Gaussian mixture. Notice that the mixture, unlike the KDE, gives negligible probability to negative precipitation.<br />
<br />
# Do the classes of the Gaussian mixture make sense as annual weather patterns?<br />
# Most probable class for each day:<br />
day.classes <- apply(snoq.k9$posterior,1,which.max)<br />
# Make a copy of the original, structured data set to edit<br />
snoqualmie.classes <- snoqualmie<br />
# Figure out which days had precipitation<br />
wet.days <- (snoqualmie > 0) & !(is.na(snoqualmie))<br />
# Replace actual precipitation amounts with classes<br />
snoqualmie.classes[wet.days] <- day.classes<br />
# Problem: the number of the classes doesn't correspond to e.g. amount of<br />
# precipitation expected. Solution: label by expected precipitation, not by<br />
# class number.<br />
snoqualmie.classes[wet.days] <- snoq.k9$mu[day.classes]<br />
<br />
plot(0,xlim=c(1,366),ylim=range(snoq.k9$mu),type="n",xaxt="n",<br />
xlab="Day of year",ylab="Expected precipiation (1/100 inch)")<br />
axis(1,at=1+(0:11)*30)<br />
for (year in 1:nrow(snoqualmie.classes)) {<br />
points(1:366,snoqualmie.classes[year,],pch=16,cex=0.2)<br />
}<br />
<br />
[[Image:SMHS_MixtureModel_Fig9.png|500px]]<br />
<br />
Plot of days classified according to the nine-component mixture. Horizontal axis: days of the year, numbered from 1 to 366 to handle leap years. Vertical axis: expected amount of precipitation on that day according to the most probable class for the day.<br />
<br />
# Next line is currently (5 April 2011) used to invoke a bug-patch kindly<br />
# provided by Dr. Derek Young; the patch will be incorporated in the next<br />
# update to mixtools, so should not be needed after April 2011<br />
source("http://www.stat.cmu.edu/~cshalizi/402/lectures/20-mixture-examples/bootcomp.R")<br />
snoq.boot <- boot.comp(snoq,max.comp=10,mix.type="normalmix",maxit=400,$\epsilon=1e-2)$<br />
# Running this takes about 5 minutes<br />
# automatically produced as a side-effect of running boot.comp()<br />
<br />
[[Image:SMHS_MixtureModel_Fig10.png|500px]]<br />
<br />
Histograms produced by boot.comp("http://www.stat.cmu.edu/~cshalizi/402/lectures/20-mixture-examples/bootcomp.R"). The vertical red lines mark the observed difference in log-likelihood.<br />
<br />
library(mvtnorm)<br />
x.points <- seq(-3,3,length.out=100)<br />
y.points <- x.points<br />
z <- matrix(0,nrow=100,ncol=100)<br />
$\mu <- c(1,1)$<br />
$\sigma <- matrix(c(2,1,1,1),nrow=2)$<br />
for (i in 1:100) {<br />
for (j in 1:100) {<br />
z[i,j] <- dmvnorm(c(x.points[i],y.points[j]),mean=mu,sigma=sigma)<br />
}<br />
}<br />
contour(x.points,y.points,z)<br />
<br />
[[Image:SMHS_MixtureModel_Fig11.png|500px]]<br />
<br />
Applications<br />
<br />
1) [http://wiki.stat.ucla.edu/socr/index.php/SOCR_EduMaterials_Activities_2D_PointSegmentation_EM_Mixture This article] demonstrated activity of the mixture modeling and expectation maximization (EM) applied to the problem of 2D point cluster segmentation. It illustrated ways to use EM and mixture modeling to obtain cluster classification of points in 2D using SOCR charts activity and SOCR modeler with specific examples.<br />
<br />
2) [http://wiki.stat.ucla.edu/socr/index.php/SOCR_EduMaterials_ModelerActivities_MixtureModel_1 This article] presented the SOCR activity that demonstrate random sampling and fitting of mixture models to data. The SOCR mixture-model applet demonstrates how unimodal-distributions come together as building blocks to form the backbone of may complex processes and allow computing probability and critical values for these mixture distributions, and enable inference on such complicated processes.<br />
<br />
3) [http://www.sciencedirect.com/science/article/pii/S0167947301000469 This article] presented a mixture model approach for the analysis of microarray gene expression data. Microarrays have emerged as powerful tools allowing investigators to assess the expression of thousands of genes in different tissues and organisms. Statistical treatment of the resulting data remains a substantial challenge. Investigators using microarray expression studies may wish to answer questions about the statistical significance of differences in expression of any of the genes under study, avoiding false positive and false negative results. This paper developed a sequence of procedures involving finite mixture modeling and bootstrap inference to address these issues in studies involving many thousands of genes and illustrated the use of these techniques with a dataset involving calorically restricted mice.<br />
<br />
4) [http://proceedings.spiedigitallibrary.org/proceeding.aspx?articleid=979899 This article] is concerned with estimating a probability density function of human skin color, using a finite Gaussian mixture model, whose parameters are estimated through the EM algorithm. Hawkins' statistical test on the normality and homoscedasticity (common covariance matrix) of the estimated Gaussian mixture models is performed and McLachlan's bootstrap method is used to test the number of components in a mixture. Experimental results show that the estimated Gaussian mixture model fits skin images from a large database. Applications of the estimated density function in image and video databases are presented.<br />
<br />
===Software===<br />
[http://cran.r-project.org/web/packages/mixtools/vignettes/mixtools.pdf Mixtool Vignettes]<br />
<br />
[http://www.stat.washington.edu/mclust/ mclust]<br />
<br />
[http://www.stat.cmu.edu/~cshalizi/uADA/12/lectures/mixture-examples.R R code for examples in Lecture 20]<br />
<br />
===Problems===<br />
<br />
1) Write a function to simulate from a Gaussian mixture model. Check if it works by comparing a density estimated on its output to the theoretical density.<br />
<br />
2) Work through the E-step and M-step for a mixture of two Poisson distributions.<br />
<br />
3) Code up the EM algorithm for a mixture of K Gaussians. Simulate data from K=3 Gaussians. How well does the code assign data points to components if give the actual Gaussian parameter the initial guess and how does it change if given other initial parameters?<br />
<br />
4) Write a function to fit a mixture of exponential distributions using the EM algorithm. <br />
<br />
===References=====<br />
<br />
[http://mirlyn.lib.umich.edu/Record/004199238 Statistical inference / George Casella, Roger L. Berger]<br />
<br />
<br />
[http://www.stat.cmu.edu/~cshalizi/uADA/12/lectures/ch20.pdf Chapter 20. Mixture Models]<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<hr><br />
* SOCR Home page: http://www.socr.umich.edu<br />
<br />
{{translate|pageName=http://wiki.socr.umich.edu/index.php?title=SMHS_MixtureModeling}}</div>Clgallahttps://wiki.socr.umich.edu/index.php?title=SMHS_MixtureModeling&diff=14417SMHS MixtureModeling2014-10-16T17:48:22Z<p>Clgalla: </p>
<hr />
<div>==[[SMHS| Scientific Methods for Health Sciences]] - Mixture Modeling ==<br />
<br />
<br />
===Overview=== <br />
Mixture model is a probabilistic model for representing the presence of subpopulations within overall population, without requiring that an observed data set should identify the sub-population to which an individual observation belongs. In this section, we will present a general introduction to mixture modeling, the structure of mixture model, various types of mixture model, the estimation of parameters in mixture model and application of mixture model in studies. The implementation of mixture modeling in R package will also be discussed in the attached articles.<br />
<br />
===Motivation===<br />
Mixture distribution represents the probability distribution of observations in the overall population. Problems associated with mixture distributions relate to deriving the properties of the overall population from those of the sub-population. We use mixture models to make statistical inference about the properties of the sub-populations given only observations on the pooled population, without sub-population identity information. It is not the same as models for compositional data, whose components are constrained to sum to a constant value. (1, 100%, etc.) What is the structure of mixture model and how can we estimate parameters in the mixture model?<br />
<br />
===Theory===<br />
'''1) Structure of mixture model:''' a distribution $f$ is a mixture of $K$ component distributions $f_{1}, f_{2}, \cdots, f_{k}$ if $f(x) = \sum_{k=1}^{K} \lambda_{k}f_{k}(x)$ with the $\lambda_{k}$ being the mixing weights, $\lambda_{k} > 0, \sum_{k}\lambda_{k} = 1$. $Z \sim Mult(\lambda_{1}, \cdots, \lambda_{k}), X|Z \sim f_{z}$, where the discrete random variable $Z$ indicating where $X$ is drawn from. Different parametric family for $f_{k}$ generates different parametric mixture models, like Gaussian, Binomial, Poisson and etc. They may all be Gaussian with different parameters, or all Poisson distributions with different means. The model can be expressed as $f(x) = \sum_{k=1}^{K} \lambda_{k}f(x;\theta_{k})$, the parameter vector of the mixture model is $\theta = (\lambda_{1}, \cdots, \lambda_{K}, \theta_{1}, \cdots, \theta_{K})$. When $K=1$, we got a simply parametric distribution of the usual sort, and density estimation reduces to estimating the parameters by ML. If K=n, the number of observations, we went back to kernel density estimation. <br />
<br />
'''2) Estimating parametric mixture models:''' assume independent samples where we have the density function to be $\prod_{i=1}^{n}f(x_{i};\theta)$,for observations $x_{1}, x_{2}, \cdots, x_{n}$. <br />
<br />
<br />
We try the logarithm to turn multiplication into addition:$l(\theta) = \sum_{i=1}^{n} logf(x_{i};\theta) = \sum_{i=1}^{n} log \sum_{k=1}^{K} \lambda_{k} f(x_{i}; \theta_{k})$, <br />
<br />
<br />
we take the derivative of this with respect to one parameter, say $\theta_{j}, \frac{\partial l}{\partial \theta_{j}}=\sum_{i=1}^{n}\frac{1}{\sum_{i=1}^{K}\lambda_{k}f(x_{i};\theta_{k})}\lambda_{j}\frac{\partial f(x_{i};\theta_{j})}{\partial \theta_{j}}=\sum_{i=1}^{n}\frac{\lambda_{j}f(x_{i};\theta_{j})}{\sum_{i=1}^{K}\lambda_{k}f(x_{i};\theta_{k})}\frac{\partial logf(x_{i};\theta_{j})}{\partial \theta_{j}}$. <br />
<br />
<br />
If we just had an ordinary parametric model, on the other hand, the derivative of the log-likelihood would be $\sum_{i=1}^{n}\frac{\partial logf(x_{i};\theta_{j})}{\partial \theta_{j}}.$ Maximizing the likelihood for a mixture model is like doing a weighted likelihood maximization, where the weight of $x_{i}$ depends on cluster, being $w_{ij} = \frac{\lambda_{j}}{f x_{i}};\theta_{j}\sum_{k=1}^{K}\lambda_{k}f(x_{i}\theta_{k}).$ <br />
<br />
Remember that $\lambda_{j}$ is the probability that the hidden class variable $Z$ is $j$,so the numerator in the weights is the joint probability of getting $Z=j$ and $X=x_{i}$. The denominator is the marginal probability of getting $X=x_{i}$, so the ratio is conditional probability of $Z=j$ given $X=x_{i}, w_{ij} = \frac{\lambda_{j}f(x_{i};\theta_{j}}{\sum_{k=1}^{K}} \lambda_{k}f(x_{i};\theta_{k}) = p(Z=j | X=x_{i}; \theta).$<br />
*EM algorithm: (1) start with guesses about the mixture components $\theta_{1}, \cdots, \theta_{K}$ and the mixing weights $\lambda_{1}, \cdots, \lambda_{K}$; (2) until nothing changes very much: using the current parameter guesses, calculate the weights $w_{ij}$ (E-step); using the current weights, maximize the weighted likelihood to get new parameter estimates (M-step); (3) return the final parameter estimates (including mixing proportions) and cluster probabilities. <br />
*Non-parametric mixture modeling: replace the $M$ step of $EM$ by some other way of estimating the distribution of each mixture component. This could be fast and crude estimate of parameters, or it could even be a non-parametric density estimator.<br />
<br />
'''3) Computational examples in R:''' Snoqualmie Falls Revisited (analyzed using the mclust package in R)<br />
<br />
RCODE:<br />
snoqualmie <- read.csv<br />
("http://www.stat.cmu.edu/~cshalizi/402/lectures/16-glm-practicals/snoqualmie.csv",header=FALSE)<br />
snoqualmie.vector <- na.omit(unlist(snoqualmie))<br />
snoq <- snoqualmie.vector[snoqualmie.vector > 0]<br />
summary(snoq)<br />
Min. 1st Qu. Median Mean 3rd Qu. Max. <br />
1.00 6.00 19.00 32.28 44.00 463.00<br />
<br />
<br />
# Two-component Gaussian mixture<br />
library(mixtools)<br />
snoq.k2 <- normalmixEM(snoq,k=2,maxit=100,epsilon=0.01)<br />
summary(snoq.k2)<br />
summary of normalmixEM object:<br />
comp 1 comp 2<br />
$\lambda$ 0.557524 0.442476<br />
$\mu$ 10.266172 60.009468<br />
$|sigma$ 8.510244 44.997240<br />
loglik at estimate: -32681.21<br />
<br />
<br />
# Function to add Gaussian mixture components, vertically scaled, to the<br />
# current plot<br />
# Presumes the mixture object has the structure used by mixtools<br />
plot.normal.components <- function(mixture,component.number,...) {<br />
curve(mixture $\lambda$ [component.number] *<br />
dnorm(x,mean=mixture $\mu$ [component.number],<br />
sd=mixture $\sigma$ [component.number]), add=TRUE, ...)<br />
}<br />
plot(hist(snoq,breaks=101),col="grey",border="grey",freq=FALSE,<br />
xlab="Precipitation (1/100 inch)",main="Precipitation in Snoqualmie Falls")<br />
lines(density(snoq),lty=2)<br />
<br />
[[Image:SMHS_MixtureModel_Fig1.png|500px]]<br />
<br />
Histogram (grey) for precipitation on wet days in Snoqualmie Falls. The dashed line is a kernel density estimate, which is not completely satisfactory.<br />
<br />
plot(hist(snoq,breaks=101),col="grey",border="grey",freq=FALSE,<br />
xlab="Precipitation (1/100 inch)",main="Precipitation in Snoqualmie Falls")<br />
lines(density(snoq),lty=2)<br />
<br />
sapply(1:2,plot.normal.components,mixture=snoq.k2)<br />
<br />
[[Image:SMHS_MixtureModel_Fig2.png|500px]]<br />
<br />
As in the pervious figure, plus the components of a mixture of two Gaussians, fitted to the data by the EM algorithm (dashed lines). These are scaled by the mixing weights of the components.<br />
<br />
# Function to calculate the cumulative distribution function of a Gaussian<br />
# mixture model<br />
# Presumes the mixture object has the structure used by mixtools<br />
# Doesn't implement some of the usual options for CDF functions in R, like<br />
# returning the log probability, or the upper tail probability<br />
pnormmix <- function(x,mixture) {<br />
$\lambda$ <- mixture $\lambda$<br />
k <- length(s\lambda$)<br />
pnorm.from.mix <- function(x,component) {<br />
$\lambda$[component]*pnorm(x,mean=mixture$\mu$[component],<br />
sd=mixture$\sigma$[component])<br />
}<br />
pnorms <- sapply(1:k,pnorm.from.mix,x=x)<br />
return(rowSums(pnorms))<br />
}<br />
<br />
<br />
#### Figure 3<br />
# Distinct values in the data<br />
distinct.snoq <- sort(unique(snoq))<br />
# Theoretical CDF evaluated at each distinct value<br />
tcdfs <- pnormmix(distinct.snoq,mixture=snoq.k2)<br />
# Empirical CDF evaluated at each distinct value<br />
# ecdf(snoq) returns an object which is a _function_, suitable for application<br />
# to new vectors<br />
ecdfs <- ecdf(snoq)(distinct.snoq)<br />
# Plot them against each other<br />
plot(tcdfs,ecdfs,xlab="Theoretical CDF",ylab="Empirical CDF",xlim=c(0,1),<br />
ylim=c(0,1))<br />
# Main diagonal for visual reference<br />
abline(0,1)<br />
<br />
[[Image:SMHS_MixtureModel_Fig3.png|500px]]<br />
<br />
Calibration plot for the two-component Gaussian mixture. For each distinct value of precipitation x, plot the fraction of days predicted by the mixture model to have \leq x precipitation on the horizontal axis, versus the actual fraction of days \leq x.<br />
<br />
# Probability density function for a Gaussian mixture<br />
# Presumes the mixture object has the structure used by mixtools<br />
dnormalmix <- function(x,mixture,log=FALSE) {<br />
lambda <- mixture$lambda<br />
k <- length(lambda)<br />
# Calculate share of likelihood for all data for one component<br />
like.component <- function(x,component) {<br />
lambda[component]*dnorm(x,mean=mixture$mu[component],<br />
sd=mixture$sigma[component])<br />
}<br />
# Create array with likelihood shares from all components over all data<br />
likes <- sapply(1:k,like.component,x=x)<br />
# Add up contributions from components<br />
d <- rowSums(likes)<br />
if (log) {<br />
d <- log(d)<br />
}<br />
return(d)<br />
}<br />
<br />
# Log likelihood function for a Gaussian mixture, potentially on new data<br />
loglike.normalmix <- function(x,mixture) {<br />
loglike <- dnormalmix(x,mixture,log=TRUE)<br />
return(sum(loglike))<br />
}<br />
loglike.normalmix(snoq,mixture=snoq.k2)<br />
[1] - 32681.21<br />
# Evaluate various numbers of Gaussian components by data-set splitting<br />
# (i.e., very crude cross-validation)<br />
n <- length(snoq)<br />
data.points <- 1:n<br />
data.points <- sample(data.points) # Permute randomly<br />
train <- data.points[1:floor(n/2)] # First random half is training<br />
test <- data.points[-(1:floor(n/2))] # 2nd random half is testing<br />
candidate.component.numbers <- 2:10<br />
loglikes <- vector(length=1+length(candidate.component.numbers))<br />
# k=1 needs special handling<br />
mu<-mean(snoq[train]) # MLE of mean<br />
sigma <- sd(snoq[train])*sqrt((n-1)/n) # MLE of standard deviation<br />
loglikes[1] <- sum(dnorm(snoq[test],mu,sigma,log=TRUE))<br />
for (k in candidate.component.numbers) {<br />
mixture <- normalmixEM(snoq[train],k=k,maxit=400,epsilon=1e-2)<br />
loglikes[k] <- loglike.normalmix(snoq[test],mixture=mixture)<br />
}<br />
loglikes<br />
[1] -17647.93 -16336.12 -15796.02 -15554.33 -15398.04 -15337.47 -15297.61<br />
[8] -15285.60 -15286.75 -15288.88<br />
plot(x=1:10, y=loglikes,xlab="Number of mixture components", ylab="Log-likelihood on testing data")<br />
<br />
[[Image:SMHS_MixtureModel_Fig4.png|500px]]<br />
<br />
log-likelihoods of different sizes of mixture models, fit to a random half of the data for training, and evaluated on the other half of the data for testing.<br />
<br />
snoq.k9 <- normalmixEM(snoq,k=9,maxit=400,epsilon=1e-2)<br />
plot(hist(snoq,breaks=101),col="grey",border="grey",freq=FALSE,<br />
xlab="Precipitation (1/100 inch)",main="Precipitation in Snoqualmie Falls")<br />
lines(density(snoq),lty=2)<br />
sapply(1:9,plot.normal.components,mixture=snoq.k9)<br />
<br />
[[Image:SMHS_MixtureModel_Fig5.png|500px]]<br />
<br />
With the nine-component Gaussian mixture.<br />
<br />
# Assigments for distinct.snoq and ecdfs are redundant if you've already done<br />
distinct.snoq <- sort(unique(snoq))<br />
tcdfs <- pnormmix(distinct.snoq,mixture=snoq.k9)<br />
ecdfs <- ecdf(snoq)(distinct.snoq)<br />
plot(tcdfs,ecdfs,xlab="Theoretical CDF",ylab="Empirical CDF",xlim=c(0,1),<br />
ylim=c(0,1))<br />
abline(0,1)<br />
<br />
[[Image:SMHS_MixtureModel_Fig6.png|500px]]<br />
<br />
Calibration plot for the nine-component Gaussian mixture.<br />
<br />
plot(0,xlim=range(snoq.k9$mu),ylim=range(snoq.k9$sigma),type="n",<br />
xlab="Component mean", ylab="Component standard deviation")<br />
points(x=snoq.k9$mu,y=snoq.k9$sigma,pch=as.character(1:9),<br />
cex=sqrt(0.5+5*snoq.k9$lambda))<br />
<br />
[[Image:SMHS_MixtureModel_Fig7.png|500px]]<br />
<br />
Characteristics of the components of the 9-mode Gaussian mixture. The horizontal axis gives the component mean, the vertical axis its standard deviation. The area of the number representing each component is proportional to the component’s mixing weight.<br />
<br />
plot(density(snoq),lty=2,ylim=c(0,0.04),<br />
main=paste("Comparison of density estimates\n",<br />
"Kernel vs. Gaussian mixture"),<br />
xlab="Precipitation (1/100 inch)")<br />
curve(dnormalmix(x,snoq.k9),add=TRUE)<br />
<br />
[[Image:SMHS_MixtureModel_Fig8.png|500px]]<br />
<br />
Dashed line: kernel density estimate. Solid line is the nine-Gaussian mixture. Notice that the mixture, unlike the KDE, gives negligible probability to negative precipitation.<br />
<br />
# Do the classes of the Gaussian mixture make sense as annual weather patterns?<br />
# Most probable class for each day:<br />
day.classes <- apply(snoq.k9$posterior,1,which.max)<br />
# Make a copy of the original, structured data set to edit<br />
snoqualmie.classes <- snoqualmie<br />
# Figure out which days had precipitation<br />
wet.days <- (snoqualmie > 0) & !(is.na(snoqualmie))<br />
# Replace actual precipitation amounts with classes<br />
snoqualmie.classes[wet.days] <- day.classes<br />
# Problem: the number of the classes doesn't correspond to e.g. amount of<br />
# precipitation expected. Solution: label by expected precipitation, not by<br />
# class number.<br />
snoqualmie.classes[wet.days] <- snoq.k9$mu[day.classes]<br />
<br />
plot(0,xlim=c(1,366),ylim=range(snoq.k9$mu),type="n",xaxt="n",<br />
xlab="Day of year",ylab="Expected precipiation (1/100 inch)")<br />
axis(1,at=1+(0:11)*30)<br />
for (year in 1:nrow(snoqualmie.classes)) {<br />
points(1:366,snoqualmie.classes[year,],pch=16,cex=0.2)<br />
}<br />
<br />
[[Image:SMHS_MixtureModel_Fig9.png|500px]]<br />
<br />
Plot of days classified according to the nine-component mixture. Horizontal axis: days of the year, numbered from 1 to 366 to handle leap years. Vertical axis: expected amount of precipitation on that day according to the most probable class for the day.<br />
<br />
# Next line is currently (5 April 2011) used to invoke a bug-patch kindly<br />
# provided by Dr. Derek Young; the patch will be incorporated in the next<br />
# update to mixtools, so should not be needed after April 2011<br />
source("http://www.stat.cmu.edu/~cshalizi/402/lectures/20-mixture-examples/bootcomp.R")<br />
snoq.boot <- boot.comp(snoq,max.comp=10,mix.type="normalmix",<br />
maxit=400,epsilon=1e-2)<br />
# Running this takes about 5 minutes<br />
# automatically produced as a side-effect of running boot.comp()<br />
<br />
[[Image:SMHS_MixtureModel_Fig10.png|500px]]<br />
<br />
Histograms produced by boot.comp(). The vertical red lines mark the observed difference in log-likelihood.<br />
<br />
library(mvtnorm)<br />
x.points <- seq(-3,3,length.out=100)<br />
y.points <- x.points<br />
z <- matrix(0,nrow=100,ncol=100)<br />
mu <- c(1,1)<br />
sigma <- matrix(c(2,1,1,1),nrow=2)<br />
for (i in 1:100) {<br />
for (j in 1:100) {<br />
z[i,j] <- dmvnorm(c(x.points[i],y.points[j]),mean=mu,sigma=sigma)<br />
}<br />
}<br />
contour(x.points,y.points,z)<br />
<br />
[[Image:SMHS_MixtureModel_Fig11.png|500px]]<br />
<br />
Applications<br />
<br />
4.1) This article (http://wiki.stat.ucla.edu/socr/index.php/SOCR_EduMaterials_Activities_2D_PointSegmentation_EM_Mixture) demonstrated activity of the mixture modeling and expectation maximization (EM) applied to the problem of 2D point cluster segmentation. It illustrated ways to use EM and mixture modeling to obtain cluster classification of points in 2D using SOCR charts activity and SOCR modeler with specific examples.<br />
<br />
4.2) This article (http://wiki.stat.ucla.edu/socr/index.php/SOCR_EduMaterials_ModelerActivities_MixtureModel_1) presented the SOCR activity that demonstrate random sampling and fitting of mixture models to data. The SOCR mixture-model applet demonstrates how unimodal-distributions come together as building blocks to form the backbone of may complex processes and allow computing probability and critical values for these mixture distributions, and enable inference on such complicated processes.<br />
<br />
4.3) This article (http://www.sciencedirect.com/science/article/pii/S0167947301000469) presented a mixture model approach for the analysis of microarray gene expression data. Microarrays have emerged as powerful tools allowing investigators to assess the expression of thousands of genes in different tissues and organisms. Statistical treatment of the resulting data remains a substantial challenge. Investigators using microarray expression studies may wish to answer questions about the statistical significance of differences in expression of any of the genes under study, avoiding false positive and false negative results. This paper developed a sequence of procedures involving finite mixture modeling and bootstrap inference to address these issues in studies involving many thousands of genes and illustrated the use of these techniques with a dataset involving calorically restricted mice.<br />
<br />
4.4) This article (http://proceedings.spiedigitallibrary.org/proceeding.aspx?articleid=979899) is concerned with estimating a probability density function of human skin color, using a finite Gaussian mixture model, whose parameters are estimated through the EM algorithm. Hawkins' statistical test on the normality and homoscedasticity (common covariance matrix) of the estimated Gaussian mixture models is performed and McLachlan's bootstrap method is used to test the number of components in a mixture. Experimental results show that the estimated Gaussian mixture model fits skin images from a large database. Applications of the estimated density function in image and video databases are presented.<br />
<br />
5) Software <br />
http://cran.r-project.org/web/packages/mixtools/vignettes/mixtools.pdf <br />
http://www.stat.washington.edu/mclust/ <br />
http://www.stat.cmu.edu/~cshalizi/uADA/12/lectures/mixture-examples.R <br />
<br />
6) Problems<br />
<br />
6.1) Write a function to simulate from a Gaussian mixture model. Check if it works by comparing a density estimated on its output to the theoretical density.<br />
<br />
6.2) Work through the E-step and M-step for a mixture of two Poisson distributions.<br />
<br />
6.3) Code up the EM algorithm for a mixture of K Gaussians. Simulate data from K=3 Gaussians. How well does the code assign data points to components if give the actual Gaussian parameter the initial guess and how does it change if given other initial parameters?<br />
<br />
6.4) Write a function to fit a mixture of exponential distributions using the EM algorithm. <br />
<br />
7) References<br />
http://mirlyn.lib.umich.edu/Record/004199238 <br />
http://www.stat.cmu.edu/~cshalizi/uADA/12/lectures/ch20.pdf <br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<hr><br />
* SOCR Home page: http://www.socr.umich.edu<br />
<br />
{{translate|pageName=http://wiki.socr.umich.edu/index.php?title=SMHS_MixtureModeling}}</div>Clgallahttps://wiki.socr.umich.edu/index.php?title=SMHS_MixtureModeling&diff=14416SMHS MixtureModeling2014-10-16T17:09:05Z<p>Clgalla: /* Scientific Methods for Health Sciences - Mixture Modeling */</p>
<hr />
<div>==[[SMHS| Scientific Methods for Health Sciences]] - Mixture Modeling ==<br />
<br />
<br />
Overview: mixture model is a probabilistic model for representing the presence of subpopulations within overall population, without requiring that an observed data set should identify the sub-population to which an individual observation belongs. In this section, we will present a general introduction to mixture modeling, the structure of mixture model, various types of mixture model, the estimation of parameters in mixture model and application of mixture model in studies. The implementation of mixture modeling in R package will also be discussed in the attached articles.<br />
<br />
2) Motivation: mixture distribution represents the probability distribution of observations in the overall population. Problems associated with mixture distributions relate to deriving the properties of the overall population from those of the sub-population. We use mixture models to make statistical inference about the properties of the sub-populations given only observations on the pooled population, without sub-population identity information. It is not the same as models for compositional data, whose components are constrained to sum to a constant value. (1, 100%, etc.) What is the structure of mixture model and how can we estimate parameters in the mixture model?<br />
<br />
3) Theory<br />
<br />
3.1) Structure of mixture model: a distribution f is a mixture of K component distributions f_{1}, f_{2}, \cdots, f_{k} if f(x) = \sum_{k=1}^{K} \lambda_{k}f_{k}(x) with the \lambda_{k} being the mixing weights, \lambda_{k} > 0, \sum_{k}\lambda_{k} = 1. Z \sim Mult(\lambda_{1}, \cdots, \lambda_{k}), X|Z \sim f_{z}, where the discrete random variable Z indicating where X is drawn from. Different parametric family for f_{k} generates different parametric mixture models, like Gaussian, Binomial, Poisson and etc. They may all be Gaussian with different parameters, or all Poisson distributions with different means. The model can be expressed as f(x) = \sum_{k=1}^{K} \lambda_{k}f(x;\theta_{k}), the parameter vector of the mixture model is \theta = (\lambda_{1}, \cdots, \lambda_{K}, \theta_{1}, \cdots, \theta_{K}). When K=1, we got a simply parametric distribution of the usual sort, and density estimation reduces to estimating the parameters by ML. If K=n, the number of observations, we went back to kernel density estimation. <br />
<br />
3.2) Estimating parametric mixture models: assume independent samples where we have the density function to be \prod_{i=1}^{n}f(x_{i};\theta), for observations x_{1}, x_{2}, \cdots, x_{n}. We try the logarithm to turn multiplication into addition: l(\theta) = \sum_{i=1}^{n} logf(x_{i};\theta) = \sum_{i=1}^{n} log \sum_{k=1}^{K} \lambda_{k} f(x_{i}; \theta_{k}), we take the derivative of this with respect to one parameter, say \theta_{j}, \frac{\partial l}{\partial \theta_{j}}=\sum_{i=1}^{n}\frac{1}{\sum_{i=1}^{K}\lambda_{k}f(x_{i};\theta_{k})}\lambda_{j}\frac{\partial f(x_{i};\theta_{j})}{\partial \theta_{j}}=\sum_{i=1}^{n}\frac{\lambda_{j}f(x_{i};\theta_{j})}{\sum_{i=1}^{K}\lambda_{k}f(x_{i};\theta_{k})}\frac{\partial logf(x_{i};\theta_{j})}{\partial \theta_{j}}. If we just had an ordinary parametric model, on the other hand, the derivative of the log-likelihood would be \sum_{i=1}^{n}\frac{\partial logf(x_{i};\theta_{j})}{\partial \theta_{j}}. Maximizing the likelihood for a mixture model is like doing a weighted likelihood maximization, where the weight of x_{i} depends on cluster, being w_{ij} = \frac{\lambda_{j}f(x_{i};\theta_{j}}{\sum_{k=1}^{K}\lambda_{k}f(x_{i};\theta_{k}). <br />
<br />
Remember that \lambda_{j} is the probability that the hidden class variable Z is j, so the numerator in the weights is the joint probability of getting Z=j and X=x_{i}. The denominator is the marginal probability of getting X=x_{i}, so the ratio is conditional probability of Z=j given X=x_{i}, w_{ij} = \frac{\lambda_{j}f(x_{i};\theta_{j}}{\sum_{k=1}^{K}} \lambda_{k}f(x_{i};\theta_{k}) = p(Z=j | X=x_{i}; \theta).<br />
EM algorithm: (1) start with guesses about the mixture components \theta_{1}, \cdots, \theta_{K} and the mixing weights \lambda_{1}, \cdots, \lambda_{K}; (2) until nothing changes very much: using the current parameter guesses, calculate the weights w_{ij} (E-step); using the current weights, maximize the weighted likelihood to get new parameter estimates (M-step); (3) return the final parameter estimates (including mixing proportions) and cluster probabilities. <br />
Non-parametric mixture modeling: replace the M step of EM by some other way of estimating the distribution of each mixture component. This could be fast and crude estimate of parameters, or it could even be a non-parametric density estimator.<br />
<br />
3.3) Computational examples in R: Snoqualmie Falls Revisited (analyzed using the mclust package in R)<br />
RCODE:<br />
snoqualmie <- read.csv("http://www.stat.cmu.edu/~cshalizi/402/lectures/16-glm-practicals/snoqualmie.csv",header=FALSE)<br />
snoqualmie.vector <- na.omit(unlist(snoqualmie))<br />
snoq <- snoqualmie.vector[snoqualmie.vector > 0]<br />
summary(snoq)<br />
Min. 1st Qu. Median Mean 3rd Qu. Max. <br />
1.00 6.00 19.00 32.28 44.00 463.00<br />
<br />
# Two-component Gaussian mixture<br />
library(mixtools)<br />
snoq.k2 <- normalmixEM(snoq,k=2,maxit=100,epsilon=0.01)<br />
summary(snoq.k2)<br />
summary of normalmixEM object:<br />
comp 1 comp 2<br />
lambda 0.557524 0.442476<br />
mu 10.266172 60.009468<br />
sigma 8.510244 44.997240<br />
loglik at estimate: -32681.21<br />
<br />
# Function to add Gaussian mixture components, vertically scaled, to the<br />
# current plot<br />
# Presumes the mixture object has the structure used by mixtools<br />
plot.normal.components <- function(mixture,component.number,...) {<br />
curve(mixture$lambda[component.number] *<br />
dnorm(x,mean=mixture$mu[component.number],<br />
sd=mixture$sigma[component.number]), add=TRUE, ...)<br />
}<br />
plot(hist(snoq,breaks=101),col="grey",border="grey",freq=FALSE,<br />
xlab="Precipitation (1/100 inch)",main="Precipitation in Snoqualmie Falls")<br />
lines(density(snoq),lty=2)<br />
<br />
<br />
<br />
<br />
[[Image:SMHS_MixtureModel_Fig1.png|500px]]<br />
<br />
Histogram (grey) for precipitation on wet days in Snoqualmie Falls. The dashed line is a kernel density estimate, which is not completely satisfactory.<br />
plot(hist(snoq,breaks=101),col="grey",border="grey",freq=FALSE,<br />
xlab="Precipitation (1/100 inch)",main="Precipitation in Snoqualmie Falls")<br />
lines(density(snoq),lty=2)<br />
<br />
sapply(1:2,plot.normal.components,mixture=snoq.k2)<br />
<br />
<br />
<br />
<br />
<br />
[[Image:SMHS_MixtureModel_Fig2.png|500px]]<br />
<br />
As in the pervious figure, plus the components of a mixture of two Gaussians, fitted to the data by the EM algorithm (dashed lines). These are scaled by the mixing weights of the components.<br />
<br />
# Function to calculate the cumulative distribution function of a Gaussian<br />
# mixture model<br />
# Presumes the mixture object has the structure used by mixtools<br />
# Doesn't implement some of the usual options for CDF functions in R, like<br />
# returning the log probability, or the upper tail probability<br />
pnormmix <- function(x,mixture) {<br />
lambda <- mixture$lambda<br />
k <- length(lambda)<br />
pnorm.from.mix <- function(x,component) {<br />
lambda[component]*pnorm(x,mean=mixture$mu[component],<br />
sd=mixture$sigma[component])<br />
}<br />
pnorms <- sapply(1:k,pnorm.from.mix,x=x)<br />
return(rowSums(pnorms))<br />
}<br />
<br />
#### Figure 3<br />
# Distinct values in the data<br />
distinct.snoq <- sort(unique(snoq))<br />
# Theoretical CDF evaluated at each distinct value<br />
tcdfs <- pnormmix(distinct.snoq,mixture=snoq.k2)<br />
# Empirical CDF evaluated at each distinct value<br />
# ecdf(snoq) returns an object which is a _function_, suitable for application<br />
# to new vectors<br />
ecdfs <- ecdf(snoq)(distinct.snoq)<br />
# Plot them against each other<br />
plot(tcdfs,ecdfs,xlab="Theoretical CDF",ylab="Empirical CDF",xlim=c(0,1),<br />
ylim=c(0,1))<br />
# Main diagonal for visual reference<br />
abline(0,1)<br />
<br />
<br />
[[Image:SMHS_MixtureModel_Fig3.png|500px]]<br />
<br />
Calibration plot for the two-component Gaussian mixture. For each distinct value of precipitation x, plot the fraction of days predicted by the mixture model to have \leq x precipitation on the horizontal axis, versus the actual fraction of days \leq x.<br />
<br />
# Probability density function for a Gaussian mixture<br />
# Presumes the mixture object has the structure used by mixtools<br />
dnormalmix <- function(x,mixture,log=FALSE) {<br />
lambda <- mixture$lambda<br />
k <- length(lambda)<br />
# Calculate share of likelihood for all data for one component<br />
like.component <- function(x,component) {<br />
lambda[component]*dnorm(x,mean=mixture$mu[component],<br />
sd=mixture$sigma[component])<br />
}<br />
# Create array with likelihood shares from all components over all data<br />
likes <- sapply(1:k,like.component,x=x)<br />
# Add up contributions from components<br />
d <- rowSums(likes)<br />
if (log) {<br />
d <- log(d)<br />
}<br />
return(d)<br />
}<br />
<br />
# Log likelihood function for a Gaussian mixture, potentially on new data<br />
loglike.normalmix <- function(x,mixture) {<br />
loglike <- dnormalmix(x,mixture,log=TRUE)<br />
return(sum(loglike))<br />
}<br />
loglike.normalmix(snoq,mixture=snoq.k2)<br />
[1] - 32681.21<br />
# Evaluate various numbers of Gaussian components by data-set splitting<br />
# (i.e., very crude cross-validation)<br />
n <- length(snoq)<br />
data.points <- 1:n<br />
data.points <- sample(data.points) # Permute randomly<br />
train <- data.points[1:floor(n/2)] # First random half is training<br />
test <- data.points[-(1:floor(n/2))] # 2nd random half is testing<br />
candidate.component.numbers <- 2:10<br />
loglikes <- vector(length=1+length(candidate.component.numbers))<br />
# k=1 needs special handling<br />
mu<-mean(snoq[train]) # MLE of mean<br />
sigma <- sd(snoq[train])*sqrt((n-1)/n) # MLE of standard deviation<br />
loglikes[1] <- sum(dnorm(snoq[test],mu,sigma,log=TRUE))<br />
for (k in candidate.component.numbers) {<br />
mixture <- normalmixEM(snoq[train],k=k,maxit=400,epsilon=1e-2)<br />
loglikes[k] <- loglike.normalmix(snoq[test],mixture=mixture)<br />
}<br />
loglikes<br />
[1] -17647.93 -16336.12 -15796.02 -15554.33 -15398.04 -15337.47 -15297.61<br />
[8] -15285.60 -15286.75 -15288.88<br />
plot(x=1:10, y=loglikes,xlab="Number of mixture components", ylab="Log-likelihood on testing data")<br />
<br />
[[Image:SMHS_MixtureModel_Fig4.png|500px]]<br />
<br />
log-likelihoods of different sizes of mixture models, fit to a random half of the data for training, and evaluated on the other half of the data for testing.<br />
<br />
snoq.k9 <- normalmixEM(snoq,k=9,maxit=400,epsilon=1e-2)<br />
plot(hist(snoq,breaks=101),col="grey",border="grey",freq=FALSE,<br />
xlab="Precipitation (1/100 inch)",main="Precipitation in Snoqualmie Falls")<br />
lines(density(snoq),lty=2)<br />
sapply(1:9,plot.normal.components,mixture=snoq.k9)<br />
<br />
[[Image:SMHS_MixtureModel_Fig5.png|500px]]<br />
<br />
With the nine-component Gaussian mixture.<br />
<br />
# Assigments for distinct.snoq and ecdfs are redundant if you've already done<br />
distinct.snoq <- sort(unique(snoq))<br />
tcdfs <- pnormmix(distinct.snoq,mixture=snoq.k9)<br />
ecdfs <- ecdf(snoq)(distinct.snoq)<br />
plot(tcdfs,ecdfs,xlab="Theoretical CDF",ylab="Empirical CDF",xlim=c(0,1),<br />
ylim=c(0,1))<br />
abline(0,1)<br />
<br />
[[Image:SMHS_MixtureModel_Fig6.png|500px]]<br />
<br />
Calibration plot for the nine-component Gaussian mixture.<br />
<br />
plot(0,xlim=range(snoq.k9$mu),ylim=range(snoq.k9$sigma),type="n",<br />
xlab="Component mean", ylab="Component standard deviation")<br />
points(x=snoq.k9$mu,y=snoq.k9$sigma,pch=as.character(1:9),<br />
cex=sqrt(0.5+5*snoq.k9$lambda))<br />
<br />
[[Image:SMHS_MixtureModel_Fig7.png|500px]]<br />
<br />
Characteristics of the components of the 9-mode Gaussian mixture. The horizontal axis gives the component mean, the vertical axis its standard deviation. The area of the number representing each component is proportional to the component’s mixing weight.<br />
<br />
plot(density(snoq),lty=2,ylim=c(0,0.04),<br />
main=paste("Comparison of density estimates\n",<br />
"Kernel vs. Gaussian mixture"),<br />
xlab="Precipitation (1/100 inch)")<br />
curve(dnormalmix(x,snoq.k9),add=TRUE)<br />
<br />
[[Image:SMHS_MixtureModel_Fig8.png|500px]]<br />
<br />
Dashed line: kernel density estimate. Solid line is the nine-Gaussian mixture. Notice that the mixture, unlike the KDE, gives negligible probability to negative precipitation.<br />
<br />
# Do the classes of the Gaussian mixture make sense as annual weather patterns?<br />
# Most probable class for each day:<br />
day.classes <- apply(snoq.k9$posterior,1,which.max)<br />
# Make a copy of the original, structured data set to edit<br />
snoqualmie.classes <- snoqualmie<br />
# Figure out which days had precipitation<br />
wet.days <- (snoqualmie > 0) & !(is.na(snoqualmie))<br />
# Replace actual precipitation amounts with classes<br />
snoqualmie.classes[wet.days] <- day.classes<br />
# Problem: the number of the classes doesn't correspond to e.g. amount of<br />
# precipitation expected. Solution: label by expected precipitation, not by<br />
# class number.<br />
snoqualmie.classes[wet.days] <- snoq.k9$mu[day.classes]<br />
<br />
plot(0,xlim=c(1,366),ylim=range(snoq.k9$mu),type="n",xaxt="n",<br />
xlab="Day of year",ylab="Expected precipiation (1/100 inch)")<br />
axis(1,at=1+(0:11)*30)<br />
for (year in 1:nrow(snoqualmie.classes)) {<br />
points(1:366,snoqualmie.classes[year,],pch=16,cex=0.2)<br />
}<br />
<br />
[[Image:SMHS_MixtureModel_Fig9.png|500px]]<br />
<br />
Plot of days classified according to the nine-component mixture. Horizontal axis: days of the year, numbered from 1 to 366 to handle leap years. Vertical axis: expected amount of precipitation on that day according to the most probable class for the day.<br />
<br />
# Next line is currently (5 April 2011) used to invoke a bug-patch kindly<br />
# provided by Dr. Derek Young; the patch will be incorporated in the next<br />
# update to mixtools, so should not be needed after April 2011<br />
source("http://www.stat.cmu.edu/~cshalizi/402/lectures/20-mixture-examples/bootcomp.R")<br />
snoq.boot <- boot.comp(snoq,max.comp=10,mix.type="normalmix",<br />
maxit=400,epsilon=1e-2)<br />
# Running this takes about 5 minutes<br />
# automatically produced as a side-effect of running boot.comp()<br />
<br />
[[Image:SMHS_MixtureModel_Fig10.png|500px]]<br />
<br />
Histograms produced by boot.comp(). The vertical red lines mark the observed difference in log-likelihood.<br />
<br />
library(mvtnorm)<br />
x.points <- seq(-3,3,length.out=100)<br />
y.points <- x.points<br />
z <- matrix(0,nrow=100,ncol=100)<br />
mu <- c(1,1)<br />
sigma <- matrix(c(2,1,1,1),nrow=2)<br />
for (i in 1:100) {<br />
for (j in 1:100) {<br />
z[i,j] <- dmvnorm(c(x.points[i],y.points[j]),mean=mu,sigma=sigma)<br />
}<br />
}<br />
contour(x.points,y.points,z)<br />
<br />
[[Image:SMHS_MixtureModel_Fig11.png|500px]]<br />
<br />
Applications<br />
<br />
4.1) This article (http://wiki.stat.ucla.edu/socr/index.php/SOCR_EduMaterials_Activities_2D_PointSegmentation_EM_Mixture) demonstrated activity of the mixture modeling and expectation maximization (EM) applied to the problem of 2D point cluster segmentation. It illustrated ways to use EM and mixture modeling to obtain cluster classification of points in 2D using SOCR charts activity and SOCR modeler with specific examples.<br />
<br />
4.2) This article (http://wiki.stat.ucla.edu/socr/index.php/SOCR_EduMaterials_ModelerActivities_MixtureModel_1) presented the SOCR activity that demonstrate random sampling and fitting of mixture models to data. The SOCR mixture-model applet demonstrates how unimodal-distributions come together as building blocks to form the backbone of may complex processes and allow computing probability and critical values for these mixture distributions, and enable inference on such complicated processes.<br />
<br />
4.3) This article (http://www.sciencedirect.com/science/article/pii/S0167947301000469) presented a mixture model approach for the analysis of microarray gene expression data. Microarrays have emerged as powerful tools allowing investigators to assess the expression of thousands of genes in different tissues and organisms. Statistical treatment of the resulting data remains a substantial challenge. Investigators using microarray expression studies may wish to answer questions about the statistical significance of differences in expression of any of the genes under study, avoiding false positive and false negative results. This paper developed a sequence of procedures involving finite mixture modeling and bootstrap inference to address these issues in studies involving many thousands of genes and illustrated the use of these techniques with a dataset involving calorically restricted mice.<br />
<br />
4.4) This article (http://proceedings.spiedigitallibrary.org/proceeding.aspx?articleid=979899) is concerned with estimating a probability density function of human skin color, using a finite Gaussian mixture model, whose parameters are estimated through the EM algorithm. Hawkins' statistical test on the normality and homoscedasticity (common covariance matrix) of the estimated Gaussian mixture models is performed and McLachlan's bootstrap method is used to test the number of components in a mixture. Experimental results show that the estimated Gaussian mixture model fits skin images from a large database. Applications of the estimated density function in image and video databases are presented.<br />
<br />
5) Software <br />
http://cran.r-project.org/web/packages/mixtools/vignettes/mixtools.pdf <br />
http://www.stat.washington.edu/mclust/ <br />
http://www.stat.cmu.edu/~cshalizi/uADA/12/lectures/mixture-examples.R <br />
<br />
6) Problems<br />
<br />
6.1) Write a function to simulate from a Gaussian mixture model. Check if it works by comparing a density estimated on its output to the theoretical density.<br />
<br />
6.2) Work through the E-step and M-step for a mixture of two Poisson distributions.<br />
<br />
6.3) Code up the EM algorithm for a mixture of K Gaussians. Simulate data from K=3 Gaussians. How well does the code assign data points to components if give the actual Gaussian parameter the initial guess and how does it change if given other initial parameters?<br />
<br />
6.4) Write a function to fit a mixture of exponential distributions using the EM algorithm. <br />
<br />
7) References<br />
http://mirlyn.lib.umich.edu/Record/004199238 <br />
http://www.stat.cmu.edu/~cshalizi/uADA/12/lectures/ch20.pdf <br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<hr><br />
* SOCR Home page: http://www.socr.umich.edu<br />
<br />
{{translate|pageName=http://wiki.socr.umich.edu/index.php?title=SMHS_MixtureModeling}}</div>Clgallahttps://wiki.socr.umich.edu/index.php?title=File:SMHS_MixtureModel_Fig11.png&diff=14415File:SMHS MixtureModel Fig11.png2014-10-16T17:00:09Z<p>Clgalla: </p>
<hr />
<div></div>Clgallahttps://wiki.socr.umich.edu/index.php?title=File:SMHS_MixtureModel_Fig10.png&diff=14414File:SMHS MixtureModel Fig10.png2014-10-16T16:59:55Z<p>Clgalla: </p>
<hr />
<div></div>Clgallahttps://wiki.socr.umich.edu/index.php?title=File:SMHS_MixtureModel_Fig9.png&diff=14413File:SMHS MixtureModel Fig9.png2014-10-16T16:59:41Z<p>Clgalla: </p>
<hr />
<div></div>Clgallahttps://wiki.socr.umich.edu/index.php?title=File:SMHS_MixtureModel_Fig8.png&diff=14412File:SMHS MixtureModel Fig8.png2014-10-16T16:59:24Z<p>Clgalla: </p>
<hr />
<div></div>Clgalla