==[[SMHS| Scientific Methods for Health Sciences]] - Model Fitting and Model Quality (KS-test) ==

===Overview===
Kolmogorov-Smirnov Test (K-S test) is a nonparametric test commonly applied in various fields to test on the equality of continuous, one-dimensional probability distribution that can be used to compare one sample with a reference probability distribution, which is commonly referred to one-sample K-S test, or to compare two samples, which is referred to as two-sample K-S test. The K-S test is used to determine if two datasets differ significantly. In this lecture, we are going to present a general introduction to K-S test and illustrate its application with examples. The implementation of K-S test in the statistical package R will also be discussed with specific examples.

===Motivation===
When we talk about testing the equality of two dataset, the first idea came into our mind may always be the simple t-test. However, there are situations where it is a mistake to trust the result of a t-test. Consider the situation where the control and treatment groups do not differ in mean, but only in some other way. Consider two dataset with same mean and very different variations, in this situation, the t-test cannot see the difference. Situations where the treatment and control groups are smallish datasets (say less than 20) that differ in mean but substantial non-normal distribution masks the difference. Consider two datasets drawn from lognormal distributions that differ substantially in mean, in this situation, the t-test also fails. Among all those situations, K-S test would be the answer. How does the K-S test work?

===Theory===
'''1) K-S test:''' a nonparametric test commonly applied in various fields to test on the equality of continuous, one-dimensional probability distribution that can be used to compare one sample with a reference probability distribution or to compare two samples. The K-S test quantifies a distance between the empirical distribution function of the sample and the cdf of reference distribution or between the empirical distribution functions of two samples. The null hypothesis is that the samples are drawn from the same distribution or that the sample is drawn from the reference distribution.
*K-S test is sensitive to difference both in location and shape of the empirical cdf of the samples and is widely used as the nonparametric methods for comparing two samples.
*The empirical distribution function $F_{n}$ for n i.i.d. observations $X_{i}: F_{n} (x)=1/n \sum_{i=1}^{n} I_{X_{i} \leq x},$ where $I_{X_{i} \leq x}$ is the indicator function, which equals to 1 when $X_{i} \leq x$ and equals to $0$ otherwise.
*The K-S statistic for a given $cdf F(x)$ is $D_{n}=sup_{x}|F_{n}(x)-F(x)|,$ where sup{x} is the supremum of the set of distances. By the Gilvenko-Cantelli therorem, if the sample comes from distribution $F(x)$, then $D_{n}$ converges to $0$ almost surely in the limit when $n$ goes to infinity.
*Kolmogorov distribution (the distribution of the random variable): $K=sup_{t\in [0,1]} |B(t)|,$ where $B(t)$ is the Brownian bridge. The $c.d.f$ of $K$ is given: $Pr(K \leq x)=1-2 \sum_{k=1}^{\infty}(-1)^{k-1}e^{-2k^{2} x^{2}}=\frac {\sqrt{2 \pi}}{x} \sum_{k=1}^{\infty}e^{-(2k-1)^{2} \pi^{2} \setminus (8x^{2})}.$ Under the null hypothesis, the sample comes from the hypothesized distribution $F(x), \sqrt{n} D_{n} \overset{n\rightarrow \infty}{\rightarrow} sup_{t} |B(F(t))|$ in distribution, where $B(t)$ is the Brownian bridge. When $F$ is continuous, $\sqrt{n} D_{n}$ under the null hypothesis converges to the Kolmogorov distribution, which does not depend on $F.$
*The goodness-of-fit test or the K-S test is constructed by using the critical values of the Kolmogorov distribution. The null hypothesis is rejected at level $\alpha$ if $\sqrt{n} D_{n} > K_{\alpha},$ where $K_{\alpha}$ is found from $Pr(K \leq K_{\alpha})=1- \alpha.$ The asymptotic power of this test is $1$.
*Test with estimated parameters: if either the form or the parameters of $F(x)$ are determined from the data $X_{j}$ the critical values determined in this way are invalid. Modifications required to the test statistics and critical values have been proposed.

'''2) Two-sample K-S test:''' to test whether two underlying one-dimensional probability distribution differ. The K-S statistic is $D_{n,{n}'}=sup_{x} |F_{1,n}(x)-F_{2,{n}'}(x)|$, where $F_{1,n}$ and $F_{2,{n}'}$ are the empirical distribution functions of the first and second sample respectively. The null hypothesis is rejected at least $\alpha$ if $D_{n,{n}'}>c(\alpha)\sqrt{\frac{n+{n}'}{n{n}'}}$. The value of $c(\alpha)$ is given in the following table for each level of $\alpha.$

<center>
{| class="wikitable" style="text-align:center; width:35%" border="1"
|-
|$\alpha$|| 0.1|| 0.05|| 0.025|| 0.01|| 0.005 ||0.001
|-
|c($\alpha$)|| 1.22|| 1.36|| 1.48|| 1.63|| 1.73|| 1.95
|}
</center>

Note: two-sample test checks whether the two data samples come from the same distribution. This does not specify what that common distribution is. While the K-S test is usually used to test whether a given $F(x)$ is the underlying probability distribution $F_{n}(x)$, the procedure may be inverted to give confidence limits on $F(x)$ itself. If we choose a critical value of the test statistics $D_{\alpha}$ such that $P(D_{n}>D_{\alpha})=\alpha$, then a band of width $\pm D_{\alpha}$ around $F_{n}(x)$ will entirely contain $F(x)$ with probability $1-\alpha.$

3) Illustration on how the K-S test works with example: consider the two datasets control$={1.26, 0.34, 0.70, 1.75, 50.57, 1.55, 0.08, 0.42, 0.50, 3.20, 0.15, 0.49, 0.95, 0.24, 1.37, 0.17, 6.98, 0.10, 0.94, 0.38}$, treatment= ${2.37, 2.16, 14.82, 1.73, 41.04, 0.23, 1.32, 2.91, 39.41, 0.11, 27.44, 4.51, 0.51, 4.50, 0.18, 14.68, 4.66, 1.30, 2.06, 1.19}$.
*Descriptive statistics: to reduce the list of all the data items to a few simpler numbers. For the control group, the mean is 3.607, the median is 0.60, the highest is 50.57, the lowest is 0.08 and the standard deviation is 11.165. Obviously, this dataset is not normally distributed.
*Sort the control data: control sorted$={0.08, 0.10, 0.15, 0.17, 0.24, 0.34, 0.38, 0.42, 0.49, 0.50, 0.70, 0.94, 0.95, 1.26, 1.37, 1.55, 1.75, 3.20, 6.98, 50.57},$ evidently no data lies strictly below $0.08, 5% = 1/20$ of the data is smaller than $0.1$, 10% of the data is strictly smaller than $0.15, 15%$ of the data is strictly smaller than $0.17$, $\cdots.$ There are $17$ data points smaller than $\pi$, and we say that the cumulative fraction of the data smaller than $\pi$ is $17/20 (0.85).$

RCODE:
Control <- c(1.26, 0.34, 0.70, 1.75, 50.57, 1.55, 0.08, 0.42, 0.50, 3.20, 0.15,
0.49, 0.95 , 0.24, 1.37, 0.17, 6.98, 0.10, 0.94, 0.38)
Treatment <- c(2.37, 2.16, 14.82, 1.73, 41.04, 0.23, 1.32, 2.91, 39.41, 0.11,
27.44, 4.51, 0.51, 4.50, 0.18, 14.68, 4.66, 1.30, 2.06, 1.19)
summary(control)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.080 0.315 0.600 3.607 1.415 50.570
sd(control)
[1] 11.16464
*Plot the empirical cumulative plot of the control data. From the chart, we can see that the majority of the data fall in the a small fraction of the plot on the top left, so this is a clear sign that the dataset from control group does not follow a normal distribution.
library(stats)
ecdf(control)
plot(ecdf(control),verticals=TRUE,main='Cumulative Fraction Plot')

[[Image:SMHS_Fig_1_Model_Fitting.png|500px]]

*Now plot the control group using a log scale, which will give more space to display the small x data points. Now the plot seems to present the data points evenly into two halves (half above the median, half below the median), which is a little bit below 1.

[[Image:SMHS_Fig_2_Model_Fitting.png|500px]]

log.con <- log(control)
plot(ecdf(log.con),verticals=T, main='Cumulative Fraction Plot of Log(Control)')

*Now, plot the cumulative fraction of both the treatment and the control group on the same graph.

[[Image:SMHS_Fig_3_Model_Fitting.png|500px]]

From the chart, we can see that both datasets span much of the same range of the values, but for most of the x value, the fraction of the treatment (red) is less than the fraction of the control group (blue). We denote the difference in the two fractions at each x value and the K-S test uses the maximum vertical deviation between the two curves. In this case, the maximum deviation occurs near x=1 and D=0.45. The fraction of the treatment group that is less than 1 is 0.2 (4 out of 20), and that for the control group is 0.65 (13 out of the 20 values), thus the maximum difference in the cumulative fraction is D=0.45).
Note that: the value of D is not affected by scale changes like using log, which is different from the t-statistics. Hence, the K-S test is a robust test that cares only about the relative distribution of the data.

log.treat <- log(treatment)
plot(ecdf(log.con),verticals=TRUE, main='Cumulative Fraction Plot on Log
Scale',col.p='bisque',col.h='blue',col.v='blue',xlim=c(-3,5))
par(new=T)
plot(ecdf(log.treat),verticals=TRUE,
col.p='bisque',col.h='red',col.v='red',main='',xlim=c (-3,5))
con.p <- ecdf(log.con)
treat.p <- ecdf(log.treat)
con.p(0)-treat.p(0) ### D=0.45
(ecdf(control))(1)-(ecdf(treatment))(1) ## D=0.45 same as using the log-scale

'''4) Using the Percentile Plot:''' for our habit of observing and comparing continuous curves. We may seek to use something similar to cumulative fraction plot, but without the odd steps, say, try the percentiles. Consider the dataset of ${-0.45, 1.11, 0.48, -0.82, -1.26}.$ Sort this data from small to large ${-1.26, -0.82, -0.45, 0.48, 1.11}.$ The median is $-0.45$, which is the 50th percentile. To calculate the percentile, denote the point’s location in the sorted dataset as r, and then divided by the number of points plus one: percentile = $r/(N+1)$. Now we have the set of (data, percentile) pairs: ${(-1.26, 0.167), (-0.82, 0.333), (-0.45, 0.5), (0.48, 0.667), (1.11, 0.833)}$. We can connect the adjacent data points with a straight line and the resulting collection of connected straight line segment is called an ''ogive'''''Bold text''').

RCODE:
data <- c(-0.45, 1.11, 0.48, -0.82, -1.26)
sort.data <- sort(data)
percentile <- c(0.167, 0.333, 0.5, 0.667, 0.8333)
plot(ecdf(data),verticals=T,xlim=c(-1.5,1.5),ylim=c(0,1), xlab='Data', ylab='' ,main='Cumulative Fraction Plot vs. Percentil Plot')
par(new=T)
plot(sort.data,percentile,type='o',xlim=c(-1.5,1.5),ylim=c(0,1),col=2,xlab='', ylab='', main='')

[[Image:SMHS_Fig_4_Model_Fitting.png|500px]]

Reasons to prefer percentile plot to cumulative fraction plots: the percentile plot is a better estimate of the distribution function and the percentiles allow us to use ‘probability graph paper’, plots with specially scaled axis divisions. Probability scales on the y-axis allow us to see how normal the data is. Normally distributed data will show a straight line on the probability paper while log-normal data will show a straight line with probability-log scaled axes.

'''5) The K-S statistic in more than one dimension:''' modifies the K-S test statistic to accommodate the multivariate data. Given that the maximum difference between two joint $cdf$ is not generally the same as the maximum difference of any of the complementary distribution functions, the modification is not straightforward in this way. Instead, the maximum difference will differ depending on which of $Pr(x<X \Lambda Y>y)$ or any of the other two possible arrangement is used. One approach may be to compare the cdfs of the two samples with all possible orderings, and take the largest as the K-S statistic. In $d$ dimensions, there are $2^{d}-1$ orderings. The critical values for the test statistic can be obtained by simulations, but depend on the dependence structure in the joint distribution.

===Applications===

1)[http://wiki.stat.ucla.edu/socr/index.php/SOCR_EduMaterials_AnalysisActivities_KolmogorovSmirnoff This article] presents the SOCR analyses example on Kolmogorov-Smirnoff Test, which compares how distinct two values are. It presents a general introduction to the K-S test and illustrates its application with the Oats example and control-treatment test through the SOCR program.

2) [http://wiki.stat.ucla.edu/socr/index.php/SOCR_EduMaterials_ModelerActivities_MixtureModel_1 This article] presents the SOCR activity which demonstrate the random sampling and fitting of mixture models to data. It starts with a general introduction to the mixture-model distribution and data description, followed with the exploratory data analysis and model fitting in the SOCR environment.

3) [http://jhc.sagepub.com/content/25/7/935.short This article] presented on the Kolmogorov-Smirnov statistical test for the analysis of histograms and discussed about the test for both the two-sample case (comparing fn1(X) to fn2 (X)) and the one-sample case (comparing fn1 (X) to f(X)). Presentation of the specific algorithmic steps involved is done through development of an example where the data are from an experiment discussed elsewhere in this issue. It is shown that the two histograms examined come from two different parent populations at the 99.9% confidence level.

4) [http://ieeexplore.ieee.org/xpl/login.jsp?tp=&arnumber=4310069&url=http%3A%2F%2Fieeexplore.ieee.org%2Fxpls%2Fabs_all.jsp%3Farnumber%3D4310069 This article] developed the digital techniques for detecting changes between two Landsat images. A data matrix containing 16 Ã 16 picture elements was used for this purpose. The Landsat imagery data was corrected for sensor inconsistencies and varying sun illumination. The Kolmogorov-Smirnov test (K-S test) was performed between the two corrected data matrices. This test is based on the absolute value of the maximum difference (Dmax) between the two cumulative frequency distributions. The limiting distribution of Dmax is known; thus a statistical decision concerning changes can be evaluated for the region. The K-S test was applied to different test sites. It successfully isolated regions of change. The test was found to be relatively independent of slight misregistration.

===Software===
http://stat.ethz.ch/R-manual/R-patched/library/stats/html/ks.test.html

http://astrostatistics.psu.edu/su07/R/html/stats/html/ks.test.html

RCODE are attached as in the examples given in this lecture.

===Problems===

1) Consider the following dataset, control={0.22, -0.87, -2.39, -1.79, 0.37, -1.54, 1.28, -0.31, -0.74, 1.72, 0.38, -0.17, -0.62, -1.10, 0.30, 0.15, 2.30, 0.19, -0.50, -0.09}, treatment={-5.13, -2.19, -2.43, -3.83, 0.50, -3.25, 4.32, 1.63, 5.18, -0.43, 7.11, 4.87, -3.10, -5.81, 3.76, 6.31, 2.58, 0.07, 5.76, 3.50}, use the K-S test to test the equality of the two samples. Include all necessary steps and plots.

2) Revise the example given in the lecture with a different control group: control={1.26, 0.34, 0.70, 1.75, 50.57, 1.55, 0.08, 0.42, 0.50, 3.20, 0.15, 0.49, 0.95, 0.24, 1.37, 0.17, 6.98, 0.10, 0.94, 0.38}, run the K-S test again and state clearly your conclusions with necessary plots.

3) Two types of plants are in bloom in wood and we want to study if bees prefer one tree to the other? We collect data by using a stop-watch to time how long a bee stays near a particular tree. We begin to time when the bee touches the tree; we stop timing when the bee is more than a meter from the tree. (As a result all our times are at least 1 second long: it takes a touch-and-go bee that long to get one meter from the tree.) We wanted to time exactly the same number of bees for each tree, but it started to rain. (Unequal dataset size is not a problem for the KS-test.) Apply the K-S test with this example. The data are given as below:
T1={23.4, 30.9, 18.8, 23.0, 21.4, 1, 24.6, 23.8, 24.1, 18.7, 16.3, 20.3, 14.9, 35.4, 21.6, 21.2, 21.0, 15.0, 15.6, 24.0, 34.6, 40.9, 30.7, 24.5, 16.6, 1, 21.7, 1, 23.6, 1, 25.7, 19.3, 46.9, 23.3, 21.8, 33.3, 24.9, 24.4, 1, 19.8, 17.2, 21.5, 25.5, 23.3, 18.6, 22.0, 29.8, 33.3, 1, 21.3, 18.6, 26.8, 19.4, 21.1, 21.2, 20.5, 19.8, 26.3, 39.3, 21.4, 22.6, 1, 35.3, 7.0, 19.3, 21.3, 10.1, 20.2, 1, 36.2, 16.7, 21.1, 39.1, 19.9, 32.1, 23.1, 21.8, 30.4, 19.62, 15.5}
T2={16.5, 1, 22.6, 25.3, 23.7, 1, 23.3, 23.9, 16.2, 23.0, 21.6, 10.8, 12.2, 23.6, 10.1, 24.4, 16.4, 11.7, 17.7, 34.3, 24.3, 18.7, 27.5, 25.8, 22.5, 14.2, 21.7, 1, 31.2, 13.8, 29.7, 23.1, 26.1, 25.1, 23.4, 21.7, 24.4, 13.2, 22.1, 26.7, 22.7, 1, 18.2, 28.7, 29.1, 27.4, 22.3, 13.2, 22.5, 25.0, 1, 6.6, 23.7, 23.5, 17.3, 24.6, 27.8, 29.7, 25.3, 19.9, 18.2, 26.2, 20.4, 23.3, 26.7, 26.0, 1, 25.1, 33.1, 35.0, 25.3, 23.6, 23.2, 20.2, 24.7, 22.6, 39.1, 26.5, 22.7}
Note that this example is based on data distributed according to the Cauchy distribution: a particularly abnormal case and the plots do not look particularly abnormal, however the large number of outliers is a tip off of a non-normal distribution.

===References===

[http://en.wikipedia.org/wiki/Kolmogorov–Smirnov_test K-S Test Wikipedia]

[http://www.physics.csbsju.edu/stats/KS-test.html K-S Test]

<hr>
* SOCR Home page: http://www.socr.umich.edu

{{translate|pageName=http://wiki.socr.umich.edu/index.php?title=SMHS_ModelFitting}}

File:SMHS Fig 4 Model Fitting.png

2014-10-20T20:28:43Z

Clgalla:

File:SMHS Fig 3 Model Fitting.png

2014-10-20T20:28:29Z

Clgalla:

File:SMHS Fig 2 Model Fitting.png

2014-10-20T20:28:16Z

Clgalla:

2014-10-20T15:18:57Z

Clgalla: Clgalla uploaded a new version of "File:SMHS Fig15 TimeSeries.png"

SMHS LongitudinalData

2014-10-17T16:37:32Z

Clgalla: /* Scientific Methods for Health Sciences - Longitudinal Data */

==[[SMHS| Scientific Methods for Health Sciences]] - Longitudinal Data ==

===Overview===
Longitudinal data are referred to data collected from a large population over a given time period where the same subjects are measured at multiple points in time. It is widely used in statistical and financial studies. In this section, we are going to introduce to the concept of longitudinal data.

===Motivation===
Data measured at multiple points in time on the same subject are commonly used in various studies. For example, consider a dataset contain students’ standard test score in four successive years (attached in the table). How do we categorize data like this?

<center>
{| class="wikitable" style="text-align:center; width:35%" border="1"
|-
|Student Name|| Grade 1 (2010) Raw Score||Grade 2 (2011) Raw Score||Grade 3 (2012) Raw Score||Grade 4 (2013) Raw Score
|-
|Tom ||339|| 350|| 361|| 366
|-
|Mike|| 332|| 343|| 350|| 351
|-
|Vivian||360 ||380 ||400|| 420
|}
</center>

===Theory===

'''1) Longitudinal data'''
Longitudinal data is data collected from a large population over a given time period where the same subjects are measured at multiple points in time.
*The primary advantage of longitudinal dataset is that they can measure change. Hence, we can estimate. For example, for the dataset listed above, we can estimate the effect of various factors on improvement in students’ achievement. Or, we can estimate the overall effectiveness of individual teachers by examining the performance of successive classes of students they teach.
*The longitudinal data extend into both the past and the present, which enable us to evaluate the effect of a specific policy by looking at student performance before or after the policy was introduced.
*Longitudinal data also allows us to use sophisticated analytic strategies to measure the impact of various policies with reasonable precision.

'''2) Longitudinal study'''
A longitudinal study is an investigation where participant outcomes and possibly treatments or exposures are collected at multiple follow-up times. It generally yields multiple (repeated) measurements on each subject, which are correlated within subjects and thus require special statistical techniques for analysis and inference. In longitudinal study, we may also be interested in measuring the time until a key clinical event such as time to death, where survival analysis is generally applied for analysis.
*It plays an important role in various studies like epidemiology, clinical research and biological studies. Longitudinal studies are used to characterize normal growth and aging to assess the effect of risk factors on human health to evaluate the effectiveness of treatments and involve lots of efforts.
*Benefits of longitudinal studies: (1) incident events are recorded, a prospective longitudinal study measures the new occurrence of disease; (2) prospective ascertainment of exposure, data recorded at multiple follow-up visits may recall bias; (3) measurement of individual change in outcomes; (4) separation of time effects: cohort, period, age; (5) control for cohort effects, in a longitudinal study, the cohort under study is fixed and thus changes in time are not confounded by cohort differences.
*Challenges of longitudinal studies: (1) participant follow-up, there is risk of bias due to incomplete follow-up of study participants; (2) analysis of correlated data, if intra-subject correlation of response measurements is ignored, then inferences such as statistical tests or confidence interval can be grossly invalid; (3) time-varying covariate, the direction of causality can be complicated by ‘feedback’ between the outcome and the exposure.

'''3) Notations'''
Notations use $Y_{ij}$ to denote the outcome measured on subject $i$ at time $t_{ij},$ where $i=1, \cdots$, $N$ is index for subject, and $j = 1$, $\cdots$, $n$ is index for observations within a subject. The measurement time follows a common set of follow-up times $t_{ij} = t_{j}$. Use $X_{ij}$ to denote covariates associated with observations $Y_{ij}$. Common covariates in a longitudinal study include the time, $t_{ij}$, and person-level characteristics such as age, treatment assignments, and etc. In many cases, the scientific studies focus on mean responses as a function of covariates such as treatment and time, we can also make statistical inference on within-person correlation of observation. Define $\rho_{jk} = corr(Y_{ij}, Y_{ik}),$ the within-subject correlation between observations at time $t_{j} and t_{k}.$

'''4) Exploratory data analysis'''
This is used to discover patterns of systematic variation across groups, as well as aspects of random variation among individual patients.
*Group means: if we are looking to measure the average response over time, statistical measures like means and standard deviation, which can reveal whether different groups change in a similar or different fashion.
*Variation among individuals: single variance parameter can be used to summarize uncertainty or variability in a response measurement. In longitudinal data, ‘distance’ between measurements on different subjects is usually expected to be greater than the distance between repeated measurements taken on the same subject, hence though the total variance can be noted as $\sigma^{2} =1/2 E[(Y_{ij}-Y_{i’j})^{2}]$ assuming $E(Y_{ij}) = E(Y_{i’j}) = \mu,$ the expected variation for two measurements taken on the same person but at time $t_{j}$ and $t_{k}$ may not equal the total variation of $\sigma^{2}$ since the measurements are correlated:$ \sigma^{2}(1-\rho_{jk})=1/2 E[(Y_{ij}-Y_{ik})^{2}]$ assuming $E(Y_{ij}) = E(Y_{ik}) = \mu.$ When $\rho_{jk} > 0,$ between-subject variation is greater than within-subject variation.
*Characterizing correlation and covariance: characterizing the correlation is useful for understanding components of variation and for identifying a variance or correlation model for regression methods such as GEE (generalized estimating equations) or mixed-effects models. Covariance can be written in terms of $\sigma_{j}^2$ and the correlation $\rho_{jk}: cov(Y_{i})=\begin{bmatrix} \sigma_{1}^2 & \sigma_{1}\sigma_{2}\rho_{12} &\cdots &\sigma_{1}\sigma_{n}\rho_{1n} \\ \sigma_{2}\sigma_{1}\rho_{21}&\sigma_{2}^2 &\cdots &\sigma_{2}\sigma_{n}\rho_{2n} \\ \vdots &\vdots & \ddots &\vdots \\ \sigma_{n}\sigma_{1}\rho_{n1}&\sigma_{n}\sigma_{2}\rho_{n2} &\cdots &\sigma_{n}^2 \end{bmatrix}; the correlation matrix: cov(Y_{i})=\begin{bmatrix} 1 & \rho_{12} &\cdots &\rho_{1n} \\\rho_{21}&1 & \cdots&\rho_{2n} \\ \vdots &\vdots & \ddots &\vdots \\\rho_{n1}&\rho_{n2} &\cdots &1 \end{bmatrix}.$

===Applications===

1) [http://biomet.oxfordjournals.org/content/73/1/13.short This article] proposed an extension of generalized linear models to the analysis of longitudinal data. This paper introduced a class of estimating equations that give consistent estimates of the regression parameters and of their variance under mild assumptions about the time dependence. The estimating equations were derived without specifying the joint distribution of a subject's observations yet they reduce to the score equations for multivariate Gaussian outcomes. Asymptotic theory was presented for the general class of estimators. Specific cases in which are assumed independence, m-dependence and exchangeable correlation structures from each subject were discussed. Efficiency of the proposed estimators in two simple situations was considered.

2) [http://onlinelibrary.wiley.com/doi/10.1002/sim.4780111406/abstract;jsessionid=0538E29F4FDDD9D0DD3D672621073EA7.f03t03?deniedAccessCustomisedMessage=&userIsAuthenticated=false This article] reviewed statistical methods for the analysis of discrete and continuous longitudinal data. The relative merits of longitudinal and cross-sectional studies were discussed. Three approaches, marginal, transition and random effects models, were presented with emphasis on the distinct interpretations of their coefficients in the discrete data case. This paper reviewed generalized estimating equations for inferences about marginal models. The ideas were illustrated with analyses of a 2 × 2 crossover trial with binary responses and a randomized longitudinal study with a count outcome.

===Software===
In R: package of ‘longitudinal’

===Problems===
install.packages('longitudinal')
library(longitudinal)
data(tcell)
is.longitudinal(tcell.34)
TRUE
summary(tcell.34)
Longitudinal data:
58 variables measured at 10 different time points
Total number of measurements per variable: 340
Repeated measurements: yes

To obtain the measurement design call 'get.time.repeats()'.
plot(tcell.10,1:9) ## plot the first 9 time series of the data

[[Image:SMHS__Longtitud_Fig1_.png|500px]]

dim(tcell.34) ## dataset with 34 repeats
[1] 340 58
get.time.repeats(tcell.34)
$\$ $time
[1] 0 2 4 6 8 18 24 32 48 72

$\$ $repeats
[1] 34 34 34 34 34 34 34 34 34 34

is.equally.spaced(tcell.34)
[1] FALSE
is.regularly.sampled(tcell.34)
[1] TRUE
has.repeated.measurements(tcell.34)
[1] TRUE
condense.longitudinal(tcell.34,1:2,mean) # compute mean value at each time point for the first two gene
RB1 CCNG1
[1,] 17.41394 16.48101
[2,] 17.62637 16.34122
[3,] 17.89343 16.26661
[4,] 17.37539 15.91950
[5,] 17.57670 16.25621
[6,] 17.85805 16.26411
[7,] 17.76270 16.24127
[8,] 17.22543 16.52049
[9,] 16.86306 16.14295
[10,] 17.09348 16.58913
has.repeated.measurements(tcell.34)
TRUE

Sorry, I failed to find a SOCR dataset to fit here…

===References===
[http://en.wikipedia.org/wiki/Longitudinal_study Longitudinal Study Wikipedia]

[http://faculty.washington.edu/heagerty/Courses/VA-longitudinal/private/LDAchapter.pdf Longitudinal Data Analysis]

<hr>
* SOCR Home page: http://www.socr.umich.edu

{{translate|pageName=http://wiki.socr.umich.edu/index.php?title=SMHS_LongitudinalData}}

SMHS LongitudinalData

2014-10-17T15:39:02Z

Clgalla: /* Scientific Methods for Health Sciences - Longitudinal Data */

==[[SMHS| Scientific Methods for Health Sciences]] - Longitudinal Data ==

===Overview===
Longitudinal data are referred to data collected from a large population over a given time period where the same subjects are measured at multiple points in time. It is widely used in statistical and financial studies. In this section, we are going to introduce to the concept of longitudinal data.

===Motivation===
Data measured at multiple points in time on the same subject are commonly used in various studies. For example, consider a dataset contain students’ standard test score in four successive years (attached in the table). How do we categorize data like this?

<center>
{| class="wikitable" style="text-align:center; width:35%" border="1"
|-
|Student Name|| Grade 1 (2010) Raw Score||Grade 2 (2011) Raw Score||Grade 3 (2012) Raw Score||Grade 4 (2013) Raw Score
|-
|Tom ||339|| 350|| 361|| 366
|-
|Mike|| 332|| 343|| 350|| 351
|-
|Vivian||360 ||380 ||400|| 420
|}
</center>

3) Theory

3.1) Longitudinal data: data collected from a large population over a given time period where the same subjects are measured at multiple points in time.
The primary advantage of longitudinal dataset is that they can measure change. Hence, we can estimate. For example, for the dataset listed above, we can estimate the effect of various factors on improvement in students’ achievement. Or, we can estimate the overall effectiveness of individual teachers by examining the performance of successive classes of students they teach.
The longitudinal data extend into both the past and the present, which enable us to evaluate the effect of a specific policy by looking at student performance before or after the policy was introduced.
Longitudinal data also allows us to use sophisticated analytic strategies to measure the impact of various policies with reasonable precision.

3.2) Longitudinal study: an investigation where participant outcomes and possibly treatments or exposures are collected at multiple follow-up times. It generally yields multiple (repeated) measurements on each subject, which are correlated within subjects and thus require special statistical techniques for analysis and inference. In longitudinal study, we may also be interested in measuring the time until a key clinical event such as time to death, where survival analysis is generally applied for analysis.
It plays an important role in various studies like epidemiology, clinical research and biological studies. Longitudinal studies are used to characterize normal growth and aging to assess the effect of risk factors on human health to evaluate the effectiveness of treatments and involve lots of efforts.
Benefits of longitudinal studies: (1) incident events are recorded, a prospective longitudinal study measures the new occurrence of disease; (2) prospective ascertainment of exposure, data recorded at multiple follow-up visits may recall bias; (3) measurement of individual change in outcomes; (4) separation of time effects: cohort, period, age; (5) control for cohort effects, in a longitudinal study, the cohort under study is fixed and thus changes in time are not confounded by cohort differences.
Challenges of longitudinal studies: (1) participant follow-up, there is risk of bias due to incomplete follow-up of study participants; (2) analysis of correlated data, if intra-subject correlation of response measurements is ignored, then inferences such as statistical tests or confidence interval can be grossly invalid; (3) time-varying covariate, the direction of causality can be complicated by ‘feedback’ between the outcome and the exposure.

3.3) Notations: use Y_{ij} to denote the outcome measured on subject i at time t_{ij}, where i =1, \cdots, N is index for subject, and j = 1, \cdots, n is index for observations within a subject. The measurement time follows a common set of follow-up times t_{ij} = t_{j}. Use X_{ij} to denote covariates associated with observations Y_{ij}. Common covariates in a longitudinal study include the time, t_{ij}, and person-level characteristics such as age, treatment assignments, and etc. In many cases, the scientific studies focus on mean responses as a function of covariates such as treatment and time, we can also make statistical inference on within-person correlation of observation. Define \rho_{jk} = corr(Y_{ij}, Y_{ik}), the within-subject correlation between observations at time t_{j} and t_{k}.

3.4) Exploratory data analysis: to discover patterns of systematic variation across groups, as well as aspects of random variation among individual patients.
Group means: if we are looking to measure the average response over time, statistical measures like means and standard deviation, which can reveal whether different groups change in a similar or different fashion.
Variation among individuals: single variance parameter can be used to summarize uncertainty or variability in a response measurement. In longitudinal data, ‘distance’ between measurements on different subjects is usually expected to be greater than the distance between repeated measurements taken on the same subject, hence though the total variance can be noted as \sigma^{2} =1/2 E[(Y_{ij}-Y_{i’j})^{2}] assuming E(Y_{ij}) = E(Y_{i’j}) = \mu, the expected variation for two measurements taken on the same person but at time t_{j} and t_{k} may not equal the total variation of \sigma^{2} since the measurements are correlated: \sigma^{2}(1-\rho_{jk})=1/2 E[(Y_{ij}-Y_{ik})^{2}] assuming E(Y_{ij}) = E(Y_{ik}) = \mu. When \rho_{jk} > 0, between-subject variation is greater than within-subject variation.
Characterizing correlation and covariance: characterizing the correlation is useful for understanding components of variation and for identifying a variance or correlation model for regression methods such as GEE (generalized estimating equations) or mixed-effects models. Covariance can be written in terms of \sigma_{j}^2 and the correlation \rho_{jk}: cov(Y_{i})=\begin{bmatrix} \sigma_{1}^2 & \sigma_{1}\sigma_{2}\rho_{12} &\cdots &\sigma_{1}\sigma_{n}\rho_{1n} \\ \sigma_{2}\sigma_{1}\rho_{21}&\sigma_{2}^2 &\cdots &\sigma_{2}\sigma_{n}\rho_{2n} \\ \vdots &\vdots & \ddots &\vdots \\ \sigma_{n}\sigma_{1}\rho_{n1}&\sigma_{n}\sigma_{2}\rho_{n2} &\cdots &\sigma_{n}^2 \end{bmatrix}; the correlation matrix: cov(Y_{i})=\begin{bmatrix} 1 & \rho_{12} &\cdots &\rho_{1n} \\\rho_{21}&1 & \cdots&\rho_{2n} \\ \vdots &\vdots & \ddots &\vdots \\\rho_{n1}&\rho_{n2} &\cdots &1 \end{bmatrix}.

4) Applications

4.1) This article (http://biomet.oxfordjournals.org/content/73/1/13.short) proposed an extension of generalized linear models to the analysis of longitudinal data. This paper introduced a class of estimating equations that give consistent estimates of the regression parameters and of their variance under mild assumptions about the time dependence. The estimating equations were derived without specifying the joint distribution of a subject's observations yet they reduce to the score equations for multivariate Gaussian outcomes. Asymptotic theory was presented for the general class of estimators. Specific cases in which are assumed independence, m-dependence and exchangeable correlation structures from each subject were discussed. Efficiency of the proposed estimators in two simple situations was considered.

4.2) This article (http://onlinelibrary.wiley.com/doi/10.1002/sim.4780111406/abstract;jsessionid=0538E29F4FDDD9D0DD3D672621073EA7.f03t03?deniedAccessCustomisedMessage=&userIsAuthenticated=false) reviewed statistical methods for the analysis of discrete and continuous longitudinal data. The relative merits of longitudinal and cross-sectional studies were discussed. Three approaches, marginal, transition and random effects models, were presented with emphasis on the distinct interpretations of their coefficients in the discrete data case. This paper reviewed generalized estimating equations for inferences about marginal models. The ideas were illustrated with analyses of a 2 × 2 crossover trial with binary responses and a randomized longitudinal study with a count outcome.

5) Software
In R: package of ‘longitudinal’

6) Problems
install.packages('longitudinal')
library(longitudinal)
data(tcell)
is.longitudinal(tcell.34)
TRUE
summary(tcell.34)
Longitudinal data:
58 variables measured at 10 different time points
Total number of measurements per variable: 340
Repeated measurements: yes

To obtain the measurement design call 'get.time.repeats()'.
plot(tcell.10,1:9) ## plot the first 9 time series of the data

[[Image:SMHS__Longtitud_Fig1_.png|500px]]

<hr>
* SOCR Home page: http://www.socr.umich.edu

{{translate|pageName=http://wiki.socr.umich.edu/index.php?title=SMHS_LongitudinalData}}

2014-10-16T17:09:05Z

Clgalla: /* Scientific Methods for Health Sciences - Mixture Modeling */

==[[SMHS| Scientific Methods for Health Sciences]] - Mixture Modeling ==

Overview: mixture model is a probabilistic model for representing the presence of subpopulations within overall population, without requiring that an observed data set should identify the sub-population to which an individual observation belongs. In this section, we will present a general introduction to mixture modeling, the structure of mixture model, various types of mixture model, the estimation of parameters in mixture model and application of mixture model in studies. The implementation of mixture modeling in R package will also be discussed in the attached articles.

2) Motivation: mixture distribution represents the probability distribution of observations in the overall population. Problems associated with mixture distributions relate to deriving the properties of the overall population from those of the sub-population. We use mixture models to make statistical inference about the properties of the sub-populations given only observations on the pooled population, without sub-population identity information. It is not the same as models for compositional data, whose components are constrained to sum to a constant value. (1, 100%, etc.) What is the structure of mixture model and how can we estimate parameters in the mixture model?

3) Theory

3.1) Structure of mixture model: a distribution f is a mixture of K component distributions f_{1}, f_{2}, \cdots, f_{k} if f(x) = \sum_{k=1}^{K} \lambda_{k}f_{k}(x) with the \lambda_{k} being the mixing weights, \lambda_{k} > 0, \sum_{k}\lambda_{k} = 1. Z \sim Mult(\lambda_{1}, \cdots, \lambda_{k}), X|Z \sim f_{z}, where the discrete random variable Z indicating where X is drawn from. Different parametric family for f_{k} generates different parametric mixture models, like Gaussian, Binomial, Poisson and etc. They may all be Gaussian with different parameters, or all Poisson distributions with different means. The model can be expressed as f(x) = \sum_{k=1}^{K} \lambda_{k}f(x;\theta_{k}), the parameter vector of the mixture model is \theta = (\lambda_{1}, \cdots, \lambda_{K}, \theta_{1}, \cdots, \theta_{K}). When K=1, we got a simply parametric distribution of the usual sort, and density estimation reduces to estimating the parameters by ML. If K=n, the number of observations, we went back to kernel density estimation.

3.2) Estimating parametric mixture models: assume independent samples where we have the density function to be \prod_{i=1}^{n}f(x_{i};\theta), for observations x_{1}, x_{2}, \cdots, x_{n}. We try the logarithm to turn multiplication into addition: l(\theta) = \sum_{i=1}^{n} logf(x_{i};\theta) = \sum_{i=1}^{n} log \sum_{k=1}^{K} \lambda_{k} f(x_{i}; \theta_{k}), we take the derivative of this with respect to one parameter, say \theta_{j}, \frac{\partial l}{\partial \theta_{j}}=\sum_{i=1}^{n}\frac{1}{\sum_{i=1}^{K}\lambda_{k}f(x_{i};\theta_{k})}\lambda_{j}\frac{\partial f(x_{i};\theta_{j})}{\partial \theta_{j}}=\sum_{i=1}^{n}\frac{\lambda_{j}f(x_{i};\theta_{j})}{\sum_{i=1}^{K}\lambda_{k}f(x_{i};\theta_{k})}\frac{\partial logf(x_{i};\theta_{j})}{\partial \theta_{j}}. If we just had an ordinary parametric model, on the other hand, the derivative of the log-likelihood would be \sum_{i=1}^{n}\frac{\partial logf(x_{i};\theta_{j})}{\partial \theta_{j}}. Maximizing the likelihood for a mixture model is like doing a weighted likelihood maximization, where the weight of x_{i} depends on cluster, being w_{ij} = \frac{\lambda_{j}f(x_{i};\theta_{j}}{\sum_{k=1}^{K}\lambda_{k}f(x_{i};\theta_{k}).

Remember that \lambda_{j} is the probability that the hidden class variable Z is j, so the numerator in the weights is the joint probability of getting Z=j and X=x_{i}. The denominator is the marginal probability of getting X=x_{i}, so the ratio is conditional probability of Z=j given X=x_{i}, w_{ij} = \frac{\lambda_{j}f(x_{i};\theta_{j}}{\sum_{k=1}^{K}} \lambda_{k}f(x_{i};\theta_{k}) = p(Z=j | X=x_{i}; \theta).
EM algorithm: (1) start with guesses about the mixture components \theta_{1}, \cdots, \theta_{K} and the mixing weights \lambda_{1}, \cdots, \lambda_{K}; (2) until nothing changes very much: using the current parameter guesses, calculate the weights w_{ij} (E-step); using the current weights, maximize the weighted likelihood to get new parameter estimates (M-step); (3) return the final parameter estimates (including mixing proportions) and cluster probabilities.
Non-parametric mixture modeling: replace the M step of EM by some other way of estimating the distribution of each mixture component. This could be fast and crude estimate of parameters, or it could even be a non-parametric density estimator.

3.3) Computational examples in R: Snoqualmie Falls Revisited (analyzed using the mclust package in R)
RCODE:
snoqualmie <- read.csv("http://www.stat.cmu.edu/~cshalizi/402/lectures/16-glm-practicals/snoqualmie.csv",header=FALSE)
snoqualmie.vector <- na.omit(unlist(snoqualmie))
snoq <- snoqualmie.vector[snoqualmie.vector > 0]
summary(snoq)
Min. 1st Qu. Median Mean 3rd Qu. Max.
1.00 6.00 19.00 32.28 44.00 463.00

# Two-component Gaussian mixture
library(mixtools)
snoq.k2 <- normalmixEM(snoq,k=2,maxit=100,epsilon=0.01)
summary(snoq.k2)
summary of normalmixEM object:
comp 1 comp 2
lambda 0.557524 0.442476
mu 10.266172 60.009468
sigma 8.510244 44.997240
loglik at estimate: -32681.21

# Function to add Gaussian mixture components, vertically scaled, to the
# current plot
# Presumes the mixture object has the structure used by mixtools
plot.normal.components <- function(mixture,component.number,...) {
curve(mixture$lambda[component.number] *
dnorm(x,mean=mixture$mu[component.number],
sd=mixture$sigma[component.number]), add=TRUE, ...)
}
plot(hist(snoq,breaks=101),col="grey",border="grey",freq=FALSE,
xlab="Precipitation (1/100 inch)",main="Precipitation in Snoqualmie Falls")
lines(density(snoq),lty=2)

[[Image:SMHS_MixtureModel_Fig1.png|500px]]

Histogram (grey) for precipitation on wet days in Snoqualmie Falls. The dashed line is a kernel density estimate, which is not completely satisfactory.
plot(hist(snoq,breaks=101),col="grey",border="grey",freq=FALSE,
xlab="Precipitation (1/100 inch)",main="Precipitation in Snoqualmie Falls")
lines(density(snoq),lty=2)

sapply(1:2,plot.normal.components,mixture=snoq.k2)

[[Image:SMHS_MixtureModel_Fig2.png|500px]]

As in the pervious figure, plus the components of a mixture of two Gaussians, fitted to the data by the EM algorithm (dashed lines). These are scaled by the mixing weights of the components.

# Function to calculate the cumulative distribution function of a Gaussian
# mixture model
# Presumes the mixture object has the structure used by mixtools
# Doesn't implement some of the usual options for CDF functions in R, like
# returning the log probability, or the upper tail probability
pnormmix <- function(x,mixture) {
lambda <- mixture$lambda
k <- length(lambda)
pnorm.from.mix <- function(x,component) {
lambda[component]*pnorm(x,mean=mixture$mu[component],
sd=mixture$sigma[component])
}
pnorms <- sapply(1:k,pnorm.from.mix,x=x)
return(rowSums(pnorms))
}

#### Figure 3
# Distinct values in the data
distinct.snoq <- sort(unique(snoq))
# Theoretical CDF evaluated at each distinct value
tcdfs <- pnormmix(distinct.snoq,mixture=snoq.k2)
# Empirical CDF evaluated at each distinct value
# ecdf(snoq) returns an object which is a _function_, suitable for application
# to new vectors
ecdfs <- ecdf(snoq)(distinct.snoq)
# Plot them against each other
plot(tcdfs,ecdfs,xlab="Theoretical CDF",ylab="Empirical CDF",xlim=c(0,1),
ylim=c(0,1))
# Main diagonal for visual reference
abline(0,1)

[[Image:SMHS_MixtureModel_Fig3.png|500px]]

Calibration plot for the two-component Gaussian mixture. For each distinct value of precipitation x, plot the fraction of days predicted by the mixture model to have \leq x precipitation on the horizontal axis, versus the actual fraction of days \leq x.

# Probability density function for a Gaussian mixture
# Presumes the mixture object has the structure used by mixtools
dnormalmix <- function(x,mixture,log=FALSE) {
lambda <- mixture$lambda
k <- length(lambda)
# Calculate share of likelihood for all data for one component
like.component <- function(x,component) {
lambda[component]*dnorm(x,mean=mixture$mu[component],
sd=mixture$sigma[component])
}
# Create array with likelihood shares from all components over all data
likes <- sapply(1:k,like.component,x=x)
# Add up contributions from components
d <- rowSums(likes)
if (log) {
d <- log(d)
}
return(d)
}

# Log likelihood function for a Gaussian mixture, potentially on new data
loglike.normalmix <- function(x,mixture) {
loglike <- dnormalmix(x,mixture,log=TRUE)
return(sum(loglike))
}
loglike.normalmix(snoq,mixture=snoq.k2)
[1] - 32681.21
# Evaluate various numbers of Gaussian components by data-set splitting
# (i.e., very crude cross-validation)
n <- length(snoq)
data.points <- 1:n
data.points <- sample(data.points) # Permute randomly
train <- data.points[1:floor(n/2)] # First random half is training
test <- data.points[-(1:floor(n/2))] # 2nd random half is testing
candidate.component.numbers <- 2:10
loglikes <- vector(length=1+length(candidate.component.numbers))
# k=1 needs special handling
mu<-mean(snoq[train]) # MLE of mean
sigma <- sd(snoq[train])*sqrt((n-1)/n) # MLE of standard deviation
loglikes[1] <- sum(dnorm(snoq[test],mu,sigma,log=TRUE))
for (k in candidate.component.numbers) {
mixture <- normalmixEM(snoq[train],k=k,maxit=400,epsilon=1e-2)
loglikes[k] <- loglike.normalmix(snoq[test],mixture=mixture)
}
loglikes
[1] -17647.93 -16336.12 -15796.02 -15554.33 -15398.04 -15337.47 -15297.61
[8] -15285.60 -15286.75 -15288.88
plot(x=1:10, y=loglikes,xlab="Number of mixture components", ylab="Log-likelihood on testing data")

[[Image:SMHS_MixtureModel_Fig4.png|500px]]

log-likelihoods of different sizes of mixture models, fit to a random half of the data for training, and evaluated on the other half of the data for testing.

snoq.k9 <- normalmixEM(snoq,k=9,maxit=400,epsilon=1e-2)
plot(hist(snoq,breaks=101),col="grey",border="grey",freq=FALSE,
xlab="Precipitation (1/100 inch)",main="Precipitation in Snoqualmie Falls")
lines(density(snoq),lty=2)
sapply(1:9,plot.normal.components,mixture=snoq.k9)

[[Image:SMHS_MixtureModel_Fig5.png|500px]]

With the nine-component Gaussian mixture.

# Assigments for distinct.snoq and ecdfs are redundant if you've already done
distinct.snoq <- sort(unique(snoq))
tcdfs <- pnormmix(distinct.snoq,mixture=snoq.k9)
ecdfs <- ecdf(snoq)(distinct.snoq)
plot(tcdfs,ecdfs,xlab="Theoretical CDF",ylab="Empirical CDF",xlim=c(0,1),
ylim=c(0,1))
abline(0,1)

[[Image:SMHS_MixtureModel_Fig6.png|500px]]

Calibration plot for the nine-component Gaussian mixture.

plot(0,xlim=range(snoq.k9$mu),ylim=range(snoq.k9$sigma),type="n",
xlab="Component mean", ylab="Component standard deviation")
points(x=snoq.k9$mu,y=snoq.k9$sigma,pch=as.character(1:9),
cex=sqrt(0.5+5*snoq.k9$lambda))

[[Image:SMHS_MixtureModel_Fig7.png|500px]]

Characteristics of the components of the 9-mode Gaussian mixture. The horizontal axis gives the component mean, the vertical axis its standard deviation. The area of the number representing each component is proportional to the component’s mixing weight.

plot(density(snoq),lty=2,ylim=c(0,0.04),
main=paste("Comparison of density estimates\n",
"Kernel vs. Gaussian mixture"),
xlab="Precipitation (1/100 inch)")
curve(dnormalmix(x,snoq.k9),add=TRUE)

[[Image:SMHS_MixtureModel_Fig8.png|500px]]

Dashed line: kernel density estimate. Solid line is the nine-Gaussian mixture. Notice that the mixture, unlike the KDE, gives negligible probability to negative precipitation.

# Do the classes of the Gaussian mixture make sense as annual weather patterns?
# Most probable class for each day:
day.classes <- apply(snoq.k9$posterior,1,which.max)
# Make a copy of the original, structured data set to edit
snoqualmie.classes <- snoqualmie
# Figure out which days had precipitation
wet.days <- (snoqualmie > 0) & !(is.na(snoqualmie))
# Replace actual precipitation amounts with classes
snoqualmie.classes[wet.days] <- day.classes
# Problem: the number of the classes doesn't correspond to e.g. amount of
# precipitation expected. Solution: label by expected precipitation, not by
# class number.
snoqualmie.classes[wet.days] <- snoq.k9$mu[day.classes]

plot(0,xlim=c(1,366),ylim=range(snoq.k9$mu),type="n",xaxt="n",
xlab="Day of year",ylab="Expected precipiation (1/100 inch)")
axis(1,at=1+(0:11)*30)
for (year in 1:nrow(snoqualmie.classes)) {
points(1:366,snoqualmie.classes[year,],pch=16,cex=0.2)
}

[[Image:SMHS_MixtureModel_Fig9.png|500px]]

Plot of days classified according to the nine-component mixture. Horizontal axis: days of the year, numbered from 1 to 366 to handle leap years. Vertical axis: expected amount of precipitation on that day according to the most probable class for the day.

# Next line is currently (5 April 2011) used to invoke a bug-patch kindly
# provided by Dr. Derek Young; the patch will be incorporated in the next
# update to mixtools, so should not be needed after April 2011
source("http://www.stat.cmu.edu/~cshalizi/402/lectures/20-mixture-examples/bootcomp.R")
snoq.boot <- boot.comp(snoq,max.comp=10,mix.type="normalmix",
maxit=400,epsilon=1e-2)
# Running this takes about 5 minutes
# automatically produced as a side-effect of running boot.comp()

[[Image:SMHS_MixtureModel_Fig10.png|500px]]

Histograms produced by boot.comp(). The vertical red lines mark the observed difference in log-likelihood.

library(mvtnorm)
x.points <- seq(-3,3,length.out=100)
y.points <- x.points
z <- matrix(0,nrow=100,ncol=100)
mu <- c(1,1)
sigma <- matrix(c(2,1,1,1),nrow=2)
for (i in 1:100) {
for (j in 1:100) {
z[i,j] <- dmvnorm(c(x.points[i],y.points[j]),mean=mu,sigma=sigma)
}
}
contour(x.points,y.points,z)

[[Image:SMHS_MixtureModel_Fig11.png|500px]]

Applications

4.1) This article (http://wiki.stat.ucla.edu/socr/index.php/SOCR_EduMaterials_Activities_2D_PointSegmentation_EM_Mixture) demonstrated activity of the mixture modeling and expectation maximization (EM) applied to the problem of 2D point cluster segmentation. It illustrated ways to use EM and mixture modeling to obtain cluster classification of points in 2D using SOCR charts activity and SOCR modeler with specific examples.

4.2) This article (http://wiki.stat.ucla.edu/socr/index.php/SOCR_EduMaterials_ModelerActivities_MixtureModel_1) presented the SOCR activity that demonstrate random sampling and fitting of mixture models to data. The SOCR mixture-model applet demonstrates how unimodal-distributions come together as building blocks to form the backbone of may complex processes and allow computing probability and critical values for these mixture distributions, and enable inference on such complicated processes.

4.3) This article (http://www.sciencedirect.com/science/article/pii/S0167947301000469) presented a mixture model approach for the analysis of microarray gene expression data. Microarrays have emerged as powerful tools allowing investigators to assess the expression of thousands of genes in different tissues and organisms. Statistical treatment of the resulting data remains a substantial challenge. Investigators using microarray expression studies may wish to answer questions about the statistical significance of differences in expression of any of the genes under study, avoiding false positive and false negative results. This paper developed a sequence of procedures involving finite mixture modeling and bootstrap inference to address these issues in studies involving many thousands of genes and illustrated the use of these techniques with a dataset involving calorically restricted mice.

4.4) This article (http://proceedings.spiedigitallibrary.org/proceeding.aspx?articleid=979899) is concerned with estimating a probability density function of human skin color, using a finite Gaussian mixture model, whose parameters are estimated through the EM algorithm. Hawkins' statistical test on the normality and homoscedasticity (common covariance matrix) of the estimated Gaussian mixture models is performed and McLachlan's bootstrap method is used to test the number of components in a mixture. Experimental results show that the estimated Gaussian mixture model fits skin images from a large database. Applications of the estimated density function in image and video databases are presented.

5) Software
http://cran.r-project.org/web/packages/mixtools/vignettes/mixtools.pdf
http://www.stat.washington.edu/mclust/
http://www.stat.cmu.edu/~cshalizi/uADA/12/lectures/mixture-examples.R

6) Problems

6.1) Write a function to simulate from a Gaussian mixture model. Check if it works by comparing a density estimated on its output to the theoretical density.

6.2) Work through the E-step and M-step for a mixture of two Poisson distributions.

6.3) Code up the EM algorithm for a mixture of K Gaussians. Simulate data from K=3 Gaussians. How well does the code assign data points to components if give the actual Gaussian parameter the initial guess and how does it change if given other initial parameters?

6.4) Write a function to fit a mixture of exponential distributions using the EM algorithm.

7) References
http://mirlyn.lib.umich.edu/Record/004199238
http://www.stat.cmu.edu/~cshalizi/uADA/12/lectures/ch20.pdf

<hr>
* SOCR Home page: http://www.socr.umich.edu

{{translate|pageName=http://wiki.socr.umich.edu/index.php?title=SMHS_MixtureModeling}}

File:SMHS MixtureModel Fig11.png

2014-10-16T17:00:09Z

Clgalla:

File:SMHS MixtureModel Fig10.png

2014-10-16T16:59:55Z

Clgalla:

File:SMHS MixtureModel Fig9.png

2014-10-16T16:59:41Z

Clgalla:

File:SMHS MixtureModel Fig8.png

2014-10-16T16:59:24Z

Clgalla: