# SMHS ModelFitting

## Scientific Methods for Health Sciences - Model Fitting and Model Quality (KS-test)

### Overview

Kolmogorov-Smirnov Test (K-S test) is a nonparametric test commonly applied in various fields to test on the equality of continuous, one-dimensional probability distribution that can be used to compare one sample with a reference probability distribution, which is commonly referred to one-sample K-S test, or to compare two samples, which is referred to as two-sample K-S test. The K-S test is used to determine if two datasets differ significantly. In this lecture, we are going to present a general introduction to K-S test and illustrate its application with examples. The implementation of K-S test in the statistical package R will also be discussed with specific examples.

### Motivation

When we talk about testing the equality of two dataset, the first idea came into our mind may always be the simple t-test. However, there are situations where it is a mistake to trust the result of a t-test. Consider the situation where the control and treatment groups do not differ in mean, but only in some other way. Consider two dataset with same mean and very different variations, in this situation, the t-test cannot see the difference. Situations where the treatment and control groups are smallish datasets (say less than 20) that differ in mean but substantial non-normal distribution masks the difference. Consider two datasets drawn from lognormal distributions that differ substantially in mean, in this situation, the t-test also fails. Among all those situations, K-S test would be the answer. How does the K-S test work?

### Theory

1) K-S test: a nonparametric test commonly applied in various fields to test on the equality of continuous, one-dimensional probability distribution that can be used to compare one sample with a reference probability distribution or to compare two samples. The K-S test quantifies a distance between the empirical distribution function of the sample and the cdf of reference distribution or between the empirical distribution functions of two samples. The null hypothesis is that the samples are drawn from the same distribution or that the sample is drawn from the reference distribution.

• K-S test is sensitive to difference both in location and shape of the empirical cdf of the samples and is widely used as the nonparametric methods for comparing two samples.
• The empirical distribution function $F_{n}$ for n i.i.d. observations $X_{i}: F_{n} (x)=1/n \sum_{i=1}^{n} I_{X_{i} \leq x},$ where $I_{X_{i} \leq x}$ is the indicator function, which equals to 1 when $X_{i} \leq x$ and equals to $0$ otherwise.
• The K-S statistic for a given $cdf F(x)$ is $D_{n}=sup_{x}|F_{n}(x)-F(x)|,$ where sup{x} is the supremum of the set of distances. By the Gilvenko-Cantelli therorem, if the sample comes from distribution $F(x)$, then $D_{n}$ converges to $0$ almost surely in the limit when $n$ goes to infinity.
• Kolmogorov distribution (the distribution of the random variable): $K=sup_{t\in [0,1]} |B(t)|,$ where $B(t)$ is the Brownian bridge. The $c.d.f$ of $K$ is given: $Pr(K \leq x)=1-2 \sum_{k=1}^{\infty}(-1)^{k-1}e^{-2k^{2} x^{2}}=\frac {\sqrt{2 \pi}}{x} \sum_{k=1}^{\infty}e^{-(2k-1)^{2} \pi^{2} \setminus (8x^{2})}.$ Under the null hypothesis, the sample comes from the hypothesized distribution $F(x), \sqrt{n} D_{n} \overset{n\rightarrow \infty}{\rightarrow} sup_{t} |B(F(t))|$ in distribution, where $B(t)$ is the Brownian bridge. When $F$ is continuous, $\sqrt{n} D_{n}$ under the null hypothesis converges to the Kolmogorov distribution, which does not depend on $F.$
• The goodness-of-fit test or the K-S test is constructed by using the critical values of the Kolmogorov distribution. The null hypothesis is rejected at level $\alpha$ if $\sqrt{n} D_{n} > K_{\alpha},$ where $K_{\alpha}$ is found from $Pr(K \leq K_{\alpha})=1- \alpha.$ The asymptotic power of this test is $1$.
• Test with estimated parameters: if either the form or the parameters of $F(x)$ are determined from the data $X_{j}$ the critical values determined in this way are invalid. Modifications required to the test statistics and critical values have been proposed.

2) Two-sample K-S test: to test whether two underlying one-dimensional probability distribution differ. The K-S statistic is $D_{n,{n}'}=sup_{x} |F_{1,n}(x)-F_{2,{n}'}(x)|$, where $F_{1,n}$ and $F_{2,{n}'}$ are the empirical distribution functions of the first and second sample respectively. The null hypothesis is rejected at least $\alpha$ if $D_{n,{n}'}>c(\alpha)\sqrt{\frac{n+{n}'}{n{n}'}}$. The value of $c(\alpha)$ is given in the following table for each level of $\alpha.$

 $\alpha$ 0.1 0.05 0.025 0.01 0.005 0.001 c($\alpha$) 1.22 1.36 1.48 1.63 1.73 1.95

Note: two-sample test checks whether the two data samples come from the same distribution. This does not specify what that common distribution is. While the K-S test is usually used to test whether a given $F(x)$ is the underlying probability distribution $F_{n}(x)$, the procedure may be inverted to give confidence limits on $F(x)$ itself. If we choose a critical value of the test statistics $D_{\alpha}$ such that $P(D_{n}>D_{\alpha})=\alpha$, then a band of width $\pm D_{\alpha}$ around $F_{n}(x)$ will entirely contain $F(x)$ with probability $1-\alpha.$

3) Illustration on how the K-S test works with example: consider the two datasets control$={1.26, 0.34, 0.70, 1.75, 50.57, 1.55, 0.08, 0.42, 0.50, 3.20, 0.15, 0.49, 0.95, 0.24, 1.37, 0.17, 6.98, 0.10, 0.94, 0.38}$, treatment= ${2.37, 2.16, 14.82, 1.73, 41.04, 0.23, 1.32, 2.91, 39.41, 0.11, 27.44, 4.51, 0.51, 4.50, 0.18, 14.68, 4.66, 1.30, 2.06, 1.19}$.

• Descriptive statistics: to reduce the list of all the data items to a few simpler numbers. For the control group, the mean is 3.607, the median is 0.60, the highest is 50.57, the lowest is 0.08 and the standard deviation is 11.165. Obviously, this dataset is not normally distributed.
• Sort the control data: control sorted$={0.08, 0.10, 0.15, 0.17, 0.24, 0.34, 0.38, 0.42, 0.49, 0.50, 0.70, 0.94, 0.95, 1.26, 1.37, 1.55, 1.75, 3.20, 6.98, 50.57},$ evidently no data lies strictly below $0.08, 5% = 1/20$ of the data is smaller than $0.1$, 10% of the data is strictly smaller than $0.15, 15%$ of the data is strictly smaller than $0.17$, $\cdots.$ There are $17$ data points smaller than $\pi$, and we say that the cumulative fraction of the data smaller than $\pi$ is $17/20 (0.85).$
RCODE:
Control <- c(1.26, 0.34, 0.70, 1.75, 50.57, 1.55, 0.08, 0.42, 0.50, 3.20, 0.15,
0.49, 0.95 , 0.24, 1.37, 0.17, 6.98, 0.10, 0.94, 0.38)
Treatment <- c(2.37, 2.16, 14.82, 1.73, 41.04, 0.23, 1.32, 2.91, 39.41, 0.11,
27.44, 4.51, 0.51, 4.50, 0.18, 14.68, 4.66, 1.30, 2.06, 1.19)
summary(control)
Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
0.080   0.315   0.600   3.607   1.415  50.570
sd(control)
[1] 11.16464

• Plot the empirical cumulative plot of the control data. From the chart, we can see that the majority of the data fall in the a small fraction of the plot on the top left, so this is a clear sign that the dataset from control group does not follow a normal distribution.
library(stats)
ecdf(control)
plot(ecdf(control),verticals=TRUE,main='Cumulative Fraction Plot')


### Applications

1)This article presents the SOCR analyses example on Kolmogorov-Smirnoff Test, which compares how distinct two values are. It presents a general introduction to the K-S test and illustrates its application with the Oats example and control-treatment test through the SOCR program.

2) This article presents the SOCR activity which demonstrate the random sampling and fitting of mixture models to data. It starts with a general introduction to the mixture-model distribution and data description, followed with the exploratory data analysis and model fitting in the SOCR environment.

3) This article presented on the Kolmogorov-Smirnov statistical test for the analysis of histograms and discussed about the test for both the two-sample case (comparing fn1(X) to fn2 (X)) and the one-sample case (comparing fn1 (X) to f(X)). Presentation of the specific algorithmic steps involved is done through development of an example where the data are from an experiment discussed elsewhere in this issue. It is shown that the two histograms examined come from two different parent populations at the 99.9% confidence level.

4) This article developed the digital techniques for detecting changes between two Landsat images. A data matrix containing 16 Ã 16 picture elements was used for this purpose. The Landsat imagery data was corrected for sensor inconsistencies and varying sun illumination. The Kolmogorov-Smirnov test (K-S test) was performed between the two corrected data matrices. This test is based on the absolute value of the maximum difference (Dmax) between the two cumulative frequency distributions. The limiting distribution of Dmax is known; thus a statistical decision concerning changes can be evaluated for the region. The K-S test was applied to different test sites. It successfully isolated regions of change. The test was found to be relatively independent of slight misregistration.

### Software

RCODE are attached as in the examples given in this lecture.

### Problems

6.1) Consider the following dataset, control={0.22, -0.87, -2.39, -1.79, 0.37, -1.54, 1.28, -0.31, -0.74, 1.72, 0.38, -0.17, -0.62, -1.10, 0.30, 0.15, 2.30, 0.19, -0.50, -0.09}, treatment={-5.13, -2.19, -2.43, -3.83, 0.50, -3.25, 4.32, 1.63, 5.18, -0.43, 7.11, 4.87, -3.10, -5.81, 3.76, 6.31, 2.58, 0.07, 5.76, 3.50}, use the K-S test to test the equality of the two samples. Include all necessary steps and plots.

6.2) Revise the example given in the lecture with a different control group: control={1.26, 0.34, 0.70, 1.75, 50.57, 1.55, 0.08, 0.42, 0.50, 3.20, 0.15, 0.49, 0.95, 0.24, 1.37, 0.17, 6.98, 0.10, 0.94, 0.38}, run the K-S test again and state clearly your conclusions with necessary plots.

6.3) Two types of plants are in bloom in wood and we want to study if bees prefer one tree to the other? We collect data by using a stop-watch to time how long a bee stays near a particular tree. We begin to time when the bee touches the tree; we stop timing when the bee is more than a meter from the tree. (As a result all our times are at least 1 second long: it takes a touch-and-go bee that long to get one meter from the tree.) We wanted to time exactly the same number of bees for each tree, but it started to rain. (Unequal dataset size is not a problem for the KS-test.) Apply the K-S test with this example. The data are given as below: T1={23.4, 30.9, 18.8, 23.0, 21.4, 1, 24.6, 23.8, 24.1, 18.7, 16.3, 20.3, 14.9, 35.4, 21.6, 21.2, 21.0, 15.0, 15.6, 24.0, 34.6, 40.9, 30.7, 24.5, 16.6, 1, 21.7, 1, 23.6, 1, 25.7, 19.3, 46.9, 23.3, 21.8, 33.3, 24.9, 24.4, 1, 19.8, 17.2, 21.5, 25.5, 23.3, 18.6, 22.0, 29.8, 33.3, 1, 21.3, 18.6, 26.8, 19.4, 21.1, 21.2, 20.5, 19.8, 26.3, 39.3, 21.4, 22.6, 1, 35.3, 7.0, 19.3, 21.3, 10.1, 20.2, 1, 36.2, 16.7, 21.1, 39.1, 19.9, 32.1, 23.1, 21.8, 30.4, 19.62, 15.5} T2={16.5, 1, 22.6, 25.3, 23.7, 1, 23.3, 23.9, 16.2, 23.0, 21.6, 10.8, 12.2, 23.6, 10.1, 24.4, 16.4, 11.7, 17.7, 34.3, 24.3, 18.7, 27.5, 25.8, 22.5, 14.2, 21.7, 1, 31.2, 13.8, 29.7, 23.1, 26.1, 25.1, 23.4, 21.7, 24.4, 13.2, 22.1, 26.7, 22.7, 1, 18.2, 28.7, 29.1, 27.4, 22.3, 13.2, 22.5, 25.0, 1, 6.6, 23.7, 23.5, 17.3, 24.6, 27.8, 29.7, 25.3, 19.9, 18.2, 26.2, 20.4, 23.3, 26.7, 26.0, 1, 25.1, 33.1, 35.0, 25.3, 23.6, 23.2, 20.2, 24.7, 22.6, 39.1, 26.5, 22.7} Note that this example is based on data distributed according to the Cauchy distribution: a particularly abnormal case and the plots do not look particularly abnormal, however the large number of outliers is a tip off of a non-normal distribution.