# SMHS ResamplingSimulation

## Scientific Methods for Health Sciences - Resampling and Simulation

### Overview

In statistics, resampling and simulation are two important concepts with wide applications in research and projects from various fields. Resampling is any of a variety of methods in which the following processes are implemented:

1. Estimating the precision of sample statistics (i.e., medians, percentiles) by using subsets of available data (i.e., jackknifing) or drawing randomly with replacement from a set of data (i.e., bootstrapping)
2. Exchanging labels on data points when performing significance tests (i.e., permutation tests)
3. Validating models by using random subsets (i.e., bootstrapping, cross-validation)

We are going to introduce some common resampling techniques including bootstrapping, jackknifing, cross-validation and permutation tests. Simulation involves imitating real world processes or systems over time. We usually apply simulation after a model (which represents the key characteristics of the process) is developed. Simulation is widely applied in many contexts such as simulation of technology for performance optimization, testing, and video games. It is often applied when the real system is not accessible or is difficult and/or costly to apply, and it provides us with an easier way to obtain data about the system or test it. We are going to present an introduction to simulation including the basic methods, applications, advantages and limitations.

### Motivation

Imagine we want to evaluate the quality of a system or process, but data on the process is very hard to collect. How can we evaluate the system without having to collect samples? If we know the characteristics of the data set (e.g., it follows a normal distribution), then we could easily generate a series of data following a normal distribution and use these to test the system. In fact, we can easily generate a large amount of data and test the system with more power. Consider another case in which, instead of knowing the exact characteristics of the data, we have few data from the past few years and we notice that they follow a certain pattern. Here, we can use this data set to work out the characteristics of the data and develop a model. We can then generate a new data set from the model we developed. A popular example is the bootstrapping method in the interest rate model. In order to learn more about resampling and simulation methods, we are going to introduce the fundamental concepts, rules and methodologies commonly applied in these fields to prepare students with necessary background in resampling and simulation.

### Theory

#### Resampling methods

Resampling methods use a computer to generate a large number of simulated samples; the patterns in these samples are then summarized and analyzed. In resampling methods, the simulated samples are drawn from the existing sample of data and not from a theoretically defined Data Generating Process (DGP). Thus, in resampling methods, the researcher does not know or control the DGP but can still learn about it.

• Principles: The assumption is that there is some population DGP that remains unobserved and that the DGP produced the one sample of data you have; all information about the population contained in the original data set is also contained in the distribution of these simulated samples. We draw a new ‘sample’ of data that consists of a mix of the observations from the original sample and repeat this many times so we have many new simulated ‘samples’. One can consider the original data set to be a reasonable representation of the population, and the distribution of parameter estimates produced from running a model on a series of resampled data sets will provide a good approximation of the distribution in the population. Resampling methods can be either parametric or non-parametric.

#### Bootstrapping

Bootstrapping is a statistical method for estimating the sampling distribution of an estimator by sampling with replacement from the original sample, most often with the purpose of deriving robust estimates of standard errors and confidence intervals of a population parameter like the median, odds ratio or regression coefficient. This technique allows estimation of the sampling distribution of almost any statistic using only very simple methods, and it falls into the broader class of resampling methods.

• Situations where bootstrapping applies:
1. When the theoretical distribution of a statistic of interest is complicated or unknown
2. When the sample size is insufficient for straightforward statistical inference
3. When power calculations have to be performed, and a small pilot sample is available
• Boostrapping is the practice of estimating properties of an estimator by measuring those properties when sampling from an approximating distribution, say the empirical distribution of the observed data. It is often used as a robust alternative to inference based on parametric assumptions when those assumptions are in doubt, or when parametric inference is impossible or requires very complicated formulas for the calculation of standard errors. It may also be used for constructing hypothesis tests.
• The basic idea of bootstrapping is that inference about a population based on sample data (sample → population) can be modeled by resampling the data and performing inference on this new sample (resample → sample). Bootstrapping treats the true probability distribution of the original data as being analogous to the empirical distribution of the resampled data. The accuracy of inferences regarding the empirical distribution using the resampled data can be assessed because we know the distribution. If the empirical distribution is a reasonable approximation of the true probability distribution, then the quality of inference on the true probability distribution can in turn be inferred.
• Common process:
1. Begin with an observed sample of size $N$.
2. Generate a simulated sample of size $N$ by drawing observations from your observed sample independently and with replacement.
3. Compute and save the statistic of interest.
4. Repeat this process many times (e.g., 1000).
5. Treat the distribution of your estimated statistics of interest as an estimate of the population distribution of that statistic.
• Key features of the bootstrap:
• The draws must be independent; that is, each observation in the observed sample must have an equal chance of being selected.
• The simulated sample must be of size $N$ to take full advantage of the information in the sample.
• Resampling must be done with replacement; if not, then every simulated sample of size $N$ would be identical and the same as the original sample.
• Resampling with replacement means that in any given simulated sample, some cases might appear more than once while others will not appear at all.
• Types of bootstrap schemes:
1. Case resampling (using the Monte Carlo algorithm)
2. Estimating the distribution of the sample means
3. Regression
4. Bayesian bootstrap
5. Smooth bootstrap
6. Parametric bootstrap
7. Resampling residuals
8. Gaussian process regression bootstrap
9. Wild bootstrap
10. Block bootstrap
• Advantages: Boostrapping is simple and is straightforward to derive estimates of standard errors and confidence intervals for complex estimators of complex parameters of the distribution. It is appropriate to control and check the stability of the results.
• Limitations: It does not provide general finite-sample guarantees. Additionally, the apparent simplicity may conceal the fact that important assumptions are being made when undertaking the bootstrap analysis where these would be more formally stated in other approaches.

#### Jackknife

The Jackknife method estimates the bias and standard error of a statistic when a random sample of observations is used to calculate it. The basic idea is to systematically recompute the statistic estimate, leaving out one or more observations at a time from the sample set. From the new set of replicates of the statistic, an estimate for the bias and an estimate for the variance of the statistic can be calculated.

• Jackknife estimates of variance asymptotically tend to be the true values almost surely. The jackknife is consistent for sample means, sample variances, etc.
• The jackknife is not consistent for the sample median. In the case of a unimodal variable, the ratio of the jackknife variance to the sample variance tends to be distributed as one half the square of a chi-square distribution with two degrees of freedom.
• It is dependent on the independence of the data. Extensions of the jackknife to allow for dependencies in the data have been proposed.
• Advantages: This method is good at detecting outliers and influential cases. Those sub-sample estimates that differ most indicate cases that have the strongest influence on those estimates in the original full sample analysis.
• Limitations: The jackknife is less general than the bootstrap and is therefore used less frequently. It does not perform well if the statistic under consideration does not change ‘smoothly’ across simulated samples, and it does not perform well with small samples because one cannot generate many resamples.

#### Cross-validation

Cross-validation (CV) is a statistical method for validating a predictive model by assessing a statistical model using a data set that is independent of the data set used to fit the model. Subsets of the data are held out for use as validating sets; a model is fit to the remaining data (i.e., training set) and used to predict the validation set. Averaging the quality of the predictions across the validation sets yields an overall measure of prediction accuracy.

• Steps:
1. Randomly partition the data into a training set and a validating set.
2. Fit the model to the training set.
3. Take the parameter estimates from that model and use them to calculate a measure of fit on the testing set.
4. Repeat several times and average to reduce variability.
• Types of CV:
• Leave-one-out CV: This is an iterative method in which the number of iterations = sample size, each observation becomes the validating set one time. Steps:
1. Delete observation #1 from the data.
2. Fit the model to observations #2-n.
3. Apply the coefficients form step #2 to observation #1 and calculate the chosen fit measure.
4. 4Delete observation #2 from the data.
5. Fit the model to observations #1 and #3-n.
6. Apply the coefficients from step #5 to observation #2 and calculate the chosen fit measure.
7. Repeat until all observations have been deleted once.
• K-fold cross-validation splits the data into K subsets and each is held out in turn as the validation set, which helps to avoid self-influence. This influence is similar to the way that in regression analysis methods like linear regression, each y value draws the regression line toward itself, making the prediction of that value appear more accurate than it really is. Cross-validation applied to linear regression predicts the y value for each observation without using that observation.
• Limitations of CV: The training and validating data must be random samples from the same population. It will be most different from in-sample measures when $n$ is small. It is more computationally demanding than calculating in-sample measures, and it is subject to the researcher’s selection of an appropriate fit statistic.

#### Permutation Test

A permutation (or randomization or re-randomization) test is a type of statistical significance test in which the distribution of the test statistic under the null hypothesis is obtained by calculating all possible values of the test statistic under rearrangements of the labels on the observed data points. It is just another form of resampling, but is done without replacement.

• Rather than assume a distribution for the null hypothesis, we simulate what it would be by randomly reconfiguring our sample lots of times (say 1000) in a way that ‘breaks’ the relationship in our sample data.
• Suppose we have group A and group B with sample means $\bar{x}_A$ and $\bar{x}_B$, respectively, and that we want to test whether they come from the same distribution at a 5% significance level. $n_A$ and $n_B$ are the sample size for each group, and a permutation test is designed to determine whether the observed difference between the sample means is large enough to reject the null hypothesis $H_0$: that the two groups have identical probability distribution. The test proceeds as follows:
1. The difference in the means between group A and B is calculated.
2. The difference in sample means is calculated and recorded for each possible way of dividing these pooled values into two groups of size $n_A$ and $n_B$. The set of these calculated differences is the exact distribution of possible differences under the null hypothesis that group label does not matter.
3. The one-side p-value of the test is calculated as the proportion of sampled permutations where the difference in means was greater than or equal to $T (obs)$; the two-sided p-value of the test is calculated as the proportion of sampled permutations where the absolute difference was greater than or equal to $ABS(T(obs))$;
4. Sort the recorded differences and then observe if $T(obs)$ is contained within the middle 95% of them. If it is not, reject $H_0$ at 5% significance level.

#### Simulation

A common assumption is that the coefficients we are trying to estimate come from a probability distribution. With large enough sample sizes, according to the central limit theorem (CLT), this distribution is multivariate normal. The goal of simulation is to make random draws from this distribution to simulate many ‘hypothetical values’ of the coefficients.

• Steps:
1. Choose a quality index (QI), e.g., expected value, predicted probability, odds ratio, first difference, etc.
2. Set a key variable in the model to a theoretically interesting value and the rest to their means or modes.
3. Calculate the QI with each set of simulated coefficients.
4. Set the variable to a new value.
5. Calculate that QI with each set of the simulated coefficients.
6. Repeat as appropriate.
7. Efficiently summarize the distribution of the computed QI at each value of the variable of interest.
• Advantages: Simulation provides more information than a table of regression outputs. It accounts for uncertainty in the QI and is flexible to many different types of models, QIs and variable specifications. After performing it once, it is easy to use and can be much easier than working with analytic solutions.
• Limitations: It relies on the CLT to justify asymptotic normality. In contrast, a fully Bayesian model using MCMC could produce an exact finite-sample distribution and bootstrapping would require no distributional assumptions. It can be computationally intense and large models can produce great uncertainty regarding quantities of interest.

#### Generating random observations using inverse CDF functions

There are alternative methods for generating random observations from a specified probability distribution (e.g., normal, exponential, or gamma distribution). One simple technique uses the inverse CDF of the distribution and random uniform $(0,1)$ data. Suppose the cumulative distribution function (CDF) of a probability distribution from which we are interested in sampling has a closed form analytical expression and is invertable. Then we generate a random sample from that distribution by evaluating the inverse CDF at $u$, where $u \sim U(0,1)$. This is possible since a continuous CDF, $F$, is a one-to-one mapping of the domain of the CDF into the interval $(0,1)$. Therefore, if $U$ is a uniform random variable on $(0,1)$, then $X = F^{–1}(U)$ has the distribution $F$. This reason for this fact is that if $U \sim Uniform[0,1]$, then $P(F^{-1}(U) \leq x)= P(U \leq F(x))$, by applying $F$ to both sides of the inequality, since $F$ is monotonic. Thus, $P(F^{-1}(U) \leq x)= F(x)$, since $P(U \leq u) = u$ for uniform random variables.

• Example: Consider the inverse CDF sampling technique (aka inverse transformation algorithm) for sampling from a standard exponential distribution, $Exp(1)$. The exponential distribution has probability density $f(x) = e^{–x}$, $x ≥ 0$, and CDF: $F(x) = 1 – e^{–x}$. The inverse CDF can be explicitly computed analytically by solving for $x$ in the equation $F(x) = u$, $F^{-1}(u) = –\ln{(1–u)} \sim Exp(1)$.

### Simulation examples

• Sampling with/without replacement in R:
names<-c('Ann','Tom','William','Tim','Kate','Mike','Rose','Alfred','Jef','Jack')
N<-length(names)
sample(names,N,replace=F)
[1] "William" "Kate"    "Mike"    "Ann"     "Jef"     "Tom"     "Rose"
[8] "Jack"    "Tim"     "Alfred"

sample(names,N,replace=T)
[1] "Mike"    "Rose"    "William" "Rose"    "Jef"     "Mike"    "Jack"
[8] "Rose"    "Tom"     "Ann"

• RNG using R
n = 10000  # number of observations
x = rnorm(n)
y = rbinom(n, 1, .25)
unif = runif(n)
y[unif < .2] = NA    # make 20% of the Y's be missing
ds = cbind(y, x)
plot(x,y)

• Bivariate simulation in R
rbivariate <- function(mean.x = 0,sd.x=1,mean.y=10,sd.y=4,r=.50,iter=1000) {
x1 <- rnorm(iter)
x2 <- rnorm(iter)
x <- sqrt(1-r^2)*sd.x*x1 + r*sd.x*x2 + mean.x
y <- sd.y*x2 + mean.y
return(list(x,y))
}

bivariateData <- rbivariate(iter=1000)
mean(bivariateData1)
sd(bivariateData1)
mean(bivariateData2)
sd(bivariateData2)
plot(bivariateData1,bivariateData2)
#finally you can write out these simulation results
write.csv(bivariateData, "/location/file.csv") # to write the data out to a file "file.csv"
write.csv(c(bivariateData1,bivariateData2), row.names=FALSE)      # to write the data directly into the R-shell


### Applications

• This article presents an application of the Central Limit Theorem using the SOCR applet for a demonstration activity. This article described an innovative effort at using technological tools for improving student motivation and learning of the theory, practice and usability of the CLT in probability and statistics courses. The method is based on harnessing the computational libraries developed by SOCR to design a new interactive Java applet and a corresponding demonstration activity that illustrate the meaning and power of the CLT. It included four experiments to demonstrate the assumptions, meaning and implication of the CLT as well as a hands-on simulation and a number of examples illustrating the theory and application of the CLT.
• This article, entitled "Asymptotic and resampling strategies for assessing and comparing indirect effects in multiple mediator models," provides an overview of simple and multiple mediation and explored three approaches that can be used to investigate indirect process. It also presents methods for contrasting two or more mediators within a single model via examples. The paper presents an illustrative example, assessing and contrasting potential mediators of the relationship between the helpfulness of socialization agents and job satisfaction as well as discussing software applications of these methods using SAS, SPSS macros, etc.
• This article presents a resampling, randomization and simulation activity and illustrates the processes of sampling, resampling and randomization using the SOCR webapp. It aims to demonstrate the concepts of simulation and data generation, illustrate data resampling on a massive scale, reinforce the concept of resampling- and randomization-based statistical inference and demonstrate the similarities and differences between parametric-based and resampling-based statistical inference. The article provides specific steps to implement the activities and video is also provided for reference.
• This article presents an experiment on sampling distributions using the Central Limit Theorem. It demonstrates the properties of the sampling distributions of various sample statistics and illustrates the CLT via an experiment. The sampling distribution CLT experiment provides a simulation accessible to the public that demonstrates characteristics of various sample statistics and the CLT and empirically demonstratse that the sample average is unique. The article helps users develop a better understanding of the two topics and apply the topics to various types of activities by explaining concepts such as the native distribution, sample distribution and numerical parameter estimate.

### Problems

1. Go over the examples in article 4.2 and 4.3.
2. Simulate a time-series, $S_t$, representing 356 days of measurement of a cognitive or physiological response of a human in 1 year timeframe. Suppose $S_{t}$ satisfies: $S_t=S_0 e^{vt+\sigma \sqrt{t} Z}$, where $Z \sim Normal(0,1)$, $S_0=36, \sigma=2$ and $v=0.01$. Can you identify 2 distinct sets of parameters $(S_0, \sigma, v)$ that generate very different kinds of time-series responses?
3. Now suppose you bought a call on this stock with strike price 40. Based on your simulated data, what percentage of days would you profit from exercising the call option? (This is the percentage of days your simulated $S_t$ is greater than 40).