SMHS ResamplingSimulation
Contents
Scientific Methods for Health Sciences - Resampling and Simulation
Overview
In statistics, resampling and simulation are two important concepts with widely application in researches and projects from various fields. Resampling is any of a variety of methods when the following processes are implemented: (1) estimating the precision of sample statistics (medians, percentiles) by using subsets of available data (jackknifing) or drawing randomly with replacement from a set of data (bootstrapping); (2) exchanging labels on data points when performing significance tests (permutation tests); (3) validating models by using random subsets (bootstrapping, cross validation). We are going to introduce some common resampling techniques including bootstrapping, jackknifing, cross-validation and permutation tests. Simulation is the imitation of what’s happening in the real world or system over time. We usually apply simulation after a model, which represents the key characteristics of the process is developed. Simulation is widely applied in many contexts such as simulation of technology for performances optimization, testing and video games. It is often applied when the real system is not accessible or hard, costly to apply and it provides us with an easier way to get the data or apply the system for the purpose of testing and etc. We are going to present an introduction to simulation including the basic methods, application, advantages and limits.
Motivation
Consider we want to evaluate the quality of a system or process, but the data is very hard to collect. How can we evaluate without having to actually taking samples from the system? In this case, it would be great if we know the characteristics of the data set, say if we know it follows a normal distribution, then we could easily generate a series of data following a normal distribution and use these to test the system. In fact, we can easily generate a large amount of dataset and test the system with more power. Consider another case, where instead of knowing the exact characters of the data, we only have very few data from the last few years where they follow a certain pattern. Here, we can use these dataset to work out the characteristic of the data and generate new dataset from the model we developed. A popular example is the bootstrapping method in the interest rate model. In order to learn more about the resampling and simulation methods, we are going to introduce the fundamental concepts, rules and methodologies commonly applied in these fields to prepare students with necessary background in resampling and simulation.
Theory
Resampling methods
Resampling methods use a computer to generate a large number of simulated samples, patterns in these samples are then summarized and analyzed. However, in resampling methods, the simulated samples are drawn from the existing sample of data you have in your hands and not from a theoretically defined DGP. Thus, in resampling methods, the researcher doesn’t know or control the DGP but the goal of learning about the DGP remains.
- Principles: assumption is that there is some population DGP that remains unobserved and that DGP produced one sample of data one had in hand; all information about the population contained in the original sample of data is also contained in the distribution of these simulated samples. Then draw a new ‘sample’ of data that consists of a different mix of the cases in original sample and repeats many times so we have lots of new simulated ‘samples’. Also, one can think that the sample of data one had in hands is reasonable representation of the population, and the distribution of parameter estimates produced from running a model on a series of resampled data sets will provide a good approximation of the distribution in the population. Resampling method can either be parametric or non-parametric.
Bootstrapping
Bootstrapping is a statistical method for estimating the sampling distribution of an estimator by sampling with replacement from the original sample, most often with the purpose of deriving robust estimates of standard errors and confidence intervals of a population parameter like median, odds ratio or regression coefficient. This technique allows estimation of the sampling distribution of almost any statistic using only very simple method and it falls in the broader class of resampling method.
- Situations where bootstrapping applies: (1) when the theoretical distribution of a statistic of interest is complicated or unknown; (2) when the sample size is insufficient for straightforward statistical inference; (3) when power calculations have to be performed, and a small pilot sample is available.
- It is the practice of estimating properties of an estimator by measuring those properties when sampling from an approximating distribution, say the empirical distribution of the observed data. It is often used as a robust alternative to inference based on parametric assumptions when those assumptions are in doubt, or where parametric inferences is impossible or requires very complicated formulas for the calculation of standard errors. It may also be used for constructing hypothesis tests.
- The basic idea of bootstrapping is that inference about a population from sample data (sample → population) can be modeled by resampling the sample data and perform inference on (resample → sample). More formally, the bootstrap works by treating inference of the true probability distribution, given the original data, as being analogous to inference of the empirical distribution given the resampled data. The accuracy of inferences regarding the empirical distribution using the resample data can be assessed because we know the distribution. If the empirical distribution is reasonable approximation to the true probability distribution, then the quality of inference on true probability distribution can in turn be inferred.
- Common process: (1) begin with an observed sample of size N, (2) generate a simulated sample of size N by drawing observations from your observed sample independently and with replacement, (3) compute and save the statistic of interest, (4) repeat this process many times (say 1000), (5) treat the distribution of your estimated statistics of interest as an estimate of the population distribution of that statistic.
- Key features of the bootstrap: the draws must be independent, each observation in the observed sample must have an equal chance of being selected; the simulated sample must be of size N to take full advantage of the information in the sample; resampling must be done with replacement, if not, then every simulated sample of size N would be identical to each other and to the original sample; resampling with replacement means that in any given simulated sample, some cases might appear more than once while others will not appear at all.
- Types of bootstrap scheme: (1) case resampling: the Monte Carlo algorithm; (2) estimating the distribution of sample mean; (3) regression; (4) Bayesian bootstrap; (5) smooth bootstrap; (6) parametric bootstrap; (7) resampling residuals; (8) Gaussian process regression bootstrap; (9) wild bootstrap; (10) block bootstrap.
- Advantages: simplicity and straightforward to derive estimates of standard errors and confidence intervals for complex estimators of complex parameters of the distribution; appropriate to control and check the stability of the results.
- Limitations: does not provide general finite-sample guarantees; the apparent simplicity may conceal the fact that important assumptions are being made when undertaking the bootstrap analysis where these would be more formally stated in other approaches.
Jackknife
The Jackknife method estimates the bias and standard error of a statistic when a random sample of observations is used to calculate it. The basic idea is systematically recomputing the statistic estimate, leaving out one or more observations at a time from the sample set. From the new set of replicates of the statistic, an estimate for the bias and an estimate for the variance of the statistic can be calculated.
- Jackknife estimate of variance tends to asymptotically to the true value almost surely. The jackknife is consistent for the sample means, sample variances, and etc.
- Jackknife is not consistent for the sample median. In the case of a unimodal variate the ratio of the jackknife variance to the sample variance tends to be distributed as one half the square of a chi-square distribution with two degrees of freedom.
- It is dependent on the independence of the data. Extensions of the jackknife to allow for dependence in the data have been proposed.
- Advantages: good at detecting outliers/influential cases. Those sub-sample estimates that differ most from the rest indicate those cases that has the most influence on those estimates in the original full sample analysis.
- Limitations: The jackknife is less general than the bootstrap, and thus used less frequently; it does not perform well if the statistic under consideration does not change ‘smoothly’ across simulated samples; it does not perform well in small samples because you don’t end up generating many resamples.
Cross-validation
Cross-validation (CV) is a statistical method for validating a predictive model, assessing a statistical model on a data set that is independent of the data set used to fit the model. Subsets of the data are held out for use as validating sets; a model is fit to the remaining data (training set) and used to predict for the validation set. Averaging the quality of the predictions across the validation sets yields an overall measure of prediction accuracy.
- Steps: (1) randomly partition the variable data into a training set and a testing set, (2) fit the model on the training set, (3) take the parameter estimates from that model, use them to calculate a measure of fit on the testing set, (4) repeat for several times and average to reduce variability.
- Types of CV:
- leave-one-out CV: iterative method with number of iterations = sample size, each observation becomes the training set one time; Steps: 1) delete observation #1 from the data, 2) fit the model on observations #2-n, 3) apply the coefficients form step #2 to observation #1, calculate the chosen fit measure, 4) delete observation #2 form the data, 5) fit the model on observations #1 and #3-n, 6) apply the coefficients from step #5 to observation #2, calculate the chosen fit measure, 7) repeat until all observations have been deleted once.
- K-fold cross-validation, splits the data into K subsets and each is held out in turn as the validation set. This avoids self-influence. For comparison, in regression analysis method such as linear regression, each y value draws the regression line toward itself, making the prediction of that value appear more accurate than it really is. Cross-validation applied to linear regression predicts the y value for each observation without using that observation.
- Limitations of CV: training and testing data must be random samples from the same population; will show biggest differences from in-sample measures when n is small; higher computational demand than calculating in-sample measures; subject to researcher’s selection of an appropriate fit statistic.
Permutation test
Permutation (or randomization or re-randomization) test is a type of statistical significance test in which the distribution of the test statistic under the null hypothesis is obtained by calculating all possible values of the test statistic under rearrangements of the labels on the observed data points. It is just another form of resampling but is done without replacement.
- Rather than assume a distribution for the null hypothesis, we simulate what it would be by randomly reconfiguring our sample lots of times (say 1000) in a way that ‘breaks’ the relationship in our sample data.
- Suppose we have group A and group B with sample means $\bar{x}_A$ and $\bar{x}_B$ respectively and we want to test, at a 5% significance level, whether they come form the same distribution. $n_A$ and $n_B$ are the sample size for each group. A permutation test is designed to determine whether the observed difference between the sample means is large enough to reject the null hypothesis $H_0$: the two groups have identical probability distribution. The test proceeds: (1) the difference in the means between group A and B is calculated, (2) difference in sample means is calculated and recorded for each possible way of dividing these pooled values into two groups of size $n_A$ and $n_B$. The set of these calculated differences if the exact distribution of possible differences under the null hypothesis that group label does not matter, (3) the one-side p-value of the test is calculated as the proportion of sampled permutations where the difference in means was greater than or equal to $T (obs)$; the two-sided p-value of the test is calculated as the proportion of sampled permutations where the absolute difference was greater than or equal to $ABS(T(obs))$;(4) sort the recorded differences and then observe if $T(obs)$ is contained within the middle 95% of them, if not, reject $H_0$ at 5% significance level.
Simulation
A common assumption is that the coefficients we estimate are drawn from a probability distribution that describes the larger population. With large enough sample size, according to the CLT this distribution is multivariate normal.
- Steps: (1) goal of simulation is to make random draws from this distribution to simulate many ‘hypothetical values’ of the coefficients, (2) the next step is to choose a QI say expected value, predicted probability, odds ratio, first difference, etc, (3) set a key variable in the model to a theoretically interesting value and the rest to their means or modes, (4) calculate QI with each set of simulated coefficients and set the variable to a new value, (5) set the variable to a new value, (6) calculate that QI with each set of simulated coefficients, (7) repeat as appropriate, (8) efficiently summarize the distribution of the computed QI at each value of our variable.
- Advantages: provide more information than a just a table of regression output; accounts for uncertainty in the QI; flexible to many different types of models, QIs and variable specifications; after doing it once, easy to use; can be much easier than working with analytic solutions.
- Limitations: relies on CLT to justify asymptotic normality (fully Bayesian model using MCMC could produce exact finite-sample distribution; bootstrapping would require no distributional assumption); computational intensity; large models can produce lots of uncertainty around quantity of interest.
Applications
- This article presents the application of Central Limit Theorem using the new SOCR applet and demonstration activity. In this article, it described an innovative effort of using technological tools for improving student motivation and learning of the theory, practice and usability of the CLT probability and statistics courses. The method is based on harnessing the computational libraries developed by SOCR to design a new interactive Java applet and a corresponding demonstration activity that illustrate the meaning and power of the CLT. It included four experiments to demonstrate the assumptions, meaning and implication of CLT as well as hands-on simulation and a number of examples illustrating the theory and application of CLT.
- This article titled Asymptotic and resampling strategies for assessing and comparing indirect effects in multiple mediator models provided an overview of simple and multiple mediation and explored three approaches that can be used to investigate indirect process. It also presents methods for contrasting two or more mediators within a single model through examples. The paper presents an illustrative example, assessing and contrasting potential mediators of relationship between the helpfulness of socialization agents and job satisfaction as well as software application of these methods including SAS, SPSS macros and etc.
- This article presents the resampling, randomization and simulation activity and illustrated the processes of sampling, resampling and randomization using the SOCR webapp. It aims to demonstrate the concepts of simulation and data generation, illustrate data resampling on a massive scale, reinforce the concept of resampling and randomization based statistical inference and demonstrate the similarities and differences between parametric-based and resampling-based statistical inference. The article provides specific steps to implement the activities and video is also provided for reference.
- This article is an experiment on sampling distribution with Central Limit Theory. It demonstrates the properties of the sampling distributions of various sample statistics and illustrates the CLT through the experiment. The sampling distribution CLT experiment provides a simulation accessible to the public that demonstrates characteristics of various sample statistics and CLT and to empirically demonstrate that the sample average is unique. The article helps users develop a better understanding of the two topics and apply them to various types of activities as the concepts of a native distribution, sample distribution and numerical parameter estimate.
Software
Sampling with/without replacement in R:
> names<-c('Ann','Tom','William','Tim','Kate','Mike','Rose','Alfred','Jef','Jack')
> N<-length(names)
> sample(names,N,replace=F)
[1] "William" "Kate" "Mike" "Ann" "Jef" "Tom" "Rose" [8] "Jack" "Tim" "Alfred"
> sample(names,N,replace=T)
[1] "Mike" "Rose" "William" "Rose" "Jef" "Mike" "Jack" [8] "Rose" "Tom" "Ann"
Problems
6.1) Go over the examples in article 4.2 and 4.3.
6.2) Do the exercise of simulating stock closing price St on 252 trading days where St satisfies: \(S_t=S_0 exp(vt+σ√t Z),Z~Normal(0,1)\)with \(S_0=36,σ=2%,v=0.01%\).
6.3) Now suppose you bought a call on this stock with strike price 40, with your simulation data, what is percentage of days you can profit from exercising the call option? (That is the percentage of days your S_t is greater than 40).
- SOCR Home page: http://www.socr.umich.edu
Translate this page: