SMHS ResamplingSimulation

From SOCR
Revision as of 12:48, 31 August 2014 by Jslavine (talk | contribs) (Jackknife)
Jump to: navigation, search

Scientific Methods for Health Sciences - Resampling and Simulation

Overview

In statistics, resampling and simulation are two important concepts with wide applications in research and projects from various fields. Resampling is any of a variety of methods in which the following processes are implemented: (1) estimating the precision of sample statistics (i.e., medians, percentiles) by using subsets of available data (i.e., jackknifing) or drawing randomly with replacement from a set of data (i.e., bootstrapping); (2) exchanging labels on data points when performing significance tests (i.e., permutation tests); or (3) validating models by using random subsets (i.e., bootstrapping, cross-validation). We are going to introduce some common resampling techniques including bootstrapping, jackknifing, cross-validation and permutation tests. Simulation involves imitating real world processes or systems over time. We usually apply simulation after a model, which represents the key characteristics of the process, is developed. Simulation is widely applied in many contexts such as simulation of technology for performance optimization, testing and video games. It is often applied when the real system is not accessible or is difficult and/or costly to apply, and it provides us with an easier way to obtain data about the system or test it. We are going to present an introduction to simulation including the basic methods, applications, advantages and limitations.

Motivation

Imagine we want to evaluate the quality of a system or process, but data on the process is very hard to collect. How can we evaluate the system without having to collect samples? If we know the characteristics of the data set, for example, if we know it follows a normal distribution, then we could easily generate a series of data following a normal distribution and use these to test the system. In fact, we can easily generate a large amount of data and test the system with more power. Consider another case in which, instead of knowing the exact characteristics of the data, we have few data from the past few years and we notice that they follow a certain pattern. Here, we can use this data set to work out the characteristics of the data and develop a model. We can then generate a new data set from the model we developed. A popular example is the bootstrapping method in the interest rate model. In order to learn more about resampling and simulation methods, we are going to introduce the fundamental concepts, rules and methodologies commonly applied in these fields to prepare students with necessary background in resampling and simulation.

Theory

Resampling methods

Resampling methods use a computer to generate a large number of simulated samples; the patterns in these samples are then summarized and analyzed. In resampling methods, the simulated samples are drawn from the existing sample of data and not from a theoretically defined data generating process (DGP). Thus, in resampling methods, the researcher does not know or control the DGP but can still learn about it.

  • Principles: The assumption is that there is some population DGP that remains unobserved and that the DGP produced the one sample of data you have; all information about the population contained in the original data set is also contained in the distribution of these simulated samples. We draw a new ‘sample’ of data that consists of a mix of the observations from the original sample and repeat this many times so we have many new simulated ‘samples’. One can consider the original data set to be a reasonable representation of the population, and the distribution of parameter estimates produced from running a model on a series of resampled data sets will provide a good approximation of the distribution in the population. Resampling methods can be either parametric or non-parametric.

Bootstrapping

Bootstrapping is a statistical method for estimating the sampling distribution of an estimator by sampling with replacement from the original sample, most often with the purpose of deriving robust estimates of standard errors and confidence intervals of a population parameter like the median, odds ratio or regression coefficient. This technique allows estimation of the sampling distribution of almost any statistic using only very simple methods, and it falls into the broader class of resampling methods.

  • Situations where bootstrapping applies: (1) When the theoretical distribution of a statistic of interest is complicated or unknown; (2) When the sample size is insufficient for straightforward statistical inference; (3) When power calculations have to be performed, and a small pilot sample is available.
  • Boostrapping is the practice of estimating properties of an estimator by measuring those properties when sampling from an approximating distribution, say the empirical distribution of the observed data. It is often used as a robust alternative to inference based on parametric assumptions when those assumptions are in doubt, or when parametric inferences is impossible or requires very complicated formulas for the calculation of standard errors. It may also be used for constructing hypothesis tests.
  • The basic idea of bootstrapping is that inference about a population based on sample data (sample → population) can be modeled by resampling the data and performing inference on this new sample (resample → sample). More formally, bootstrapping works by treating inference regarding the true probability distribution given the data as being analogous to inference regarding the empirical distribution given the resampled data. The accuracy of inferences regarding the empirical distribution using the resampled data can be assessed because we know the distribution. If the empirical distribution is a reasonable approximation of the true probability distribution, then the quality of inference on the true probability distribution can in turn be inferred.
  • Common process: (1) Begin with an observed sample of size N, (2) generate a simulated sample of size N by drawing observations from your observed sample independently and with replacement, (3) compute and save the statistic of interest, (4) repeat this process many times (e.g., 1000), and (5) treat the distribution of your estimated statistics of interest as an estimate of the population distribution of that statistic.
  • Key features of the bootstrap: The draws must be independent; that is, each observation in the observed sample must have an equal chance of being selected. The simulated sample must be of size N to take full advantage of the information in the sample. Resampling must be done with replacement, if not, then every simulated sample of size N would be identical and the same as the original sample. Resampling with replacement means that in any given simulated sample, some cases might appear more than once while others will not appear at all.
  • Types of bootstrap schemes: (1) Case resampling using the Monte Carlo algorithm; (2) estimating the distribution of the sample means; (3) regression; (4) Bayesian bootstrap; (5) smooth bootstrap; (6) parametric bootstrap; (7) resampling residuals; (8) Gaussian process regression bootstrap; (9) wild bootstrap; (10) block bootstrap.
  • Advantages: Boostrapping is simple, and it is straightforward to derive estimates of standard errors and confidence intervals for complex estimators of complex parameters of the distribution; it is appropriate to control and check the stability of the results.
  • Limitations: It does not provide general finite-sample guarantees. Additionally, the apparent simplicity may conceal the fact that important assumptions are being made when undertaking the bootstrap analysis where these would be more formally stated in other approaches.

Jackknife

The Jackknife method estimates the bias and standard error of a statistic when a random sample of observations is used to calculate it. The basic idea is to systematically recompute the statistic estimate, leaving out one or more observations at a time from the sample set. From the new set of replicates of the statistic, an estimate for the bias and an estimate for the variance of the statistic can be calculated.

  • Jackknife estimates of variance asymptotically tend to the true value almost surely. The jackknife is consistent for sample means, sample variances, etc.
  • The jackknife is not consistent for the sample median. In the case of a unimodal variable, the ratio of the jackknife variance to the sample variance tends to be distributed as one half the square of a chi-square distribution with two degrees of freedom.
  • It is dependent on the independence of the data. Extensions of the jackknife to allow for dependences in the data have been proposed.
  • Advantages: This method is good at detecting outliers and influential cases. Those sub-sample estimates that differ most indicate cases that have the strongest influence on those estimates in the original full sample analysis.
  • Limitations: The jackknife is less general than the bootstrap and thus, is used less frequently. It does not perform well if the statistic under consideration does not change ‘smoothly’ across simulated samples and it does not perform well with small samples because you cannot generate many resamples.

Cross-validation

Cross-validation (CV) is a statistical method for validating a predictive model, assessing a statistical model on a data set that is independent of the data set used to fit the model. Subsets of the data are held out for use as validating sets; a model is fit to the remaining data (training set) and used to predict for the validation set. Averaging the quality of the predictions across the validation sets yields an overall measure of prediction accuracy.

  • Steps: (1) randomly partition the variable data into a training set and a testing set, (2) fit the model on the training set, (3) take the parameter estimates from that model, use them to calculate a measure of fit on the testing set, (4) repeat for several times and average to reduce variability.
  • Types of CV:
    • leave-one-out CV: iterative method with number of iterations = sample size, each observation becomes the training set one time; Steps: 1) delete observation #1 from the data, 2) fit the model on observations #2-n, 3) apply the coefficients form step #2 to observation #1, calculate the chosen fit measure, 4) delete observation #2 form the data, 5) fit the model on observations #1 and #3-n, 6) apply the coefficients from step #5 to observation #2, calculate the chosen fit measure, 7) repeat until all observations have been deleted once.
    • K-fold cross-validation, splits the data into K subsets and each is held out in turn as the validation set. This avoids self-influence. For comparison, in regression analysis method such as linear regression, each y value draws the regression line toward itself, making the prediction of that value appear more accurate than it really is. Cross-validation applied to linear regression predicts the y value for each observation without using that observation.
  • Limitations of CV: training and testing data must be random samples from the same population; will show biggest differences from in-sample measures when n is small; higher computational demand than calculating in-sample measures; subject to researcher’s selection of an appropriate fit statistic.

Permutation test

Permutation (or randomization or re-randomization) test is a type of statistical significance test in which the distribution of the test statistic under the null hypothesis is obtained by calculating all possible values of the test statistic under rearrangements of the labels on the observed data points. It is just another form of resampling but is done without replacement.

  • Rather than assume a distribution for the null hypothesis, we simulate what it would be by randomly reconfiguring our sample lots of times (say 1000) in a way that ‘breaks’ the relationship in our sample data.
  • Suppose we have group A and group B with sample means $\bar{x}_A$ and $\bar{x}_B$ respectively and we want to test, at a 5% significance level, whether they come form the same distribution. $n_A$ and $n_B$ are the sample size for each group. A permutation test is designed to determine whether the observed difference between the sample means is large enough to reject the null hypothesis $H_0$: the two groups have identical probability distribution. The test proceeds: (1) the difference in the means between group A and B is calculated, (2) difference in sample means is calculated and recorded for each possible way of dividing these pooled values into two groups of size $n_A$ and $n_B$. The set of these calculated differences if the exact distribution of possible differences under the null hypothesis that group label does not matter, (3) the one-side p-value of the test is calculated as the proportion of sampled permutations where the difference in means was greater than or equal to $T (obs)$; the two-sided p-value of the test is calculated as the proportion of sampled permutations where the absolute difference was greater than or equal to $ABS(T(obs))$;(4) sort the recorded differences and then observe if $T(obs)$ is contained within the middle 95% of them, if not, reject $H_0$ at 5% significance level.

Simulation

A common assumption is that the coefficients we estimate are drawn from a probability distribution that describes the larger population. With large enough sample size, according to the CLT this distribution is multivariate normal.

  • Steps: (1) goal of simulation is to make random draws from this distribution to simulate many ‘hypothetical values’ of the coefficients, (2) the next step is to choose a QI say expected value, predicted probability, odds ratio, first difference, etc, (3) set a key variable in the model to a theoretically interesting value and the rest to their means or modes, (4) calculate QI with each set of simulated coefficients and set the variable to a new value, (5) set the variable to a new value, (6) calculate that QI with each set of simulated coefficients, (7) repeat as appropriate, (8) efficiently summarize the distribution of the computed QI at each value of our variable.
  • Advantages: provide more information than a just a table of regression output; accounts for uncertainty in the QI; flexible to many different types of models, QIs and variable specifications; after doing it once, easy to use; can be much easier than working with analytic solutions.
  • Limitations: relies on CLT to justify asymptotic normality (fully Bayesian model using MCMC could produce exact finite-sample distribution; bootstrapping would require no distributional assumption); computational intensity; large models can produce lots of uncertainty around quantity of interest.

Applications

  • This article presents the application of Central Limit Theorem using the new SOCR applet and demonstration activity. In this article, it described an innovative effort of using technological tools for improving student motivation and learning of the theory, practice and usability of the CLT probability and statistics courses. The method is based on harnessing the computational libraries developed by SOCR to design a new interactive Java applet and a corresponding demonstration activity that illustrate the meaning and power of the CLT. It included four experiments to demonstrate the assumptions, meaning and implication of CLT as well as hands-on simulation and a number of examples illustrating the theory and application of CLT.
  • This article titled Asymptotic and resampling strategies for assessing and comparing indirect effects in multiple mediator models provided an overview of simple and multiple mediation and explored three approaches that can be used to investigate indirect process. It also presents methods for contrasting two or more mediators within a single model through examples. The paper presents an illustrative example, assessing and contrasting potential mediators of relationship between the helpfulness of socialization agents and job satisfaction as well as software application of these methods including SAS, SPSS macros and etc.
  • This article presents the resampling, randomization and simulation activity and illustrated the processes of sampling, resampling and randomization using the SOCR webapp. It aims to demonstrate the concepts of simulation and data generation, illustrate data resampling on a massive scale, reinforce the concept of resampling and randomization based statistical inference and demonstrate the similarities and differences between parametric-based and resampling-based statistical inference. The article provides specific steps to implement the activities and video is also provided for reference.
  • This article is an experiment on sampling distribution with Central Limit Theory. It demonstrates the properties of the sampling distributions of various sample statistics and illustrates the CLT through the experiment. The sampling distribution CLT experiment provides a simulation accessible to the public that demonstrates characteristics of various sample statistics and CLT and to empirically demonstrate that the sample average is unique. The article helps users develop a better understanding of the two topics and apply them to various types of activities as the concepts of a native distribution, sample distribution and numerical parameter estimate.

Software

  • Sampling with/without replacement in R:
> names<-c('Ann','Tom','William','Tim','Kate','Mike','Rose','Alfred','Jef','Jack')
> N<-length(names)
> sample(names,N,replace=F)
[1] "William" "Kate"    "Mike"    "Ann"     "Jef"     "Tom"     "Rose"   
[8] "Jack"    "Tim"     "Alfred"

> sample(names,N,replace=T)
[1] "Mike"    "Rose"    "William" "Rose"    "Jef"     "Mike"    "Jack"   
[8] "Rose"    "Tom"     "Ann"

Problems

  1. Go over the examples in article 4.2 and 4.3.
  2. Do the exercise of simulating stock closing price $S_t$ on 252 trading days where $S_{t}$ satisfies: $S_t=S_0 e^{vt+\sigma \sqrt{t} Z}$, $Z \sim Normal(0,1)$ with $S_0=36, \sigma=2%, v=0.01%$.
  3. Now suppose you bought a call on this stock with strike price 40, with your simulation data, what is percentage of days you can profit from exercising the call option? (That is the percentage of days your $S_t$ is greater than 40).



References





Translate this page:

(default)
Uk flag.gif

Deutsch
De flag.gif

Español
Es flag.gif

Français
Fr flag.gif

Italiano
It flag.gif

Português
Pt flag.gif

日本語
Jp flag.gif

България
Bg flag.gif

الامارات العربية المتحدة
Ae flag.gif

Suomi
Fi flag.gif

इस भाषा में
In flag.gif

Norge
No flag.png

한국어
Kr flag.gif

中文
Cn flag.gif

繁体中文
Cn flag.gif

Русский
Ru flag.gif

Nederlands
Nl flag.gif

Ελληνικά
Gr flag.gif

Hrvatska
Hr flag.gif

Česká republika
Cz flag.gif

Danmark
Dk flag.gif

Polska
Pl flag.png

România
Ro flag.png

Sverige
Se flag.gif