SMHS ResamplingSimulation - Revision history

Glenbrau: /* Generating random observations using inverse CDF functions */

2015-03-30T20:55:55Z

‎Generating random observations using inverse CDF functions

Glenbrau: /* Simulation */

2015-03-30T20:20:44Z

‎Simulation

Glenbrau: /* Permutation test */

2015-03-30T20:04:47Z

‎Permutation test

Glenbrau: /* Cross-validation */

2015-03-30T19:56:07Z

‎Cross-validation

Glenbrau: /* Jackknife */

2015-03-30T15:35:22Z

‎Jackknife

Glenbrau: /* Bootstrapping */

2015-03-30T15:26:28Z

‎Bootstrapping

Glenbrau: /* Resampling methods */

2015-03-30T14:26:24Z

‎Resampling methods

Glenbrau: /* Motivation */

2015-03-30T14:20:39Z

‎Motivation

Glenbrau: /* Overview */

2015-03-30T14:19:10Z

‎Overview

Dinov: /* Simulation examples */

2014-09-24T15:41:25Z

‎Simulation examples

@@ Line 125: / Line 125: @@
 ====Generating random observations using inverse CDF functions====
-There are [http://en.wikipedia.org/wiki/Random_number_generation alternative methods for generating random observations] from a specified probability distribution (e.g., normal, exponential, or gamma distribution). One simple one technique uses the inverse CDF of the distribution and random uniform (0,1) data. Suppose the cumulative distribution function (CDF) of a probability distribution we are interested in sampling from, has a closed form analytical expression and is invertable. Then we generate a random sample from that distribution by evaluating the inverse CDF at $u$, where $u \sim U(0,1)$. This is possible as a continuous CDF, $F$, is a one-to-one mapping of the domain of the CDF into the interval $(0,1)$. Therefore, if $U$ is a uniform random variable on $(0,1)$, then $X = F^{–1}(U)$ has the distribution $F$. This reason for this fact is that if $U \sim Uniform[0,1]$, then $P(F^{-1}(U) \leq x)= P(U \leq F(x))$, by applying $F$ to both sides of the inequality, since $F$ is monotonic. Thus, $P(F^{-1}(U) \leq x)= F(x)$, since $P(U \leq u) = u$ for uniform random variables.
+There are [http://en.wikipedia.org/wiki/Random_number_generation alternative methods for generating random observations] from a specified probability distribution (e.g., normal, exponential, or gamma distribution). One simple technique uses the inverse CDF of the distribution and random uniform $(0,1)$ data. Suppose the cumulative distribution function (CDF) of a probability distribution from which we are interested in sampling has a closed form analytical expression and is invertable. Then we generate a random sample from that distribution by evaluating the inverse CDF at $u$, where $u \sim U(0,1)$. This is possible since a continuous CDF, $F$, is a one-to-one mapping of the domain of the CDF into the interval $(0,1)$. Therefore, if $U$ is a uniform random variable on $(0,1)$, then $X = F^{–1}(U)$ has the distribution $F$. This reason for this fact is that if $U \sim Uniform[0,1]$, then $P(F^{-1}(U) \leq x)= P(U \leq F(x))$, by applying $F$ to both sides of the inequality, since $F$ is monotonic. Thus, $P(F^{-1}(U) \leq x)= F(x)$, since $P(U \leq u) = u$ for uniform random variables.
 * Example: Consider the inverse CDF sampling technique (aka inverse transformation algorithm) for sampling from a [[AP_Statistics_Curriculum_2007_Exponential|standard exponential distribution]], $Exp(1)$. The exponential distribution has probability density $f(x) = e^{–x}$, $x ≥ 0$, and CDF: $F(x) = 1 – e^{–x}$. The inverse CDF can be explicitly computed analytically by solving for $x$ in the equation $F(x) = u$, $F^{-1}(u) = –\ln{(1–u)} \sim Exp(1)$.

@@ Line 111: / Line 111: @@
 A common assumption is that the coefficients we are trying to estimate come from a probability distribution. With large enough sample sizes, according to the central limit theorem (CLT), this distribution is multivariate normal.
 The goal of simulation is to make random draws from this distribution to simulate many ‘hypothetical values’ of the coefficients.
-*Steps: (1) Choose a quality index (QI), e.g., expected value, predicted probability, odds ratio, first difference, etc.; (2) Set a key variable in the model to a theoretically interesting value and the rest to their means or modes; (3) calculate the QI with each set of simulated coefficients; (4) set the variable to a new value; (5) calculate that QI with each set of the simulated coefficients, (6) repeat as appropriate; (7) efficiently summarize the distribution of the computed QI at each value of the variable of interest.
+*Steps:
 *Advantages: Simulation provides more information than a table of regression outputs. It accounts for uncertainty in the QI and is flexible to many different types of models, QIs and variable specifications. After performing it once, it is easy to use and can be much easier than working with analytic solutions.

@@ Line 97: / Line 97: @@
 *Limitations of CV: The training and validating data must be random samples from the same population. It will be most different from in-sample measures when $n$ is small. It is more computationally demanding than calculating in-sample measures, and it is subject to the researcher’s selection of an appropriate fit statistic.
-====Permutation test====
+====Permutation Test====
-Permutation (or randomization or re-randomization) test is a type of statistical significance test in which the distribution of the test statistic under the null hypothesis is obtained by calculating all possible values of the test statistic under rearrangements of the labels on the observed data points. It is just another form of resampling but is done without replacement.
+A permutation (or randomization or re-randomization) test is a type of statistical significance test in which the distribution of the test statistic under the null hypothesis is obtained by calculating all possible values of the test statistic under rearrangements of the labels on the observed data points. It is just another form of resampling, but is done without replacement.
 *Rather than assume a distribution for the null hypothesis, we simulate what it would be by randomly reconfiguring our sample lots of times (say 1000) in a way that ‘breaks’ the relationship in our sample data.
-*Suppose we have group A and group B with sample means $\bar{x}_A$ and $\bar{x}_B$ respectively and we want to test, at a 5% significance level, whether they come form the same distribution. $n_A$ and $n_B$ are the sample size for each group. A permutation test is designed to determine whether the observed difference between the sample means is large enough to reject the null hypothesis $H_0$: the two groups have identical probability distribution. The test proceeds as follows:
+*Suppose we have group A and group B with sample means $\bar{x}_A$ and $\bar{x}_B$, respectively, and that we want to test whether they come from the same distribution at a 5% significance level. $n_A$ and $n_B$ are the sample size for each group, and a permutation test is designed to determine whether the observed difference between the sample means is large enough to reject the null hypothesis $H_0$: that the two groups have identical probability distribution. The test proceeds as follows:
-# the difference in the means between group A and B is calculated,
+# The difference in the means between group A and B is calculated.
-# difference in sample means is calculated and recorded for each possible way of dividing these pooled values into two groups of size $n_A$ and $n_B$. The set of these calculated differences if the exact distribution of possible differences under the null hypothesis that group label does not matter,
+# The difference in sample means is calculated and recorded for each possible way of dividing these pooled values into two groups of size $n_A$ and $n_B$. The set of these calculated differences is the exact distribution of possible differences under the null hypothesis that group label does not matter.
-# the one-side p-value of the test is calculated as the proportion of sampled permutations where the difference in means was greater than or equal to $T (obs)$; the two-sided p-value of the test is calculated as the proportion of sampled permutations where the absolute difference was greater than or equal to $ABS(T(obs))$;
+# The one-side p-value of the test is calculated as the proportion of sampled permutations where the difference in means was greater than or equal to $T (obs)$; the two-sided p-value of the test is calculated as the proportion of sampled permutations where the absolute difference was greater than or equal to $ABS(T(obs))$;
-# sort the recorded differences and then observe if $T(obs)$ is contained within the middle 95% of them, if not, reject $H_0$ at 5% significance level.
+# Sort the recorded differences and then observe if $T(obs)$ is contained within the middle 95% of them. If it is not, reject $H_0$ at 5% significance level.
 ====Simulation====

@@ Line 77: / Line 77: @@
 Cross-validation (CV) is a statistical method for validating a predictive model by assessing a statistical model using a data set that is independent of the data set used to fit the model. Subsets of the data are held out for use as ''validating sets''; a model is fit to the remaining data (i.e., ''training set'') and used to predict the validation set. Averaging the quality of the predictions across the validation sets yields an overall measure of prediction accuracy.
-*Steps: (1) Randomly partition the data into a training set and a validating set, (2) fit the model to the training set, (3) take the parameter estimates from that model and use them to calculate a measure of fit on the testing set, (4) repeat several times and average to reduce variability.
+*Steps:
 *'''Types of CV''':
-**''Leave-one-out CV'': This is an iterative method in which the number of iterations = sample size, each observation becomes the validating set one time. Steps: 1) Delete observation #1 from the data, 2) fit the model to observations #2-n, 3) apply the coefficients form step #2 to observation #1 and calculate the chosen fit measure, 4) delete observation #2 from the data, 5) fit the model to observations #1 and #3-n, 6) apply the coefficients from step #5 to observation #2 and calculate the chosen fit measure, 7) repeat until all observations have been deleted once.
+**''Leave-one-out CV'': This is an iterative method in which the number of iterations = sample size, each observation becomes the validating set one time. Steps:
-**''K-fold cross-validation'' splits the data into K subsets and each is held out in turn as the validation set. This avoids self-influence. This influence is similar to they way in which in regression analysis methods, such as linear regression, each y value draws the regression line toward itself, making the prediction of that value appear more accurate than it really is. Cross-validation applied to linear regression predicts the y value for each observation without using that observation.
+#Delete observation #1 from the data.
-*Limitations of CV: The training and validating data must be random samples from the same population. It will be most different from in-sample measures when n is small. It is more computationally demanding than calculating in-sample measures. It is subject to the researcher’s selection of an appropriate fit statistic.
+*''K-fold cross-validation'' splits the data into K subsets and each is held out in turn as the validation set, which helps to avoid self-influence. This influence is similar to the way that in regression analysis methods like linear regression, each y value draws the regression line toward itself, making the prediction of that value appear more accurate than it really is. Cross-validation applied to linear regression predicts the y value for each observation without using that observation.
 ====Permutation test====

@@ Line 64: / Line 64: @@
 The Jackknife method estimates the bias and standard error of a statistic when a random sample of observations is used to calculate it. The basic idea is to systematically recompute the statistic estimate, leaving out one or more observations at a time from the sample set. From the new set of replicates of the statistic, an estimate for the bias and an estimate for the variance of the statistic can be calculated.
-*Jackknife estimates of variance asymptotically tend to the true value almost surely. The jackknife is consistent for sample means, sample variances, etc.
+*Jackknife estimates of variance asymptotically tend to be the true values almost surely. The jackknife is consistent for sample means, sample variances, etc.
 *It is dependent on the independence of the data. Extensions of the jackknife to allow for dependencies in the data have been proposed.
 *Advantages: This method is good at detecting outliers and influential cases. Those sub-sample estimates that differ most indicate cases that have the strongest influence on those estimates in the original full sample analysis.
-*Limitations: The jackknife is less general than the bootstrap and thus, is used less frequently. It does not perform well if the statistic under consideration does not change ‘smoothly’ across simulated samples and it does not perform well with small samples because you cannot generate many resamples.
+*Limitations: The jackknife is less general than the bootstrap and is therefore used less frequently. It does not perform well if the statistic under consideration does not change ‘smoothly’ across simulated samples, and it does not perform well with small samples because one cannot generate many resamples.
 ==== Cross-validation====

@@ Line 22: / Line 22: @@
 Bootstrapping is a statistical method for estimating the sampling distribution of an estimator by sampling with replacement from the original sample, most often with the purpose of deriving robust estimates of standard errors and confidence intervals of a population parameter like the median, odds ratio or regression coefficient. This technique allows estimation of the sampling distribution of almost any statistic using only very simple methods, and it falls into the broader class of resampling methods.
-*Situations where bootstrapping applies: (1) When the theoretical distribution of a statistic of interest is complicated or unknown; (2) When the sample size is insufficient for straightforward statistical inference; (3) When power calculations have to be performed, and a small pilot sample is available.
+*Situations where bootstrapping applies:
-*Boostrapping is the practice of estimating properties of an estimator by measuring those properties when sampling from an approximating distribution, say the empirical distribution of the observed data. It is often used as a robust alternative to inference based on parametric assumptions when those assumptions are in doubt, or when parametric inferences is impossible or requires very complicated formulas for the calculation of standard errors. It may also be used for constructing hypothesis tests.
+#When the theoretical distribution of a statistic of interest is complicated or unknown
-*The basic idea of bootstrapping is that inference about a population based on sample data (sample &rarr; population) can be modeled by resampling the data and performing inference on this new sample (resample &rarr; sample). More formally, bootstrapping works by treating inference regarding the true probability distribution given the data as being analogous to inference regarding the empirical distribution given the resampled data. The accuracy of inferences regarding the empirical distribution using the resampled data can be assessed because we know the distribution. If the empirical distribution is a reasonable approximation of the true probability distribution, then the quality of inference on the true probability distribution can in turn be inferred.
+*Boostrapping is the practice of estimating properties of an estimator by measuring those properties when sampling from an approximating distribution, say the empirical distribution of the observed data. It is often used as a robust alternative to inference based on parametric assumptions when those assumptions are in doubt, or when parametric inference is impossible or requires very complicated formulas for the calculation of standard errors. It may also be used for constructing hypothesis tests.
-*Common process: (1) Begin with an observed sample of size N, (2) generate a simulated sample of size N by drawing observations from your observed sample independently and with replacement, (3) compute and save the statistic of interest, (4) repeat this process many times (e.g., 1000), and (5) treat the distribution of your estimated statistics of interest as an estimate of the population distribution of that statistic.
+*The basic idea of bootstrapping is that inference about a population based on sample data (sample &rarr; population) can be modeled by resampling the data and performing inference on this new sample (resample &rarr; sample). Bootstrapping treats the true probability distribution of the original data as being analogous to the empirical distribution of the resampled data. The accuracy of inferences regarding the empirical distribution using the resampled data can be assessed because we know the distribution. If the empirical distribution is a reasonable approximation of the true probability distribution, then the quality of inference on the true probability distribution can in turn be inferred.
-*Key features of the bootstrap: The draws must be independent; that is, each observation in the observed sample must have an equal chance of being selected. The simulated sample must be of size N to take full advantage of the information in the sample. Resampling must be done with replacement, if not, then every simulated sample of size N would be identical and the same as the original sample. Resampling with replacement means that in any given simulated sample, some cases might appear more than once while others will not appear at all.
+*Common process:
-*Types of bootstrap schemes: (1) ''Case resampling'' using the Monte Carlo algorithm; (2) estimating the distribution of the sample means; (3) regression; (4) Bayesian bootstrap; (5) smooth bootstrap; (6) parametric bootstrap; (7) resampling residuals; (8) Gaussian process regression bootstrap; (9) wild bootstrap; (10) block bootstrap.
+*Key features of the bootstrap:
-*Advantages: Boostrapping is simple, and it is straightforward to derive estimates of standard errors and confidence intervals for complex estimators of complex parameters of the distribution; it is appropriate to control and check the stability of the results.
+*Types of bootstrap schemes:
 *Limitations: It does not provide general finite-sample guarantees. Additionally, the apparent simplicity may conceal the fact that important assumptions are being made when undertaking the bootstrap analysis where these would be more formally stated in other approaches.

@@ Line 15: / Line 15: @@
 ===Theory===
 ==== Resampling methods====
-Resampling methods use a computer to generate a large number of simulated samples; the patterns in these samples are then summarized and analyzed. In resampling methods, the simulated samples are drawn from the existing sample of data and not from a theoretically defined data generating process (DGP). Thus, in resampling methods, the researcher does not know or control the DGP but can still learn about it.
+Resampling methods use a computer to generate a large number of simulated samples; the patterns in these samples are then summarized and analyzed. In resampling methods, the simulated samples are drawn from the existing sample of data and not from a theoretically defined Data Generating Process (DGP). Thus, in resampling methods, the researcher does not know or control the DGP but can still learn about it.
-*Principles: The assumption is that there is some population DGP that remains unobserved and that the DGP produced the one sample of data you have; all information about the population contained in the original data set is also contained in the distribution of these simulated samples. We draw a new ‘sample’ of data that consists of a mix of the observations from the original sample and repeat this many times so we have many new simulated ‘samples’. One can consider the original data set to be a reasonable representation of the population, and the distribution of parameter estimates produced from running a model on a series of resampled data sets will provide a good approximation of the distribution in the population. Resampling methods can be either parametric or non-parametric.
+* Principles: The assumption is that there is some population DGP that remains unobserved and that the DGP produced the one sample of data you have; all information about the population contained in the original data set is also contained in the distribution of these simulated samples. We draw a new ‘sample’ of data that consists of a mix of the observations from the original sample and repeat this many times so we have many new simulated ‘samples’. One can consider the original data set to be a reasonable representation of the population, and the distribution of parameter estimates produced from running a model on a series of resampled data sets will provide a good approximation of the distribution in the population. Resampling methods can be either parametric or non-parametric.
 ====Bootstrapping====

@@ Line 11: / Line 11: @@
 ===Motivation===
-Imagine we want to evaluate the quality of a system or process, but data on the process is very hard to collect. How can we evaluate the system without having to collect samples? If we know the characteristics of the data set, for example, if we know it follows a normal distribution, then we could easily generate a series of data following a normal distribution and use these to test the system. In fact, we can easily generate a large amount of data and test the system with more power. Consider another case in which, instead of knowing the exact characteristics of the data, we have few data from the past few years and we notice that they follow a certain pattern. Here, we can use this data set to work out the characteristics of the data and develop a model.  We can then generate a new data set from the model we developed. A popular example is the bootstrapping method in the interest rate model. In order to learn more about resampling and simulation methods, we are going to introduce the fundamental concepts, rules and methodologies commonly applied in these fields to prepare students with necessary background in resampling and simulation.
+Imagine we want to evaluate the quality of a system or process, but data on the process is very hard to collect. How can we evaluate the system without having to collect samples? If we know the characteristics of the data set (e.g., it follows a normal distribution), then we could easily generate a series of data following a normal distribution and use these to test the system. In fact, we can easily generate a large amount of data and test the system with more power. Consider another case in which, instead of knowing the exact characteristics of the data, we have few data from the past few years and we notice that they follow a certain pattern. Here, we can use this data set to work out the characteristics of the data and develop a model.  We can then generate a new data set from the model we developed. A popular example is the bootstrapping method in the interest rate model. In order to learn more about resampling and simulation methods, we are going to introduce the fundamental concepts, rules and methodologies commonly applied in these fields to prepare students with necessary background in resampling and simulation.
 ===Theory===

@@ Line 2: / Line 2: @@
 ===Overview===
-In statistics, ''resampling'' and ''simulation'' are two important concepts with wide applications in research and projects from various fields. [http://en.wikipedia.org/wiki/Resampling_(statistics) ''Resampling''] is any of a variety of methods in which the following processes are implemented: (1) estimating the precision of sample statistics (i.e., medians, percentiles) by using subsets of available data (i.e., ''jackknifing'') or drawing randomly with replacement from a set of data (i.e., ''bootstrapping''); (2) exchanging labels on data points when performing significance tests (i.e., ''permutation tests''); or (3) validating models by using random subsets (i.e., bootstrapping, ''cross-validation''). We are going to introduce some common resampling techniques including bootstrapping, jackknifing, cross-validation and permutation tests. ''Simulation'' involves imitating real world processes or systems over time. We usually apply simulation after a model, which represents the key characteristics of the process, is developed. Simulation is widely applied in many contexts such as simulation of technology for performance optimization, testing and video games. It is often applied when the real system is not accessible or is difficult and/or costly to apply, and it provides us with an easier way to obtain data about the system or test it. We are going to present an introduction to simulation including the basic methods, applications, advantages and limitations.
+In statistics, ''resampling'' and ''simulation'' are two important concepts with wide applications in research and projects from various fields. [http://en.wikipedia.org/wiki/Resampling_(statistics) ''Resampling''] is any of a variety of methods in which the following processes are implemented:
 ===Motivation===

@@ Line 116: / Line 116: @@
   sd(bivariateData[[2]])
   plot(bivariateData[[1]],bivariateData[[2]])
 * [[SOCR_EduMaterials_Activities_RNG|Random Number Simulation using SOCR]]