Difference between revisions of "SMHS PowerSensitivitySpecificity"
(→Power Analysis/Psychometrics) |
(→Power Analysis/Psychometrics) |
||
Line 90: | Line 90: | ||
• How to assess the reliability and validity of newly introduced psychometric instruments? | • How to assess the reliability and validity of newly introduced psychometric instruments? | ||
+ | |||
+ | <b>Statistical Power Analysis</b> | ||
+ | |||
+ | Power analysis is a technique employed in many experimental designs to determine the relation between sample size of the study, specific effect sizes, and ability to detect these effects from sample data given its size and a degree of confidence. In a nutshell, the statistical power is a proxy for the probability of detecting an effect (signal, discrepancy, misalignment, etc.) of a given size, provided we start with a given level of confidence and some sample size constraints. If this probability is very low (e.g., p<0.5), the study may be under-powered which may require a redesign or reformulation of the experiment. If the power is very high (e.g., p>0.999), then we can tradeoff (e.g., lower the sample-size, but preserve higher power to detect effects) various parameters in the power calculations. | ||
+ | |||
+ | Important components of power calculations: | ||
+ | |||
+ | 1. The <b>sample size</b> (either investigator controlled, or estimated) | ||
+ | |||
+ | 2. The <b>effect size</b> (must come outside of the study, can’t use the same data to estimate it. E.g., prior pubs, previous reports of expected effects, etc. | ||
+ | |||
+ | 3. <b>Significance level</b> is the α=P(<i>Type I error</i>)= probability of finding a (spurious) effect that is not there in reality, but is due to random chance alone | ||
+ | |||
+ | 4. The <b>statistical power</b> β= 1 - P(<i>Type II error</i>) = probability of finding an effect that is there. In some specific experimental designs (but not always), given any three of these components, we can determine the fourth. For instance, we can estimate the (minimum) sample-size that yields a power lower-bound by, say, β=0.8. | ||
+ | |||
+ | • <b>Type I error</b>: the false positive (Type I) error of rejecting the null hypothesis given that it is actually true; e.g., the purses are detected to containing the radioactive material while they actually do not. | ||
+ | |||
+ | • <b>Type II error</b>: the false negative (Type II) error of failing to reject the null hypothesis given that the alternative hypothesis is actually true; e.g., the purses are detected to not containing the radioactive material while they actually do. | ||
+ | |||
+ | • <b>Statistical power</b>: the probability that the test will reject a false null hypothesis (that it will not make a Type II error). When power increases, the chances of a Type II error decrease. | ||
+ | |||
+ | • <b>Test specificity</b> (ability of a test to correctly accept the null hypothesis). | ||
+ | |||
+ | • <b>Test sensitivity</b> (ability of a test to correctly reject the alternative hypothesis). | ||
+ | |||
+ | |||
+ | The table below gives an example of calculating specificity, sensitivity, False positive rate α, False Negative Rate, β, and power given the information about <i>TN</i> and <i>FN</i>. | ||
===Applications=== | ===Applications=== |
Revision as of 15:34, 10 May 2016
Scientific Methods for Health Sciences - Statistical Power, Sample-Size, Sensitivity and Specificity
Overview:
In statistics, we have many ways to evaluate and choose tests or models. In this lecture, we are going to introduce some commonly used methods to describe the characteristics of a test: power, sample size, effect size, sensitivity and specificity. This lecture will present background knowledge of these concepts and illustrate their power and application through examples.
Motivation:
Experiments, models and tests are fundamental to the field of statistics. All researchers are faced with the question of how to choose appropriate models and set up tests. We are interested in studying some of the most commonly used methods, including power, effect size, sensitivity and specificity. Focusing on these characteristics will greatly help us to choose and appropriate model and understand the results. We must consider questions such as, what would be a reasonable sample size to reach a balance in the trade off between cost and efficiency? What is the probability that a test will reject a false null hypothesis? What is the test’s ability to correctly accept a true null hypothesis or reject a false alternative hypothesis?
Theory
Type I Error, Type II Error and Power
- Type I error: the false positive (Type I) error of rejecting the null hypothesis given that it is actually true; e.g., the purses are detected to containing the radioactive material while they actually do not.
- Type II error: the false negative (Type II) error of failing to reject the null hypothesis given that the alternative hypothesis is actually true; e.g., the purses are detected to not containing the radioactive material while they actually do.
- Statistical power: the probability that the test will reject a false null hypothesis (that it will not make a Type II error). When power increases, the chances of a Type II error decrease.
- Test specificity (ability of a test to correctly accept the null hypothesis $ =\frac{d}{b+d}$.
- Test sensitivity (ability of a test to correctly reject the alternative hypothesis $=\frac{a}{a+c}$.
- The table below gives an example of calculating specificity, sensitivity, False positive rate $\alpha$, False Negative Rate $\beta$ and power given the information of TN and FN.
Actual Condition | |||
Absent ($H_0$ is true) | Present ($H_1$ is true) | ||
Test Result | Negative(fail to reject $H_0$) | Condition absent + Negative result = True (accurate) Negative (TN, 0.98505) | Condition present + Negative result = False (invalid) Negative (FN, 0.00025)Type II error (proportional to $\beta$) |
Positive (reject $H_0$) | Condition absent + Positive result = False Positive (FP, 0.00995)Type I error (α) | Condition Present + Positive result = True Positive (TP, 0.00475) | |
Test Interpretation | $Power = 1-\frac{FN}{FN+TP}= 0.95 $ | Specificity: TN/(TN+FP) = 0.98505/(0.98505+ 0.00995) = 0.99 | Sensitivity: TP/(TP+FN) = 0.00475/(0.00475+ 0.00025)= 0.95 |
Specificity $=\frac{TN}{TN + FP}$, Sensitivity $=\dfrac{TP}{TP+FN}$, $\alpha=\dfrac {FP}{FP+TN}$, $\beta=\frac{FN}{FN+TP}$, power$=1-\beta.$ Note that both (Type I ($\alpha$) and Type II ($\beta$)) errors are proportions in the range [0,1], so they represent error-rates. The reason they are listed in the corresponding cells is that they are directly proportionate to the numerical values of the FP and FN, respectively.
Note that the two alternative definitions of power are equivalent:
- power$=1-\beta$, and
- power=sensitivity
This is because power=$1-\beta=1-\frac{FN}{FN+TP}=\frac{FN+TP}{FN+TP} - \frac{FN}{FN+TP}=\frac{TP}{FN+TP}=$ sensitivity.
Sample size
The number of observations or replicates included in a statistical sample. It is an important feature of any empirical study, which aims to make inference about a population. In complicated studies, there may be several different sample sizes involved in the study: for example, in a survey sampling involving stratified sampling, there may be different sizes of samples for each population.
- Factors influence sample size: expense of data collection; need to have sufficient statistical power.
- Ways to choose sample sizes: (1) expedience. Consider a simple experiment where the sample data is readily available or convenient to collect, yet the size of sample is crucial in avoiding wide confidence intervals or risks of errors in statistical hypothesis testing. (2) using a target variance for an estimate to be derived from the sample eventually obtained; (3) using a target for the power of a statistical test to be applied once the sample is collected.
- Intuitively, larger sample size generally lead to increased precision in estimating unknown parameters. However, in some situations, the increase in accuracy for larger sample size is minimal, or even doesn’t exist. This can result from the presence of systematic error or strong dependence in the data, or if the data follow a heavy-tailed distribution. Sample size is judged based on the quality of the resulting estimates. For example, if a proportion is being estimated, one may wish to have the 95% confidence interval be less than 0.06 units wide. Alternatively, sample size may be assessed based on the power of a hypothesis test.
- Choose the sample size based on our expectation of other measures.
- Suppose the simple experiment of flipping a coin, where estimator of a proportion is $\hat{p}=\frac{X}{n}$, where $X$ is the number of heads out of n experiments. The estimator follows a binomial distribution and when n is sufficiently large, the distribution will be closely approximated by a normal distribution. With approximation, it can be shown that around $95\%$of this distribution’s probability lies within 2 standard deviations of the mean. Use Wald method for the binomial distribution, an interval of the form $(\hat{p} -2\sqrt{\frac{0.25}{n}}, \hat{p} + 2\sqrt{\frac{.25}{n}}) $ will form a 95% CI for the true proportion. If this interval needs to be no more than $W$ units wide, then we have $4\sqrt{\frac{0.25}{n}}=W$, solved for $n$, we have $ n=\frac{4}{W^2}=\frac{1}{B^2}$ where $B$ is the error bound on the estimate, i.e., the estimate is usually given as within $\pm B$. Hence, if $B=10$, then $n=100$; and if $B=0.05$ (5%), then $n=400$.
Proportion
A proportion is a special case of mean. When estimating the population mean using an independent and identically distributed sample of size n, where each data has variance $ \sigma ^{2}$, the standard error of the sample mean is $\frac{\sigma}{\sqrt{n}}$. With CLT, the 95% CI is $(\bar x - \frac {2\sigma}{\sqrt n},\bar x +\frac{2\sigma}{\sqrt n})$. If we wish to have a confidence interval with W units in width, then solve for n, we have $n=\frac{16\sigma^2}{W^2}$.
Hypothesis tests
Sample size for hypothesis tests: Let $X_i,i=1,2,…,n$ be independent observations taken from a normal distribution with unknown mean μ and known variance $\sigma^2$. The null hypothesis vs. alternative hypothesis: $H_0:\mu=0$ vs.$H_a:\mu=\mu^*$. If we wish to:
- reject $H_0$ with a probability of at least $1-\beta$ when $H_a$ is true,
- reject $H_0$ with probability $\alpha$ when $H_0$ is true.
Thus, we need: $P(\bar x >\frac{z_{\alpha}\sigma}{\sqrt n}|H_0 \text{ true})=\alpha $, and so reject $H_0$ if our sample average is more than $\frac{z_\alpha\sigma} {\sqrt n}$ is a decision rule which satisfies (2). $z_\alpha$ is the upper percentage point of the standard normal distribution (as this is a one-sided test!) If we wish this to happen with a probability $1-\beta$ when $H_a$ is true, that is our sample average comes from a normal distribution with a different mean $μ^*$.
Therefore, require $P (\bar x >\frac {z_{\alpha}\sigma}{\sqrt n}|H_a \text{ true})\le 1-\beta $. Solving for $n$, we have $n \ge \left( \frac{z_{\alpha}-\Phi^{-1}(1-\beta)}{\frac{\mu^{*}}{\sigma}}\right)^{2}$, where $\Phi$ is the normal cumulative distribution function. Recall that by CLT, $\bar{x} \sim N(\mu, \frac{\sigma^2}{n})$ and under $H_a$, $x \sim N(\mu^*, \sigma^2)$. Then standardizing $\bar{x}$, we have that $\frac{\bar{x} -\mu^*}{\frac{\sigma}{\sqrt{n}}} \sim N(0,1)$. $\frac{\bar{x} -\mu^*}{\frac{\sigma}{\sqrt{n}}} \ge z_{\alpha}$ can be solved for $\sqrt{n}$ (or $n$) given the specified lower boundary on the right-tail probability ($1-\beta$). Thus, $\Phi^{-1}(1-\beta) \ge z_{\alpha} - \frac{\mu^*}{\frac{\sigma}{\sqrt{n}}},$ and $n \ge \left( \frac{z_{\alpha}-\Phi^{-1}(1-\beta)}{\frac{\mu^{*}}{\sigma}}\right)^{2}$.
Effect size
Effect size is a descriptive statistic that conveys the estimated magnitude of a relationship without making any statement about whether the apparent relationship in the data reflects a true relationship in the population. It complements inferential statistics such as p-value and plays an important role in statistical studies. The term effect size can refer to a statistic calculated from a sample of data, or to a parameter of a hypothetical statistical population. These effect sizes estimate the amount of the variance within an experiment that is "explained" or "accounted for" by the experiment's model.
Other common measures
- Pearson $r$ (correlation): an effect size when paired quantitative data are available, for instance if one were studying the relationship between birth weight and longevity. It varies from -1 to 1, 1 indicating a perfect negative linear relation, 1 indicating a perfect positive linear relation and 0 indicating no linear relation between two variables.
- Correlation coefficient, $ r^2 $: a coefficient determination calculated as the square of Pearson correlation r. It varies from 0 to 1 and is always nonnegative. For example, if $r=0.2$ then $r^2=0.04$ meaning that $4\%$ of the variance of either variable is shared with the other variable.
- Eta-squared, $ \eta^2 $, describes the ratio of variance explained in the dependent variable by a predictor while controlling for other predictors, making it analogous to the $ r^2 $. It is a biased estimator of the variance explained by the model in the population. $ \eta^2=\frac{SS_{treatment}} {SS_{total}} $ .
- Omega-squared, $\omega^2$: a less biased estimator of the variance explained in the population. $\omega^2 =\frac{SS_{treatment}-df_{treatment}*MS_{error}}{SS_{total}+MS_{error}}$. Given it is less biased, $\omega^2$ is preferable to $\eta^2$, however, it can be more inconvenient to calculate for complex analyses.
- Cohen’s $ f^2 $: one of several effect size measures to use in the context of an F test for ANOVA or multiple regression. Its amount of bias depends on the bias of its underlying measurement of variance explained. $f^2=\frac{R^2}{1-R^2}$,$R^2 $ is the squared multiple correlation.
Power Analysis/Psychometrics
Questions:
• Why are sample-sizes important and how do they influence our ability to design experiments tailored to detect effect of certain magnitude? Recall the first (LLN) and the second (CLT) fundamental laws of probability theory.
• How to assess the reliability and validity of newly introduced psychometric instruments?
Statistical Power Analysis
Power analysis is a technique employed in many experimental designs to determine the relation between sample size of the study, specific effect sizes, and ability to detect these effects from sample data given its size and a degree of confidence. In a nutshell, the statistical power is a proxy for the probability of detecting an effect (signal, discrepancy, misalignment, etc.) of a given size, provided we start with a given level of confidence and some sample size constraints. If this probability is very low (e.g., p<0.5), the study may be under-powered which may require a redesign or reformulation of the experiment. If the power is very high (e.g., p>0.999), then we can tradeoff (e.g., lower the sample-size, but preserve higher power to detect effects) various parameters in the power calculations.
Important components of power calculations:
1. The sample size (either investigator controlled, or estimated)
2. The effect size (must come outside of the study, can’t use the same data to estimate it. E.g., prior pubs, previous reports of expected effects, etc.
3. Significance level is the α=P(Type I error)= probability of finding a (spurious) effect that is not there in reality, but is due to random chance alone
4. The statistical power β= 1 - P(Type II error) = probability of finding an effect that is there. In some specific experimental designs (but not always), given any three of these components, we can determine the fourth. For instance, we can estimate the (minimum) sample-size that yields a power lower-bound by, say, β=0.8.
• Type I error: the false positive (Type I) error of rejecting the null hypothesis given that it is actually true; e.g., the purses are detected to containing the radioactive material while they actually do not.
• Type II error: the false negative (Type II) error of failing to reject the null hypothesis given that the alternative hypothesis is actually true; e.g., the purses are detected to not containing the radioactive material while they actually do.
• Statistical power: the probability that the test will reject a false null hypothesis (that it will not make a Type II error). When power increases, the chances of a Type II error decrease.
• Test specificity (ability of a test to correctly accept the null hypothesis).
• Test sensitivity (ability of a test to correctly reject the alternative hypothesis).
The table below gives an example of calculating specificity, sensitivity, False positive rate α, False Negative Rate, β, and power given the information about TN and FN.
Applications
- This articletitled Introduction To Sample Size Determination And Power Analysis For Clinical Trials reviewed the importance of sample size in clinical trials and presented a general method from which specific equations are derived for sample size determination and analysis of power for a wide variety of statistical procedures. This paper discussed the method in details with illustration in relation to the t test, test for proportions, test for survival time and tests for correlations that commonly occurred in clinical trials.
- This article presents measures of the magnitude of effects (i.e., effect size statistics) and their confidence intervals in all biological journals. It illustrated the combined use of an effect size and its confidence interval, which enables one to assess the relationships within data more effectively than the use of p values, regardless of statistical significance. It focused on standardized effect size statistics and extensively discussed two dimensionless classes of effect size statistics: d statistics (standardized mean difference) and r statistics (correlation coefficient), because these can be calculated from almost all study designs and also because their calculations are essential for meta-analysis. The paper provided potential solutions for four main technical problems researchers may encounter when calculating effect size and CIs: (1) when covariates exist, (2) when bias in estimating effect size is possible, (3) when data have non-normal error structure and/or variances, and (4) when data are non-independent.
- This articlereviewed methods of sample size and power calculation for most commonly study designs. It presents two generic formulae for sample size and power calculation, from which the commonly used methods are derived. It also illustrates the calculation with a computer program, which can be used for studies with dichotomous, continuous, or survival response measures.
Software
Problems
Other things being equal, which of the following actions will reduce the power of a hypothesis test?
I. Increasing sample size. II. Increasing significance level. III. Increasing beta, the probability of a Type II error.
- (A) I only
- (B) II only
- (C) III only
- (D) All of the above
- (E) None of the above
Suppose a researcher conducts an experiment to test a hypothesis. If she doubles her sample size, which of the following will increase?
I. The power of the hypothesis test. II. The effect size of the hypothesis test. III. The probability of making a Type II error.
- (A) I only
- (B) II only
- (C) III only
- (D) All of the above
- (E) None of the above
Suppose we have the following measurements taken. Calculate the corresponding power, specificity and sensitivity.
Actual Condition | |||
Absent ($H_0$ is true) | Present ($H_1$ is true) | ||
Test Result | Negative(fail to reject $H_0$) | 0.983 | 0.0025 |
Positive (reject $H_0$) | 0.0085 | 0.0055 |
Suppose we are running a test on a simple experiment where the population standard deviation is $ 0.06$. $H_0: \mu=0$ vs. $H_a: \mu=0.5$. With type I error of 5%, what would be a reasonable sample size if we want to achieve at least 98% power.
References
- SOCR Home page: http://www.socr.umich.edu
Translate this page: