SMHS PowerSensitivitySpecificity
Scientific Methods for Health Sciences - Statistical Power, Sample-Size, Sensitivity and Specificity
Overview:
In the statistics, we have many ways to value and choose a test or model. In this lecture, we are going to introduce some commonly used methods, which describes the characteristics of a test: power, sample size, effect size, sensitivity and specificity. Those measures and characteristics of a test would help us in our statistical test or experiments. This lecture will present introduction to the background knowledge of those concepts and illustrate their power and application through examples.
Motivation:
Experiments, models and tests are significant fundamentals to the filed of statistics and we all experienced the question of how to set up the right test and how to choose a better model. We are interested in studying on some of the most commonly used methods including power, effect size, sensitivity and specificity, which will greatly help us in understanding and choosing the model. So, what would be a reasonable sample size to reach a balance in the trade off between cost and efficiency? What would be the probability that the test will reject a false null hypothesis? What is the test’s ability to correctly accept a true null hypothesis or reject a false alternative hypothesis?
Theory
Type I Error, Type II Error and Power
- Type I error: the false positive (Type I) error of rejecting the null hypothesis given that it is actually true; e.g., the purses are detected to containing the radioactive material while they actually do not.
- Type II error: the false negative (Type II) error of failing to reject the null hypothesis given that the alternative hypothesis is actually true; e.g., the purses are detected to not containing the radioactive material while they actually do.
- Statistical power: the probability that the test will reject a false null hypothesis (that it will not make a Type II error). When power increases, the chances of a Type II error decrease.
- Test specificity (ability of a test to correctly accept the null hypothesis $ =\frac{d}{b+d}$.
- Test sensitivity (ability of a test to correctly reject the alternative hypothesis $=\frac{a}{a+c}$.
- The table below gives an example of calculating specificity, sensitivity, False positive rate $\alpha$, False Negative Rate $\beta$ and power given the information of TN and FN.
Actual Condition | |||
Absent (H_0 is true) | Present (H_1 is true) | ||
Test Result | Negative(fail to reject H_0) | Condition absent + Negative result = True (accurate) Negative (TN, 0.98505) | Condition present + Negative result = False (invalid) Negative (FN, 0.00025)Type II error (β) |
Positive (reject H_0) | Condition absent + Positive result = False Positive (FP, 0.00995)Type I error (α) | Condition Present + Positive result = True Positive (TP, 0.00475) | |
Test Interpretation | $Power = 1-FN= 1-0.00025 = 0.99975 $ | Specificity: TN/(TN+FP) = 0.98505/(0.98505+ 0.00995) = 0.99 | Sensitivity: TP/(TP+FN) = 0.00475/(0.00475+ 0.00025)= 0.95 |
Specificity $=\frac{TN}{TN + FP}$, Sensitivity $=\dfrac{TP}{TP+FN}$, $\alpha=\dfrac {FP}{FP+TN}$, $\beta=\frac{FN}{FN+TP}$, power$=1-\beta.$
Sample size
The number of observations or replicates included in a statistical sample. It is an important feature of any empirical study, which aims to make inference about a population. In complicated studies, there may be several different sample sizes involved in the study: for example, in a survey sampling involving stratified sampling, there may be different sizes of samples for each population.
- Factors influence sample size: expense of data collection; need to have sufficient statistical power.
- Ways to choose sample sizes: (1) expedience. Consider a simple experiment where the sample data is readily available or convenient to collect, yet the size of sample is crucial in avoiding wide confidence intervals or risks of errors in statistical hypothesis testing. (2) using a target variance for an estimate to be derived from the sample eventually obtained; (3) using a target for the power of a statistical test to be applied once the sample is collected.
- Intuitively, larger sample size generally lead to increased precision in estimating unknown parameters. However, in some situations, the increase in accuracy for larger sample size is minimal, or even doesn’t exist. This can result from the presence of systematic error or strong dependence in the data, or if the data follow a heavy-tailed distribution. Sample size is judged based on the quality of the resulting estimates. For example, if a proportion is being estimated, one may wish to have the 95% confidence interval be less than 0.06 units wide. Alternatively, sample size may be assessed based on the power of a hypothesis test.
- Choose the sample size based on our expectation of other measures.
- Suppose the simple experiment of flipping a coin, where estimator of a proportion is $\hat{p}=\frac{X}{n}$, where $X$ is the number of heads out of n experiments. The estimator follows a binomial distribution and when n is sufficiently large, the distribution will be closely approximated by a normal distribution. With approximation, it can be shown that around $95\%$of this distribution’s probability lies within 2 standard deviations of the mean. Use Wald method for the binomial distribution, an interval of the form $(\hat{p} -2\sqrt{\frac{0.25}{n}}, \hat{p} + 2\sqrt{\frac{.25}{n}}) $ will form a 95% CI for the true proportion. If this interval needs to be no more than $W$ units wide, then we have $4\sqrt{\frac{0.25}{n}}=W$, solved for $n$, we have $ n=\frac{4}{W^2}=\frac{1}{B^2}$ where $B$ is the error bound on the estimate, i.e., the estimate is usually given as within $\pm B$. Hence, if $B=10$, then $n=100$; and if $B=0.05$ (5%), then $n=400$.
- A proportion is a special case of mean. When estimating the population mean using an independent and identically distributed sample of size n, where each data has variance $ \sigma ^{2}$, the standard error of the sample mean is $\frac{\sigma}{\sqrt{n}}$. With CLT, the 95% CI is $(\bar x - \frac {2\sigma}{\sqrt n},\bar x +\frac{2\sigma}{\sqrt n})$. If we wish to have a confidence interval with W units in width, then solve for n, we have $n=\frac{16\sigma^2}{W^2}$.
- Sample size for hypothesis tests: Let $X_i,i=1,2,…,n$ be independent observations taken from a normal distribution with unknown mean μ and known variance $\sigma^2$. The null hypothesis vs. alternative hypothesis: $H_0:\mu=0$ vs.$H_a:\mu=\mu^*$. If we wish to (1) reject $H_0$ with a probability of at least $1-\beta$ when $H_a$ is true, (2) reject $H_0$ with probability $\alpha$ when $H_0$ is true, we need: $P(\bar x >\frac{z_{\alpha}\sigma}{\sqrt n}|H_0 \text{ true})=\alpha $, and so reject $H_0$ if our sample average is more than $\frac{z_\alpha\sigma} {\sqrt n}$ is a decision rule which satisfies (2). $z_\alpha$ is the upper percentage point of the standard normal distribution. If we wish this to happen with a probability $1-\beta$ when $H_a$ is true. In this case, our sample average will come from a normal distribution with mean $μ^*$.
- Therefore, require $P (\bar x >\frac {z_{\alpha}\sigma}{\sqrt n}|H_0 \text{ true})\le 1-\beta $. Solve for n, we have $n \ge ( \frac{z_{\alpha}-\Phi^{-1}(1-\beta)}{\frac{\mu^{*}}{\sigma}})^{2}$, where $\Phi$ is the normal cumulative distribution function.
Effect size
Effect size is a descriptive statistic that conveys the estimated magnitude of a relationship without making any statement about whether the apparent relationship in the data reflects a true relationship in the population. It complements inferential statistics such as p-value and plays an important role in statistical studies. The term effect size can refer to a statistic calculated from a sample of data, or to a parameter of a hypothetical statistical population. These effect sizes estimate the amount of the variance within an experiment that is "explained" or "accounted for" by the experiment's model.
Other common measures
- Pearson $r$ (correlation): an effect size when paired quantitative data are available, for instance if one were studying the relationship between birth weight and longevity. It varies from -1 to 1, 1 indicating a perfect negative linear relation, 1 indicating a perfect positive linear relation and 0 indicating no linear relation between two variables.
- Correlation coefficient, $ r^2 $: a coefficient determination calculated as the square of Pearson correlation r. It varies from 0 to 1 and is always nonnegative. For example, if $r=0.2$ then $r^2=0.04$ meaning that $4\%$ of the variance of either variable is shared with the other variable.
- Eta-squared, $ \eta^2 $, describes the ratio of variance explained in the dependent variable by a predictor while controlling for other predictors, making it analogous to the $ r^2 $. It is a biased estimator of the variance explained by the model in the population. $ \eta^2=\frac{SS_{treatment}} {SS_{total}} $ .
- Omega-squared, $\omega^2$: a less biased estimator of the variance explained in the population. $\omega^2 =/frac{SS_{treatment}-df_{treatment}*MS_{error}}{SS_{total}+MS_error}$. Given it is less biased, $\omega^2$ is preferable to $\eta^2$, however, it can be more inconvenient to calculate for complex analyses.
- Cohen’s $ f^2 $: one of several effect size measures to use in the context of an F test for ANOVA or multiple regression. Its amount of bias depends on the bias of its underlying measurement of variance explained. $f^2=\frac{R^2}{1-R^2}$,$R^2 $ is the squared multiple correlation.
Applications
- This articletitled Introduction To Sample Size Determination And Power Analysis For Clinical Trials reviewed the importance of sample size in clinical trials and presented a general method from which specific equations are derived for sample size determination and analysis of power for a wide variety of statistical procedures. This paper discussed the method in details with illustration in relation to the t test, test for proportions, test for survival time and tests for correlations that commonly occurred in clinical trials.
- This article presents measures of the magnitude of effects (i.e., effect size statistics) and their confidence intervals in all biological journals. It illustrated the combined use of an effect size and its confidence interval, which enables one to assess the relationships within data more effectively than the use of p values, regardless of statistical significance. It focused on standardized effect size statistics and extensively discussed two dimensionless classes of effect size statistics: d statistics (standardized mean difference) and r statistics (correlation coefficient), because these can be calculated from almost all study designs and also because their calculations are essential for meta-analysis. The paper provided potential solutions for four main technical problems researchers may encounter when calculating effect size and CIs: (1) when covariates exist, (2) when bias in estimating effect size is possible, (3) when data have non-normal error structure and/or variances, and (4) when data are non-independent.
- This articlereviewed methods of sample size and power calculation for most commonly study designs. It presents two generic formulae for sample size and power calculation, from which the commonly used methods are derived. It also illustrates the calculation with a computer program, which can be used for studies with dichotomous, continuous, or survival response measures.
Software
Problems
Other things being equal, which of the following actions will reduce the power of a hypothesis test?
I. Increasing sample size. II. Increasing significance level. III. Increasing beta, the probability of a Type II error.
- (A) I only
- (B) II only
- (C) III only
- (D) All of the above
- (E) None of the above
Suppose a researcher conducts an experiment to test a hypothesis. If she doubles her sample size, which of the following will increase?
I. The power of the hypothesis test. II. The effect size of the hypothesis test. III. The probability of making a Type II error.
- (A) I only
- (B) II only
- (C) III only
- (D) All of the above
- (E) None of the above
Suppose we have the following measurements taken. Calculate the corresponding power, specificity and sensitivity.
Actual Condition | |||
Absent ($H_0$ is true) | Present ($H_1$ is true) | ||
Test Result | Negative(fail to reject $H_0$) | 0.983 | 0.0025 |
Positive (reject $H_0$) | 0.0085 | 0.0055 |
Suppose we are running a test on a simple experiment where the population standard deviation is $ 0.06$. $H_0: \mu=0$ vs. $H_a: \mu=0.5$. With type I error of 5%, what would be a reasonable sample size if we want to achieve at least 98% power.
References
- SOCR Home page: http://www.socr.umich.edu
Translate this page: