SMHS CorrectionMultipleTesting
Contents
Scientific Methods for Health Sciences - Correction for Multiple Testing
Overview
Multiple testing refers to studies where simultaneous testing of several hypotheses is performed. This is very common in empirical research and additional methods besides the traditional rules needs to be applied in multiple testing in order to adjust the error rates for the multiple testing problems. Here, we introduce protocols for correction of multiple testing, discuss about the general problems with multiple testing, and present ways to deal with the multiple testing problems efficiently including Bonferroni, Tukey’s procedure, Family-Wise Error Rate (FWER), and FDR (false discovery rate).
Motivation
We have learned how to do the hypothesis testing and parametric inference using statistical tests. However, the multiple testing problems will occur when one considers simultaneously a set of inference questions, or infers a subset of parameters selected based on the observed values. Consider a simple example where we run the same experiment 100 times, independently. If our a priori false-positive (type I error) rate is 0.05 (meaning one out of 20, or 5 out of 100) experiments are likely to generate significant result for any one experiment simply due to chance alone, then among the 100 experiments, we would expect to see about 5 (5 out of 100) experiments falsely generate positive results. Multiple-testing correction refers to modifying the inference protocol so that the true false-positive rate remains fixed (or is controlled) despite the fact that we test multiple hypotheses. Examples:
- Suppose a treatment is a new way of teaching writing to students, and the control is the standard way of teaching writing. Students in the two groups can be compared in terms of grammar, spelling, organization, content, and so on. As more attributes are compared, it becomes more likely that the treatment and control groups will appear to differ on at least one attribute by random chance alone.
- Consider the efficacy of a drug in terms of the reduction of any one of a number of disease symptoms. As more symptoms are considered, it becomes more likely that the drug will appear to be an improvement over existing drugs in terms of at least one symptom.
- Suppose we investigate the safety of a drug in terms of the occurrences of different types of side effects. As more types of side effects are considered, it becomes more likely that the new drug will appear to be less safe than existing drugs in terms of at least one side effect.
The number of comparisons increases in these examples which leads to conclusions that the groups being compared do differ in terms of at least one attribute. Our confidence that a result will generalize to independent data should generally be weaker if it is observed as part of an analysis that involves multiple comparisons, rather than an analysis that involves only a single comparison. If a test is performed at the \(\alpha=0.05\) level, there is only a 5% chance of incorrectly rejecting the null hypothesis if the null hypothesis is true. However, for 100 tests where all null hypotheses are true, the expected number of incorrect rejections is 5. If the tests are independent, the probability of at least one incorrect rejection is 99.4% (as \(P(\ge 1)=1-P(0) = 1-0.95^(100)=0.9940795\). These errors are called false positives or Type I errors.
So what can we do to adjust for multiple testing? How can we keep the prescribed family wise error rate of α in an analysis involving more than one comparison? Apparently, the error rate for each comparison, it must be more stringent than α. Multiple testing correction would be the way to go and we are going to introduce some commonly used methods for adjusting for this type of error in multiple testing.
Theory
Family-Wise Error Rate (FWER)
The probability of making the type I error among all the hypotheses when performing multiple hypothesis tests. FWER exerts a more stringent control over false discovery compared to false discovery rate controlling procedure. Suppose we did simultaneous tests on m hypotheses denoted by \(H_1,H_2,…,H_m\) with corresponding p-values \(p_1,p_2,…,p_m\). Let \(I_0\) be the subset of the true null hypotheses with \(m_0\). Our aim is to achieve an overall type I error rate of α from this cumulative multiple testing inference. \(FWER=Pr(V≥1)=1-Pr(V=0)\). By assuming \(FWER≤α\), the probability of making even one type I error in the family is controlled at level α.
Null hypothesis is True | Alternative hypothesis is True | Total | |
---|---|---|---|
Declared significant | V (number of false positives) | S(number of true positives) | R |
Declared non-significant | U(number of true negatives) | T(number of false negatives) | m-R |
Total | m_0(number of true null hypotheses) | m-m_0(number of true alternatives) | m |
- A procedure controls the FWER in the weak sense if the FWER control at level α is guaranteed only when all hypotheses are true.
- A procedure controls the FWER in the strong sense if the FWER control at level α is guaranteed for any configuration of true and non-true null hypotheses.
Controlling FWER:
- Bonferrroni: states that rejecting all \(p_i≤α/m\) will control that \(FWER≤α\) which is proved through Boole’s Inequality: \(FWER=Pr(\Cup_i{p_i≤α/m}) \leq \sum_i{Pr(p_i≤α/m)} ≤ m_0 α/m ≤ m α/m = α\). This is the simplest and most conservative method to control FWER though it can be (very) conservative if there are a large number of tests and/or the test statistics are positively correlated. It controls the probability of false positives only.
- Tukey’s procedure is only applicable for pairwise comparisons. It assumes independence of the observations being tested as well as equal variance across observations. The procedure calculates for each pair the standardized range statistics: \( \frac{Y_A-Y_B}{SE}\), where \(Y_A\) is the larger of the two means being compared and \(Y_B\) is the smaller one and SE is the standard error of the data.
- The Sidak procedure works for independent tests where each hypothesis test has \(α_SID=1-(1-α)^{1/m}\). This is a more powerful method than Bonferroni but the gain is small.
- Holm’s step-down procedure starts by ordering the p values from lowest to highest as \(p_{(1)},p_{(2)},…,p_{(m)} \) with corresponding hypotheses \(H_{(1)},H_{(2)},…,H_{(m)}\). Suppose R is the smallest k such that \(p_{(k)}>α/(m+1-k)\). Reject the null hypotheses \(H_{(1)},H_{(2)},…,H_{(m)}\), if \(R=1\) then none of the hypotheses are rejected. This method is uniformly better than Bonferroni’s and it is based on Bonferroni with no restriction on the joint distribution of the test statistics.
- Hochberg’s step-up procedure: starts by ordering the p values from lowest to highest as \(p_{(1)},p_{(2)},…,p_{(m)} \) with corresponding hypotheses \(H_{(1)},H_{(2)},…,H_{(m)}\). For a given α, let \(R\) be the largest k such that \(p_{(k)}≤α/(m+1-k)\). Reject the null hypotheses \(H_{(1)},H_{(2)},…,H_{(R)}\) only, and none of the \(H_{(R+1)},…,H_{(m)}\). It is more powerful than Holm’s, however, it is based on the Simes test so it holds only under independence (and also under some form of positive dependence).
FDR (false discovery rate)
A statistical method used in multiple hypothesis testing to adjust for multiple comparisons. It is designed to control the expected proportion of incorrectly rejected null hypotheses. Compared to FWER, it exerts a less stringent control over false discovery and seeks to reduce the probability of even one false discovery as opposed to the expected proportion of false discoveries and enjoys greater power at the cost increased rate of type I errors.
Null hypothesis is True | Alternative hypothesis is True | Total | |
---|---|---|---|
Declared significant | V (number of false positives) | S(number of true positives) | R |
Declared non-significant | U(number of true negatives) | T(number of false negatives) | m-R |
Total | $m_0$(number of true null hypotheses) | $m-m_0$ (number of true alternatives) | m |
Define $Q$ as the proportion of false discoveries among the discoveries $Q=\frac{V}{R}$, then FDR is defined as $FDR=Q_e=E[Q]=E[\frac{V}{V+S}]=E[\frac{V}{R}]$, where $\frac{V}{R} is defined to be 0 when R=0. Our aim is to keep FDR below the threshold α (or q). And q-value is defined as FDR analogue of the p-value, the q-value of individual hypothesis test is the minimum FDR at which the test may be called significant.
Controlling procedures of FDR:
- With m null hypotheses \(H_1,H_2,…,H_m\) and \(p_1,p_2,…,p_m\) as their corresponding p-values. We order these p-values in increasing order and denote as \(p_{(1)},p_{(2)},…,p_{(m)}\).
- Benjamini-Hochberg procedure controls the false discovery (at least α). For a given α, find the largest k such that \(p_((k))≤k/m α\); then reject all \(H_{(i)}\) for \(i=1,…,k\). This method works when the m tests are independent as well as with some cases of dependence: \(E(Q)≤\frac{m_0}{m}α≤α\).
- Benjamini-Hochberg-Yekutieli procedure controls the FDR under positive dependence assumptions. It modifies the BH procedure: \(p_{(k)}≤\frac{k}{m*c(m)}α\), if the tests are independent or positively correlated we choose \(c(m)=1\) and choose \(c(m)=\sum_{i=1}^m{1/i}\) with arbitrary dependent tests, when the tests are negatively correlated, c(m) can be approximated with \(\sum_{i=1}^m{1/i} \approx ln(m)+ \gamma\), where \(\gamma\) is the Euler-Mascheroni constant.
Example
Suppose we have computed a vector of p-values \(p_1,p_2…,p_n\). Let’s compare the corrections using different strategies:
# Given a set of p-values, returns p-values adjusted using one of several methods. # c("holm", "hochberg", "hommel", "bonferroni", "fdr", "BY", # "fdr", "none") > p.adjust(c(0.05,0.05,0.1),"bonferroni") [1] 0.15 0.15 0.30 > p.adjust(c(0.05,0.05,0.1),"fdr") [1] 0.075 0.075 0.100 > p.adjust(c(0.05,0.05,0.1),"fdr") [1] 0.075 0.075 0.100 > p.adjust(c(0.05,0.05,0.1),"holm") [1] 0.15 0.15 0.15
Applications
- This article presents information on how to use the SOCR analyses library for the purpose of computing the False Discovery Rate (FDR) correction for multiple testing in volumetric and shape-based analyses. It provides the specific procedure to compute FDR using SOCR in multiple testing and illustrates with examples and supplementary information about FDR.
- This article is a comprehensive introduction to multiple testing. It describes the problem of multiple testing more formally and discusses methods, which account for the multiplicity issue. In particular, the recent developments based on resampling results in an improved ability to reject false hypotheses compared to classical methods such as Bonferroni.
Software
Problems
- Suppose the research is conducted to test a new drug and there are 10 hypotheses being tested simultaneously. Calculate the significance level of each individual test using Bonferroni correction if we want to maintain an overall type I error of 5% and the probability of observing at least one significant result when using the correction you chose?
- Consider we are working with a study on test of a new drug for cancer where we have three treatments: the new medicine, the old medicine and the combination of the two. We are doing a pairwise test on these three treatments and want to maintain a Type I error rate of 5%. Consider the Tukey’s correction and describe how you are going to apply this method here.
References
- Statistical inference / Casella,G. Berger, R.
- Sampling / Thompson, S.
- Sampling theory and methods / Sampath, S.
- SOCR Home page: http://www.socr.umich.edu
Translate this page: