Difference between revisions of "Scientific Methods for Health Sciences"

From SOCR
Jump to: navigation, search
(Clinical vs. Statistical Significance)
(Clinical vs. Statistical Significance)
Line 76: Line 76:
 
IV. HS 850: Fundamentals
 
IV. HS 850: Fundamentals
  
Multiple Testing Inference
+
Clinical vs. Statistical Significance
  
1) Overview: Multiple testing refers to the cases where testing of several hypotheses are involved simultaneously. This is very common in empirical researches and additional methods besides the traditional rules needs to be applied in multiple testing in order to adjust for the multiple testing problems. In this lecture, we are going to introduce to the area of multiple testing: we will introduce to the basic concepts we are going to use in multiple testing and discuss about the general problems with multiple testing and ways to deal with the multiple testing problems efficiently including with Bonferroni, Tukey’s procedure, Family-Wise Error Rate (FWER), and FDR (false discovery rate).
+
1) Overview: Statistical significance is related to the question of whether or not the results of a statistical test meet an accepted criterion. The criterion can be arbitrary and the same statistical test may give different results based on different criterion of significance. Usually, statistical significance is expressed in terms of probability (say p value, which is the probability of obtaining a test statistic result at least as extreme as the one that was actually observed assuming the null hypothesis is true). Clinical significance is the difference between new and old therapy found in the study large enough to alter the practice. This section presents a general introduction to the field of statistical significance with important concepts of tests for statistical significance and measurements of significance of tests as well as the application of statistical test in clinical and the comparison between clinical and statistical significance.
  
2) Motivation: We have learned how to do the hypothesis testing with statistical tests. However, the multiple testing problems will occur when one considers a set of statistical inference simultaneously, or infers a subset of parameters selected based on the observed values. So what can we do to adjust for multiple testing? How can we keep the prescribed family wise error rate of α in an analysis involving more than one comparison? Apparently, the error rate for each comparison, it must be more stringent than α. Multiple testing correction would be the way to go and we are going to introduce some commonly used methods for adjusting for this type of error in multiple testing.
+
2) Motivation: Significance is one of the most commonly used measurements in statistical tests from various fields. However, most researchers and students misinterpret statistical significance and non-significance. Few people know the exact indication of p value, which in some sense defines statistical significance. So the question would be, how can we define statistical significance? Is there any other ways to define statistical significance besides p value? What is missing in the ways to make inferences in clinical vs. statistical significance? This lecture aims to help students have a thorough understanding of clinical and statistical significance.  
  
 
3) Theory
 
3) Theory
  
3.1) Family-Wise Error Rate (FWER): the probability of making the type I error among all the hypotheses when performing multiple hypothesis tests. FWER exerts a more stringent control over false discovery compared to false discovery rate controlling procedure. Suppose we did simultaneous tests on m hypotheses denoted by H_1,H_2,,H_m with corresponding p-values p_1,p_2,,p_m. Let I_0 be the subset of the true null hypotheses with m_0. Our aim is to achieve an overall type I error rate of α from this multiple testing.  FWER=Pr⁡(V≥1)=1-Pr⁡(V=0). By assuming FWER≤α, the probability of making even one type I error in the family is controlled at level α.  
+
3.1) Statistical significance: the low probability at which an observed effect would have occurred due to chance. It is an integral part of statistical hypothesis testing where it plays a vital role to decide if a null hypothesis can be rejected. The criterion level is typically the value of p<0.05, which is chosen to minimize the possibility of a Type I error, finding a significant difference when one does not exist. It does not protect us from Type II error, which is defined as failure to find a difference when the difference does exist.
 +
Statistical significance involves important factors like (1) magnitude of the effect; (2) the sample size; (3) the reliability of the effect (i.e., the treatment equally effective for all participants); (4) the reliability of the measurement instrument.
 +
Problems with p value and statistical significance: (1) failure to reject the null hypothesis doesn’t mean we accept the null; (2) in any cases, the true effects in real life are never zero and things can be disproved only in pure math not in real life; (3) it’s not logical to assume that the effects are zero until disproved; (4) the significant level is arbitrary.
 +
p value: probability of obtaining a test statistic result at least as extreme as the one that was actually observed when the null hypothesis is actually true. It is used in the context of null hypothesis testing in order to quantify the idea of statistical significance of evidence. A researcher will often reject the null when p value turns out to be less than a predetermined significance level, say 0.05. If the p value is very small, usually less than or equal to a threshold value previously chosen (significance level), it suggests that the observed data is inconsistent with the assumption that the null hypothesis is true and thus the hypothesis must be rejected. The smaller the p value, the larger the significance because it informs that the hypothesis under consideration may not adequately explain the observation.
 +
Definition of p value: Pr⁡(X≥x|H_0) for right tail event; Pr⁡(X≤x|H_0) for left tail event; 2*min⁡(Pr⁡(X≥x│H_0 ),Pr⁡(X≥x│H_0 )) for double tail event.
 +
The hypothesis H_0 is rejected if any of these probabilities is less than or equal to a small, fixed but arbitrarily predefined threshold α (level of significance), which only depends on the consensus of the research community that the investigator is working on. α=Pr⁡(reject H_0│H_0  is true)=Pr⁡(p≤α).
 +
Interpretation of p value: p≤0.01 very strong presumption against null hypothesis; 0.01<p≤0.05 strong presumption against null hypothesis; 0.05<p≤0.1 low presumption against null hypothesis; p>0.1 no presumption against the null hypothesis.
 +
Criticism about p value: (1) p value does not in itself allow reasoning about the probabilities of hypotheses, which requires multiple hypotheses or a range of hypotheses with a prior distribution of likelihoods between them; (2) it refers only to a single hypothesis (null hypothesis) and does not make reference to or allow conclusions about any other hypotheses such as alternative hypothesis; (3) the criterion is based on arbitrary choice of level; (4) p value is incompatible with the likelihood principle and the p value depends on the experiment design or equivalently on the test statistic in question; (5) it is an informal measure of evidence against the null hypothesis.
 +
Several common misunderstandings about p values: (1) it is not the probability that the null hypothesis is true, nor is it the probability that the alternative probability is false, it is not concerned with either of them; (2) it is not the probability that a finding is merely by chance; (3) it is not the probability of falsely rejecting the null hypothesis; (4) it is not the probability that replicating the experiment would yield the same conclusion; (5) the significance level is not determined by p value; (6) p value does not indicate the size or importance of the observed effect.  
  
+
3.2) Clinical significance: in medicine and psychology, clinical significance is the practical importance of a treatment effect of whether it has a real genuine, noticeable effect on daily life. It yields information on whether a treatment is effective enough to change a patient’s diagnostic label and answers question of whether the treatment effective enough to cause the patient to be normal in clinical treatment studies. It is also a consideration when interpreting the result of a psychological assessment of an individual. Frequently, there will be a difference of scores that is statistically significant, unlikely to have occurred purely by chance.
|class="wikitable" style="text-align:center; width:75%" border="1"
+
A clear demonstration of clinical significance would be to take a group of clients who score, say, beyond +2 SDs of the normative group prior to treatment and move them to within 1 SD from the mean of that group. The research implication of this definition is that you want to select people who are clearly disturbed to be in the clinical outcome study. If the mean of your untreated group is at, say, +1.2 SDs above the mean the change due to treatment probably is not going to be viewed as clinically significant.
| ||Null hypothesis is True|| Alternative hypothesis is True || Total||||
+
Clinical significance is defined by the smallest clinically beneficial and harmful values of the effect. These values are usually equal and opposite in sign. Because there is always a leap of faith in applying the results of a study to your patients (who, after all, were not in the study), perhaps a small improvement in the new therapy is not sufficient to cause you to alter your clinical approach. Note that you would almost certainly not alter your approach if the study results were not statistically significant (i.e. could well have been due to chance). But when is the difference between two therapies large enough for you to alter your practice?
|Declared significant||V (number of false positives)||S(number of true positives)||R
 
Declared non-significant||||
 
|U(number of true negatives)||T(number of false negatives)|| m-R||||
 
|Total||m_0(number of true null hypotheses)||m-m_0(number of true alternatvies)||m||||
 
  
A procedure controls the FWER in the weak sense if the FWER control at level α is guaranteed only when all hypotheses are true.
+
A procedure controls the FWER in the strong sense if the FWER control at level α is guaranteed for any configuration of true and non-true null hypotheses.
 
  
Controlling FWER:
+
Statistics cannot fully answer this question. It is one of clinical judgment, considering the magnitude of benefit of each treatment, the respective profiles of side effects of the two treatments, their relative costs, your comfort with prescribing a new therapy, the patient's preferences, and so on. But we can provide different ways of illustrating the benefit of treatments, in terms of the number needed to treat. If a study is very large, its result may be statistically significant (unlikely to be due to chance), and yet the deviation from the null hypothesis may be too small to be of any clinical interestConversely, the result may not be statistically significant because the study was so small (or "under powered"), but the difference is large and would seem potentially important from a clinical point of view. You will then be wise to do another, perhaps larger, study.
Bonferrroni: states that rejecting all p_i≤α/m will control that FWER≤α which is proved through Boole’s Inequality: FWER=Pr⁡{⋃_(I_0)▒〖(p_i≤α/m)}〗 ∑_(I_0)▒Pr⁡(p_i≤α/m) }≤m_0  α/m≤m α/m=α. This is the simplest and most conservative method to control FWER though it can be somewhat conservative if there are a large number of tests and/or the test statistics are positively correlated. It controls the probability of false positives only.
+
The smallest clinically beneficial and harmful values help define probabilities that the true effect could be clinically beneficial, trivial, or harmful (p_beneficial,p_trivial,p_harmful) and these P’s make an effort easier to assess and to publish.
Tukey’s procedure: only applicable for pairwise comparisons. It assumes independence of the observations being tested as well as equal variation across observations. The procedure calculates for each pair the standardized range statistics: (Y_A-Y_B)/SE where Y_A is the larger of the two means being compared and Y_B is the smaller one and SE is the standard error of the data.
+
Ways to calculate clinical significance:
The S ̌Ida ́k procedure: works for independent tests where each hypothesis test has α_SID=1-(1-α)^(1/m). This is a more powerful method than Bonferroni but the gain is small.  
+
Jacobson-Truax: common method of calculating clinical significance. It involves calculating a Reliability Change Index (RCI). RCI equals the difference between a participant’s pre-test and post-test scores, divided by the standard error of the difference.
Holm’s step-down procedure: starts by ordering the p values from lowest to highest as p_((1)),p_((2)),…,p_((m) ) with corresponding hypotheses H_((1)),H_((2)),…,H_((m) ). Suppose R is the smallest k such that p_((k))>α/(m+1-k). Reject the null hypotheses H_((1)),H_((2)),,H_((R-1) ). If R=1 then none of the hypotheses are rejected. This method is uniformly better than Bonferroni’s and it is based on Bonferroni with no restriction on the joint distribution of the test statistics.
+
Gulliksen-Lord-Novick: it is similar to Jacobson-Truax except that it takes into account regression to the mean. It is done by subtracting the pre-test and post-test scores from a population mean, and divided by the standard deviation of the population.  
Hochberg’s step-up procedure: starts by ordering the p values from lowest to highest as  p_((1)),p_((2)),,p_((m) ) with corresponding hypotheses  H_((1)),H_((2)),…,H_((m) ). For a given α, let R be the largest k such that p_((k))≤α/(m+1-k). Reject the null hypotheses H_((1)),H_((2)),,H_((R) ). It is more powerful than Holm’s, however, it is based on the Simes test so it holds only under independence (and also under some form of positive dependence).
+
Edwards-Nunnally: more stringent alternative to calculate clinical significance compared to Jacobson-Truax method. Reliability scores are used to bring the pre-test scores closer to the mean, and then a confidence interval is developed for this adjusted pre-test score.
 +
Hageman-Arrindel: involves indices of group change and individual change. The reliability of change indicates whether a patient has improved, stayed the same, or deteriorated. A second index, the clinical significance of change, indicates four categories similar to those used by Jacobson-Truax: deteriorated, not reliably changed, improved but not recovered, and recovered.
 +
Hierarchical Linear Modeling (HLM): involves growth curve analysis instead of pre-test post-test comparisons, so three data points are needed from each patient, instead of only two data points (pre-test and post-test).
  
3.4) FDR (false discovery rate): a statistical method used in multiple hypothesis testing to adjust for multiple comparisons. It is designed to control the expected proportion of incorrectly rejected null hypotheses. Compared to FWER, it exerts a less stringent control over false discovery and seeks to reduce the probability of even one false discovery as opposed to the expected proportion of false discoveries and enjoys greater power at the cost increased rate of type I errors.
 
  
Null hypothesis is True Alternative hypothesis is True Total
+
3.3) One example illustrating the use of spreadsheet and the clinical importance of p=0.2.
Declared significant V (number of false positives) S(number of true positives) R
+
p value Value of statistic Confidence level (%) Degree of freedom Confidence limits Threshold for clinical chances
Declared non-significant U(number of true negatives) T(number of false negatives) m-R
+
lower upper positive negative
Total m_0(number of true null hypotheses) m-m_0(number of true alternatives) m
+
0.03 1.5 90 18 0.4 2.6 1 -1
Define Q as the proportion of false discoveries among the discoveries (Q=V/R) then FDR is defined as FDR=Q_e=E[Q]=E[V/(V+S)]=E[V/R],where V/R is defined to be 0 when R=0. Our aim is to keep FDR below the threshold α (or q). And q-value is defined as FDR analogue of the p-value, the q-value of individual hypothesis test is the minimum FDR at which the test may be called significant.
+
0.2 2.4 90 18 -0.7 5.5 1 -1
  
Controlling procedures of FDR:
+
Clinically positive Clinically trivial Clinically negative
With m null hypotheses H_1,H_2,…,H_m and p_1,p_2,…,p_m as their corresponding p-values. We order these p-values in increasing order and denote as p_((1)),p_((2)),…,p_((m) ).
+
prob (%) odds prob (%) odds prob (%) odds
Benjamini-Hochberg procedure: controls the false discovery (at least α). For a given α, find the largest k such that p_((k))≤k/m α; then reject all H_((i)) for i=1,…,k. This method works when the m tests are independent as well as with some cases of dependence. E(Q)≤m_0/m α≤α.
+
78 3:1 22 1:3 0 1:2.2071
Benjamini-Hochberg-Yekutieli procedure: controls the FDR under positive dependence assumptions. It modifies the BH procedure: p_((k))≤k/(m*c(m)) α, if the tests are independent or positively correlated we choose c(m)=1 and choose c(m)=∑_(i=1)^m▒1/i with arbitrary dependent tests, when the tests are negatively correlated, c(m) can be approximated with ∑_(i=1)^m▒1/i≈ln⁡(m)+γ.
+
Likely, probable Unlikely, probably not (almost certainly) not
 
+
78 3:1 19 1:4 4 1:25
Example: Suppose we have computed a vector of p-values (p_1,p_2…,p_n). Let’s compare the corrections using different strategies:
+
Likely, probable Unlikely, probably not Very unlikely
 
 
# Given a set of p-values, returns p-values adjusted using one of several methods.
 
# c("holm", "hochberg", "hommel", "bonferroni", "fdr", "BY",
 
#  "fdr", "none")
 
> p.adjust(c(0.05,0.05,0.1),"bonferroni")
 
[1] 0.15 0.15 0.30
 
 
 
> p.adjust(c(0.05,0.05,0.1),"fdr")
 
[1] 0.075 0.075 0.100
 
 
 
> p.adjust(c(0.05,0.05,0.1),"fdr")
 
[1] 0.075 0.075 0.100
 
 
 
> p.adjust(c(0.05,0.05,0.1),"holm")
 
[1] 0.15 0.15 0.15
 
  
 +
And when reporting the research, one need to show the observed magnitude of the effect; attend to precision of estimation by showing 90% confidence limits of the true value; show the p value when necessary; attend to clinical, practical or mechanistic significance by stating the smallest worthwhile value when showing the probabilities that the true effect is beneficial, trivial, and/or harmful; make a qualitative statement about the clinical or practical significance of the effect with terms like likely, unlikely.
  
 +
One example would be: Clinically trivial, statistically significant and publishable rare outcome that can arise from a large sample size and usually misinterpreted as a worthwhile effect: (1) the observed effect of the treatment is 1.1 units (90% likely limits 0.4 to 1.8 units and p=0.007), (2) the chances that the true effect is practically beneficial/trivial/harmful are 1/99/0%.
  
 
4) Applications
 
4) Applications
  
4.1) This article (http://wiki.stat.ucla.edu/socr/index.php/SOCR_EduMaterials_AnalysesCommandLineFDR_Correction) presents information on how to use the SOCR analyses library for the purpose of computing the False Discovery Rate (FDR) correction for multiple testing in volumetric and shape-based analyses. It provides the specific procedure to compute FDR using SOCR in multiple testing and illustrates with examples and supplementary information about FDR.
+
4.1) This article (http://archpsyc.jamanetwork.com/article.aspx?articleid=206036) titled Revised Prevalence Estimates of Mental Disorders In The United States responses to question on life interference from telling a professional about, or using medication for symptoms to ascertain the prevalence of clinically significant mental disorders in each survey. It made a revised national prevalence estimate by selecting the lower estimate of the 2 surveys for each diagnostic category accounting for comorbidity and combining categories. It concluded that establishing the clinical significance of disorders in the community is crucial for estimating treatment need and that more work should be done in defining and operationalizing clinical significance, and characterizing the utility of clinically significant symptoms in determining treatment need even when some criteria of the disorder are not met.  
 
 
4.2) This article (http://home.uchicago.edu/amshaikh/webfiles/palgrave.pdf)  is a comprehensive introduction to multiple testing. It describes the problem of multiple testing more formally and discusses methods, which account for the multiplicity issue. In particular, the recent developments based on resampling results in an improved ability to reject false hypotheses compared to classical methods such as Bonferroni.  
 
  
 +
4.2) This article (http://jama.jamanetwork.com/article.aspx?articleid=187180) aims to evaluate whether the time to completion and the time to publication of randomized phase 2 and phase 3 trials are affected by the statistical significance of results and to describe the natural history of such trial and conducted a prospective cohort of randomized efficacy trials conducted by 2 trialist groups from 1986 to 1996. It finally concluded that among randomized efficacy trials, there is a time lag in the publication of negative findings that occurs mostly after the completion of the trial follow-up.
  
 
5) Software  
 
5) Software  
http://bioinformatics.oxfordjournals.org/content/21/12/2921.full
 
http://socr.ucla.edu/htmls/SOCR_Analyses.html
 
 
http://graphpad.com/quickcalcs/PValue1.cfm  
 
http://graphpad.com/quickcalcs/PValue1.cfm  
http://wiki.stat.ucla.edu/socr/index.php/SOCR_EduMaterials_AnalysesCommandLineFDR_Correction
+
http://www.surveysystem.com/sscalc.htm
 +
http://vassarstats.net/vsclin.html
  
 
6) Problems
 
6) Problems
  
6.1) Suppose the research is conducted to test a new drug and there are 10 hypotheses being tested simultaneously. Calculate the significance level of each individual test using Bonferroni correction if we want to maintain an overall type I error of 5% and the probability of observing at least one significant result when using the correction you chose?  
+
6.1) Suppose we are playing one roll of a pair of dice and we roll a pair of dice once and assumes a null hypothesis that the dice are fair. The test statistic is "the sum of the rolled numbers" and is one-tailed. Suppose we observe both dice show 6, which yield a test statistic of 12. The p-value of this outcome is about 0.028 (1/36) (the highest test statistic out of 6×6 = 36 possible outcomes). If the researcher assumed a significance level of 0.05, what would be the conclusion from this experiment? What would be a potential problem with experiment to run the conclusion you proposed?
 +
 
 +
6.2) Suppose a researcher flips a coin some arbitrary number of times (n) and assumes a null hypothesis that the coin is fair. The test statistic is the total number of heads. Suppose the researcher observes heads for each flip, yielding a test statistic of n and a p-value of 2/2n. If the coin was flipped only 5 times, the p-value would be 2/32 = 0.0625, which is not significant at the 0.05 level. But if the coin was flipped 10 times, the p-value would be 2/1024 ≈ 0.002, which is significant at the 0.05 level. What would be the problem here?
  
6.2) Consider we are working with a study on test of a new drug for cancer where we have three treatments: the new medicine, the old medicine and the combination of the two. We are doing a pairwise test on these three treatments and want to maintain a Type I error rate of 5%. Consider the Tukey’s correction and describe how you are going to apply this method here.
+
6.3) Suppose a researcher flips a coin two times and assumes a null hypothesis that the coin is unfair: it has two heads and no tails. The test statistic is the total number of heads (one-tailed). The researcher observes one head and one tail (HT), yielding a test statistic of 1 and a p-value of 0. In this case the data is inconsistent with the hypothesis–for a two-headed coin, a tail can never come up. In this case the outcome is not simply unlikely in the null hypothesis, but in fact impossible, and the null hypothesis can be definitely rejected as false. In practice such experiments almost never occur, as all data that could be observed would be possible in the null hypothesis (albeit unlikely). What if the null hypothesis were instead that the coin came up heads 99% of the time (otherwise the same setup)?
  
 
7) References
 
7) References
 
http://mirlyn.lib.umich.edu/Record/004199238  
 
http://mirlyn.lib.umich.edu/Record/004199238  
 
http://mirlyn.lib.umich.edu/Record/004232056  
 
http://mirlyn.lib.umich.edu/Record/004232056  
http://mirlyn.lib.umich.edu/Record/004133572
+
http://mirlyn.lib.umich.edu/Record/004133572  
 +
 
 +
 
 +
Answers:
 +
 
 +
6.1) We would deem this result significant and would reject the hypothesis that the dice are fair. In this case, a single roll provides a very weak basis (that is, insufficient data) to draw a meaningful conclusion about the dice. This illustrates the danger with blindly applying p-value without considering the experiment design.
 +
 
 +
6.2) In both cases the data suggest that the null hypothesis is false (that is, the coin is not fair somehow), but changing the sample size changes the p-value and significance level. In the first case the sample size is not large enough to allow the null hypothesis to be rejected at 0.05 level of significance (in fact, the p-value will never be below 0.05). This demonstrates that in interpreting p-values, one must also know the sample size, which complicates the analysis.
 +
 
 +
6.3) The p-value would instead be approximately 0.02 (0.0199). In this case the null hypothesis could not definitely be ruled out – this outcome is unlikely in the null hypothesis, but not impossible – but the null hypothesis would be rejected at the 0.05 level of significance, and in fact at the 0.02 level, since the outcome is less than 2% likely in the null hypothesis.
  
 
===[[SMHS_CorrectionMultipleTesting | Correction for Multiple Testing]]===
 
===[[SMHS_CorrectionMultipleTesting | Correction for Multiple Testing]]===

Revision as of 11:37, 28 July 2014

The Scientific Methods for Health Sciences EBook is still under active development. It is expected to be complete by Sept 01, 2014, when this banner will be removed.

Contents

SOCR Wiki: Scientific Methods for Health Sciences

Scientific Methods for Health Sciences EBook


Electronic book (EBook) on Scientific Methods for Health Sciences (coming up ...)

Preface

The Scientific Methods for Health Sciences (SMHS) EBook is designed to support a 4-course training of scientific methods for graduate students in the health sciences.

Format

Follow the instructions in this page to expand, revise or improve the materials in this EBook.

Learning and Instructional Usage

This section describes the means of traversing, searching, discovering and utilizing the SMHS EBook resources in both formal and informal learning setting.

Copyrights

The SMHS EBook is a freely and openly accessible electronic book developed by SOCR and the general community.

Chapter I: Fundamentals

Exploratory Data Analysis, Plots and Charts

Review of data types, exploratory data analyses and graphical representation of information.

Ubiquitous Variation

There are many ways to quantify variability, which is present in all natural processes.

Parametric Inference

Foundations of parametric (model-based) statistical inference.

Probability Theory

Random variables, stochastic processes, and events are the core concepts necessary to define likelihoods of certain outcomes or results to be observed. We define event manipulations and present the fundamental principles of probability theory including conditional probability, total and Bayesian probability laws, and various combinatorial ideas.

Odds Ratio/Relative Risk

The relative risk, RR, (a measure of dependence comparing two probabilities in terms of their ratio) and the odds ratio, OR, (the fraction of one probability and its complement) are widely applicable in many healthcare studies.

Probability Distributions

Probability distributions are mathematical models for processes that we observe in nature. Although there are different types of distributions, they have common features and properties that make them useful in various scientific applications.

Resampling and Simulation

Resampling is a technique for estimation of sample statistics (e.g., medians, percentiles) by using subsets of available data or by randomly drawing replacement data. Simulation is a computational technique addressing specific imitations of what’s happening in the real world or system over time without awaiting it to happen by chance.

Design of Experiments

Design of experiments (DOE) is a technique for systematic and rigorous problem solving that applies data collection principles to ensure the generation of valid, supportable and reproducible conclusions.

Intro to Epidemiology

Epidemiology is the study of the distribution and determinants of disease frequency in human populations. This section presents the basic epidemiology concepts. More advanced epidemiological methodologies are discussed in the next chapter.

Experiments vs. Observational Studies

Experimental and observational studies have different characteristics and are useful in complementary investigations of association and causality.

Estimation

Estimation is a method of using sample data to approximate the values of specific population parameters of interest like population mean, variability or 97th percentile. Estimated parameters are expected to be interpretable, accurate and optimal, in some form.

Hypothesis Testing

Hypothesis testing is a quantitative decision-making technique for examining the characteristics (e.g., centrality, span) of populations or processes based on observed experimental data.

Statistical Power, Sensitivity and Specificity

The fundamental concepts of type I (false-positive) and type II (false-negative) errors lead to the important study-specific notions of statistical power, sample size, effect size, sensitivity and specificity.

Data Management

All modern data-driven scientific inquiries demand deep understanding of tabular, ASCII, binary, streaming, and cloud data management, processing and interpretation.

Bias and Precision

Bias and precision are two important and complementary characteristics of estimated parameters that quantify the accuracy and variability of approximated quantities.

Association and Causality

An association is a relationship between two, or more, measured quantities that renders them statistically dependent so that the occurrence of one does affect the probability of the other. A causal relation is a specific type of association between an event (the cause) and a second event (the effect) that is considered to be a consequence of the first event.

Rate-of-change

Rate of change is a technical indicator describing the rate in which one quantity changes in relation to another quantity.

Clinical vs. Statistical Significance

Statistical significance addresses the question of whether or not the results of a statistical test meet an accepted quantitative criterion, whereas clinical significance is answers the question of whether the observed difference between two treatments (e.g., new and old therapy) found in the study large enough to alter the clinical practice.

IV. HS 850: Fundamentals

Clinical vs. Statistical Significance

1) Overview: Statistical significance is related to the question of whether or not the results of a statistical test meet an accepted criterion. The criterion can be arbitrary and the same statistical test may give different results based on different criterion of significance. Usually, statistical significance is expressed in terms of probability (say p value, which is the probability of obtaining a test statistic result at least as extreme as the one that was actually observed assuming the null hypothesis is true). Clinical significance is the difference between new and old therapy found in the study large enough to alter the practice. This section presents a general introduction to the field of statistical significance with important concepts of tests for statistical significance and measurements of significance of tests as well as the application of statistical test in clinical and the comparison between clinical and statistical significance.

2) Motivation: Significance is one of the most commonly used measurements in statistical tests from various fields. However, most researchers and students misinterpret statistical significance and non-significance. Few people know the exact indication of p value, which in some sense defines statistical significance. So the question would be, how can we define statistical significance? Is there any other ways to define statistical significance besides p value? What is missing in the ways to make inferences in clinical vs. statistical significance? This lecture aims to help students have a thorough understanding of clinical and statistical significance.

3) Theory

3.1) Statistical significance: the low probability at which an observed effect would have occurred due to chance. It is an integral part of statistical hypothesis testing where it plays a vital role to decide if a null hypothesis can be rejected. The criterion level is typically the value of p<0.05, which is chosen to minimize the possibility of a Type I error, finding a significant difference when one does not exist. It does not protect us from Type II error, which is defined as failure to find a difference when the difference does exist. Statistical significance involves important factors like (1) magnitude of the effect; (2) the sample size; (3) the reliability of the effect (i.e., the treatment equally effective for all participants); (4) the reliability of the measurement instrument. Problems with p value and statistical significance: (1) failure to reject the null hypothesis doesn’t mean we accept the null; (2) in any cases, the true effects in real life are never zero and things can be disproved only in pure math not in real life; (3) it’s not logical to assume that the effects are zero until disproved; (4) the significant level is arbitrary. p value: probability of obtaining a test statistic result at least as extreme as the one that was actually observed when the null hypothesis is actually true. It is used in the context of null hypothesis testing in order to quantify the idea of statistical significance of evidence. A researcher will often reject the null when p value turns out to be less than a predetermined significance level, say 0.05. If the p value is very small, usually less than or equal to a threshold value previously chosen (significance level), it suggests that the observed data is inconsistent with the assumption that the null hypothesis is true and thus the hypothesis must be rejected. The smaller the p value, the larger the significance because it informs that the hypothesis under consideration may not adequately explain the observation. Definition of p value: Pr⁡(X≥x|H_0) for right tail event; Pr⁡(X≤x|H_0) for left tail event; 2*min⁡(Pr⁡(X≥x│H_0 ),Pr⁡(X≥x│H_0 )) for double tail event. The hypothesis H_0 is rejected if any of these probabilities is less than or equal to a small, fixed but arbitrarily predefined threshold α (level of significance), which only depends on the consensus of the research community that the investigator is working on. α=Pr⁡(reject H_0│H_0 is true)=Pr⁡(p≤α). Interpretation of p value: p≤0.01 very strong presumption against null hypothesis; 0.01<p≤0.05 strong presumption against null hypothesis; 0.05<p≤0.1 low presumption against null hypothesis; p>0.1 no presumption against the null hypothesis. Criticism about p value: (1) p value does not in itself allow reasoning about the probabilities of hypotheses, which requires multiple hypotheses or a range of hypotheses with a prior distribution of likelihoods between them; (2) it refers only to a single hypothesis (null hypothesis) and does not make reference to or allow conclusions about any other hypotheses such as alternative hypothesis; (3) the criterion is based on arbitrary choice of level; (4) p value is incompatible with the likelihood principle and the p value depends on the experiment design or equivalently on the test statistic in question; (5) it is an informal measure of evidence against the null hypothesis. Several common misunderstandings about p values: (1) it is not the probability that the null hypothesis is true, nor is it the probability that the alternative probability is false, it is not concerned with either of them; (2) it is not the probability that a finding is merely by chance; (3) it is not the probability of falsely rejecting the null hypothesis; (4) it is not the probability that replicating the experiment would yield the same conclusion; (5) the significance level is not determined by p value; (6) p value does not indicate the size or importance of the observed effect.

3.2) Clinical significance: in medicine and psychology, clinical significance is the practical importance of a treatment effect of whether it has a real genuine, noticeable effect on daily life. It yields information on whether a treatment is effective enough to change a patient’s diagnostic label and answers question of whether the treatment effective enough to cause the patient to be normal in clinical treatment studies. It is also a consideration when interpreting the result of a psychological assessment of an individual. Frequently, there will be a difference of scores that is statistically significant, unlikely to have occurred purely by chance. A clear demonstration of clinical significance would be to take a group of clients who score, say, beyond +2 SDs of the normative group prior to treatment and move them to within 1 SD from the mean of that group. The research implication of this definition is that you want to select people who are clearly disturbed to be in the clinical outcome study. If the mean of your untreated group is at, say, +1.2 SDs above the mean the change due to treatment probably is not going to be viewed as clinically significant. Clinical significance is defined by the smallest clinically beneficial and harmful values of the effect. These values are usually equal and opposite in sign. Because there is always a leap of faith in applying the results of a study to your patients (who, after all, were not in the study), perhaps a small improvement in the new therapy is not sufficient to cause you to alter your clinical approach. Note that you would almost certainly not alter your approach if the study results were not statistically significant (i.e. could well have been due to chance). But when is the difference between two therapies large enough for you to alter your practice?


Statistics cannot fully answer this question. It is one of clinical judgment, considering the magnitude of benefit of each treatment, the respective profiles of side effects of the two treatments, their relative costs, your comfort with prescribing a new therapy, the patient's preferences, and so on. But we can provide different ways of illustrating the benefit of treatments, in terms of the number needed to treat. If a study is very large, its result may be statistically significant (unlikely to be due to chance), and yet the deviation from the null hypothesis may be too small to be of any clinical interest. Conversely, the result may not be statistically significant because the study was so small (or "under powered"), but the difference is large and would seem potentially important from a clinical point of view. You will then be wise to do another, perhaps larger, study. The smallest clinically beneficial and harmful values help define probabilities that the true effect could be clinically beneficial, trivial, or harmful (p_beneficial,p_trivial,p_harmful) and these P’s make an effort easier to assess and to publish. Ways to calculate clinical significance: Jacobson-Truax: common method of calculating clinical significance. It involves calculating a Reliability Change Index (RCI). RCI equals the difference between a participant’s pre-test and post-test scores, divided by the standard error of the difference. Gulliksen-Lord-Novick: it is similar to Jacobson-Truax except that it takes into account regression to the mean. It is done by subtracting the pre-test and post-test scores from a population mean, and divided by the standard deviation of the population. Edwards-Nunnally: more stringent alternative to calculate clinical significance compared to Jacobson-Truax method. Reliability scores are used to bring the pre-test scores closer to the mean, and then a confidence interval is developed for this adjusted pre-test score. Hageman-Arrindel: involves indices of group change and individual change. The reliability of change indicates whether a patient has improved, stayed the same, or deteriorated. A second index, the clinical significance of change, indicates four categories similar to those used by Jacobson-Truax: deteriorated, not reliably changed, improved but not recovered, and recovered. Hierarchical Linear Modeling (HLM): involves growth curve analysis instead of pre-test post-test comparisons, so three data points are needed from each patient, instead of only two data points (pre-test and post-test).


3.3) One example illustrating the use of spreadsheet and the clinical importance of p=0.2. p value Value of statistic Confidence level (%) Degree of freedom Confidence limits Threshold for clinical chances lower upper positive negative 0.03 1.5 90 18 0.4 2.6 1 -1 0.2 2.4 90 18 -0.7 5.5 1 -1

Clinically positive Clinically trivial Clinically negative prob (%) odds prob (%) odds prob (%) odds 78 3:1 22 1:3 0 1:2.2071 Likely, probable Unlikely, probably not (almost certainly) not 78 3:1 19 1:4 4 1:25 Likely, probable Unlikely, probably not Very unlikely

And when reporting the research, one need to show the observed magnitude of the effect; attend to precision of estimation by showing 90% confidence limits of the true value; show the p value when necessary; attend to clinical, practical or mechanistic significance by stating the smallest worthwhile value when showing the probabilities that the true effect is beneficial, trivial, and/or harmful; make a qualitative statement about the clinical or practical significance of the effect with terms like likely, unlikely.

One example would be: Clinically trivial, statistically significant and publishable rare outcome that can arise from a large sample size and usually misinterpreted as a worthwhile effect: (1) the observed effect of the treatment is 1.1 units (90% likely limits 0.4 to 1.8 units and p=0.007), (2) the chances that the true effect is practically beneficial/trivial/harmful are 1/99/0%.

4) Applications

4.1) This article (http://archpsyc.jamanetwork.com/article.aspx?articleid=206036) titled Revised Prevalence Estimates of Mental Disorders In The United States responses to question on life interference from telling a professional about, or using medication for symptoms to ascertain the prevalence of clinically significant mental disorders in each survey. It made a revised national prevalence estimate by selecting the lower estimate of the 2 surveys for each diagnostic category accounting for comorbidity and combining categories. It concluded that establishing the clinical significance of disorders in the community is crucial for estimating treatment need and that more work should be done in defining and operationalizing clinical significance, and characterizing the utility of clinically significant symptoms in determining treatment need even when some criteria of the disorder are not met.

4.2) This article (http://jama.jamanetwork.com/article.aspx?articleid=187180) aims to evaluate whether the time to completion and the time to publication of randomized phase 2 and phase 3 trials are affected by the statistical significance of results and to describe the natural history of such trial and conducted a prospective cohort of randomized efficacy trials conducted by 2 trialist groups from 1986 to 1996. It finally concluded that among randomized efficacy trials, there is a time lag in the publication of negative findings that occurs mostly after the completion of the trial follow-up.

5) Software http://graphpad.com/quickcalcs/PValue1.cfm http://www.surveysystem.com/sscalc.htm http://vassarstats.net/vsclin.html

6) Problems

6.1) Suppose we are playing one roll of a pair of dice and we roll a pair of dice once and assumes a null hypothesis that the dice are fair. The test statistic is "the sum of the rolled numbers" and is one-tailed. Suppose we observe both dice show 6, which yield a test statistic of 12. The p-value of this outcome is about 0.028 (1/36) (the highest test statistic out of 6×6 = 36 possible outcomes). If the researcher assumed a significance level of 0.05, what would be the conclusion from this experiment? What would be a potential problem with experiment to run the conclusion you proposed?

6.2) Suppose a researcher flips a coin some arbitrary number of times (n) and assumes a null hypothesis that the coin is fair. The test statistic is the total number of heads. Suppose the researcher observes heads for each flip, yielding a test statistic of n and a p-value of 2/2n. If the coin was flipped only 5 times, the p-value would be 2/32 = 0.0625, which is not significant at the 0.05 level. But if the coin was flipped 10 times, the p-value would be 2/1024 ≈ 0.002, which is significant at the 0.05 level. What would be the problem here?

6.3) Suppose a researcher flips a coin two times and assumes a null hypothesis that the coin is unfair: it has two heads and no tails. The test statistic is the total number of heads (one-tailed). The researcher observes one head and one tail (HT), yielding a test statistic of 1 and a p-value of 0. In this case the data is inconsistent with the hypothesis–for a two-headed coin, a tail can never come up. In this case the outcome is not simply unlikely in the null hypothesis, but in fact impossible, and the null hypothesis can be definitely rejected as false. In practice such experiments almost never occur, as all data that could be observed would be possible in the null hypothesis (albeit unlikely). What if the null hypothesis were instead that the coin came up heads 99% of the time (otherwise the same setup)?

7) References http://mirlyn.lib.umich.edu/Record/004199238 http://mirlyn.lib.umich.edu/Record/004232056 http://mirlyn.lib.umich.edu/Record/004133572


Answers:

6.1) We would deem this result significant and would reject the hypothesis that the dice are fair. In this case, a single roll provides a very weak basis (that is, insufficient data) to draw a meaningful conclusion about the dice. This illustrates the danger with blindly applying p-value without considering the experiment design.

6.2) In both cases the data suggest that the null hypothesis is false (that is, the coin is not fair somehow), but changing the sample size changes the p-value and significance level. In the first case the sample size is not large enough to allow the null hypothesis to be rejected at 0.05 level of significance (in fact, the p-value will never be below 0.05). This demonstrates that in interpreting p-values, one must also know the sample size, which complicates the analysis.

6.3) The p-value would instead be approximately 0.02 (0.0199). In this case the null hypothesis could not definitely be ruled out – this outcome is unlikely in the null hypothesis, but not impossible – but the null hypothesis would be rejected at the 0.05 level of significance, and in fact at the 0.02 level, since the outcome is less than 2% likely in the null hypothesis.

Correction for Multiple Testing

Multiple testing refers to analytical protocols involving testing of several (typically more then two) hypotheses. Multiple testing studies require correction for type I (false-positive rate), which can be done using Bonferroni's method, Tukey’s procedure, family-wise error rate (FWER), or false discovery rate (FDR).


Chapter II: Applied Inference

Epidemiology

Correlation and Regression (ρ and slope inference, 1-2 samples)

ROC Curve

ANOVA

Non-parametric inference

Instrument Performance Evaluation: Cronbach's α

Measurement Reliability and Validity

Survival Analysis

Decision Theory

CLT/LLNs – limiting results and misconceptions

Association Tests

Bayesian Inference

PCA/ICA/Factor Analysis

Point/Interval Estimation (CI) – MoM, MLE

Study/Research Critiques

Common mistakes and misconceptions in using probability and statistics, identifying potential assumption violations, and avoiding them

Chapter III: Linear Modeling

Multiple Linear Regression (MLR)

Generalized Linear Modeling (GLM)

Analysis of Covariance (ANCOVA)

First, see the ANOVA section above.

Multivariate Analysis of Variance (MANOVA)

Multivariate Analysis of Covariance (MANCOVA)

Repeated measures Analysis of Variance (rANOVA)

Partial Correlation

Time Series Analysis

Fixed, Randomized and Mixed Effect Models

Hierarchical Linear Models (HLM)

Multi-Model Inference

Mixture Modeling

Surveys

Longitudinal Data

Generalized Estimating Equations (GEE) Models

Model Fitting and Model Quality (KS-test)

Chapter IV: Special Topics

Scientific Visualization

PCOR/CER methods Heterogeneity of Treatment Effects

Big-Data/Big-Science

Missing data

Genotype-Environment-Phenotype associations

Medical imaging

Data Networks

Adaptive Clinical Trials

Databases/registries

Meta-analyses

Causality/Causal Inference, SEM

Classification methods

Time-series analysis

Scientific Validation

Geographic Information Systems (GIS)

Rasch measurement model/analysis

MCMC sampling for Bayesian inference

Network Analysis




Translate this page:

(default)
Uk flag.gif

Deutsch
De flag.gif

Español
Es flag.gif

Français
Fr flag.gif

Italiano
It flag.gif

Português
Pt flag.gif

日本語
Jp flag.gif

България
Bg flag.gif

الامارات العربية المتحدة
Ae flag.gif

Suomi
Fi flag.gif

इस भाषा में
In flag.gif

Norge
No flag.png

한국어
Kr flag.gif

中文
Cn flag.gif

繁体中文
Cn flag.gif

Русский
Ru flag.gif

Nederlands
Nl flag.gif

Ελληνικά
Gr flag.gif

Hrvatska
Hr flag.gif

Česká republika
Cz flag.gif

Danmark
Dk flag.gif

Polska
Pl flag.png

România
Ro flag.png

Sverige
Se flag.gif