SMHS PowerSensitivitySpecificity
Contents
- 1 Scientific Methods for Health Sciences - Statistical Power, Sample-Size, Sensitivity and Specificity
- 1.1 Overview:
- 1.2 Motivation:
- 1.3 Theory
- 1.4 Power Analysis/Psychometrics
- 1.5 Statistical Power Analysis
- 1.6 Some R Practice Problems
- 1.7 Creating Power or Sample Size Plots
- 1.8 Psychometrics
- 1.9 Reliability measures:
- 1.10 Intra-class correlation (ICC)
- 1.11 Interpretation
- 1.12 Examples
- 1.13 Gazi University (Turkey) Student Evaluation Data
- 1.14 Applications
- 1.15 Software
- 1.16 Problems
- 1.17 References
Scientific Methods for Health Sciences - Statistical Power, Sample-Size, Sensitivity and Specificity
Overview:
In statistics, we have many ways to evaluate and choose tests or models. In this lecture, we are going to introduce some commonly used methods to describe the characteristics of a test: power, sample size, effect size, sensitivity and specificity. This lecture will present background knowledge of these concepts and illustrate their power and application through examples.
Motivation:
Experiments, models and tests are fundamental to the field of statistics. All researchers are faced with the question of how to choose appropriate models and set up tests. We are interested in studying some of the most commonly used methods, including power, effect size, sensitivity and specificity. Focusing on these characteristics will greatly help us to choose and appropriate model and understand the results. We must consider questions such as, what would be a reasonable sample size to reach a balance in the trade off between cost and efficiency? What is the probability that a test will reject a false null hypothesis? What is the test’s ability to correctly accept a true null hypothesis or reject a false alternative hypothesis?
Theory
Type I Error, Type II Error and Power
- Type I error: the false positive (Type I) error of rejecting the null hypothesis given that it is actually true; e.g., the purses are detected to containing the radioactive material while they actually do not.
- Type II error: the false negative (Type II) error of failing to reject the null hypothesis given that the alternative hypothesis is actually true; e.g., the purses are detected to not containing the radioactive material while they actually do.
- Statistical power: the probability that the test will reject a false null hypothesis (that it will not make a Type II error). When power increases, the chances of a Type II error decrease.
- Test specificity (ability of a test to correctly accept the null hypothesis $ =\frac{d}{b+d}$.
- Test sensitivity (ability of a test to correctly reject the alternative hypothesis $=\frac{a}{a+c}$.
- The table below gives an example of calculating specificity, sensitivity, False positive rate $\alpha$, False Negative Rate $\beta$ and power given the information of TN and FN.
Actual Condition | |||
Absent ($H_0$ is true) | Present ($H_1$ is true) | ||
Test Result | Negative(fail to reject $H_0$) | Condition absent + Negative result = True (accurate) Negative (TN, 0.98505) | Condition present + Negative result = False (invalid) Negative (FN, 0.00025)Type II error (proportional to $\beta$) |
Positive (reject $H_0$) | Condition absent + Positive result = False Positive (FP, 0.00995)Type I error (α) | Condition Present + Positive result = True Positive (TP, 0.00475) | |
Test Interpretation | $Power = 1-\frac{FN}{FN+TP}= 0.95 $ | Specificity: TN/(TN+FP) = 0.98505/(0.98505+ 0.00995) = 0.99 | Sensitivity: TP/(TP+FN) = 0.00475/(0.00475+ 0.00025)= 0.95 |
Specificity $=\frac{TN}{TN + FP}$, Sensitivity $=\dfrac{TP}{TP+FN}$, $\alpha=\dfrac {FP}{FP+TN}$, $\beta=\frac{FN}{FN+TP}$, power$=1-\beta.$ Note that both (Type I ($\alpha$) and Type II ($\beta$)) errors are proportions in the range [0,1], so they represent error-rates. The reason they are listed in the corresponding cells is that they are directly proportionate to the numerical values of the FP and FN, respectively.
Note that the two alternative definitions of power are equivalent:
- power$=1-\beta$, and
- power=sensitivity
This is because power=$1-\beta=1-\frac{FN}{FN+TP}=\frac{FN+TP}{FN+TP} - \frac{FN}{FN+TP}=\frac{TP}{FN+TP}=$ sensitivity.
Sample size
The number of observations or replicates included in a statistical sample. It is an important feature of any empirical study, which aims to make inference about a population. In complicated studies, there may be several different sample sizes involved in the study: for example, in a survey sampling involving stratified sampling, there may be different sizes of samples for each population.
- Factors influence sample size: expense of data collection; need to have sufficient statistical power.
- Ways to choose sample sizes: (1) expedience. Consider a simple experiment where the sample data is readily available or convenient to collect, yet the size of sample is crucial in avoiding wide confidence intervals or risks of errors in statistical hypothesis testing. (2) using a target variance for an estimate to be derived from the sample eventually obtained; (3) using a target for the power of a statistical test to be applied once the sample is collected.
- Intuitively, larger sample size generally lead to increased precision in estimating unknown parameters. However, in some situations, the increase in accuracy for larger sample size is minimal, or even doesn’t exist. This can result from the presence of systematic error or strong dependence in the data, or if the data follow a heavy-tailed distribution. Sample size is judged based on the quality of the resulting estimates. For example, if a proportion is being estimated, one may wish to have the 95% confidence interval be less than 0.06 units wide. Alternatively, sample size may be assessed based on the power of a hypothesis test.
- Choose the sample size based on our expectation of other measures.
- Suppose the simple experiment of flipping a coin, where estimator of a proportion is $\hat{p}=\frac{X}{n}$, where $X$ is the number of heads out of n experiments. The estimator follows a binomial distribution and when n is sufficiently large, the distribution will be closely approximated by a normal distribution. With approximation, it can be shown that around $95\%$of this distribution’s probability lies within 2 standard deviations of the mean. Use Wald method for the binomial distribution, an interval of the form $(\hat{p} -2\sqrt{\frac{0.25}{n}}, \hat{p} + 2\sqrt{\frac{.25}{n}}) $ will form a 95% CI for the true proportion. If this interval needs to be no more than $W$ units wide, then we have $4\sqrt{\frac{0.25}{n}}=W$, solved for $n$, we have $ n=\frac{4}{W^2}=\frac{1}{B^2}$ where $B$ is the error bound on the estimate, i.e., the estimate is usually given as within $\pm B$. Hence, if $B=10$, then $n=100$; and if $B=0.05$ (5%), then $n=400$.
Proportion
A proportion is a special case of mean. When estimating the population mean using an independent and identically distributed sample of size n, where each data has variance $ \sigma ^{2}$, the standard error of the sample mean is $\frac{\sigma}{\sqrt{n}}$. With CLT, the 95% CI is $(\bar x - \frac {2\sigma}{\sqrt n},\bar x +\frac{2\sigma}{\sqrt n})$. If we wish to have a confidence interval with W units in width, then solve for n, we have $n=\frac{16\sigma^2}{W^2}$.
Hypothesis tests
Sample size for hypothesis tests: Let $X_i,i=1,2,…,n$ be independent observations taken from a normal distribution with unknown mean μ and known variance $\sigma^2$. The null hypothesis vs. alternative hypothesis: $H_0:\mu=0$ vs.$H_a:\mu=\mu^*$. If we wish to:
- reject $H_0$ with a probability of at least $1-\beta$ when $H_a$ is true,
- reject $H_0$ with probability $\alpha$ when $H_0$ is true.
Thus, we need: $P(\bar x >\frac{z_{\alpha}\sigma}{\sqrt n}|H_0 \text{ true})=\alpha $, and so reject $H_0$ if our sample average is more than $\frac{z_\alpha\sigma} {\sqrt n}$ is a decision rule which satisfies (2). $z_\alpha$ is the upper percentage point of the standard normal distribution (as this is a one-sided test!) If we wish this to happen with a probability $1-\beta$ when $H_a$ is true, that is our sample average comes from a normal distribution with a different mean $μ^*$.
Therefore, require $P (\bar x >\frac {z_{\alpha}\sigma}{\sqrt n}|H_a \text{ true})\le 1-\beta $. Solving for $n$, we have $n \ge \left( \frac{z_{\alpha}-\Phi^{-1}(1-\beta)}{\frac{\mu^{*}}{\sigma}}\right)^{2}$, where $\Phi$ is the normal cumulative distribution function. Recall that by CLT, $\bar{x} \sim N(\mu, \frac{\sigma^2}{n})$ and under $H_a$, $x \sim N(\mu^*, \sigma^2)$. Then standardizing $\bar{x}$, we have that $\frac{\bar{x} -\mu^*}{\frac{\sigma}{\sqrt{n}}} \sim N(0,1)$. $\frac{\bar{x} -\mu^*}{\frac{\sigma}{\sqrt{n}}} \ge z_{\alpha}$ can be solved for $\sqrt{n}$ (or $n$) given the specified lower boundary on the right-tail probability ($1-\beta$). Thus, $\Phi^{-1}(1-\beta) \ge z_{\alpha} - \frac{\mu^*}{\frac{\sigma}{\sqrt{n}}},$ and $n \ge \left( \frac{z_{\alpha}-\Phi^{-1}(1-\beta)}{\frac{\mu^{*}}{\sigma}}\right)^{2}$.
Effect size
Effect size is a descriptive statistic that conveys the estimated magnitude of a relationship without making any statement about whether the apparent relationship in the data reflects a true relationship in the population. It complements inferential statistics such as p-value and plays an important role in statistical studies. The term effect size can refer to a statistic calculated from a sample of data, or to a parameter of a hypothetical statistical population. These effect sizes estimate the amount of the variance within an experiment that is "explained" or "accounted for" by the experiment's model.
Other common measures
- Pearson $r$ (correlation): an effect size when paired quantitative data are available, for instance if one were studying the relationship between birth weight and longevity. It varies from -1 to 1, 1 indicating a perfect negative linear relation, 1 indicating a perfect positive linear relation and 0 indicating no linear relation between two variables.
- Correlation coefficient, $ r^2 $: a coefficient determination calculated as the square of Pearson correlation r. It varies from 0 to 1 and is always nonnegative. For example, if $r=0.2$ then $r^2=0.04$ meaning that $4\%$ of the variance of either variable is shared with the other variable.
- Eta-squared, $ \eta^2 $, describes the ratio of variance explained in the dependent variable by a predictor while controlling for other predictors, making it analogous to the $ r^2 $. It is a biased estimator of the variance explained by the model in the population. $ \eta^2=\frac{SS_{treatment}} {SS_{total}} $ .
- Omega-squared, $\omega^2$: a less biased estimator of the variance explained in the population. $\omega^2 =\frac{SS_{treatment}-df_{treatment}*MS_{error}}{SS_{total}+MS_{error}}$. Given it is less biased, $\omega^2$ is preferable to $\eta^2$, however, it can be more inconvenient to calculate for complex analyses.
- Cohen’s $ f^2 $: one of several effect size measures to use in the context of an F test for ANOVA or multiple regression. Its amount of bias depends on the bias of its underlying measurement of variance explained. $f^2=\frac{R^2}{1-R^2}$,$R^2 $ is the squared multiple correlation.
Power Analysis/Psychometrics
Questions:
- Why are sample-sizes important and how do they influence our ability to design experiments tailored to detect effect of certain magnitude? Recall the first (LLN) and the second (CLT) fundamental laws of probability theory^{1}.
- How to assess the reliability and validity of newly introduced psychometric instruments?
^{1}http://wiki.socr.umich.edu/index.php/EBook#Chapter_VI:_Relations_Between_Distributions
Statistical Power Analysis
Power analysis is a technique employed in many experimental designs to determine the relation between sample size of the study, specific effect sizes, and ability to detect these effects from sample data given its size and a degree of confidence. In a nutshell, the statistical power is a proxy for the probability of detecting an effect (signal, discrepancy, misalignment, etc.) of a given size, provided we start with a given level of confidence and some sample size constraints. If this probability is very low (e.g., p<0.5), the study may be under-powered which may require a redesign or reformulation of the experiment. If the power is very high (e.g., p>0.999), then we can tradeoff (e.g., lower the sample-size, but preserve higher power to detect effects) various parameters in the power calculations.
Important components of power calculations^{2}:
1. The sample size (either investigator controlled, or estimated)
2. The effect size (must come outside of the study, can’t use the same data to estimate it. E.g., prior pubs, previous reports of expected effects, etc.)
3. Significance level is the $α$=P(Type I error)= probability of finding a (spurious) effect that is not there in reality, but is due to random chance alone
4. The statistical power $β$= 1 - P(Type II error) = probability of finding an effect that is there. In some specific experimental designs (but not always), given any three of these components, we can determine the fourth. For instance, we can estimate the (minimum) sample-size that yields a power lower-bound by, say, $β$=0.8.
- Type I error: the false positive (Type I) error of rejecting the null hypothesis given that it is actually true; e.g., the purses are detected to containing the radioactive material while they actually do not.
- Type II error: the false negative (Type II) error of failing to reject the null hypothesis given that the alternative hypothesis is actually true; e.g., the purses are detected to not containing the radioactive material while they actually do.
- Statistical power: the probability that the test will reject a false null hypothesis (that it will not make a Type II error). When power increases, the chances of a Type II error decrease.
- Test specificity (ability of a test to correctly accept the null hypothesis).
- Test sensitivity (ability of a test to correctly reject the alternative hypothesis).
The table below gives an example of calculating specificity, sensitivity, False positive rate $α$, False Negative Rate, $β$, and power given the information about TN and FN.
!!!!INSERT TABLE!!!!
- Specificity = TN/(TN+FP), Sensitivity = TP/(TP+FN), $α$= FP/(TN+FP), $β$ = FN/(TP+FN), power=1-$β$.
Note that both (Type I ($α$) and Type II ($β$)) errors are proportions in the range [0,1], so they represent error-rates. The reason they are listed in the corresponding cells is that they are directly proportional to the numerical values of the FP and FN, respectively.
- Note that the two alternative definitions of power are equivalent:
$power=1- β$ and $power=sensitivity$
This is because $power=1- β =1-\frac{FN}{FN+TP}=\frac{FN+TP}{FN+TP}-\frac{FN}{FN+TP}=\frac{TP}{FN+TP}=sensitivity$
^{2}http://wiki.socr.umich.edu/index.php/SMHS_PowerSensitivitySpecificity
R power analysis
- The pwr package includes one implementation of power analysis. Some of its core functions include:
Function | Purpose |
pwr.2p.test | Two proportions (equal n) |
pwr.2p2n.test | Two proportions (unequal n) |
pwr.anova.test | Balanced one way ANOVA |
pwr.chisq.test | Chi-square test |
pwr.f2.test | GLM (general linear model) |
pwr.p.test | Proportion (one sample) |
pwr.r.test | Correlation |
pwr.t.test | T-tests (one sample, 2 sample, paired) |
pwr.t2n.test | T-test (two samples with unequal n) |
Providing 3 out of the 4 parameters (effect size, sample size, significance level, and power) will allow the calculation of the remaining component.
- A common value for the significance level is α=0.05, indicating a default false-positive rate of 1:20.
- In R, to calculate the default significance level use "sig.level=NULL" and give the remaining effect size and sample size to compute the power.
- In general, identifying the best (most appropriate) effect size could be a huge problem as the power calculations are very sensitive to it.
- The Cohen's suggestions provides a first-order approach for determining the effect-size, however, previously published research and reproduced findings may yield more reliable estimates. Cohen’s protocol interprets (effect-sizes) d values of 0.2 (small),0.5(medium), and 0.8 (large) large effect sizes.
Experimental Designs
T-tests ^{3}
# install.packages("pwr"); library("pwr")
Equal sample-sizes
pwr.t.test(n = , d = , sig.level = , power = , type = c("two.sample", "one.sample", "paired"))
pwr.t.test(n =100 , d = 0.5, sig.level = 0.05, type = "two.sample") # compute power
pwr.t.test(n =100 , power=0.8, sig.level=0.05, type = "two.sample") # compute effect-size
pwr.t.test(d = 0.5, power=0.8, sig.level=0.05, type = "one.sample") # compute sample-size
- n is the sample size,
- d is the effect size,
- type specifies two-sample t-test, one-sample t-test or paired t-test.
^{3}http://wiki.socr.umich.edu/index.php/SMHS_HypothesisTesting
Unequal sample-sizes
- pwr.t.test(n1= , n2=, d = , sig.level = , power = , alternative=c("two.sided", "less", "greater"))
- alt indicates a two-tailed, or one-tailed test. A two tailed test is the default.
- Note that “type” is not specified (two-sample is default)
pwr.t2n.test(n1=100 , n2=20 , d= 0.5, sig.level=0.05, alternative="less")
- n1 and n2 are the sample sizes of the 2 independent cohorts
- For t-tests, the effect size is assessed as:
\begin{equation} d=\frac{|μ_1-μ_2 |}{σ}, \end{equation}
Where $μ_1$ and $μ_2$ are the group 1 and 2 means, and $σ^2$ is the common error variance.
1-Way ANOVA^{4}
# pwr.anova.test(k = , n = , f = , sig.level = , power =
pwr.anova.test(k = 5, n = 100, f =0.1 , sig.level =0.05)
pwr.anova.test(k = 5, f =0.1 , sig.level =0.05, power=0.8)
where k is the number of groups and $n$ is the common sample size in each group, and the (ANOVA) effect size (f) is:
\begin{equation} f=\sqrt{\frac{\sum_{i=1}^{k}p_i(\mu_i-\mu)^2}{σ^2}}, \end{equation}
where n_{i} is the number of observations in group i, N is the total number of observations, $p_i=\frac{n_i}{N}, μ_i$ is the group i mean, μ is the overall grand mean, and σ^{2} is the error variance within groups.
- In the Cohen’s protocol, effect-sizes (f) values of 0.1, 0.25, and 0.4 represent small, medium, and large effects.
^{4}http://wiki.socr.umich.edu/index.php/SMHS_ANOVA
Correlations
# pwr.r.test(n = , r = , sig.level = , power = )
pwr.r.test(n = 40, r = 0.4, sig.level = 0.05)
where n is the sample size and r is the expected correlation (analogue of the effect-size above).
Again, the Cohen’s protocol classifies r values of 0.1, 0.3, and 0.5 as small, medium, and large correlations.
Linear Models^{5}
- For all linear models (e.g., multiple regression) we can use:
pwr.f2.test(u =, v = , f2 = , sig.level = , power = ) pwr.f2.test(u =, v = , f2 = , sig.level = , power = )
where u and v are the numerator and denominator degrees of freedom, f2 as the effect size measure, and R^{2} is the population squared multiple correlation.
- When evaluating the impact of a set of predictors on an outcome, we use this f2:
$f^2=\frac{R^2}{1-R^2}$
- Alternatively, when evaluating the impact of one set of predictors (A) above and beyond a second set of predictors or covariates (B), we use this f2:
$f^2=\frac{{R^2}_{AB}-{R^2}_A}{1-{R^2}_{AB}},$
where R^{2}_{AB} and R^{2}_{A} represent the variances jointly accounted for in the population by set A and B, or solely by set A.
- Cohen’s protocol classifies f2 values of 0.02, 0.15, and 0.35 as small, medium, and large effect sizes.
^{5}http://wiki.socr.umich.edu/index.php/SMHS_MLR
Tests of Proportions^{6}
Equal sample-sizes
# pwr.2p.test(h = , n = , sig.level =, power = ) pwr.2p.test(h =0.4, n =50 , sig.level = 0.05)
where h is the effect size, n is the common sample size in each group, and $p_1$ and $p_2$ are the 2 sample proportions.
\begin{equation} h=2 arcsin(\sqrt{p_1})-2 arcsin(\sqrt{p_2}) \end{equation}
Cohen’s protocol declares h values of 0.2, 0.5, and 0.8 as small, medium, and large effect sizes.
^{6}http://wiki.socr.umich.edu/index.php/SMHS_HypothesisTesting
Unequal n's
# pwr.2p2n.test(h = , n1 = , n2 = , sig.level = , power = ) pwr.2p2n.test(h =0.6 , n1 = 40, n2 = 30, sig.level =0.05)
Single proportion test
# pwr.p.test(h = , n = , sig.level = power = ) pwr.p.test(h = 0.4, sig.level = 0.05, power = 0.8)
- For one or two sample proportion tests, we can specify alternative="two.sided", "less", or "greater" to indicate a two-tailed, or one-tailed test (two-tailed test is default).
Chi-square Tests
# pwr.chisq.test(w =, N = , df = , sig.level =, power = ) pwr.chisq.test(w = 0.5, df = 8, sig.level =0.05, power = 0.8)
where N is the total sample size, df is the degrees of freedom and w is the effect size:
!!!!!!!!INSERT FORMULA
Cohen’s protocol interprets w values of 0.1, 0.3, and 0.5 as small, medium, and large effect sizes.
Some R Practice Problems
# library(pwr) # For a one-way ANOVA comparing 5 groups, calculate the # sample size needed in each group to obtain a power of # 0.75, when the effect size is moderate (0.22) and a # significance level of 0.05 is employed. pwr.anova.test(k=5,f=.22,sig.level=.05,power=.75) # What is the power of a one-tailed t-test, with a # significance level of 0.01, 25 people in each group, # and an effect size equal to 0.75? pwr.t.test(n=25,d=0.75,sig.level=.01, alternative="greater") # Using a two-tailed test proportions, and assuming a # significance level of 0.02 and a common sample size of # 20 for each proportion, what effect size can be detected # with a power of .8? pwr.2p.test(n=20, sig.level=0.02,power=0.8)
Creating Power or Sample Size Plots
The functions in the pwr package can be used to generate power and sample size graphs.
# Plot sample size curves for detecting correlations of # various sizes. library(pwr) # range of correlations r <- seq(.1, .5, .01) nr <- length(r) # power values p <- seq(.4,.9,.1) np <- length(p) # obtain sample sizes samsize <- array(numeric(nr*np), dim=c(nr,np)) for (i in 1:np){ for (j in 1:nr){ result <- pwr.r.test(n = NULL, r = r[j], sig.level = .05, power = p[i], alternative = "two.sided") samsize[j,i] <- ceiling(result$\$$n) } } # set up graph xrange <- range(r) yrange <- round(range(samsize)) colors <- rainbow(length(p)) plot(xrange, yrange, type="n", xlab="Correlation Coefficient (r)", ylab="Sample Size (n)" ) # add power curves for (i in 1:np) { lines(r, samsize[,i], type="l", lwd=3, col=colors[i]) } # add annotation (grid lines, title, legend) abline(v=0, h=seq(0,yrange[2], 50), lty=2, col="grey") abline(h=0, v=seq(xrange[1],xrange[2], 0.02), lty=2, col="grey") title("Sample Size Estimation for Correlation Studies Sig=0.05 (Two-tailed)") legend("topright", title="Power", as.character(p), fill=colors)
Psychometrics
Psychometrics studies rely on social and psychological measurements to objectively assess skills, knowledge (e.g., IQ tests), abilities, attitudes, personality traits (e.g., ADL), etc. Construction and validation of assessment instruments, e.g., questionnaires, tests, and personality tests, response theory, and intra-class correlation are examples of psychometric techniques.
Psychometric research focuses on 2 complementary directions:
- design of instruments
- Deployment of procedures and analytic strategies for interpreting the data generated by the instruments
Reliability and validity are core psychometric components. Reliable measures are consistently across time, individuals, and situations, whereas valid measures assess the intended process (instead of tracking a tangential characteristics). Validity implies reliability (necessary, but not sufficient, for validity), but not vice-versa. Cronbach's α is the most commonly used index of reliability along with the intra-class correlation, which is the ratio of variance of measurements of a given target to the variance of all targets.
Typical psychometric methods for analyzing variance/covariance (correlations and covariance) matrices include factor analysis (e.g., to determine the underlying dimensions of data), multidimensional scaling (e.g., to find a simple representation for data with a large number of latent dimensions), clustering (e.g., classifying similar and dissimilar objects), structural equation modeling (e.g., strength and direction of predictive relations in latent variables), and path analysis (special case of SEMs). We have previously discussed some of these^{7}.
Reliability measures:
Cronbach's $α$ ^{8}
- Cronbach’s alpha $α$ is a coefficient of internal consistency that is commonly used as an estimate of the reliability of a psychometric test. Internal consistency measures the correlations between different items on the same test. The internal consistency measures whether several items that propose to assess the same general construct produce similar scores. Cronbach’s $α$ is widely used in the social science, nursing, business and other areas.
- Suppose we measure a quantity $X$, which is a sum of $K$ components: $X=Y_1+Y_2+\dots+Y_K$, then Cronbach’s α is defined as $\alpha=\frac{K}{K-1}\Bigg(1-\frac{\sum_{i=1}^{K}σ_{Y_i}^2}{σ_X^2}\Bigg),$ where $σ_X^2$ is the variance of the observed total test scores, and $σ_(Y_i)^2$ is the variance of component $i$ for the current sample.
- When the items are scored from 0 to 1, then $=\frac{K}{K-1}\Bigg(1-\frac{\sum_{i=1}^{K}P_iQ_i}{σ_X^2}\Bigg),$ where $P_i$ is the proportion scoring 1 on item $i$ and $Q_i=1-P_i.$
- Alternatively, Cronbach’s $α$ can be defined as $=\frac{Kc ̅}{v ̅+(K-1)c ̅}$, where $v ̅$ is the average variance of each component and $c ̅$ is the average of all covariance between the components across the current sample of persons.
- The standardized Cronbach’s $α$ can be defined as $α=\frac{Kr ̅}{1+(K-1)r ̅}$, where $r ̅$ is the mean of $\frac{K(K-1)}{2}$ non redundant correlation coefficients (i.e., the mean of an upper triangular, or lower triangular, correlation matrix).
- Theoretically, $0≤α≤1$, since it is a ratio of two variances.
Interpretation
Internal consistency is a measure of whether several items that proposed to measure the same general construct produce similar score. It is usually measured with Cronbach’s $α$, which is calculated from the pairwise correlation between items. Internal consistency can take values from negative infinity to 1. It is negative when there is greater within subject variability than between-subject variability. Only positive values of Cronbach’s $α$ make sense. Cronbach’s $α$ will generally increases as the inter-correlations among items tested increase.
Cronbach's $α$ | Internal consistency |
$α ≥ 0.9$ | Excellent (High-Stakes testing) |
$0.7 ≤ α < 0.9$ | Good (Low-Stakes testing) |
$0.6 ≤ α < 0.7$ | Acceptable |
$0.5 ≤ α < 0.6$ | Poor |
$α < 0.5 $ | Unacceptable |
Drawbacks of Cronbach’s $α$
- It is dependent not only on the magnitude of the correlations among items, but also on the number of items in the scale. So, a scale can be made to look more homogenous simply by increasing the number of items though the average correlation may remain the unchanged;
- If two scales each measuring a distinct aspect are combined to form a long scale, $α$ would probably be high though the merged scale is obviously tapping two different attributes;
- If $α$ is too high, then it may reflect a high level of item redundancy, instead of instrument reliability.
There a other metrics e.g., Kuder–Richardson Formula 20 (KR-20), Split-Half Reliability Standard Error of Measurement (SEM) (http://wiki.socr.umich.edu/index.php/SMHS_Cronbachs)
^{7}https://umich.instructure.com/courses/38100/files/
^{8}http://wiki.socr.umich.edu/index.php/SMHS_Cronbachs
Intra-class correlation (ICC)
- The Intra-class correlation coefficient (ICC) assesses the consistency, or reproducibility, of quantitative measurements made by different observers measuring the same quantity. The ICC is defined as the ratio of between-cluster variance to total variance:
\begin{equation} ICC=\frac{Variance\_Due\_To\_Rated\_Subjects(Patients)}{(Variance\_Due\_To\_Subjects)+(Variance\_Due\_To\_Judges)+(Residual\_Variance)}. \end{equation}
- Ronald Fisher proposed the intra-class correlation. For a dataset consisting of $N$ cases and $K$ test-questions: values ${x_{n,1},x_{n,2},x_{n,3},…x_{n,K} }$, for $n = 1,...,N.$
\begin{equation} ICC=\frac{K}{(K-1)N}×\frac{\sum_{n=1}^N(x ̅_n-x ̅)^2}{s^2}-\frac{1}{K-1} \end{equation}
Where $x ̅_n$ and $x ̅$ are the sample means of the $n^{th}$ group and overall mean.
- Note: The difference between this ICC and the inter-class (Pearson) correlation is that to compute the ICC, the data are pooled to obtain the mean and variance estimates.
Interpretation
Note that $-\frac{1}{K-1}≤ICC≤1.$
ICC | Interpretation |
$-\frac{1}{K-1}≤ICC≤0$ | no interpretation, two members chosen randomly from the same class vary almost as much as any two randomly chosen members of the whole population |
$0≤ICC≤0.2$ | poor agreement |
$0.3≤ICC≤0.4$ | fair agreement |
$0.5≤ICC≤0.6$ | moderate agreement |
$0.7≤ICC≤0.8$ | strong agreement |
$0.8<ICC$ | almost perfect agreement |
Examples
Nursing study of depression
A study involving $K=4$ nurses rated $N=6$ patients on a 10 point depression scale:
PatientID | NurseRater1 | NurseRater2 | NurseRater3 | NurseRater4 |
1 | 9 | 2 | 5 | 8 |
2 | 6 | 1 | 3 | 2 |
3 | 8 | 4 | 6 | 8 |
4 | 7 | 1 | 2 | 6 |
5 | 10 | 5 | 6 | 9 |
6 | 6 | 2 | 4 | 7 |
# install.packages("ICC") library("ICC") # load the data “09_NursingStudyDepression_LongData.csv” dataset <- read.csv('https://umich.instructure.com/files/674384/download?download_frd=1', header = TRUE) # load the long formatted dataset # remove the first columns (Patient ID number) dataset <- dataset[,-1] attach(dataset) dataset
# Sample-size estimation: # Compute the N individuals/groups required to estimate the ICC with a # desired confidence interval. Assuming k measures per individual/group, and # a ICC estimation type. N-est() calculates the N individuals/groups necessary # to obtain a desired confidence interval w.
Nest(est.type="p",w=0.14,k=4,x=as.factor(Rating),y=Nurse,data=dataset,alpha=0.05) Nest(est.type = c("hypothetical", "pilot"), w, ICC = NULL, k = NULL, x = NULL, y = NULL, data = NULL, alpha = 0.05)
Argument | Description |
est.type | character string of either "hypothetical" indicating usage of the given values of k and ICC or if "pilot" is specified then to calculate these from the dataset provided. Just the first letter may be used |
w | desired width of the confidence interval about the ICC estimate |
ICC | expected intraclass correlation coefficient |
k | number of measurements per individual or group |
x | column name of data indicating the individual or group ID from a pilot study |
y | column name of data indicating the measurements from a pilot study |
data | a data.frame from a pilot experiment |
alpha | the alpha level to use when estimating the confidence interval |
N-est Output | return value is a data.frame with rows representing the values of the specified ICCs and the columns yield the different k values. |
icc <-ICCest(x=as.factor(Nurse), y=Rating, data=dataset, alpha=0.05) icc$\$$UpperCI-icc$\$$LowerCI #confidence interval width icc
# Estimate the Intraclass Correlation Coefficient using the variance components from a one-way ANOVA. ICCest(x, y, data = NULL, alpha = 0.05, CI.type = c("THD", "Smith"))
Argument | Description |
x | column name indicating individual or group id in the dataframe data |
y | column name indicating measurements in the dataframe data |
data | a dataframe containing x and y |
alpha | the alpha level to use when estimating the confidence interval. Default is 0.05. |
CI.type | the particular confidence interval to estimate. CIs of the type "THD" are based upon the exact confidence limit equation in Searle (1971) and can be used for unbalanced data. CIs of the type "Smith" are based upon the approximate formulas for the standard error of the ICC estimate |
Results
Output | Description |
ICC | intraclass correlation coefficient |
LowerCI | the lower confidence interval limit, where the confidence level is set by alpha |
UpperCI | the upper confidence interval limit, where the confidence level is set by alpha |
N | the total number of individuals or groups used in the analysis |
k | number of measurements per individual or group. In an unbalanced design, k is always less than the mean number of measurements per individual/group and is calculated using the equation in Lessells and Boag (1987) |
varw | the within individual or group variance |
vara | the among individual or group variance |
Gazi University (Turkey) Student Evaluation Data
R package psy (https://cran.r-project.org/web/packages/psy/psy.pdf)
## cronbach(v1) ## v1 is n*p matrix or data frame with n subjects and p items.
## This phrase is used to compute the Cronbach’s reliability coefficient alpha.
## This coefficient may be applied to a series of items aggregated in a single score.
## It estimates reliability in the framework of the domain sampling model.
An example to calculate Cronbach’s alpha:
# install.packages("psy") library(psy) # 09_TurkiyeCourseEvalData.csv dataset <- read.csv('https://umich.instructure.com/files/687890/download?download_frd=1', header = TRUE) # load the long formatted dataset # remove the first columns (Patient ID number) dataset <- dataset[,-c(1:5)] attach(dataset) colnames(dataset) cronbach(dataset[,5:32]) ## choose the vector of the columns 5 to 32 (Q1 – Q28) and calculated the Cronbach’s Alpha value # contrast this to (smaller subsample): cronbach(dataset[1:10,5:7])
Note, if column rankings differ (e.g., Q1 high value indicates great and low value indicates poor, but Q2 has these reversed, we need to transform all columns to have the same, consistent, interpretation prior to computing the Cronbach’s alpha.
## to obtain a 95% (bootstrap-based) confidence interval^{10}: library(boot) cronbach.boot <- function(data, x) { cronbach(data[x,])[[3]] } This selected the computed “Alpha” as the 3^{rd} argument in the output of “Cronbach”:
INSERT TABLE!!
Next, generate bootstrap replicates of a statistic (in this case “Cronbach”) applied to the dataset.
BCI <- boot(dataset, cronbach.boot, 1000) # 1000 = number of bootstrap replicates # Print estimate, bias, precision BCI
Finally, compute the bootstrap confidence intervals (BCI):
quantile(BCI$\$$t, c(0.025,0.975)) # sample quantiles corresponding to the given probabilities. # The smallest observation corresponds to a probability of 0 and # the largest to a probability of 1.
Compute the Intra-class correlation coefficient (ICC)
icc <- icc(dataset[1:1000,]) #reduce the computational complexity to get quick output # The Argument “dataset” (in this case the Turkiye course evals data) are a n*p matrix # or dataframe, n subjects time p raters. Missing data are omitted in a listwise way. icc[[6]] # $\$$icc.consistency # Intra class correlation coefficient, "consistency" version icc[[7]] # $\$$icc.agreement # Intra class correlation coefficient, "agreement" v # The "agreement" ICC is the ratio of the subject variance to the sum of the subject # variance, the rater variance and the residual. # The "consistency" ICC is the ratio of the subject variance to the sum of the # subject variance and the residual; it may be of interest when estimating the # reliability of pre/post variations in measurements
^{10} http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0019178
Problem – compare the ICC’s for class=1, class=2, and class=3
# table(dataset$\$$Class) dataset <- read.csv('https://umich.instructure.com/files/687890/download?download_frd=1', header = TRUE) attach(dataset); colnames(dataset) class1_data <- dataset[which(dataset$\$$Class=='1'),]; head(class1_data); dim(class1_data)
Applications
- This articletitled Introduction To Sample Size Determination And Power Analysis For Clinical Trials reviewed the importance of sample size in clinical trials and presented a general method from which specific equations are derived for sample size determination and analysis of power for a wide variety of statistical procedures. This paper discussed the method in details with illustration in relation to the t test, test for proportions, test for survival time and tests for correlations that commonly occurred in clinical trials.
- This article presents measures of the magnitude of effects (i.e., effect size statistics) and their confidence intervals in all biological journals. It illustrated the combined use of an effect size and its confidence interval, which enables one to assess the relationships within data more effectively than the use of p values, regardless of statistical significance. It focused on standardized effect size statistics and extensively discussed two dimensionless classes of effect size statistics: d statistics (standardized mean difference) and r statistics (correlation coefficient), because these can be calculated from almost all study designs and also because their calculations are essential for meta-analysis. The paper provided potential solutions for four main technical problems researchers may encounter when calculating effect size and CIs: (1) when covariates exist, (2) when bias in estimating effect size is possible, (3) when data have non-normal error structure and/or variances, and (4) when data are non-independent.
- This articlereviewed methods of sample size and power calculation for most commonly study designs. It presents two generic formulae for sample size and power calculation, from which the commonly used methods are derived. It also illustrates the calculation with a computer program, which can be used for studies with dichotomous, continuous, or survival response measures.
Software
Problems
Other things being equal, which of the following actions will reduce the power of a hypothesis test?
I. Increasing sample size. II. Increasing significance level. III. Increasing beta, the probability of a Type II error.
- (A) I only
- (B) II only
- (C) III only
- (D) All of the above
- (E) None of the above
Suppose a researcher conducts an experiment to test a hypothesis. If she doubles her sample size, which of the following will increase?
I. The power of the hypothesis test. II. The effect size of the hypothesis test. III. The probability of making a Type II error.
- (A) I only
- (B) II only
- (C) III only
- (D) All of the above
- (E) None of the above
Suppose we have the following measurements taken. Calculate the corresponding power, specificity and sensitivity.
Actual Condition | |||
Absent ($H_0$ is true) | Present ($H_1$ is true) | ||
Test Result | Negative(fail to reject $H_0$) | 0.983 | 0.0025 |
Positive (reject $H_0$) | 0.0085 | 0.0055 |
Suppose we are running a test on a simple experiment where the population standard deviation is $ 0.06$. $H_0: \mu=0$ vs. $H_a: \mu=0.5$. With type I error of 5%, what would be a reasonable sample size if we want to achieve at least 98% power.
References
- SOCR Home page: http://www.socr.umich.edu
Translate this page: