# SMHS HypothesisTesting

## Scientific Methods for Health Sciences - Hypothesis Testing

### Overview

Hypothesis testing is a statistical technique for decision-making regarding populations or processes based on experimental data. It quantitatively answers the probability that chance along might be responsible for the observed discrepancies between a theoretical model and the empirical observations. In this class, we are going to introduce the fundamental terminologies we are going to discuss in Hypothesis Testing include null and alternative hypotheses, Type I and Type II errors, sensitivity, specificity and statistical power and we are going to discuss about hypothesis testing of mean, proportion and mean under various assumptions and hope to prepare students with enough background information of Hypothesis testing in real data analysis.

Important parts included in Hypothesis testing:

• Decision (significance or no significance)
• Parameter of interest
• Variable of interest
• Population under study
• p-value

### Motivation

In statistical data analysis, we often encounter the problem of making statistical decisions about populations or processes based on experimental data. Hypothesis testing will be the direct answer to questions like:

• How well the findings fit the possibility that the chance alone might be responsible for the observed discrepancy between the theoretical model and the empirical observations
• The likelihood of the observed summary statistics, assuming that the data comes from the distribution specified by the null hypothesis
• Whether the data follows the distribution stated in the alternative hypothesis

In fact, one use of hypothesis testing is to decide whether experimental results contain enough information to cast doubt on conventional wisdom.

Consider an example of testing whether the new production purse from a factory contains radioactive material. The null hypothesis is that there is no radioactive material in the purse and that all measured counts are due to ambient radioactivity typical of the surrounding air and harmless objects in the purse. We can then calculate how likely it is that the null hypothesis will produce 10 counts per minute. If it is likely - e.g., if the null hypothesis predicts on average 9 counts per minute - we say the purse is compatible with the null hypothesis. On the other hand, if the null hypothesis predicts for example 3 count per minute, then the purse is not compatible with the null hypothesis and there must be other factors responsible to produce the increased radioactive counts.

### Theory

#### Fundamentals of Hypothesis testing (statistical significance testing)

Null and alternative hypothesis: a null hypothesis a thesis set up to be nullified or refuted in order to support an Alternate (research) Hypothesis. The null hypothesis is presumed true until statistical evidence, in the form of a hypothesis test, indicates otherwise. In science, the null hypothesis is used to test differences between treatment and control groups, and the assumption at the outset of the experiment is that no difference exists between the two groups for the variable of interest (e.g., population means). The null hypothesis proposes something initially presumed true, and it is rejected only when it becomes evidently false. That is, when a researcher has a certain degree of confidence, usually 95% to 99%, that the data do not support the null hypothesis. In the example of testing radioactive purse above, the null hypothesis is there is no radioactive material in the purse and that all measured counts are due to ambient radioactivity typical of the surrounding air and harmless objects in the purse. Formulation of the null hypothesis is a vital step in testing statistical significance. Having formulated such a hypothesis, one can establish the probability of observing the obtained data from the prediction of the null hypothesis, if the null hypothesis is true. That probability is what commonly called the significance level of the results.

In many scientific experimental designs we predict that a particular factor will produce an effect on our dependent variable — this is our alternative hypothesis. We then consider how often we would expect to observe our experimental results, or results even more extreme, if we were to take many samples from a population where there was no effect (i.e. we test against our null hypothesis). If we find that this happens rarely (up to, say, 5% of the time), we can conclude that our results support our experimental prediction — we reject our null hypothesis.

#### Type I Error, Type II Error and Power

• Type I error: the false positive (Type I) error of rejecting the null hypothesis given that it is actually true; e.g., the purses are detected to containing the radioactive material while they actually do not.
• Type II error: the false negative (Type II) error of failing to reject the null hypothesis given that the alternative hypothesis is actually true; e.g., the purses are detected to not containing the radioactive material while they actually do.
• Statistical power: the probability that the test will reject a false null hypothesis (that it will not make a Type II error). When power increases, the chances of a Type II error decrease.

The table below gives an example of calculating specificity, sensitivity, False positive rate $α$, False Negative Rate $β$ and power given the information of TN and FN.

 Actual Condition Absent ($H_{0}$ is true) Present ($H_{1}$ is true) Test Result Negative(fail to reject $H_{0})$ Condition absent + Negative result = True (accurate) Negative (TN, 0.98505) Condition present + Negative result = False (invalid) Negative (FN, 0.00025) Type II error ($\beta$) Positive (reject $H_{0})$ Condition absent + Positive result = False Positive (FP, 0.00995) Type I error ($\alpha$) Condition Present + Positive result = True Positive (TP, 0.00475) Test Interpretation $Power$= $1-\beta$= $1-0.05$ = $0.95$ Specificity:$\frac{TN}{(TN+FP)}=\frac{0.98505}{(0.98505+ 0.00995)}= 0.99$ Sensitivity:$\frac {TP} {(TP+FN)} = \frac {0.00475} {(0.00475+ 0.00025)}= 0.95$

Thus, $Specificity = \frac{TN}{TN + FP}$, $Sensitivity= \frac {TP} {TP+FN}$, $\alpha= \frac {FP}{FP+TN},\beta=\frac {FN} {FN+TP}$, $power=1-\beta.$ Note that $\alpha$ and $\beta$ are the false-positive and false-negative rates (which are directly proportional to the FP and FN counts), respectefully.

#### Testing a claim about a mean with large sample size

Recall the random sample ${X_1,X_2,…,X_n}$ of the process where the population mean is estimated by the sample average $\bar{X}_{n}=\frac{1}{n}∑_{i=1}^{n}X_{i}$. For a given small significant level say $\alpha=0.025$, the $(1-\alpha)100\%$ confidence interval for the mean is constructed by $CI(\alpha)$: $\bar x\pm z_\frac{\alpha}{2}E$, where the margin of error E is defined as

$$E = \begin{cases}{\sigma\over\sqrt{n}},& \texttt{for-known}-\sigma,\\ {{1\over \sqrt{n}} \sqrt{\sum_{i=1}^n{(x_i-\overline{x})^2\over n-1}}},& \texttt{for-unknown}-\sigma.\end{cases}$$
and $z_{\frac{\alpha}{2}}$ is the critical value for a Standard Normal distribution at $\frac{\alpha}{2}.$
• Hypothesis testing about a mean: large samples
$H_{0}: \mu=\mu_{0}$ (e.g., $\mu_{0}=0)$; one sided $H_{1}:\mu>\mu_{0}$ or $\mu<\mu_{0}$; two sided $H_{1}:\mu≠\mu_{0}$.
Test statistics:
(1) with known variance: $Z_0=\frac{\bar x -\mu_{0}} {\frac{\sigma}{\sqrt n}}$ $\thicksim N (0,1)$
(2) with unknown variance $T_{0}=\frac{\bar x -\mu_{0}} {SE(\bar x)}=\frac{\bar x -\mu_{0}} {{\frac{1} {\sqrt n} \sqrt {\displaystyle \sum_{i=1}^{n} \frac{(x_{i}-\bar x)^{2}}{n-1}}}}$ $\thicksim T_{df=n-1}$
• Example: consider we are testing if the population mean equal to 20 at $\alpha$=0.05 using a double sided alternative test. $H_{0}$:$\mu=20$ vs.$H_{1}:\mu≠20$. The sample data is given: 16, 9, 14, 11, 17, 12, 99, 18, 13, 12, 5, 9, 17, 6, 11, 17, 18, 20, 6, 14, 7, 11, 12, 5, 18, 6, 4, 13, 11, and 12. Population variance not given.
a <- c( 16, 9, 14, 11, 17, 12, 99, 18, 13, 12, 5, 9, 17, 6, 11, 17, 18, 20, 6, 14, 7, 11, 12, 5, 18, 6, 4, 13, 11, 12)
summary(a)
Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
4.00    9.00   12.00   14.77   16.75   99.00
sd(a)
[1] 16.53561

$T_{0}=\frac{\bar x -\mu_{0}} {SE(\bar x)}=\frac{\bar x -\mu_{0}} {{\frac{1} {\sqrt {n}} \sqrt {\displaystyle \sum_{i=1}^{n} \frac{(x_{i}-\bar x)^{2}}{n-1}}}}$
From the sample we have $\bar x=14.77, s=16.54$.
$T_{0}=\frac{\bar x -\mu_{0}} {SE(\bar x)}=\frac{\bar x -\mu_{0}} {{\frac{1} {\sqrt n} \sqrt {\displaystyle \sum_{i=1}^{n} \frac{(x_{i}-\bar x)^{2}}{n-1}}}}$ $\frac{14.77-20}{{\frac{1} {\sqrt 30} \sqrt {\displaystyle \sum_{i=1}^{30} \frac{(x_{i}-14.77)^{2}}{30-1}}}}$ $= 1.176$
$P(T_{df=29}<T_{0}=-1.733)=0.047$, hence we have p-value $=2*0.047=0.094$, we can’t reject the null hypothesis at $\alpha=0.05$ level of significance.

#### Comparing the means of two samples

When comparing the means of 2 samples one need to identify if the samples are paired or independent. In the paired samples case, or single sample case, the paired test should be used. When the two samples are independent, the independent sample test needs to be used.

#### Testing a claim about a mean with small sample size

Recall the random sample ${X_1,X_2,…,X_n}$ of the process where the population mean is estimated by the sample average $\bar X_{n}=\frac{1}{n}\sum_{i=1}^{n} X_{i}$. For a given small significant level say $\alpha=0.025$, the $(1-\alpha)100\%$ confidence interval for the mean is constructed by $CI(\alpha)$: $\bar{x}\pm t_{\{df=n-1,\frac{\alpha}{2}\}} \frac{1}{\sqrt {n}} \sqrt{\sum_{i=1}^{n} {\frac{(x_{i}-\bar x)^{2}}{n-1}}}$, where $E=\frac{1}{\sqrt {n}} \sqrt{\sum_{i=1}^{n} {\frac{(x_{i}-\bar x)^{2}}{n-1}}}$ is the margin of error and $t_{df=n-1,\frac{\alpha} {2}}$ is the critical value of T distribution of df=sample size-1 at $\frac{\alpha}{2}$.

• Hypothesis testing about a mean: small samples
$H_{0}:\mu=\mu_{0}$(e.g.,$\mu_{0}=0$); one sided $H_{1}:\mu>\mu_{0}$ or $\mu<\mu_{0}$;two sided $H_{1}:\mu≠\mu_{0}$.
Test statistics:
(1) with known variance: $Z_0=\frac{\bar x -\mu_{0}} {\frac{\sigma}{\sqrt n}}$ $\thicksim N (0,1)$
(2) with unknown variance $T_{0}=\frac{\bar x -\mu_{0}} {SE(\bar x)}=\frac{\bar x -\mu_{0}} {{\frac{1} {\sqrt {n}} \sqrt {\sum_{i=1}^{n} \frac{(x_{i}-\bar x)^{2}}{n-1}}}}$ $\thicksim T_{df=n-1}$
Example: consider we are testing if the population mean equal to 20 at α=0.01 using a one sided alternative test. $H_{0}: \mu=12$ vs.$H_{1}:\mu>12$. The sample data is given: 16, 9, 14, 11, 17, 12, 99, 18, 13, and 12. Population variance is not given.
From the sample we have $\bar x=22.1,s=27.164$
$T_{0}=\frac{\bar x -\mu_{0}} {SE(\bar x)}=\frac{\bar x -\mu_{0}} {{\frac{1} {\sqrt n} \sqrt {\sum_{i=1}^{n} \frac{(x_{i}-\bar x)^{2}}{n-1}}}}$ $\frac{22.1-12}{{\frac{1} {\sqrt 10} \sqrt {\sum_{i=1}^{10} \frac{(x_{i}-22.1)^{2}}{10-1}}}}$ $= 1.176$
$p-value=P(T_{df=29}>T_{0}=1.176)=0.13488$, hence we can’t reject the null hypothesis at $\alpha=0.01$ level of significance.

#### Testing a claim about a proportion

Recall that for large samples, the sample distribution of the sample proportion $\hat p$ is approximately normal by CLT, as the sample proportion may be presented as a sample average or Bernoulli random variables. When sample size is small, the normal approximation is inadequate. The accommodate this, we modify the sample proportion $\hat p$ slightly and obtain the corrected-sample-proportion $\tilde p$.

$$\hat{p}={y\over n} \longrightarrow \tilde{p}={y+0.5z_{\alpha \over 2}^2 \over n+z_{\alpha \over 2}^2},$$ where $z_\frac{\alpha}{2}$ is the critical value of a standard normal distribution at $\alpha/2$.

The standard error of $$\hat{p}$$ ( and $$\tilde{p}$$) also needs a slight modification

$$SE_{\hat{p}} = \sqrt{\hat{p}(1-\hat{p})\over n} \longrightarrow SE_{\tilde{p}} = \sqrt{\tilde{p}(1-\tilde{p})\over n+z_{\alpha \over 2}^2}.$$

• Hypothesis testing about a single sample proportion:
Null Hypothesis: $H_o: p=p_o$ (e.g., $p_o=\frac{1}{2}$), where $p$ is the population proportion of interest.
Alternative Research Hypotheses:
One sided (uni-directional): $H_1: p >p_o$, or $H_1: p<p_o$
Double sided: $H_1: p \not= p_o.$
Test Statistics: $Z_o={\tilde{p} -p_o \over SE_{\tilde{p}}} \sim N(0,1).$
• Example: consider we are testing the effect of some medicine. 500 patients are randomly recruited with evidence of early this disease and were scheduled to take one pill daily for two years. At the end two years, only 17 patients had the disease. Use $\alpha=0.05$ to formulate a test a research hypothesis that the proportion of patient on this treatment that have the disease within 2 years of treatment is $p_0=0.04$.
$\tilde{p} = {17+0.5z_{0.025}^2\over 500+z_{0.025}^2}== {17+1.92\over 500+3.84}=0.038$
$SE_{\tilde{p}}= \sqrt{0.038(1-0.038)\over 500+3.84}=0.0085$
And the corresponding test statistics is
$Z_o={\tilde{p} - 0.04 \over SE_{\tilde{p}}}={0.002 \over 0.0085}=0.2353$
The p-value of this test is clearly insignificant and we can’t reject the null hypothesis at $\alpha=0.05$ level of significance.

#### Testing a claim about variance (or standard deviation)

• Recall that the sample variance $s^2$ is an unbiased point estimate for the population variance $σ^2$, similarly for the standard deviation. The sample variance is roughly chi-square distributed: $\chi_0^2=\frac{(n-1) s^2}{\sigma_0^2} \sim \chi_{df=n-1}^2.$
$H_0: σ^2=σ_0^2$ vs. $H_1:σ^2≠σ_0^2$. Given that the chi-square distribution is not symmetric, there are two critical values $$χ_L^2$$ and $$χ_R^2.$$
• Test statistics: $\chi_0^2 =\frac{(n-1) s^2}{\sigma_0^2} \sim \chi_{df=n-1}^2$
• Example: we have a random sample of 30 objects drawn form a normal distribution with sample variance $$s^2=5.$$ Test at $$α=0.05$$ level of significance if this is consistent with $$H_0: \sigma^2=2$$.
$\chi_0^2=\frac{(n-1) s^2}{\sigma_0^2} =(29*5)/2=72.5$, $\chi_L^2=16.047$ and $\chi_R^2=45.722$, since we have $$\chi_0^2 > \chi_R^2$$, we reject the null hypothesis at 5% level of significance. The image below illustrates this calculation using the SOCR $\chi_{df=29}^2$ calculator. Notice the Left and right limits for the central 95% confidence interval, $(\chi_L^2=16.047 : \chi_R^2=45.722)$.

### Applications

• This article illustrates SOCR analyses example on one sample t-test. It presents the background information for one sample t-test and demonstrate the process of doing one sample t-test in the SOCR one sample t-test applet.
• This article titled Choosing the Right Test presents the procedure to select a statistical test. It starts with getting the right hypotheses and then develops the topic based on the choice and characteristics of data. It offers a broad sense of what type of test to choose based on the hypothesis and the data. The article is also accompanied with several exercise for students to practice on their own.
• This article titled The Hypothesis Testing For Difference Of Population Parameters presents a comprehensive introduction to the hypothesis testing of difference of population parameters with the background information as well as the application. It also presents the steps to apply hypothesis testing using SOCR analyses.

### Problems

USA Today's AD Track examined the effectiveness of the new ads involving the Pets.com Sock Puppet (which is now extinct). In particular, they conducted a nationwide poll of 428 adults who had seen the Pets.com ads and asked for their opinions. They found that 36% of the respondents said they liked the ads. Suppose you increased the sample size for this poll to 1000, but you had the same sample percentage who like the ads (36%). How would this change the p-value of the hypothesis test you want to conduct?

(a) No way to tell
(b) The new p-value would be the same as before
(c) The new p-value would be smaller than before
(d) The new p-value would be larger than before

If we want to estimate the mean difference in scores on a pre-test and post-test for a sample of students, how should we proceed?

(a) We should construct a confidence interval or conduct a hypothesis test
(b) We should collect one sample, two samples, or conduct a paired data procedure
(c) We should calculate a z or a t statistic

The paint used to make lines on roads must reflect enough light to be clearly visible at night. Let mu denote the true average reflectometer reading for a new type of paint under consideration. A test of the null hypothesis that mu = 20 versus the alternative hypothesis that mu > 20 will be based on a random sample of size n from a normal population distribution. In which of the following scenarios is there significant evidence that mu is larger than 20?

(i) n=15, t=3.2, alpha=0.05
(ii) n=9, t=1.8, alpha=0.01
(iii) n=24, t=-0.2, alpha=0.01
(a) (ii) and (iii)
(b) (i)
(c) (iii)
(d) (ii)

We observe the math self-esteem scores from a random sample of 25 female students. How should we determine the probable values of the population mean score for this group?

(a) Test the difference in means between two paired or dependent samples.
(b) Test that a correlation coefficient is not equal to 0 (correlation analysis).
(c) Test the difference between two means (independent samples).
(d) Test for a difference in more than two means (one way ANOVA).
(e) Construct a confidence interval.
(f) Test one mean against a hypothesized constant.
(g) Use a chi-squared test of association.

Food inspectors inspect samples of food products to see if they are safe. This can be thought of as a hypothesis test where H0: the food is safe, and H1: the food is not A. If you are a consumer, which type of error would be the worst one for the inspector to make, the type I or type II error?

(a) Type I
(b) Type II

A college admissions officer is concerned that their admission criteria might not treat men and women with equal weight. To test this, the college took a random sample of male and female high school seniors from a very large local school district and determined the percent of males and females who would be eligible for admission at the college. Which of the following is a suitable null hypothesis for this test?

(a) p = 0.5
(b) The proportion of all eligible men in the district will not equal the proportion of all eligible women in the district.
(c) The proportion of all eligible men in the school district should be equal to the proportion of all eligible women in the school district.
(d) The proportion of eligible men sampled should equal the proportion of eligible women sampled.

We want to determine if college GPAs differ for male athletes in major sports (e.g., football), minor sports (e.g., swimming), and intramural sports. What statistical method is most likely to be used to answer this question? Assume that all necessary assumptions have been met for using this procedure.

(a) Test one mean against a hypothesized constant
(b) Test the difference in means between two paired or dependent samples
(c) test for a difference in more than two means (one way ANOVA)
(d) Test that a correlation coefficient is not equal to 0, correlation analysis
(e) Test the difference between two means (independent samples)

Statistics show that the average level of a mother's education for a city of 300,000 people is 14 years with a standard deviation of 1.5 years. A major state university is located in this town. The administrators in this university think that the average level of a mother's education for the freshmen who are admitted to this school is higher than 14 years. The average education level of mothers for a random sample of 100 freshmen who were admitted to this university within the last two years was 14.7 years. We want to test the null at the level of alpha = 0.001. What is the best answer?

(a) We reject the alternative and believe that the level of a mother's education for university freshmen is not higher than the overall population average.
(b) We reject the null at 0.001 and conclude that the average level of a mother's education is higher for university freshmen.
(c) We fail to reject the null and conclude that the level of a mother's education for university freshmen is not higher than the overall population average.
(d) In order to be certain about the conclusion we reach, a larger sample size is needed to increase the power of the test and the margin of error.

The average length of time required to complete a certain aptitude test is claimed to be 80 minutes. A random sample of 25 students yielded an average of 86.5 minutes and a standard deviation of 15.4 minutes. If we assume normality of the population distribution, is there evidence to reject the claim? Choose at least one answer.

(a) Yes, because the observed 86.5 did not happen by chance
(b) Yes, because the t-test statistic is 2.11
(c) Yes, because the observed 86.5 happened by chance
(d) No, because the probability that the null is true is > 0.05

Based on past experience, a bank believes that 4% of the people who receive loans will not make payments on time. The bank has recently approved 300 loans. What is the probability that over 6% of these clients will not make timely payments?

(a) 0.096
(b) 0.038
(c) 0.962
(d) 0.904

Many people sleep in on the weekends to make up for short nights during the work week. The Better Sleep Council reports that 61% of us get more than 7 hours of sleep per night on the weekend. A random sample of 350 adults found that 235 had more than seven hours each night last weekend. At the 0.05 level of significance, does this evidence show that more than 61% of us get seven or more hours off sleep per night on the weekend?