Difference between revisions of "SMHS UbiquitousVariation"

From SOCR
Jump to: navigation, search
(Theory)
(References)
 
(25 intermediate revisions by 4 users not shown)
Line 2: Line 2:
  
 
===Overview===
 
===Overview===
In real world, variation exists in almost all the data set. The truth is no matter how controlled the environment is in the protocol or the design, virtually any repeated measurement, observation, experiment, trial, or study is bounded to generate data that varies because of intrinsic (internal to the system) or extrinsic (ambient environment) effects. And the extent to which they are unalike, or vary can be noted as variation. Variation is an important concept in statistics and measuring variability is of special importance in statistic inference. And measure of variation, which is namely measures that provided information on the variation, illustrates the extent to which data are dispersed or spread out. We will introduce several basic measures of variation commonly used in statistics: range, variation, standard deviation, sum of squares, Chebyshev’s theorem and empirical rules.
+
In the real world, ''variation'' - which is the extent to which data from the same study may differentiate from each other - exists in almost all data sets. The truth is that no matter how controlled an environment might be in terms of its protocol or design, virtually any repeated measurement, observation, experiment, trial or study is bound to generate data that varies due to ''intrinsic effects'' (which are internal to the system being studied) and ''extrinsic effects'' (which originate from the system's ambient environment). To measure variability requires using measures that provide information on the variation of data and that illustrate the extent to which data are dispersed or spread out. We will introduce several basic measures of variation that are commonly used in statistics, including range, variation, standard deviation, sum of squares, Chebyshev’s theorem, and empirical rules.
  
 
===Motivation===
 
===Motivation===
Variation is of significant importance in statistics and it is ubiquitous in data. Consider the example in UCLA’s study of Alzheimer’s disease which analyzed the data of 31 Mild Cognitive Impairment (MCI) and 34 probable Alzheimer’s disease (AD) patients. The investigators made every attempt to control as many variables as possible. Yet, demographic information they collected from the outcomes of the subjects contained unavoidable variation. The same study found variation in the MMSE cognitive scores even in the same subject. The table below shows the demographic characteristics for the subjects and patients included in this study, where the following notation is used M (male), F (female), W (white), AA (African American), A (Asian).
+
Variation is of significant importance in statistics and is ubiquitous in data. Consider the example in [[SMHS_UbiquitousVariation#References | UCLA’s study of Alzheimer’s disease]], which analyzed the data of 31 Mild Cognitive Impairment (MCI) and 34 probable Alzheimer’s disease (AD) patients. The investigators made every attempt to control as many variables as possible, but the demographic information they collected from the outcomes of their subjects were unavoidably varied. The same study found variation in the MMSE cognitive scores even in the same subject. The table below shows the demographic characteristics for the subjects and patients included in this study, where the following notation is used M (male), F (female), W (white), AA (African American), A (Asian).
  
 +
<center>
 
{| class="wikitable" style="text-align:center; width:75%" border="1"
 
{| class="wikitable" style="text-align:center; width:75%" border="1"
 
|-
 
|-
| Variable|| Alzhelmer's disease || MCI || Test Statistics || Test Score || P-value
+
| '''Variable''' || '''Alzheimer’s disease''' || '''MCI''' || '''Test statistics''' || '''Test score''' || '''P-value'''
 
|-
 
|-
| Age(years)|| 76.2 (8.3) range 52-89|| 73.7 (7.3) range 57-84|| Student’s T ||t <sub>0</sub> =1.284 || p=0.21
+
| '''Age (years)''' || 76.2 (8.3) range 52–89 || 73.7 (7.4) range 57–84 || Student’s T || $t_o = 1.284$ || ''p=0.21''
 
|-
 
|-
| Gender(M:F)|| 15:19|| 15:16|| Proportion|| z <sub>0</sub>= -0.345 || p=0.733
+
| '''Gender (M:F)''' || 15:19 || 15:16 || Proportion || $z_o = -0.345|| ''p=0.733''
|-  
+
|-
| Education(years)|| 14.0 (2.1) range 12-19 || 16.23 (2.7) range 12-20 || Wilcoxon rank sum || w <sub>0</sub> =773.0 || p<0.001
+
| '''Education (years)''' || 14.0 (2.1) range 12–19 || 16.23 (2.7) range 12–20 || Wilcoxon rank sum || $w_o = 773.0|| ''p<0.001''
 
|-
 
|-
| Race(W:AA:A)|| 29:1:4 || 26:2:3 ||x <sup>2</sup>  <sub>(df=2)</sub> || x <sup>2</sup>  <sub>(df=2)</sub> =1.18 || 0.55
+
| '''Race (W:AA:A)'''  || 29:1:4 || 26:2:3 || $\chi_{(df=2)}^2$ || $\chi_{(df=2)}^2=1.18$ || 0.55
 
|-
 
|-
| MMSE|| 20.9 (6.3) range 4-29 || 28.2 (1.6) range 23-30 || Wilcoxon rank sum || w <sub>0</sub> =977.5 || p<0.001
+
| '''MMSE''' || 20.9 (6.3) range 4–29 || 28.2 (1.6) range 23–30 || Wilcoxon rank-sum || $w_o= 977.5|| ''p<0.001''
 
|}
 
|}
 +
</center>
  
Once we accept that all natural phenomena are inherently variant and there aren’t completely deterministic processes, we need to look for measures of variation that allow us to know the extent to which the data are dispersed. Suppose, for instance, we flip a coin 50 times and get 15 heads and 35 tails. But according to the fundamental probability theory where we assume it’s a fair coin, we should have got 25 heads and 25 tails. So, what happened here? Now, suppose there are 100 students and each one flipped the coin 50 times. So, how would you imagine the results to be?
+
Once we accept that all natural phenomena are inherently variant and that there are no completely deterministic processes, we need to look for measures of variation that allow us to know the extent to which data are dispersed. For example, suppose that we flip a coin 50 times and get 15 heads and 35 tails. According to [http://en.wikipedia.org/wiki/Probability_theory probability theory], we have a 50% chance of getting heads or tails whenever we flip a coin, so one would expect us to end up with 25 heads and 25 tails. What happened here? Now, suppose there are 100 students and that each one flipped the coin 50 times; what would you imagine the results to be like?
  
 
===Theory===
 
===Theory===
 
Measures of variation:
 
Measures of variation:
* '''Range''': range is the simplest measure of variation and it is the difference between the largest value and the smallest. Range = Maximum – Minimum.  
+
* '''Range''': The simplest measure of variation and the difference between the largest value and the smallest; Range = Maximum – Minimum.  
Suppose the pulse rate of Jack varied from 70 to 76 while that of Tom varied from 58 to 79. Here we have Jack has a range of 76 – 70 = 6 and Tom has a range of 79 – 58 = 21. Hence we conclude that Tom has a big variation in pulse rate compared to Jack with the range measure.
+
: Suppose Jack's pulse rate varied from 70 to 76 while Tom's varied from 58 to 79. Jack has a range of 76 – 70 = 6 and Tom has a range of 79 – 58 = 21. Hence we conclude that Tom has a larger variation in pulse rate compared to Jack with the range measure.
 
+
: A similar measure of variation covers (more or less) the middle 50 percent. It is the interquartile range: Q<sub>3</sub> - Q <sub>1</sub> where Q <sub>1</sub> and Q <sub>3</sub> are the first and third quarters.
A similar measure of variation covers (more or less) the middle 50 percent. It is the interquartile range:Q <sub>3</sub> - Q <sub>1</sub> where Q <sub>1</sub> and Q <sub>3</sub> are the first and third quarters.
 
  
* '''Variance''': unlike range, which only involves the largest and smallest data, variance involves all the data values.
+
* '''Variance''': Unlike range, which only involves the largest and smallest data, variance involves all the data values.
** Population variance: σ^2=((x-μ)^2 )/N  where μ is the population mean of the data and N is the size of the population.
+
** Population variance: The second central moment, relative to the mean (first moment) $\mu =E(X)$: $ \operatorname{Var}(X) = \operatorname{E}\left[(X - \mu)^2 \right].$
**Unbiased estimate of the population variance:s^2=(∑(x-x ̅ )^2 )/(n-1), where x ̅ is the sample mean and n is the sample size.
+
**Unbiased estimate of the population variance (sample-variance): $s^2=\frac{\sum {(x_i-\bar{x})^2}}{n-1}$, where $\bar{x}=\frac{1}{n}\sum_{i=1}^{n}{x_i}$ is the sample mean and $n$ is the sample size.
  
* '''Standard deviation''': It is the square root of variance. Given that the deviations in variance were squared, meaning the units were squared, so to take the square root of the variance gets the unit back the same as the original data values.
+
* '''Standard deviation''' (SD): The square root of variance. Given that the deviations in variance were squared, meaning the units were squared, so to take the square root of the variance gets the unit back the same as the original data values.
**Population variance: σ=√((∑(x-μ)^2 )/N)  where μ is the population mean of the data and N is the size of the population.
+
**Population SD: $\sigma =\sqrt{E(X-\mu)^2}$, where $\mu$ is the population mean of the data.
**Unbiased estimate of the population stand deviation (sample standard deviation):s=√((∑(x-x ̅ )^2 )/(n-1)) where x ̅ is the sample mean and n is the sample size.
+
**Unbiased estimate of the population stand deviation (sample standard deviation): $s=\sqrt{\frac{\sum {(x_i-\bar{x})^2}}{n-1}}$ where $\bar{x}=\frac{1}{n}\sum_{i=1}^{n}{x_i}$ is the sample mean and $n$ is the sample size.
  
Consider an example: a biologist found 8, 11, 7, 13, 10, 11, 7 and 9 contaminated mice in 8 groups. Calculate s.
+
:: Consider an example: a biologist found 8, 11, 7, 13, 10, 11, 7 and 9 contaminated mice in 8 different groups. Calculate $s$. The sample average is: $\bar{x} =\frac{8+11+7+13+10+11+7+9}{8}=9.5.$
x ̅=(8+11+7+13+10+11+7+9)/8=9.5
 
  
 +
<center>
 
{| class="wikitable" style="text-align:center; width:75%" border="1"
 
{| class="wikitable" style="text-align:center; width:75%" border="1"
 
|-
 
|-
 
| x || 8 || 11 || 7 || 13 || 10 || 11 || 7 || 9 || Sum
 
| x || 8 || 11 || 7 || 13 || 10 || 11 || 7 || 9 || Sum
 
|-
 
|-
| $\sum {x_i-\bar{x}}$ || -1.5  || 1.5 || -2.5 || 3.5 || 0.5 || 1.5 || -2.5 ||-0.5 || 0
+
| $\sum {(x_i-\bar{x})}$ || -1.5  || 1.5 || -2.5 || 3.5 || 0.5 || 1.5 || -2.5 ||-0.5 || 0
 
|-
 
|-
 
| $\sum {(x_i-\bar{x})^2}$ || 2.25 || 2.25|| 6.25 || 12.25 || 0.25 || 2.25 || 6.25 || 0.25 || 32
 
| $\sum {(x_i-\bar{x})^2}$ || 2.25 || 2.25|| 6.25 || 12.25 || 0.25 || 2.25 || 6.25 || 0.25 || 32
 
|}
 
|}
 +
</center>
  
$s=\sqrt{\frac{\sum {(x_i-\bar{x})^2}}{n-1}} = 32/7 \approx 2.14$
+
:: $s=\sqrt{\frac{\sum {(x_i-\bar{x})^2}}{n-1}} = \sqrt{\frac{32}{7}} \approx 2.14$
  
 
* '''Sum of squares''' (shortcuts)
 
* '''Sum of squares''' (shortcuts)
The sum of the squares of the deviations fro the means is given a shortcut notation and several alternative formulas.
+
: The sum of the squares of the deviations fro the means is given a shortcut notation and several alternative formulas.
SS(x)=s=∑(x-x ̅ )^2  
+
$$SS(x)=\sum_{i} {(x_i - \bar{x})^2}$$
 +
: A little algebraic simplification yields: $SS(x)=\sum_i {x_i^2} - \frac{(\sum_i{x_i})^2}{n}$.
  
A little algebraic simplification returns: SS(x)=∑x^2 -(∑x)^2/n
+
====Chebyshev’s Theorem====
 +
The proportion of the values that fall within $k$ standard deviations of the mean will be at least $1-1/k^2$ where $k$ is the number greater than 1. Chebyshev’s theorem is true for any sample set from any (well-defined) distribution. Formally, if ''X'' is a random variable from a distribution with finite mean ($\mu$) and finite non-zero variance ($\sigma^2$), then for any real number $k > 0$, $\Pr(|X-\mu|\geq k\sigma) \leq \frac{1}{k^2}$.
  
* '''Chebyshev’s Theorem''': The proportion of the values that fall within k standard deviations of the mean will be at least 1-1/k^2 where k is the number greater than 1. The interpretation of x ̅-ks to x ̅+ks would be within k standard deviations. Chebyshev’s theorem is true for any sample set with any distribution.
+
: The useful case is when $k \gt 1$ as when $k \lt 1$ the right-hand side is greater than one, and all probabilities are smaller than or equal to one. If $k=1$, the inequality degenerates into a trivial fact, saying that the probability is less than or equal to one. For example, using $k=\sqrt{2}$ yields that at least half of the values of this distribution lie in the (symmetric) interval $\mu − \sqrt{2}\sigma, \mu + \sqrt{2}\sigma$. Note that this is true for all distributions, whether they are symmetrical or not.
  
* '''Empirical Rule''': This rule only works for bell-shaped (normal) distributions. With this kind of distribution, we have: approximately 68% of the data values fall within one standard deviation of the mean; approximately 95% of the data values fall within two standard deviations of the mean; approximately 99.7% of the data values fall within three standard deviations of the mean.
+
* '''Empirical Rule''': This rule only works for bell-shaped (normal) distributions. With this kind of distribution, approximately 68% of the data values fall within one standard deviation of the mean; approximately 95% fall within two standard deviations of the mean; and approximately 99.7% fall within three standard deviations of the mean.
  
 
=== Applications===
 
=== Applications===
* [http://www.nature.com/ng/journal/v39/n7s/full/ng2042.html This article titled The Population Genetics of Structural Variation], talked about genomic variation in human genome. It summarized recent dramatic advances and illustrated on the diverse mutational origins of chromosomal rearrangements and argued about their complexity necessitates a re-evaluation of existing population genetic methods. It started with an introduction on genomic variants including their biological significance, their basic characteristics leading to the importance of study on structural variation. It then pointed out the improvements in knowledge of structural variation in human genome compared to the current state of studies in structural variation in human genome and ended with two important future challenges in the study of structural variation.
+
* An article titled [http://www.nature.com/ng/journal/v39/n7s/full/ng2042.html The Population Genetics of Structural Variation] discusses variation in the human genome. It summarizes recent dramatic advances and illustrates the diverse mutational origins of chromosomal rearrangements, arguing that their complexity necessitates a re-evaluation of existing population genetic methods. The article begins with an introduction on genomic variants, their biological significance and their basic characteristics, leading to the importance of study on structural variation. It then points out how knowledge of structural variation in the human genome has improved, and it concludes with two important challenges that will be faced by future studies of structural variation.
  
 
===Software ===
 
===Software ===
Line 72: Line 76:
  
 
===Problems===
 
===Problems===
* Let X be a random variable with mean 80 and standard deviation 12. Find the mean and the standard deviation of the following variable: X-20. Choose one answer.
+
* Let $X$ be a random variable with a mean of 80 and a standard deviation of 12. Find the mean and the standard deviation of the following variable: $Y=X-20$. Choose one answer.
 
: (a) Mean = 60, standard deviation = 144
 
: (a) Mean = 60, standard deviation = 144
 
: (b) Mean = 60, standard deviation = 12
 
: (b) Mean = 60, standard deviation = 12
Line 78: Line 82:
 
: (d) Mean = 60, standard deviation = -8
 
: (d) Mean = 60, standard deviation = -8
  
* A physician collected data on 1000 patients to examine their heights. A statistician hired to look at the files noticed the typical height was about 60 inches, but found that one height was 720 inches. This is clearly an outlier. The physician is out of town and can't be contacted, but the statistician would like to have some preliminary descriptions of the data to present when the doctor returns. Which of the following best describes how the statistician should handle this outlier? Choose one answer.
+
* A physician collected data on 1,000 patients to examine their heights. A statistician hired to look at the files noticed the typical height was about 60 inches, but found that one height was 720 inches. This is clearly an outlier. The physician is out of town and can't be contacted, but the statistician would like to have some preliminary descriptions of the data to present when the doctor returns. Which of the following best describes how the statistician should handle this outlier? Choose one answer.
 
: (a) The statistician should publish a paper on the emergence of a new race of giants.
 
: (a) The statistician should publish a paper on the emergence of a new race of giants.
 
: (b) The statistician should keep the data point in; each point is too valuable to drop one.
 
: (b) The statistician should keep the data point in; each point is too valuable to drop one.
Line 85: Line 89:
 
: (e) The statistician should drop the observation from the dataset because we can't analyze the data with it.
 
: (e) The statistician should drop the observation from the dataset because we can't analyze the data with it.
  
* Researchers do a study on the number of cars that a person owns. They think that the distribution of their data might be normal, even though the median is much smaller than the mean. They make a p-plot. What does it look like? Choose one answer.
+
* Researchers do a study on the number of cars that a person owns. They think that the distribution of their data might be normal, even though the median is much smaller than the mean. If they were to make a p-plot, what would it look like? Choose one answer.
: (a) It's not a straight line.
+
: (a) It would not be a straight line.
: (b) It's a bell curve.
+
: (b) It would be a bell curve.
: (c) It's a group of points clustered around the middle of the plot.
+
: (c) It would be a group of points clustered around the middle of the plot.
: (d) It's a straight line.
+
: (d) It would be a straight line.
  
 
* Which of the following parameters is most sensitive to outliers? Choose one answer.
 
* Which of the following parameters is most sensitive to outliers? Choose one answer.
Line 106: Line 110:
 
: (e) The median or (6 + 7)/2 = 6.5
 
: (e) The median or (6 + 7)/2 = 6.5
  
* The following data is collected from website for 121 schools and included these attributes about each institution: name, public or private institution, state, , cost of health insurance, resident tuition, resident fees, resident total expenses, nonresident tuition, nonresident fees, and nonresident total expenses in 2005. So was surprised that medical schools charge no tuition for residents. However, other students pay about $\$20,000$ in fees.
+
* The following data is collected from a website for 121 schools, and it includes these attributes about each institution: ''Name'', ''Public or Private Institution'', ''State'' , ''Cost of Health Insurance'', ''Resident Tuition'', ''Resident Fees'', ''Resident Total Expenses'', ''Nonresident Tuition'', ''Nonresident Fees'', and ''Nonresident Total Expenses in 2005''. Some students were surprised that medical schools charged no tuition for residents, but others had to pay about $\$20,000$ in fees.
  
 +
<center>
 
{| class="wikitable" style="text-align:center; width:75%" border="1"  
 
{| class="wikitable" style="text-align:center; width:75%" border="1"  
 
|-  
 
|-  
 
|  || Min || Q1 || Median || Q3 || Max   
 
|  || Min || Q1 || Median || Q3 || Max   
 
|-  
 
|-  
| Private || $$-6,550 || $$30,729 || $$33,850 || $$36,685 || $$41,360  
+
| Private || -$\$6,550$ || $\$30,729$ || $\$33,850$ || $\$36,685$ || $\$41,360$
 
|-  
 
|-  
| Public || $$0 || $$10,219 || $$16,168 || $$18,800 || $$27,886  
+
| Public || $\$0$ || $\$10,219$ || $\$16,168$ || $\$18,800$ || $\$27,886$
 
|}
 
|}
 +
</center>
  
: On the same scale, use the 5-Number summary to construct two boxplots for the tuition for residents at 73 public and 48 private medical colleges. Use the data and plots to determine which statement about centers is true.
+
: On the same scale, use the 5-Number summary to construct two box plots for the tuition for residents at 73 public and 48 private medical colleges. Use the data and plots to determine which statement about centers is true.
 
: (a) For private medical schools, the mean tuition of residents is greater than the median tuition for residents.
 
: (a) For private medical schools, the mean tuition of residents is greater than the median tuition for residents.
 
: (b) With these data, we cannot determine the relationship between mean and median tuition for residents.
 
: (b) With these data, we cannot determine the relationship between mean and median tuition for residents.
Line 124: Line 130:
  
 
* Suppose that we create a new data set by doubling the highest value in a large data set of positive values. What statement is FALSE about the new data set? Choose one answer.
 
* Suppose that we create a new data set by doubling the highest value in a large data set of positive values. What statement is FALSE about the new data set? Choose one answer.
: (a) The mean increases
+
: (a) The mean increases.
: (b) The standard deviation increases
+
: (b) The standard deviation increases.
: (c) The range increases
+
: (c) The range increases.
: (d) The median and interquartile range both increase
+
: (d) The median and interquartile range both increase.
  
 
* Consider a large data set of positive values and multiply each value by 100. Which of the following statement is true? Choose one answer.
 
* Consider a large data set of positive values and multiply each value by 100. Which of the following statement is true? Choose one answer.
: (a) The mean, median, and standard deviation increase
+
: (a) The mean, median, and standard deviation increase.
: (b) The mean and median increase but the standard deviation is unchanged.
+
: (b) The mean and median increase, but the standard deviation is unchanged.
 
: (c) The standard deviation increases but the mean and median are unchanged.
 
: (c) The standard deviation increases but the mean and median are unchanged.
: (d) The range and interquartile range are unchanged
+
: (d) The range and interquartile range are unchanged.
  
 
===References===
 
===References===
* [http://wiki.stat.ucla.edu/socr/index.php/AP_Statistics_Curriculum_2007_IntroVar (SOCR EBook) Introduction to Variability]
+
*[http://brain.oxfordjournals.org/content/129/11/2867.full-text.pdf 3D comparison of hippocampal atrophy in amnestic mild cognitive impairment and Alzheimer's disease.]
* [http://mirlyn.lib.umich.edu/Record/004199238 Statistical inference / George Casella, Roger L. Berger]
+
*[http://wiki.stat.ucla.edu/socr/index.php/AP_Statistics_Curriculum_2007_IntroVar (SOCR EBook) Introduction to Variability]
 +
*[https://people.richland.edu/james/lecture/m170/ch03-var.html  Stats: Measures of Variation]
  
  

Latest revision as of 11:55, 19 March 2015

Scientific Methods for Health Sciences - Ubiquitous Nature of Process Variability

Overview

In the real world, variation - which is the extent to which data from the same study may differentiate from each other - exists in almost all data sets. The truth is that no matter how controlled an environment might be in terms of its protocol or design, virtually any repeated measurement, observation, experiment, trial or study is bound to generate data that varies due to intrinsic effects (which are internal to the system being studied) and extrinsic effects (which originate from the system's ambient environment). To measure variability requires using measures that provide information on the variation of data and that illustrate the extent to which data are dispersed or spread out. We will introduce several basic measures of variation that are commonly used in statistics, including range, variation, standard deviation, sum of squares, Chebyshev’s theorem, and empirical rules.

Motivation

Variation is of significant importance in statistics and is ubiquitous in data. Consider the example in UCLA’s study of Alzheimer’s disease, which analyzed the data of 31 Mild Cognitive Impairment (MCI) and 34 probable Alzheimer’s disease (AD) patients. The investigators made every attempt to control as many variables as possible, but the demographic information they collected from the outcomes of their subjects were unavoidably varied. The same study found variation in the MMSE cognitive scores even in the same subject. The table below shows the demographic characteristics for the subjects and patients included in this study, where the following notation is used M (male), F (female), W (white), AA (African American), A (Asian).

Variable Alzheimer’s disease MCI Test statistics Test score P-value
Age (years) 76.2 (8.3) range 52–89 73.7 (7.4) range 57–84 Student’s T $t_o = 1.284$ p=0.21
Gender (M:F) 15:19 15:16 Proportion $z_o = -0.345$ p=0.733
Education (years) 14.0 (2.1) range 12–19 16.23 (2.7) range 12–20 Wilcoxon rank sum $w_o = 773.0$ p<0.001
Race (W:AA:A) 29:1:4 26:2:3 $\chi_{(df=2)}^2$ $\chi_{(df=2)}^2=1.18$ 0.55
MMSE 20.9 (6.3) range 4–29 28.2 (1.6) range 23–30 Wilcoxon rank-sum $w_o= 977.5$ p<0.001

Once we accept that all natural phenomena are inherently variant and that there are no completely deterministic processes, we need to look for measures of variation that allow us to know the extent to which data are dispersed. For example, suppose that we flip a coin 50 times and get 15 heads and 35 tails. According to probability theory, we have a 50% chance of getting heads or tails whenever we flip a coin, so one would expect us to end up with 25 heads and 25 tails. What happened here? Now, suppose there are 100 students and that each one flipped the coin 50 times; what would you imagine the results to be like?

Theory

Measures of variation:

  • Range: The simplest measure of variation and the difference between the largest value and the smallest; Range = Maximum – Minimum.
Suppose Jack's pulse rate varied from 70 to 76 while Tom's varied from 58 to 79. Jack has a range of 76 – 70 = 6 and Tom has a range of 79 – 58 = 21. Hence we conclude that Tom has a larger variation in pulse rate compared to Jack with the range measure.
A similar measure of variation covers (more or less) the middle 50 percent. It is the interquartile range: Q3 - Q 1 where Q 1 and Q 3 are the first and third quarters.
  • Variance: Unlike range, which only involves the largest and smallest data, variance involves all the data values.
    • Population variance: The second central moment, relative to the mean (first moment) $\mu =E(X)$: $ \operatorname{Var}(X) = \operatorname{E}\left[(X - \mu)^2 \right].$
    • Unbiased estimate of the population variance (sample-variance): $s^2=\frac{\sum {(x_i-\bar{x})^2}}{n-1}$, where $\bar{x}=\frac{1}{n}\sum_{i=1}^{n}{x_i}$ is the sample mean and $n$ is the sample size.
  • Standard deviation (SD): The square root of variance. Given that the deviations in variance were squared, meaning the units were squared, so to take the square root of the variance gets the unit back the same as the original data values.
    • Population SD: $\sigma =\sqrt{E(X-\mu)^2}$, where $\mu$ is the population mean of the data.
    • Unbiased estimate of the population stand deviation (sample standard deviation): $s=\sqrt{\frac{\sum {(x_i-\bar{x})^2}}{n-1}}$ where $\bar{x}=\frac{1}{n}\sum_{i=1}^{n}{x_i}$ is the sample mean and $n$ is the sample size.
Consider an example: a biologist found 8, 11, 7, 13, 10, 11, 7 and 9 contaminated mice in 8 different groups. Calculate $s$. The sample average is: $\bar{x} =\frac{8+11+7+13+10+11+7+9}{8}=9.5.$
x 8 11 7 13 10 11 7 9 Sum
$\sum {(x_i-\bar{x})}$ -1.5 1.5 -2.5 3.5 0.5 1.5 -2.5 -0.5 0
$\sum {(x_i-\bar{x})^2}$ 2.25 2.25 6.25 12.25 0.25 2.25 6.25 0.25 32
$s=\sqrt{\frac{\sum {(x_i-\bar{x})^2}}{n-1}} = \sqrt{\frac{32}{7}} \approx 2.14$
  • Sum of squares (shortcuts)
The sum of the squares of the deviations fro the means is given a shortcut notation and several alternative formulas.

$$SS(x)=\sum_{i} {(x_i - \bar{x})^2}$$

A little algebraic simplification yields: $SS(x)=\sum_i {x_i^2} - \frac{(\sum_i{x_i})^2}{n}$.

Chebyshev’s Theorem

The proportion of the values that fall within $k$ standard deviations of the mean will be at least $1-1/k^2$ where $k$ is the number greater than 1. Chebyshev’s theorem is true for any sample set from any (well-defined) distribution. Formally, if X is a random variable from a distribution with finite mean ($\mu$) and finite non-zero variance ($\sigma^2$), then for any real number $k > 0$, $\Pr(|X-\mu|\geq k\sigma) \leq \frac{1}{k^2}$.

The useful case is when $k \gt 1$ as when $k \lt 1$ the right-hand side is greater than one, and all probabilities are smaller than or equal to one. If $k=1$, the inequality degenerates into a trivial fact, saying that the probability is less than or equal to one. For example, using $k=\sqrt{2}$ yields that at least half of the values of this distribution lie in the (symmetric) interval $\mu − \sqrt{2}\sigma, \mu + \sqrt{2}\sigma$. Note that this is true for all distributions, whether they are symmetrical or not.
  • Empirical Rule: This rule only works for bell-shaped (normal) distributions. With this kind of distribution, approximately 68% of the data values fall within one standard deviation of the mean; approximately 95% fall within two standard deviations of the mean; and approximately 99.7% fall within three standard deviations of the mean.

Applications

  • An article titled The Population Genetics of Structural Variation discusses variation in the human genome. It summarizes recent dramatic advances and illustrates the diverse mutational origins of chromosomal rearrangements, arguing that their complexity necessitates a re-evaluation of existing population genetic methods. The article begins with an introduction on genomic variants, their biological significance and their basic characteristics, leading to the importance of study on structural variation. It then points out how knowledge of structural variation in the human genome has improved, and it concludes with two important challenges that will be faced by future studies of structural variation.

Software

Problems

  • Let $X$ be a random variable with a mean of 80 and a standard deviation of 12. Find the mean and the standard deviation of the following variable: $Y=X-20$. Choose one answer.
(a) Mean = 60, standard deviation = 144
(b) Mean = 60, standard deviation = 12
(c) Mean = 80, standard deviation = 12
(d) Mean = 60, standard deviation = -8
  • A physician collected data on 1,000 patients to examine their heights. A statistician hired to look at the files noticed the typical height was about 60 inches, but found that one height was 720 inches. This is clearly an outlier. The physician is out of town and can't be contacted, but the statistician would like to have some preliminary descriptions of the data to present when the doctor returns. Which of the following best describes how the statistician should handle this outlier? Choose one answer.
(a) The statistician should publish a paper on the emergence of a new race of giants.
(b) The statistician should keep the data point in; each point is too valuable to drop one.
(c) The statistician should drop the observation from the analysis because this is clearly a mistake; the person would be 60 feet tall.
(d) The statistician should analyze the data twice, once with and once without this data point, and then compare how the point affects conclusions.
(e) The statistician should drop the observation from the dataset because we can't analyze the data with it.
  • Researchers do a study on the number of cars that a person owns. They think that the distribution of their data might be normal, even though the median is much smaller than the mean. If they were to make a p-plot, what would it look like? Choose one answer.
(a) It would not be a straight line.
(b) It would be a bell curve.
(c) It would be a group of points clustered around the middle of the plot.
(d) It would be a straight line.
  • Which of the following parameters is most sensitive to outliers? Choose one answer.
(a) Standard deviation
(b) Interquartile range
(c) Mode
(d) Median
  • Which value given below is the best representative for the following data?
2, 3, 4, 4, 4, 4, 4, 5, 6, 7, 8, 9, 9, 9, 9, 9, 10, 11
Choose one answer.
(a) The weighted average of the two modes or (4*5 + 9*5)/10 = 6.5
(b) No single number could represent this data set
(c) The average of the two modes or (4 + 9) / 2 = 6.5
(d) The mean or (2 + 3 + 4 + … + 10 + 11)/18 = 5.9
(e) The median or (6 + 7)/2 = 6.5
  • The following data is collected from a website for 121 schools, and it includes these attributes about each institution: Name, Public or Private Institution, State , Cost of Health Insurance, Resident Tuition, Resident Fees, Resident Total Expenses, Nonresident Tuition, Nonresident Fees, and Nonresident Total Expenses in 2005. Some students were surprised that medical schools charged no tuition for residents, but others had to pay about $\$20,000$ in fees.
Min Q1 Median Q3 Max
Private -$\$6,550$ $\$30,729$ $\$33,850$ $\$36,685$ $\$41,360$
Public $\$0$ $\$10,219$ $\$16,168$ $\$18,800$ $\$27,886$
On the same scale, use the 5-Number summary to construct two box plots for the tuition for residents at 73 public and 48 private medical colleges. Use the data and plots to determine which statement about centers is true.
(a) For private medical schools, the mean tuition of residents is greater than the median tuition for residents.
(b) With these data, we cannot determine the relationship between mean and median tuition for residents.
(c) For private medical schools, the mean tuition of residents is equal to the median tuition for residents.
(d) For private medical schools, the mean tuition of residents is less the median tuition for residents.
  • Suppose that we create a new data set by doubling the highest value in a large data set of positive values. What statement is FALSE about the new data set? Choose one answer.
(a) The mean increases.
(b) The standard deviation increases.
(c) The range increases.
(d) The median and interquartile range both increase.
  • Consider a large data set of positive values and multiply each value by 100. Which of the following statement is true? Choose one answer.
(a) The mean, median, and standard deviation increase.
(b) The mean and median increase, but the standard deviation is unchanged.
(c) The standard deviation increases but the mean and median are unchanged.
(d) The range and interquartile range are unchanged.

References





Translate this page:

(default)
Uk flag.gif

Deutsch
De flag.gif

Español
Es flag.gif

Français
Fr flag.gif

Italiano
It flag.gif

Português
Pt flag.gif

日本語
Jp flag.gif

България
Bg flag.gif

الامارات العربية المتحدة
Ae flag.gif

Suomi
Fi flag.gif

इस भाषा में
In flag.gif

Norge
No flag.png

한국어
Kr flag.gif

中文
Cn flag.gif

繁体中文
Cn flag.gif

Русский
Ru flag.gif

Nederlands
Nl flag.gif

Ελληνικά
Gr flag.gif

Hrvatska
Hr flag.gif

Česká republika
Cz flag.gif

Danmark
Dk flag.gif

Polska
Pl flag.png

România
Ro flag.png

Sverige
Se flag.gif