Difference between revisions of "AP Statistics Curriculum 2007 EDA Var"

From SOCR
Jump to: navigation, search
m (Text replacement - "{{translate|pageName=http://wiki.stat.ucla.edu/socr/" to ""{{translate|pageName=http://wiki.socr.umich.edu/")
 
(20 intermediate revisions by 3 users not shown)
Line 4: Line 4:
 
There are many measures of (population or sample) variation, e.g., the range, the variance, the standard deviation, mean absolute deviation, etc. These are used to assess the dispersion or spread of the population.
 
There are many measures of (population or sample) variation, e.g., the range, the variance, the standard deviation, mean absolute deviation, etc. These are used to assess the dispersion or spread of the population.
  
Suppose we are interested in the long-jump performance of some students. We can carry an experiment by randomly selecting 8 male statistics students and ask them to perform the standing long jump.  In reality every student participated, but for the ease of calculations below we will focus on these eight students.  The long jumps were as follows:
+
Suppose we are interested in the long-jump performance of some students. We can carry an experiment by randomly selecting eight male statistics students and ask them to perform the standing long jump.  In reality every student participated, but for the ease of calculations below we will focus on these eight students.  The long jumps were as follows:
 
+
<center>
 
{| class="wikitable" style="text-align:center; width:75%" border="1"
 
{| class="wikitable" style="text-align:center; width:75%" border="1"
 
|+Long-Jump (inches) Sample Data
 
|+Long-Jump (inches) Sample Data
 
|-
 
|-
| 74 || 78 || 106 || 80 || 68 || 64 || 60 || 76
+
| 60 || 64 || 68 || 74 || 76 || 78 || 80 || 106
 
|}
 
|}
 +
</center>
  
 
===Range===
 
===Range===
 
The range is the easiest measure of dispersion to calculate, yet, perhaps not the best measure. The '''Range = max - min'''. For example, for the Long Jump data, the range is calculated by:
 
The range is the easiest measure of dispersion to calculate, yet, perhaps not the best measure. The '''Range = max - min'''. For example, for the Long Jump data, the range is calculated by:
<center><math>Range = 106 – 60 = 46</math></center>. Note that the range is only sensitive to the extreme values of a sample and ignores all other information. So, two completely different distributions may have the same range.
+
<center>Range = 106 – 60 = 46.</center>
 +
 
 +
Note that the range is only sensitive to the extreme values of a sample and ignores all other information. So, two completely different distributions may have the same range.
 +
 
 +
===Quartiles and IQR===
 +
The first quartile (<math>Q_1</math>) and the third quartile (<math>Q_3</math>) are defined values that split the dataset into ''bottom-25% vs. top-75%'' and ''bottom-75% vs. top-25%'', respectively. Thus the inter-quartile range (IQR), which is the difference <math>Q_3 - Q_1</math>, represents the central 50% of the data and can be considered as a measure of data dispersion or variation. The wider the IQR, the more variant the data.
 +
 
 +
For example, <math>Q_1=(64+68)/2=66</math>,  <math>Q_3=(78+80)/2=79</math> and <math>IQR=Q_3-Q_1=13</math>, for the Long-Jump data shown above. Thus we expect the middle half of all long jumps (for that population) to be between 66 and 79 inches.
 +
 
 +
===Coefficient of Variation===
 +
For a given process, the coefficient of variation (CV) is defined as the ratio of the standard deviation (<math>\sigma </math>) to the mean (<math>\mu </math>):
 +
:<math>CV = \frac{\sigma}{\mu}.</math>
 +
 
 +
Obviously, the CV is well-defined for processes with well-defined first two moments (mean and variance), but also requires a ''non-trivial'' mean (<math>\mu \not= 0</math>). If the CV is expressed as percentage, than this ratio is  multiplied by 100. The sample coefficient of variation is computed mostly for data measured on a ratio scale. For instance, if a set of distances are measured, the standard deviation does not depend on whether the distances were measured in kilometers (metric) or miles. This is because changes in the particle/object's distances by 1 kilometer also changes its distance by 1 mile. However the mean distance of the data would differ in each measurement scale (as [http://en.wikipedia.org/wiki/Mile 1 mile is approximately 1.7 kilometers]) and thus the coefficient of variation would differ. In general, the CV may not have any meaning for data on an interval scale.
 +
 
 +
The sample-coefficient of variation is computed by plugin the sample-driven estimates of the standard deviation (sample-standard deviation, <math>s</math>, and the sample-average, <math>\bar{x}</math>). In image processing, the reciprocal of the coefficient of variation is  μ/σ is called ''signal-to-noise-ratio'' (SNR).
 +
 
 +
===Five-number summary===
 +
The five-number summary for a dataset is the 5-tuple <math>\{min, Q_1, Q_2, Q_3, max\}</math>, containing the sample minimum, first-quartile, second-quartile (median), third-quartile, and maximum.
  
 
===Variance and Standard Deviation===
 
===Variance and Standard Deviation===
The logic behind the variance and standard deviation measures is to measure the difference between each observation and the mean (i.e., dispersion). The deviation of the i-th measurement from the mean is defined by <math>(y_i - \overline{y})</math>.
+
The logic behind the variance and standard deviation measures is to measure the difference between each observation and the mean (i.e., dispersion). Suppose we have ''n > 1'' observations, <math>\left \{ y_1, y_2, y_3, ..., y_n \right \}</math>. The deviation of the <math>i^{th}</math> measurement, <math>y_i</math>, from the mean (<math>\overline{y}</math>) is defined by <math>(y_i - \overline{y})</math>.
  
 
Does the average of these deviations seem like a reasonable way to find an average deviation for the sample or the population? No, because the sum of all deviations is trivial:
 
Does the average of these deviations seem like a reasonable way to find an average deviation for the sample or the population? No, because the sum of all deviations is trivial:
Line 23: Line 42:
  
 
To solve this problem we employ different versions of the '''mean absolute deviation''':
 
To solve this problem we employ different versions of the '''mean absolute deviation''':
<center><math>\sum_{i=1}^n{|y_i - \overline{y}|}.</math></center>
+
<center><math>{1 \over n-1}\sum_{i=1}^n{|y_i - \overline{y}|}.</math></center>
  
 
In particular, the '''variance''' is defined as:
 
In particular, the '''variance''' is defined as:
<center><math>\sum_{i=1}^n{|y_i - \overline{y}|^2}.</math></center>
+
<center><math>{1 \over n-1}\sum_{i=1}^n{(y_i - \overline{y})^2}.</math></center>
  
 
And the '''standard deviation''' is defined as:
 
And the '''standard deviation''' is defined as:
<center><math>\sqrt{\sum_{i=1}^n{|y_i - \overline{y}|^2}}.</math></center>
+
<center><math>\sqrt{{1 \over n-1}\sum_{i=1}^n{(y_i - \overline{y})^2}}.</math></center>
 +
 
 +
For the long-jump sample of 8 measurements, the standard deviation is:
 +
<center><math>\sqrt{{1 \over 8-1} \left \{(60-75.75)^2 + (64-75.75)^2 + (68-75.75)^2 + (74-75.75)^2 + (76-75.75)^2 + (78-75.75)^2 + (80-75.75)^2 + (106-75.75)^2 \right \} } = 14.079.</math></center>
  
===Examples===
+
===Activities===
Computer simulations and real observed data.  
+
Try to pair each of the 4 samples whose numerical summaries are reported below with one of the 4 frequency plots below. Explain your answers.
 +
{| class="wikitable" style="text-align:center; width:75%" border="1"
 +
|+Long-Jump (inches) Sample Data
 +
|-
 +
| Sample || Mean || Median || StdDev
 +
|-
 +
| A || 4.688 || 5.000 || 1.493
 +
|-
 +
| B || 4.000 || 4.000 || 1.633 
 +
|-
 +
| C || 3.933 || 4.000 || 1.387
 +
|-
 +
| D || 4.000 || 4.000 || 2.075
 +
|}
 +
 
 +
<center>[[Image:SOCR_EBook_Dinov_EDA_012708_Fig10.jpg|500px]]</center>
  
* TBD
+
 
 +
===Notes===
 +
*Some software packages may use <math>{1 \over n}</math>, instead of the <math>{1 \over n-1}</math>, which we used above. Note that for large sample-sizes this difference becomes increasingly smaller. Also, there are theoretical properties of the sample variance, as defined above (e.g., sample-variance is an unbiased estimate of the population-variance!)
 
   
 
   
===Hands-on activities===
+
*Most of the [http://socr.ucla.edu/htmls/SOCR_Charts.html SOCR Charts] and [http://socr.ucla.edu/htmls/SOCR_Analyses.html SOCR Analyses] compute the variance or standard deviation for the sample. You can see these examples of [[SOCR_EduMaterials_ChartsActivities | Charts Activities]] and [[SOCR_EduMaterials_AnalysesActivities | Analyses Activities]] and you can test these using [[SOCR_012708_ID_Data_HotDogs | hotdogs dataset]].
Step-by-step practice problems.  
 
  
* TBD
+
===[[EBook_Problems_EDA_Var | Problems]]===
  
 
<hr>
 
<hr>
 +
 
===References===
 
===References===
 
* [http://www.stat.ucla.edu/%7Edinov/courses_students.dir/07/Fall/STAT13.1.dir/STAT13_notes.dir/lecture02.pdf Lecture notes on EDA]
 
* [http://www.stat.ucla.edu/%7Edinov/courses_students.dir/07/Fall/STAT13.1.dir/STAT13_notes.dir/lecture02.pdf Lecture notes on EDA]
Line 48: Line 87:
 
* SOCR Home page: http://www.socr.ucla.edu
 
* SOCR Home page: http://www.socr.ucla.edu
  
{{translate|pageName=http://wiki.stat.ucla.edu/socr/index.php?title=AP_Statistics_Curriculum_2007_EDA_Var}}
+
"{{translate|pageName=http://wiki.socr.umich.edu/index.php?title=AP_Statistics_Curriculum_2007_EDA_Var}}

Latest revision as of 12:35, 3 March 2020

General Advance-Placement (AP) Statistics Curriculum - Measures of Variation

Measures of Variation and Dispersion

There are many measures of (population or sample) variation, e.g., the range, the variance, the standard deviation, mean absolute deviation, etc. These are used to assess the dispersion or spread of the population.

Suppose we are interested in the long-jump performance of some students. We can carry an experiment by randomly selecting eight male statistics students and ask them to perform the standing long jump. In reality every student participated, but for the ease of calculations below we will focus on these eight students. The long jumps were as follows:

Long-Jump (inches) Sample Data
60 64 68 74 76 78 80 106

Range

The range is the easiest measure of dispersion to calculate, yet, perhaps not the best measure. The Range = max - min. For example, for the Long Jump data, the range is calculated by:

Range = 106 – 60 = 46.

Note that the range is only sensitive to the extreme values of a sample and ignores all other information. So, two completely different distributions may have the same range.

Quartiles and IQR

The first quartile (\(Q_1\)) and the third quartile (\(Q_3\)) are defined values that split the dataset into bottom-25% vs. top-75% and bottom-75% vs. top-25%, respectively. Thus the inter-quartile range (IQR), which is the difference \(Q_3 - Q_1\), represents the central 50% of the data and can be considered as a measure of data dispersion or variation. The wider the IQR, the more variant the data.

For example, \(Q_1=(64+68)/2=66\), \(Q_3=(78+80)/2=79\) and \(IQR=Q_3-Q_1=13\), for the Long-Jump data shown above. Thus we expect the middle half of all long jumps (for that population) to be between 66 and 79 inches.

Coefficient of Variation

For a given process, the coefficient of variation (CV) is defined as the ratio of the standard deviation (\(\sigma \)) to the mean (\(\mu \)): \[CV = \frac{\sigma}{\mu}.\]

Obviously, the CV is well-defined for processes with well-defined first two moments (mean and variance), but also requires a non-trivial mean (\(\mu \not= 0\)). If the CV is expressed as percentage, than this ratio is multiplied by 100. The sample coefficient of variation is computed mostly for data measured on a ratio scale. For instance, if a set of distances are measured, the standard deviation does not depend on whether the distances were measured in kilometers (metric) or miles. This is because changes in the particle/object's distances by 1 kilometer also changes its distance by 1 mile. However the mean distance of the data would differ in each measurement scale (as 1 mile is approximately 1.7 kilometers) and thus the coefficient of variation would differ. In general, the CV may not have any meaning for data on an interval scale.

The sample-coefficient of variation is computed by plugin the sample-driven estimates of the standard deviation (sample-standard deviation, \(s\), and the sample-average, \(\bar{x}\)). In image processing, the reciprocal of the coefficient of variation is μ/σ is called signal-to-noise-ratio (SNR).

Five-number summary

The five-number summary for a dataset is the 5-tuple \(\{min, Q_1, Q_2, Q_3, max\}\), containing the sample minimum, first-quartile, second-quartile (median), third-quartile, and maximum.

Variance and Standard Deviation

The logic behind the variance and standard deviation measures is to measure the difference between each observation and the mean (i.e., dispersion). Suppose we have n > 1 observations, \(\left \{ y_1, y_2, y_3, ..., y_n \right \}\). The deviation of the \(i^{th}\) measurement, \(y_i\), from the mean (\(\overline{y}\)) is defined by \((y_i - \overline{y})\).

Does the average of these deviations seem like a reasonable way to find an average deviation for the sample or the population? No, because the sum of all deviations is trivial:

\(\sum_{i=1}^n{(y_i - \overline{y})}=0.\)

To solve this problem we employ different versions of the mean absolute deviation:

\({1 \over n-1}\sum_{i=1}^n{|y_i - \overline{y}|}.\)

In particular, the variance is defined as:

\({1 \over n-1}\sum_{i=1}^n{(y_i - \overline{y})^2}.\)

And the standard deviation is defined as:

\(\sqrt{{1 \over n-1}\sum_{i=1}^n{(y_i - \overline{y})^2}}.\)

For the long-jump sample of 8 measurements, the standard deviation is:

\(\sqrt{{1 \over 8-1} \left \{(60-75.75)^2 + (64-75.75)^2 + (68-75.75)^2 + (74-75.75)^2 + (76-75.75)^2 + (78-75.75)^2 + (80-75.75)^2 + (106-75.75)^2 \right \} } = 14.079.\)

Activities

Try to pair each of the 4 samples whose numerical summaries are reported below with one of the 4 frequency plots below. Explain your answers.

Long-Jump (inches) Sample Data
Sample Mean Median StdDev
A 4.688 5.000 1.493
B 4.000 4.000 1.633
C 3.933 4.000 1.387
D 4.000 4.000 2.075
SOCR EBook Dinov EDA 012708 Fig10.jpg


Notes

  • Some software packages may use \({1 \over n}\), instead of the \({1 \over n-1}\), which we used above. Note that for large sample-sizes this difference becomes increasingly smaller. Also, there are theoretical properties of the sample variance, as defined above (e.g., sample-variance is an unbiased estimate of the population-variance!)

Problems


References


"-----


Translate this page:

(default)
Uk flag.gif

Deutsch
De flag.gif

Español
Es flag.gif

Français
Fr flag.gif

Italiano
It flag.gif

Português
Pt flag.gif

日本語
Jp flag.gif

България
Bg flag.gif

الامارات العربية المتحدة
Ae flag.gif

Suomi
Fi flag.gif

इस भाषा में
In flag.gif

Norge
No flag.png

한국어
Kr flag.gif

中文
Cn flag.gif

繁体中文
Cn flag.gif

Русский
Ru flag.gif

Nederlands
Nl flag.gif

Ελληνικά
Gr flag.gif

Hrvatska
Hr flag.gif

Česká republika
Cz flag.gif

Danmark
Dk flag.gif

Polska
Pl flag.png

România
Ro flag.png

Sverige
Se flag.gif