SOCR - User contributions [en]

AP Statistics Curriculum 2007 GLM MultLin

2014-10-03T22:42:44Z

Jslavine: /* Parameter Estimation in Multilinear Regression */

==[[AP_Statistics_Curriculum_2007 | General Advance-Placement (AP) Statistics Curriculum]] - Multiple Linear Regression ==

In the [[AP_Statistics_Curriculum_2007#Chapter_X:_Correlation_and_Regression | previous sections]], we saw how to study the relations in bivariate designs. Now we extend that to any finite number of variables (multivariate case).

=== Multiple Linear Regression ===
We are interested in determining the '''linear regression''', as a model, of the relationship between one '''dependent''' variable ''Y'' and many '''independent''' variables ''X''''i'', ''i'' = 1, ..., ''p''. The multilinear regression model can be written as

: <math>Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \cdots +\beta_p X_p + \varepsilon</math>, where <math>\varepsilon</math> is the error term.

The coefficient <math>\beta_0</math> is the intercept ("constant" term) and <math>\beta_i</math>s are the respective parameters of the '' p'' independent variables. There are ''p+1'' parameters to be estimated in the multilinear regression.

* Multilinear vs. non-linear regression: This multilinear regression method is "linear" because the relation of the response (the dependent variable <math>Y</math>) to the independent variables is assumed to be a [http://en.wikipedia.org/wiki/Linear_function linear function] of the parameters <math>\beta_i</math>. Note that multilinear regression is a linear modeling technique '''not''' because that the graph of <math>Y = \beta_{0}+\beta x </math> is a straight line '''nor''' because <math>Y</math> is a linear function of the ''X'' variables. But the "linear" term refers to the fact that <math>Y</math> can be considered a linear function of the parameters ( <math>\beta_i</math>), even though it is not a linear function of <math>X</math>. Thus, any model like

: <math>Y = \beta_o + \beta_1 x + \beta_2 x^2 + \varepsilon</math> is still one of the '''linear''' regression, that is, linear in <math>x</math> and <math>x^2</math> respectively, even though the graph on <math>x</math> by itself is not a straight line.

===Parameter Estimation in Multilinear Regression===
A multilinear regression with ''p'' coefficients and the regression intercept β0 and ''n'' data points (sample size), with <math>n\geq (p+1) </math> allows construction of the following vectors and matrix with associated standard errors:

:<math> \begin{bmatrix} y_{1} \\ y_{2} \\ \vdots \\ y_{n} \end{bmatrix} = \begin{bmatrix} 1 & x_{11} & x_{12} & \dots & x_{1p} \\ 1 & x_{21} & x_{22} & \dots & x_{2p} \\ \vdots & \vdots & \vdots & & \vdots \\ 1 & x_{n1} & x_{n2} & \dots & x_{np} \end{bmatrix} \begin{bmatrix} \beta_0 \\ \beta_1 \\ \vdots \\ \beta_p \end{bmatrix} + \begin{bmatrix} \varepsilon_1\\ \varepsilon_2\\ \vdots\\ \varepsilon_n \end{bmatrix} </math>

or, in '''vector-matrix notation'''

:<math> \ y = \mathbf{X}\cdot\beta + \varepsilon.\, </math>
Each data point can be given as <math>(\vec x_i, y_i)</math>, <math>i=1,2,\dots,n.</math>. For n = p, standard errors of the parameter estimates could not be calculated. For n less than p, parameters could not be calculated.

* '''Point Estimates''': The estimated values of the parameters <math>\beta_i</math> are given as
:<math>\widehat{\beta} </math><math>=(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T {\vec y}</math>

* '''Residuals''': The residuals, representing the difference between the observations and the model's predictions, are required to analyse the regression and are given by:

:<math>\hat\vec\varepsilon = \vec y - \mathbf{X} \hat\beta\,</math>

The standard deviation, <math>\hat \sigma </math> for the model is determined from

:<math> {\hat \sigma = \sqrt{ \frac {\hat\vec\varepsilon^T \hat\vec\varepsilon} {n-p-1}} = \sqrt {\frac{{ \vec y^T \vec y - \hat\vec\beta^T \mathbf{X}^T \vec y}}{{n - p - 1}}} } </math>

The variance in the errors is Chi-square distributed:
:<math>\frac{(n-p-1)\hat\sigma^2}{\sigma^2} \sim \chi_{n-p-1}^2</math>

* '''Interval Estimates''': The <math>100(1-\alpha)\% </math> [[AP_Statistics_Curriculum_2007_Estim_S_Mean#Interval_Estimation_of_a_Population_Mean | confidence interval]] for the parameter, <math>\beta_i </math>, is computed as follows:

:<math> {\widehat \beta_i \pm t_{\frac{\alpha }{2},n - p - 1} \hat \sigma \sqrt {(\mathbf{X}^T \mathbf{X})_{ii}^{ - 1} } } </math>,

where ''t'' follows the [[AP_Statistics_Curriculum_2007_StudentsT | Student's t-distribution]] with <math>n-p-1</math> degrees of freedom and <math> (\mathbf{X}^T \mathbf{X})_{ii}^{ - 1}</math> denotes the value located in the <math>i^{th}</math> row and column of the matrix.

The '''regression sum of squares''' (or sum of squared residuals) ''SSR'' (also commonly called ''RSS'') is given by:

:<math> {\mathit{SSR} = \sum {\left( {\hat y_i - \bar y} \right)^2 } = \hat\beta^T \mathbf{X}^T \vec y - \frac{1}{n}\left( { \vec y^T \vec u \vec u^T \vec y} \right)} </math>,

where <math> \bar y = \frac{1}{n} \sum y_i</math> and <math> \vec u </math> is an ''n'' by 1 unit vector (i.e. each element is 1). Note that the terms <math>y^T u</math> and <math>u^T y</math> are both equivalent to <math>\sum y_i</math>, and so the term <math>\frac{1}{n} y^T u u^T y</math> is equivalent to <math>\frac{1}{n}\left(\sum y_i\right)^2</math>.

The '''error''' (or explained)''' sum of squares''' (''ESS'') is given by:

:<math> {\mathit{ESS} = \sum {\left( {y_i - \hat y_i } \right)^2 } = \vec y^T \vec y - \hat\beta^T \mathbf{X}^T \vec y}. </math>

The '''total sum of squares''' ('''TSS''') is given by

:<math> {\mathit{TSS} = \sum {\left( {y_i - \bar y} \right)^2 } = \vec y^T \vec y - \frac{1}{n}\left( { \vec y^T \vec u \vec u^T \vec y} \right) = \mathit{SSR}+ \mathit{ESS}}. </math>

===Partial Correlations===
For a given linear model
<center> <math>Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \cdots +\beta_p X_p + \varepsilon</math></center>
the partial correlation between <math>X_1</math> and ''Y'' given a set of ''p-1'' controlling variables <math>Z = \{X_2, X_3, \cdots, X_p\}</math>, denoted by <math>\rho_{YX_1|Z}</math>, is the correlation between the residuals ''R''''X'' and ''R''''Y'' resulting from the linear regression of ''X'' with '''Z''' and that of ''Y'' with '''Z''', respectively. The first-order partial correlation is just the difference between a correlation and the product of the removable correlations divided by the product of the coefficients of alienation of the removable correlations.

* Partial correlation coefficients for three variables is calculated from the pairwise simple correlations.
: If, <math>Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \varepsilon</math>,
: then the partial correlation between <math>Y</math> and <math>X_2</math>, adjusting for <math>X_1</math> is:
: <math>\rho_{YX_2|X_1} = \frac{\rho_{YX_2} - \rho_{YX_1}\times \rho_{X_2X_1}}{\sqrt{1- \rho_{YX_1}^2}\sqrt{1-\rho_{X_2X_1}^2}}</math>

* In general, the sample partial correlation is
:<math>\hat{\rho}_{XY\cdot\mathbf{Z}}=\frac{N\sum_{i=1}^N r_{X,i}r_{Y,i}-\sum_{i=1}^N r_{X,i}\sum r_{Y,i}}
{\sqrt{N\sum_{i=1}^N r_{X,i}^2-\left(\sum_{i=1}^N r_{X,i}\right)^2}~\sqrt{N\sum_{i=1}^N r_{Y,i}^2-\left(\sum_{i=1}^N r_{Y,i}\right)^2}},</math> where the [http://en.wikipedia.org/wiki/Partial_correlation residuals] <math>r_{X,i}</math> and <math>r_{X,i}</math> are given by:
:: <math>r_{X,i} = x_i - \langle\mathbf{w}_X^*,\mathbf{z}_i \rangle</math>
:: <math>r_{Y,i} = y_i - \langle\mathbf{w}_Y^*,\mathbf{z}_i \rangle</math>,
:: with <math>x_i</math>, <math>y_i</math> and <math>z_i</math> denoting the random (IID) samples of some joint probability distribution over X, Y and Z.

====Computing the partial correlations====
The ''n''th-order partial correlation (|'''Z'''| = ''n'') can be computed from three (''n'' - 1)th-order partial correlations. The ''0''th-order partial correlation <math>\rho_{YX|\empty}</math> is defined to be the regular [http://en.wikipedia.org/wiki/Correlation correlation coefficient] <math>\rho_{YX}</math>.

For any <math>Z_0 \in \mathbf{Z}</math>:
:<math>\rho_{XY| \mathbf{Z} } =
\frac{\rho_{XY| \mathbf{Z}\setminus\{Z_0\}} - \rho_{XZ_0| \mathbf{Z}\setminus\{Z_0\}}\rho_{YZ_0 | \mathbf{Z}\setminus\{Z_0\}}}
{\sqrt{1-\rho_{XZ_0 |\mathbf{Z}\setminus\{Z_0\}}^2} \sqrt{1-\rho_{YZ_0 | \mathbf{Z}\setminus\{Z_0\}}^2}}.</math>

Implementing this computation recursively yields an exponential time [http://en.wikipedia.org/wiki/Computational_complexity_theory complexity].

Note in the case where Z is a single variable, this reduces to:
:<math>\rho_{XY | Z } =
\frac{\rho_{XY} - \rho_{XZ}\rho_{YZ}}
{\sqrt{1-\rho_{XZ}^2} \sqrt{1-\rho_{YZ}^2}}.</math>

===Categorical Variables in Multiple Regression===

When using categorical variables with more than two levels in a multiple regression modeling, we need to make sure the results are correctly interpreted. For categorical variables with more than 2 levels, a number of separate, dichotomous variables need to be created – this is called “dummy coding” or “dummy variables”.

Dichotomous categorical predictor variables, variables with two levels, may be directly entered as predictor or predicted variables in a multiple regression model. Their use in multiple regression is a direct extension of their use in simple linear regression. The interpretation of their regression weights depends upon how the variables are coded. If dichotomous variables are coded as 0 and 1, their regression weights are added or subtracted to the predicted value of the response variable, Y, depending upon whether it is positive or negative.

If a dichotomous variable is coded as -1 and 1, then if the regression weight is positive, it is subtracted from the group coded as -1 and added to the group coded as 1. If a regression weight is negative, then addition and subtraction is reversed. Dichotomous variables are also included in hypothesis tests for R2 change like any other quantitative variable.

Adding variables to a linear regression model will always increase the (unadjusted) $R^2$ value. If additional predictor variables are correlated with the predictor variables already in the model, then the combined results may be difficult to predict. In some cases, the combined result will only slightly improve the prediction, whereas in other cases, a much better prediction may be obtained by combining two correlated variables.

If the additional predictor variables are uncorrelated (their correlation is trivial) with the predictor variables already in the model, then the result of adding additional variables to the regression model is easy to predict. Thus, the $R^2$ change will be equal to the correlation coefficient squared between the added variable and predicted variable. In this case it makes no difference what order the predictor variables are entered into the prediction model. For example, if $X_1$ and $X_2$ were uncorrelated $\rho_{12} = 0$ and $\rho^2_{1y} = 0.4$ and $\rho^2_{2y} = 0.5$, then $R^2$ for $X_1$ and $X_2$ would equal $0.4 + 0.5 = 0.9$. The value for $R^2$ change for $X_2$, given that X1 is included in the model, would be 0.5. The value for $R^2$ change for $X_2$ given no variable was in the model would be 0.5. It would make no difference at what stage $X_2$ was entered into the model, the value for $R^2$ change would always be 0.5. Similarly, the $R^2$ change value for $X_1$ would always be 0.4. Because of this relationship, uncorrelated predictor variables will be preferred, when possible.

Look at the [[SOCR_Data_AD_BiomedBigMetadata| Modeling and Analysis of Clinical, Genetic and Imaging Data of Alzheimer’s Disease dataset]]. It is fairly clear that '''DX_Conversion''' could be directly entered into a regression model predicting '''MMSE''', because it is dichotomous. The problem is how to deal with the two categorical predictor variables with more than two levels (e.g., '''GDTOTAL''').

'''Dummy Coding''' refers to making many dichotomous variables out of one multilevel categorical variable. Because categorical predictor variables cannot be entered directly into a regression model and be meaningfully interpreted, some other method of dealing with information of this type must be developed. In general, a categorical variable with k levels will be transformed into k-1 variables each with two levels. For example, if a categorical variable had 4 levels, then 3 dichotomous (dummy) variables could be constructed that would contain the same information as the single categorical variable. Dichotomous variables have the advantage that they can be directly entered into the regression model. The process of creating dichotomous variables from categorical variables is called dummy coding.

Depending upon how the dichotomous variables are constructed, additional information can be gleaned from the analysis. In addition, careful construction will result in uncorrelated dichotomous variables. These variables have the advantage of simplicity of interpretation and are preferred to correlated predictor variables.

For example, when using dummy coding with three levels, the simplest case of dummy coding is when the categorical variable has three levels and is converted to two dichotomous variables. School in the example data has three levels, 1=Math, 2=Biology, and 3=Engineering. This variable could be dummy coded into two variables, one called Math and one called Biology. If School = 1, then Math would be coded with a 1 and Biology with a 0. If School=2, then Math would be coded with a 0 and Biology would be coded with a 1. If School=3, then both Math and Biology would be coded with a 0. The dummy coding is represented below.

<center>
{| class="wikitable" style="text-align:center; width:75%" border="1"
|+ Dummy Coded Variables
|-
| colspan=2|Subject||Math||Biology
|-
| Math||1||1||0
|-
| Biology||2||0||1
|-
| Engineering||3||0||0
|}</center>

See the [[SOCR_EduMaterials_AnalysisActivities_MLR |SOCR Regression analysis]] and compare it against [[SOCR_EduMaterials_AnalysisActivities_ANOVA_1|SOCR ANOVA]].

Dummy Coding into Independent Variables, and the selection of an appropriate set of dummy codes, will result in new variables that are uncorrelated or independent of each other. In the case when the categorical variable has three levels this can be accomplished by creating a new variable where one level of the categorical variable is assigned the value of -2 and the other levels are assigned the value of 1. The signs are arbitrary and may be reversed, that is, values of 2 and -1 would work equally well. The second variable created as a dummy code will have the level of the categorical variable coded as -2 given the value of 0 and the other values recoded as 1 and -1. In all cases the sum of the dummy coded variable will be zero.

We can directly interpret each of the new dummy-coded variables, called a contrast, and compare levels coded with a positive number to levels coded with a negative number. Levels coded with a zero are not included in the interpretation.

For example, School in the example data has three levels, 1=Math, 2=Biology, and 3=Engineering. This variable could be dummy coded into two variables, one called Engineering (comparing the Engineering School with the other two Schools) and one called Math_vs_Bio (comparing Math versus Biology schools). The Engineering contrast would create a variable where all members of the Engineering Department would be given a value of -2 and all members of the other two Schools would be given a value of 1. The Math_vs_Bio contrast would assign a value of 0 to members of the Engineering Department, 1 divided by the number of members of the Math Department to member of the Math Department, and -1 divided by the number of members of the Biology Department to members of the Biology Department. The Math_vs_Bio variable could be coded as 1 and -1 for Math and Biology respectively, but the recoded variable would no longer be uncorrelated with the first dummy coded variable (Engineering). In most practical applications, it makes little difference whether the variables are correlated or not, so the simpler 1 and -1 coding is generally preferred. The contrasts are summarized in the following table.

<center>
{| class="wikitable" style="text-align:center; width:75%" border="1"
|+ Dummy Coded Variables
|-
| colspan=2|School||Engineering||Math_vs_Bio
|-
| Math||1||1||1/N1 = 1/12= 0.0833
|-
| Biology||2||1||-1/N2 = -1/7 = -0.1429
|-
| Engineering||3||-2||0
|}</center>

Note that the correlation coefficient between the two contrasts is zero. The correlation between the Engineering contrast and Salary is -.585 with a squared correlation coefficient of .342. This correlation coefficient has a significance level of .001. The correlation coefficient between the Math_vs_Bio contrast and Salary is -.150 with a squared value of .023.

Generate the corresponding [[AP_Statistics_Curriculum_2007_ANOVA_1Way#ANOVA_Table|SOCR ANOVA table]].

Show that the significance level is identical to the value when each contrast was entered last into the regression model. In this case the Engineering contrast was significant and the Math_vs_Bio contrast was not. The interpretation of these results would be that the Engineering Department was paid significantly more than the Math and Biology Schools, but that no significant differences in salary were found between the Math and Biology Schools.
If a categorical variable had four levels, three dummy coded contrasts would be necessary to use the categorical variable in a regression analysis. For example, suppose that a researcher at a pain center did a study with 4 groups of four patients each (N is being deliberately kept small). The dependent measure is subjective experience of pain. The 4 groups consisted of 4 different treatment conditions.

<center>
{| class="wikitable" style="text-align:center; width:75%" border="1"
! Group||Treatment
|-
|1||None
|-
|2||Placebo
|-
|3||Psychotherapy
|-
|4||Acupuncture
|}</center>

An independent contrast is a contrast that is not a linear combination of any other set of contrasts. Any set of independent contrasts would work equally well if the end result was the simultaneous test of the five contrasts, as in an ANOVA. One of the many possible examples is presented below.

<center>
{| class="wikitable" style="text-align:center; width:75%" border="1"
|+ Dummy Coded Variables
|-
| colspan=2|Group||C1||C2||C3||C4
|-
| None||1||0||0||0||0
|-
| Placebo||2||1||0||0||0
|-
| Psychotherapy||3||0||1||0||0
|-
| Acupuncture||4||0||0||1||0
|}</center>

Application of this dummy coding in a regression model entering all contrasts in a single block would result in an ANOVA table similar to the one obtained using Means, ANOVA, or General Linear Model.

This solution would not be ideal, however, because there is considerable information available by setting the contrasts to test specific hypotheses. The levels of the categorical variable generally dictate the structure of the contrasts. In the example study, it makes sense to contrast the two control groups (1 and 2) with the other four experimental groups (3 and 4). Any two numbers would work, one assigned to groups 1 and 2 and the others assigned to the other four groups, but it is conventional to have the sum of the contrasts equal to zero. One contrast that meets this criterion would be (-2, -2, 1, 1).
Generally it is easiest to set up contrasts within subgroups of the first contrast. For example, a second contrast might test whether there are differences between the two control groups. This contrast would appear as (1, -1, 0, 0). As can be seen, this would be a contrast within the experimental treatment groups. Within the non-drug treatment, a contrast comparing Group 3 with Group 4 might be appropriate (0, 0, 1, -1). Combined, the contrasts are given in the following table.

<center>
{| class="wikitable" style="text-align:center; width:75%" border="1"
|+ Dummy Coded Variables
|-
| colspan=2|Group||C1||C2||C3
|-
| None||1||-2||1||0
|-
| Placebo||2||-2||-1||0
|-
| Psychotherapy||3||1||0||1
|-
| Acupuncture||4||1||0||1
|}</center>

Equal sample sizes are important as the results are much cleaner when the sample sizes are presumed to be equal. However equal sample size are not common in real applications, even in well-designed experiments. Unequal sample size makes the effects no longer independent. This implies that it makes difference in hypothesis testing when the effects are added into the model, first, middle, or last. The same dummy coding that was applied to equal sample sizes will now be applied to the original data with unequal sample sizes.

===Examples===

We now demonstrate the use of [[SOCR_EduMaterials_AnalysisActivities_MLR | SOCR Multilinear regression applet]] to analyze multivariate data.

====Earthquake Modeling====
This is an example where the relation between variables may not be linear or explanatory. In the [[AP_Statistics_Curriculum_2007_GLM_Regress | simple linear regression case]], we were able to compute by hand some (simple) examples. Such calculations are much more involved in the multilinear regression situations. Thus we demonstrate multilinear regression only using the [http://socr.ucla.edu/htmls/SOCR_Analyses.html SOCR Multiple Regression Analysis Applet].

Use the SOCR California Earthquake dataset to investigate whether Earthquake magnitude (dependent variable) can be predicted by knowing the longitude, latitude, distance and depth of the quake. Clearly, we do not expect these predictors to have a strong effect on the earthquake magnitude, so we expect the coefficient parameters not to be significantly distinct from zero (null hypothesis). SOCR Multilinear regression applet reports this model:

: <math>Magnitude = \beta_o + \beta_1\times Close+ \beta_2\times Depth+ \beta_3\times Longitude+ \beta_4\times Latitude + \varepsilon.</math>

: <math>Magnitude = 2.320 + 0.001\times Close -0.003\times Depth -0.035\times Longitude -0.028\times Latitude + \varepsilon.</math>

<center>[[Image:SOCR_EBook_Dinov_GLM_MLR_021808_Fig1.jpg|500px]]
[[Image:SOCR_EBook_Dinov_GLM_MLR_021808_Fig2.jpg|500px]]</center>

====Multilinear Regression on Consumer Price Index====
Using the [[SOCR_Data_Dinov_021808_ConsumerPriceIndex | SOCR Consumer Price Index Dataset]] we can explore the relationship between the prices of various products and commodities. For example, regressing '''Gasoline''' on the following three predictor prices: '''Orange Juice''', '''Fuel''' and '''Electricity''' illustrates significant effects of all these variables as significant explanatory prices (at <math>\alpha=0.05</math>) for the cost of ''Gasoline'' between 1981 and 2006.

: <math>Gasoline = 0.083 -0.190\times Orange +0.793\times Fuel +0 .013\times Electricity
</math>

<center>[[Image:SOCR_EBook_Dinov_GLM_MLR_021808_Fig3.jpg|500px]]
[[Image:SOCR_EBook_Dinov_GLM_MLR_021808_Fig4.jpg|500px]]</center>

====2011 Best Jobs in the US====
Repeat the multiliniear regression analysis using hte [[SOCR_Data_2011_US_JobsRanking| Ranking Dataset of the Best and Worst USA Jobs for 2011]].

<hr>

===[[EBook_Problems_GLM_MultLin|Problems]]===

<hr>
* SOCR Home page: http://www.socr.ucla.edu

{{translate|pageName=http://wiki.stat.ucla.edu/socr/index.php?title=AP_Statistics_Curriculum_2007_GML_MultLin}}

SMHS CLT LLN

2014-10-01T15:11:30Z

Jslavine: /* Law of Large NUmbers (LLN) */

==[[SMHS| Scientific Methods for Health Sciences]] - Limit Theory: Central Limit Theorem and Law of Large Numbers ==

===Overview:===
The two most commonly used theorems in the field of probability – Law of Large Numbers (LLT) and the Central Limit Theorem (CLT) – are commonly referred to as the first and second fundamental laws of probability. CLT suggests that the arithmetic mean of a sufficiently large number of iterates of independent random variables given certain conditions will be approximately normally distributed. LLT states that in performing the same experiment a large number of times, the average of the results obtained should be close to the expected value and tends to get closer to the expected value with increasing number of trials. In this section, we are going to introduce these two probability theorems and illustrate their applications with examples. Finally, we will show some common misconceptions of CLT and LLN.

===Motivation:===
Suppose we independently conduct one experiment repeatedly. Assume that we are interested in the relative frequency of the occurrence of one event whose probability to be observed at each experiment is p. The ratio of the observed frequency of that event to the total number of repetitions converges towards p as the number of (identical and independent) experiments increases. This is an informal statement of the Law of Large Numbers (LLN). Another important property comes with large sample size is the CLT. What would be the situation when the experiment is repeated with a sufficiently large number of iterations? Does it matter what the original distribution each individual outcome follow in this case? What would CLT tell us in situations like this and how can we apply this theorem to help us solve more complicated problems in researches?

===Theory===
====Law of Large Numbers (LLN)====
When performing the same experiment a large number of times, the average of the results obtained should be close to the expected value and tends to get closer to the expected value with increasing number of trials.
*It is generally necessary to draw the parallels between the formal LLN statements (in terms of sample averages) and the frequent interpretations of the LLN (in terms of probabilities of various events). Suppose we observe the same process independently multiple times. Assume a binarized (dichotomous) function of the outcome of each trial is of interest (e.g., failure may denote the event that the continuous voltage measure $< 0.5V$, and the complement, success, that voltage $≥ 0.5V,$ this is the situation in electronic chips which binarize electric currents to 0 or 1). Researchers are often interested in the event of observing a success at a given trial or the number of successes in an experiment consisting of multiple trials. Let’s denote $p=P(success)$ at each trial. Then, the ratio of the total number of successes to the number of trials $(n)$ is the average:$\bar X_{n}=\frac{1}{n}\sum_{i=1}^{n}X_{i}$ , where $X_{i}=0$ if failure and 1 if success. Hence, $\bar X_{n}=\hat\rho$,the ratio of the observed frequency of that event to the total number of repetitions, estimates the true p=P(success). Therefore, $\hat\rho$ converges towards $\rho$ as the number of (identical and independent) trials increases.
*LLN Application: One demonstration of the law of large numbers provides practical algorithms for estimation of transcendental numbers. The two most popular transcendental numbers are $\pi$ and $e$.
*[http://socr.ucla.edu/htmls/SOCR_Experiments.html The SOCR Uniform e-Estimate Experiment] provides the complete details of this simulation. In a nutshell, we can estimate the value of the natural number e using random sampling from Uniform distribution. Suppose $X_{1},X_{2},…,X_{n}$ are drawn from Uniform distribution on $(0,1)$and define $U= \arg\min_{n}( X_{1}+X_{2}+⋯+X_{n}>1)$, note that all $X_{i}≥0.$ Now,the expected value $E(U)=e\approx 2.7182.$ Therefore, by [[SOCR_EduMaterials_Activities_LawOfLargeNumbers|LLN]], taking averages of ${U_{1},U_{2},…,U_{k}}$ values, each computed from random samples $X_{1},X_{2},…,X_{n}\sim U(0,1)$ as described above, will provide a more accurate estimate (as $k\rightarrow\infty$) of the natural number $e$. The Uniform E-Estimate Experiment, part of provides a hands-on demonstration of how the LLN facilitates stochastic simulation-based estimation of $e$.
*Common misconceptions: (1) If we observe a streak of 10 consecutive heads (when p=0.5, say) the odds of the 11th trial being a Head is > p! This is of course, incorrect, as the coin tosses are independent trials (an example of a memoryless process); (2) If run large number of coin tosses, the number of heads and number of tails become more and more equal. This is incorrect, as the LLN only guarantees that the sample proportion of heads will converge to the true population proportion (the p parameter that we selected). In fact, the difference |Heads - Tails| diverges.

====Central Limit Theorem (CLT)====
The arithmetic mean of a sufficiently large number of iterates of independent random variables given certain conditions will be approximately normally distributed. It states that the sum of many independent and identically distributed (i.i.d.) random variables will tend to be distributed according to one of a small set of attractor distributions. There are various statements of the central limit theorem, but all of them represent weak-convergence results regarding (mostly) the sums of independent identically-distributed (random) variables.
*Definition of CLT: let ${X_{1},X_{2},…,X_{n}}$ be a i.i.d. random sample of size n drawn from distributions of expected values $\mu$ and finite variance $\sigma^{2}$. The sample average $\bar{x}_{n}=\frac{X_{1}+X_{2}+⋯+X_{n}}{n}$. By LLT, the sample averages converge in probability and almost surely to the expected value $\mu$ as $n\rightarrow \infty$. As n gets larger, the distribution of difference between the sample average $\bar{x}_{n}$ and its limit $\mu$, when multiplied by the factor $\sqrt n$, that is $\sqrt n(\bar{x}_{n}-\mu)$ approximates the normal distribution with mean 0 and variance $\sigma^{2}$: $\sqrt n(\bar{x}_{n}-\mu)\rightarrow N(0,\sigma^{2})$ when n is large enough. Thus, for large enough n, the distribution of $\bar{x}_{n}$ is close to the normal distribution with mean $\mu$ and variance $\frac{\sigma^{2}}{n}$: $\bar{x}_{n}\rightarrow N(\mu,\frac{\sigma^{2}}{n})$.

* See the [[SOCR_EduMaterials_Activities_GeneralCentralLimitTheorem| Generalized CLT Activity and Applications]].

*Multidimensional CLT: extend the central limit theorem characteristics to the cases where $X_1,X_2,…,X_n$ for each individual is an i.i.d. random vector in $R^k$ with mean $μ=E(X_i)$ and covariance matrix $Σ$. Then with multidimensional CLT, Let ${X_i}=\begin{bmatrix} X_{i(1)} \\ \vdots \\ X_{i(k)} \end{bmatrix}$ be the $i$-vector. The bold in '''X'''''i'' means that it is a random vector, not a random (univariate) variable. Then the sum of the random vectors will be $\begin{bmatrix} X_{1(1)} \\ \vdots \\ X_{1(k)} \end{bmatrix}+\begin{bmatrix} X_{2(1)} \\ \vdots \\ X_{2(k)} \end{bmatrix}+\cdots+\begin{bmatrix} X_{n(1)} \\ \vdots \\ X_{n(k)} \end{bmatrix} = \begin{bmatrix} \sum_{i=1}^{n} \left [ X_{i(1)} \right ] \\ \vdots \\ \sum_{i=1}^{n} \left [ X_{i(k)} \right ] \end{bmatrix} = \sum_{i=1}^{n} \left [ \mathbf{X_i} \right ].$ Also, the average will be $\left (\frac{1}{n}\right)\sum_{i=1}^{n} \left [ \mathbf{X_i} \right ]= \frac{1}{n}\begin{bmatrix} \sum_{i=1}^{n} \left [ X_{i(1)} \right ] \\ \vdots \\ \sum_{i=1}^{n} \left [ X_{i(k)} \right ] \end{bmatrix} = \begin{bmatrix} \bar X_{i(1)} \\ \vdots \\ \bar X_{i(k)} \end{bmatrix}=\mathbf{\bar X_n}$. Thus, $\frac{1}{\sqrt{n}} \sum_{i=1}^{n} \left [\mathbf{X_i} - E\left ( X_i\right ) \right ]=\frac{1}{\sqrt{n}}\sum_{i=1}^{n} \left [ \mathbf{X_i} - \mu \right ]=\sqrt{n}\left(\mathbf{\overline{X}}_n - \mu\right) .$

The multivariate central limit theorem implies that
$$\sqrt{n}\left(\mathbf{\overline{X}}_n - \mu\right)\ \stackrel{D}{\rightarrow}\ \mathcal{N}_k(0,\Sigma),$$
where the covariance matrix $Σ$ is equal to
$$\Sigma=\begin{bmatrix}
{Var \left (X_{1(1)} \right)} & {Cov \left (X_{1(1)},X_{1(2)} \right)} & Cov \left (X_{1(1)},X_{1(3)} \right) & \cdots & Cov \left (X_{1(1)},X_{1(k)} \right) \\
{Cov \left (X_{1(2)},X_{1(1)} \right)} & {Var \left (X_{1(2)} \right)} & {Cov \left(X_{1(2)},X_{1(3)} \right)} & \cdots & Cov \left(X_{1(2)},X_{1(k)} \right) \\
Cov \left (X_{1(3)},X_{1(1)} \right) & {Cov \left (X_{1(3)},X_{1(2)} \right)} & Var \left (X_{1(3)} \right) & \cdots & Cov \left (X_{1(3)},X_{1(k)} \right) \\
\vdots & \vdots & \vdots & \ddots & \vdots \\
Cov \left (X_{1(k)},X_{1(1)} \right) & Cov \left (X_{1(k)},X_{1(2)} \right) & Cov \left (X_{1(k)},X_{1(3)} \right) & \cdots & Var \left (X_{1(k)} \right) \\
\end{bmatrix}.$$

*The chart below demonstrates the CLT: The sample means are generated using a random number generator, which draws numbers between 1 and 100 from a uniform probability distribution. It illustrates that increasing sample sizes result in the 500 measured sample means being more closely distributed about the population mean (50 in this case). It also compares the observed distributions with the distributions that would be expected for a normalized Gaussian distribution, and shows the chi-squared values that quantify the goodness of the fit (the fit is good if the reduced chi-squared value is less than or approximately equal to one). The input into the normalized Gaussian function is the mean of sample means (~50) and the mean sample standard deviation divided by the square root of the sample size (~28.87/√n), which is called the standard deviation of the mean (since it refers to the spread of sample means).

Use the following R-script to generate the graph below:
par(mfrow=c(4,3))
k = 5 # Sample-size
m <- 200 # Number of Samples

xbarn.5 <- apply(matrix(rnorm(m*k,50,15),nrow=m),1,mean)
hist(xbarn.5,col="blue",xlim=c(0,100),prob=T,xlab="",ylab="",main="Normal(50,15)")
mtext(expression(bar(x)[5]),side=4,line=1)

xbaru.5 <- apply(matrix(runif(m*k,0,1),nrow=m),1,mean)
hist(xbaru.5,col="blue",xlim=c(0,1),prob=T,xlab="",ylab="",main="Uniform(0,1)")
mtext(expression(bar(x)[5]),side=4,line=1)

xbare.5 <- apply(matrix(rlnorm(m*k,1,1),nrow=m),1,mean)
hist(xbare.5,col="blue",xlim=c(0,15),prob=T,xlab="",ylab="",main="Log-Normal(1,1)")
mtext(expression(bar(x)[5]),side=4,line=1)

xbarn.10 <- apply(matrix(rnorm(m*k*2,50,15),nrow=m),1,mean)
hist(xbarn.10,col="blue",xlim=c(0,100),prob=T,xlab="",ylab="",main="")
mtext(expression(bar(x)[10]),side=4,line=1)

xbaru.10 <- apply(matrix(runif(m*k*2,0,1),nrow=m),1,mean)
hist(xbaru.10,col="blue",xlim=c(0,1),prob=T,xlab="",ylab="",main="")
mtext(expression(bar(x)[10]),side=4,line=1)

xbare.10 <- apply(matrix(rlnorm(m* k*2,1,1),nrow=m),1,mean)
hist(xbare.10,col="blue",xlim=c(0,15),prob=T,xlab="",ylab="",main="")
mtext(expression(bar(x)[10]),side=4,line=1)

xbarn.30 <- apply(matrix(rnorm(m*k*3,50,15),nrow=m),1,mean)
hist(xbarn.30,col="blue",xlim=c(0,100),prob=T,xlab="",ylab="",main="")
mtext(expression(bar(x)[30]),side=4,line=1)

xbaru.30 <- apply(matrix(runif(m*k*3,0,1),nrow=m),1,mean)
hist(xbaru.30,col="blue",xlim=c(0,1),prob=T,xlab="",ylab="",main="")
mtext(expression(bar(x)[30]),side=4,line=1)

xbare.30 <- apply(matrix(rlnorm(m*k*3,1,1),nrow=m),1,mean)
hist(xbare.30,col="blue",xlim=c(0,15),prob=T,xlab="",ylab="",main="")
mtext(expression(bar(x)[30]),side=4,line=1)

[[Image:SMHS CCT LLN Fig 1.png|600px]]

# Alternative Plots
m <- 2000 # Number of samples
n <- 16 # size of each sample
mu <- 50
sigma <- 15
sigma.xbar <- sigma/sqrt(n)
rnv <- rnorm(m*n,mu,sigma) # m samples of size n
rnvm <- matrix(rnv,nrow=m) # m*n matrix

samplemeans <- apply(rnvm,1,mean) # compute mean across rows of matrix

hist(samplemeans) # plain histogram
hist(samplemeans,prob=T) # density histogram
xs <- seq((mu-4*sigma.xbar),(mu+4*sigma.xbar),length=800)
ys <- dnorm(xs,mu,sigma.xbar)
lines(xs,ys,type="l") # superimpose normal
par(mfrow=c(1,1))

par(col.main="blue",pty="s")
hist(samplemeans,prob=T,col="blue",breaks="scott",
xlab=expression(bar(X)[16]),
main=expression(paste("(X~N(50,15^2): Simulated Sampling Distribution of ", bar(X))))
lines(xs,ys,type="l",lwd=2,col="red") # superimpose normal
Alpha <- round(mean(samplemeans),5)
Beta <- round(sd(samplemeans),5)
text(37,.08,bquote(hat(mu)[bar(X)]==.(Alpha)),pos=4,col="blue")
text(37,.07,bquote(hat(sigma)[bar(X)]==.(Beta)),pos=4,col="blue")
text(55, .08,bquote(mu[bar(X)]==.(mu)),pos=4,col="red")
text(55,.07,bquote(sigma[bar(X)]==.(sigma.xbar)),pos=4,col="red")

[[Image:SMHS_CCT_LLN_Fig_2.png|600px]]

*CLT instructional challenges: We have extensive CLT pedagogical experience based on graduate and undergraduate teaching, interacting with students (and teaching assistants) and evaluating students’ performance in various probability and statistics classes. In our endeavors, we have used a variety of classical (e.g., mathematical formulations), hands-on activities (e.g., beads, sums, Quincunx) and technological approaches (e.g., applets, demonstrations). Our prior efforts have identified the following instructional challenges in teaching the concept of the CLT using purely classical and physical hands-on activities.
**Some of these challenges may be addressed by employing modern IT-based technologies, like interactive applets and computer activities: What is a native process (native distribution), a sample, a sample distribution, a parameter estimator, a sample-driven numerical parameter (point) estimate or a sampling distribution? What is the relationship between the inference of the CLT and its applications in the real world? How does one improve CLT knowledge retention, which seems to decay over time? Are there paramount characteristics we can demonstrate in the classroom, which may later serve as a foundation for reconstructing the detailed statement of the CLT and improving communication of CLT meaning and results? How does one directly involve and challenge students in thinking about CLT (in and out of classroom)?
**Traditional CLT teaching techniques (symbolic mathematics and physical demonstrations) are typically restricted in terms of time and space (e.g., shown once in class) and may have the limitations of involving one native process, studying one population parameter and restricting the scope of the inference (e.g., sample-size constraints).
**Modern IT-based blended instruction approaches address many of these CLT teaching challenges by utilizing the Internet and the available computational power. For example, a Java CLT applet may be evoked repeatedly under different initial conditions (choosing sample-sizes and number of experiments, native process distributions, parameters of interest, etc.). Such tests may be performed from remote locations (e.g., classroom, library, home), and may provide enhanced interactive features (e.g., fitting Normal model to sampling distribution) demonstrated in different experimental modes (e.g., intense computational vs. visual animated sampling). Such features are especially useful for active, visual and deductive learners. Furthermore, interactive demonstrations are thought to significantly enhance the learning process for some student populations.
**Students in probability and statistics classes are generally expected to master difficult concepts that ultimately lead to understanding the basis of data variation, modeling and analysis. For many students relying on procedural manipulations and recipes is natural, perhaps because of their prior experiences with (deterministic) Newtonian sciences. Various statistics-education researchers have experimented with technology to explore novel exploratory data-analysis techniques that emphasize making sense of data via data manipulation, visualization and simulation. Such investigators refer to statistical literacy as the process of acquiring and utilizing intuition for discovering and interpreting trends, proposing solutions and counterexamples to basic problems in probability, as well as understanding statistical data modeling and analysis. Because the concepts of distribution, variation, probability, randomness, modeling and estimation are so ubiquitously used and entangled, instructors frequently forget that these notions should be defined, explained and demonstrated in (most) undergraduate probability and statistics classes. Various sampling and simulation applets and demonstrations are quite useful for this purpose.

===Applications===

[http://wiki.stat.ucla.edu/socr/index.php/SOCR_EduMaterials_Activities_LawOfLargeNumbers This article]demonstrated the theory and application of LLN in SOCR tools. It illustrated the theoretical meaning and practical implications of LLN and presented the LLN in varieties of situations. It also provided empirical evidence in support of LLN convergence and dispelled the common LLN misconceptions.

[http://www.amstat.org/publications/jse/v16n2/dinov.html This article] presents the CLT in new SOCR applet and demonstration activity.

Abstract: Modern approaches for information technology based blended education utilize a variety of novel instructional, computational and network resources. Such attempts employ technology to deliver integrated, dynamically linked, interactive content and multi-faceted learning environments, which may facilitate student comprehension and information retention. In this manuscript, we describe one such innovative effort of using technological tools for improving student motivation and learning of the theory, practice and usability of the Central Limit Theorem (CLT) in probability and statistics courses. Our approach is based on harnessing the computational libraries developed by the Statistics Online Computational Resource (SOCR) to design a new interactive Java applet and a corresponding demonstration activity that illustrate the meaning and the power of the CLT. The CLT applet and activity have clear common goals; to provide graphical representation of the CLT, to improve student intuition, and to empirically validate and establish the limits of the CLT. The SOCR CLT activity consists of four experiments that demonstrate the assumptions, meaning and implications of the CLT and ties these to specific hands-on simulations. We include a number of examples illustrating the theory and applications of the CLT. Both the SOCR CLT applet and activity are freely available online to the community to test, validate and extend [http://www.socr.ucla.edu/htmls/SOCR_Experiments.html Applet:]

[http://wiki.stat.ucla.edu/socr/index.php/SOCR_EduMaterials_Activities_GeneralCentralLimitTheorem Activity:]

[http://link.springer.com/article/10.1007/BF01207515 This article] presents the CLT for quadratic forms in strongly dependent linear variables and its application to asymptotical normality of Whittle’s estimate. A central limit theorem for quadratic forms in strongly dependent linear (or moving average) variables is proved, generalizing the results of Avramand Fox and Taqqu for Gaussian variables. The theorem is applied to prove asymptotical normality of Whittle's estimate of the parameter of strongly dependent linear sequences.

[http://www.sciencedirect.com/science/article/pii/0022053185900596 This article] studied on LLN with continuous i.i.d. random variables. There are two problems with the common argument that a continuum of independent and identically distributed random variables sum to a nonrandom quantity in “large economies”. First, it may be unintelligible in that it may call for the measure of a non-measurable set. However, there is a probability measure, consistent with the finite-dimensional distributions, which assigns zero measure to the set of realizations having that difficulty. A second difficulty is that the “law of large numbers” may not hold even when there is no measurability problem.

===Software===
* [http://socr.ucla.edu/htmls/exp/LLN_Simple_Experiment.html SOCR LLN Experiment (Java Applet)]
* [http://socr.ucla.edu/htmls/exp/Coin_Toss_LLN_Experiment.html SOCR COin Toss LLN Experiment (Java Applet)]
* [[SOCR_EduMaterials_Activities_LawOfLargeNumbers#Estimating_.CF.80_using_SOCR_simulation |SOCR LLN Activity]]
* [[SOCR_EduMaterials_Activities_GeneralCentralLimitTheorem| SOCR CLT Activity]]

===Problems===

6.1) Your friend is in Vegas playing Keno, and he has noticed that some numbers have been coming up more frequently than others. He declares that the other numbers were "due" to come up, and puts all of his money on those numbers. Is this a correct assessment?
(a) Yes, the Law of Averages says that the numbers that haven't shown up will now come up more often, because the probabilities will even out in the end.

(b) No, this is a misconception, because random phenomena do not "compensate" for what happened in the past.

(c) No, the game is probably broken, and the other numbers won't be coming up more frequently.

(d) Yes, the more often a certain number doesn't come up, its probability of coming up next turn increases.

6.2) You are flipping a coin, and it has already landed heads seven times in a row. For the next flip, the probability of getting tails will be greater than the probability of getting heads.
(a) TRUE
(b) FALSE

Hands-on activities for practice to help students experiment with the SOCR LLN activity and understand the meaning, ramifications and limitations of the LLN.

6.3) Run the SOCR Coin Toss LLN Experiment twice with stop=100 and p=0.5. This corresponds to flipping a fair coin 100 times and observing the behavior of the proportion of heads across (discrete) time.

What will be different in the outcomes of the 2 experiments?

What properties of the 2 outcomes will be very similar?

If we did this 10 times, what is expected to vary and what may be predicted accurately?

6.4) Use the SOCR Uniform e-Estimate Experiment to obtain stochastic estimates of the natural number $e≈2.7182$.

Try to explain in words, and support your argument with data/results from this simulation, why is the expected value of the variable U (defined above) equal to e, $E(U) = e$.

How does the LLN come into play in this experiment?

How would you go about in practice if you had to estimate $e^2≈7.38861124$?

Similarly, try to estimate $π≈3.141592$ and $π^2≈9.8696044$ using the [[SOCR_EduMaterials_Activities_BuffonNeedleExperiment|SOCR Buffon’s Needle Experiment]].

6.5) Run the [[SOCR_EduMaterials_Activities_RouletteExperiment|SOCR Roulette Experiment]] and bet on 1-18 (out of the 38 possible numbers/outcomes). What is the probability of success (p)?

What does the LLN imply about $p$ and repeated runs of this experiment?

Run this experiment 3 times. What is the sample estimate of p ($\hat{p}$)? What is the difference $p-\hat{p}$?

Would this difference change if we ran the experiment 10 or 100 times? How?

In 100 Roulette experiments, what can you say about the difference of the number of successes (outcome in 1-18) and the number of failures? How about the proportion of successes?

6.6) Work through the experiments given in this article to (1) empirically validate the sample average of random observations (most processes) follow normal distribution; (2) demonstrate that the sample average is special and other sample statistics like median or variance generally don’t have distributions that are normal; (3) illustrate that the expectations of the sample average equals the population mean; and show that the variation of the sampling distribution of the mean rapidly decreases as the sample size increases.

'''Answer the following questions:'''

What effects will asymmetry, gaps and continuity of the native distribution have on the applicability of the CLT, or on the asymptotic distribution of various sample statistics?

When can we reasonably expect statistics, other than the sample mean, to have CLT properties?

If a native process has $σ_X = 10$ and we take a sample of size 10, what will be ($σ_{\bar{X}}$)? Does it depend on the shape of the original process? How large should the sample-size be so that $σ_{\bar{X}}=\frac{2}{3} σ_X$?

===References===
http://www.amstat.org/publications/jse/v16n2/dinov.html

http://mirlyn.lib.umich.edu/Record/004199238

http://mirlyn.lib.umich.edu/Record/004232056

http://en.wikipedia.org/wiki/Central_limit_theorem

http://en.wikipedia.org/wiki/Law_of_large_numbers

<hr>
* SOCR Home page: http://www.socr.umich.edu

{{translate|pageName=http://wiki.socr.umich.edu/index.php?title=SMHS_CLT_LLN}}

SMHS ANOVA

2014-09-29T20:34:02Z

Jslavine: /* Overview */

==[[SMHS| Scientific Methods for Health Sciences]] - Analysis of Variance (ANOVA) ==

===Overview===
Analysis of Variance (ANOVA) is a method that is commonly applied to analyze differences between group means. In ANOVA, we divide the observed variance into components attributed to different sources of variation. It is a widely used statistical technique that provides a statistical test of whether or not the means of several groups are equal; ANOVA can be thought of as a generalized t-test for more than 2 groups. If there are only 2 groups, ANOVA results coincide with the corresponding results of a 2-sample independent t-test. Here, we introduce ANOVAs, both one-way and two-way, and provide examples.

===Motivation===
In the previous two-sample inference, we applied a t-test to compare two independent group means. What if we want to compare mroe than 2 independent samples? In this case, we will need to decompose the entire variation into components that allow us to analyze the variance of the entire dataset. Suppose 5 varieties of a particular crop are tested for further study. A field was divided into 20 plots, with each variety planted in 4 plots. The measurements are shown in the table below:

<center>
{| class="wikitable" style="text-align:center; width:35%" border="1"
|-
|A ||B|| C|| D|| E
|-
|26.2|| 29.2|| 29.1 ||21.3|| 20.1
|-
|24.3|| 28.1 ||30.8 ||22.4|| 19.3
|-
|21.8 ||27.3|| 33.9|| 24.3 ||19.9
|-
|28.1|| 31.2|| 32.8|| 21.8|| 22.1
|-
|}
</center>

<center>
{| class="wikitable" style="text-align:center; width:35%" border="1"
|-
|A|| 26.2,24.3,21.8,28.1
|-
|B|| 29.2,28.1,27.3,31.2
|-
|C|| 29.1,30.8,33.9,32.8
|-
|D|| 21.3,22.4,24.3,21.8
|-
|E|| 20.1,19.3,19.9,22.1
|-
|}
</center>

Using ANOVA, the data are regarded as random samples from 5 populations. Suppose the population means are denoted as $\mu_{1},\mu_{2},\mu_{3},\mu_{4},$ and $\mu_{5}$, and the population standard deviations are denoted as $\sigma_{1},\sigma_{2},\sigma_{3},\sigma_{4},$ and $\sigma_{5}$. One method would be to apply $\binom{5}{2}=10$ separate t-tests and compare all independent pairs of groups. However, in this case, ANOVA would be much easier and more powerful.

===Theory===

====One-way ANOVA====
One-way ANOVA expands our inference methods to study and compare k independent samples. In this case, we will be decomposing the entire variation in the data into independent components.

*Notation: $y_{ij}$ is the $j^{th}$ measurement from group $i$; $k$ is the number of groups; $n_{i}$ is the number of observations in group $i$; $n$ is the total number of observations and $n=n_{1}+n_{2}+⋯+n_{k}$. The group mean for group $i$ is $\bar y_{i}$=$\frac{\sum_{j=1}^{n_{i}} y_{ij}} {n_{i}}$, and the grand mean is $\bar y =\bar y_{..}=$ $\frac{\sum_{i=1}^{k}\sum_{j=1}^{n}_{i}y_{ij}}{n}$

*Difference between means (i.e., compare each group mean to the grand mean).
**The total variance is calculated as the total sum of squares (SST) divided by the total degrees of freedom (df(total)). $SST=\sum_{i=1}^{k}\sum_{j=1}^{n_i}(y_{ij}-\bar y_{..})^{2}$ and $df(total)=n-1$.
**The difference between each group mean and the grand mean: $SST(between)$=$\sum_{i=1}^{k} \sum_{j=1}^{n_i}(y_{ij}-\bar y_{..})^{2}$, degrees of freedom $df(between)=k-1$; Sum square due to error (combination of variation within group):$SSE=\sum_{i=1}^{k} n_{i}(\bar y_{l.}- \bar y_{..})^2$, degrees of freedom $df(within)=n-k$. With ANOVA decomposition, we have $\sum_{i=1}^{k} {\sum_{j=1}^{n_{i}} {(y_{ij}- \bar y_{..})^2 }} = \sum_{i=1}^{k} {n_{i} (y_{l.}-\bar y_{..})^2} + \sum_{i=1}^{k} {\sum_{j=1}^{n_i} {(y_{ij}-\bar y_{l.})^2}},$ that is $SST(total)$=$SST(between)$+$SSE(within)$ and $df(total)$=$df(between)$+$df(within).$

*Calculations:
<center>
{| class="wikitable" style="text-align:center; width:50%" border="1"
|-
| Variance Source || Degrees of Freedom (df) || Sum of Squares (SS) || Mean Sum of Squares (MS) || F-Statistics || [http://socr.ucla.edu/htmls/SOCR_Distributions.html P-value]
|-
| Treatment Effect (Between Group) || k-1 || <math>\sum_{i=1}^{k}{n_i(\bar{y}_{i,.}-\bar{y})^2}</math> || <math>MST(Between)={SST(Between)\over df(Between)}</math> || <math>F_o = {MST(Between)\over MSE(Within)}</math> || <math>P(F_{(df(Between), df(Within))} > F_o)</math>
|-
| Error (Within Group) || n-k || <math>\sum_{i=1}^{k}{\sum_{j=1}^{n_i}{(y_{i,j}-\bar{y}_{i,.})^2}}</math> || <math>MSE(Within)={SSE(Within)\over df(Within)}</math> || || [http://socr.ucla.edu/Applets.dir/Normal_T_Chi2_F_Tables.htm F-Distribution Calculator]
|-
| Total || n-1 || <math>\sum_{i=1}^{k}{\sum_{j=1}^{n_i}{(y_{i,j} - \bar{y})^2}}</math> || || || [[SOCR_EduMaterials_AnalysisActivities_ANOVA_1 | ANOVA Activity]]
|}
</center>

* ANOVA hypotheses (general form): $H_{\sigma}:\mu_{1}=\mu_{2}=⋯=\mu_{k}$; $H_{a}:\mu_{I}≠\mu_{j}$ for some $i≠j$. The test statistics: $F_{0}=\frac{MST(between)}{MSE(within)}$ , if $F_{0}$ is large, then there is a lot between group variation, relative to the within group variation. Therefore, the discrepancies between the group means are large compared to the variability within the groups (error). That is large $F_{0}$ provides strong evidence against $H_{0}$.

*Examples: given the following data from a hands-on study.
<center>
{| class="wikitable" style="text-align:center; width:35%" border="1"
|-
| || colspan=3|Groups
|-
|Index|| A|| B|| C
|-
|1 || 0|| 1|| 4
|-
|2|| 1|| 0|| 5
|-
|3|| ||2||
|-
|$n_{i}$|| 2|| 3|| 2
|-
| $s$ ||1|| 3|| 9
|-
|$\bar y_{l}$|| 0.5|| 1|| 4.5
|-
|}
</center>

Using this data, we have the following ANOVA table:
<center>
{| class="wikitable" style="text-align:center; width:50%" border="1"
|-
| Variance Source || Degrees of Freedom (df) || Sum of Squares (SS) || Mean Sum of Squares (MS) || F-Statistics || [http://socr.ucla.edu/htmls/dist/Fisher_Distribution.html P-value]
|-
| Treatment Effect (Between Group) || 3-1 || <math>\sum_{i=1}^{k}{n_i(\bar{y}_{i,.}-\bar{y})^2}=19.86</math> || <math>{SST(Between)\over df(Between)}={19.86\over 2}</math> || <math>F_o = {MST(Between)\over MSE(Within)}=13.24</math> || <math>P(F_{(df(Between), df(Within))} > F_o)=0.017</math>
|-
| Error (Within Group) || 7-3 || <math>\sum_{i=1}^{k}{\sum_{j=1}^{n_i}{(y_{i,j}-\bar{y}_{i,.})^2}}=3</math> || <math>{SSE(Within)\over df(Within)}={3\over 4}</math> || || [http://socr.ucla.edu/Applets.dir/Normal_T_Chi2_F_Tables.htm F-Distribution Calculator]
|-
| Total || 7-1 || <math>\sum_{i=1}^{k}{\sum_{j=1}^{n_i}{(y_{i,j} - \bar{y})^2}}=22.86</math> || || || [[SOCR_EduMaterials_AnalysisActivities_ANOVA_1 | Anova Activity]]
|}
</center>

Based on the ANOVA table above, we can reject the null hypothesis at $\alpha=0.05.$

*ANOVA conditions: valid if (1) design conditions: all groups of observations represent random samples from their population respectively. Plus, all the observations within each group are independent of each other; (2) population conditions: the k population distributions must be approximately normal. If sample size is large, the normality condition is less crucial. Plus, the standard deviations of all populations are equal, which can be slightly relaxed when $0.5≤\frac{\sigma_{i}}{\sigma_{j}}≤2,$ for all $i$ and $j$, none of the population variance is twice larger than any of the other ones.

====Two-way ANOVA====
Two-way ANOVA decomposes the variance of a dataset into independent (orthogonal) components when we have two grouping factors.
Notations first: two-way model:$y_{ijk}=\mu+\tau_{i}+\beta_{j}+γ_{ij}+\varepsilon_{ijk},$ for all $1≤i≤a,1≤j≤b$ and $1≤k≤r.$ $y_{ijk}$ is the A-factor level $i$, and B-factor level $j$, observation-index $k$ measurement; $k$ is the number of replications; $a_{i}$ is the number of A-factor observations at level $i,a=a_{1}+⋯+a_{I}$; $b_{j}$ is the number of B-factor observations at level $j$, $b=b_{1}+⋯+b_{J}$; $N$ is the total number of observations and $N=a*a*b$. Here $\mu$ is the overall mean response, $\tau_{i}$ is the effect due to the $i^{th}$ level of factor A, $\beta_{j}$ is the effect due to the $j^{th}$ level of factor B, and $\gamma_{ij}$ is the effect due to any interaction between the $i^{th}$ level of factor A and $j^{th}$ level of factor B. The mean for A-factor group mean at level $I$ and B-factor at level $j$ is $\bar{y}_{ij.}=\frac{\sum_{k=1}^{r} {y_{ijk}}} {r},$ the grand mean is $\bar {y} =\bar{y}_{...} = \frac{\sum_{k=1}^{r} {\sum_{i=1}^{a} {\sum_{j=1}^{b} {y_{ijk}}}}} {n}$, and we have we have $SST(total)=SS(A)+SS(B)+SS(AB)+SSE$.

*Hypotheses:
**Null hypotheses: (1) the population means of the first factor are equal, which is like the one-way ANOVA for the row factor; (2) the population means of the second factor are equal, which is like the one-way ANOVA for the column factor; (3) there is no interaction between the two factors, which is similar to performing a test for independence with contingency tables.
**Factors: factor A and factor B are independent variables in two-way ANOVA.
**Treatment groups: formed by making all possible combinations of two factors. For example, if the factor A has 3 levels and factor B has 5 levels, then there will be 3*5=15 different treatment groups.
**Main effect: involves the dependent variable one at a time. The interaction is ignored for this part.
**Interaction effect: the effect that one factor has on the other factor. The degree of freedom is the product of the two degrees of freedom of each factor.

*Calculations:
It is assumed that main effect A has ''a'' levels (and df(A) = a-1), main effect B has ''b'' levels (and (df(B) = b-1), ''r'' is the sample size of each treatment, and <math>N = a\times b\times n</math> is the total sample size. Notice the overall degree of freedom is once again one less than the total sample size.

<center>
{| class="wikitable" style="text-align:center; width:50%" border="1"
|-
| Variance Source || Degrees of Freedom (df) || Sum of Squares (SS) || Mean Sum of Squares (MS) || F-Statistics || [http://socr.umich.edu/html/dist/ P-value]
|-
| Main Effect A || df(A)=a-1 || <math>SS(A)=r\times b\times\sum_{i=1}^{a}{(\bar{y}_{i,.,.}-\bar{y})^2}</math> || <math>{SS(A)\over df(A)}</math> || <math>F_o = {MS(A)\over MSE}</math> || <math>P(F_{(df(A), df(E))} > F_o)</math>
|-
| Main Effect B || df(B)=b-1 || <math>SS(B)=r\times a\times\sum_{j=1}^{b}{(\bar{y}_{., j,.}-\bar{y})^2}</math> || <math>{SS(B)\over df(B)}</math> || <math>F_o = {MS(B)\over MSE}</math> || <math>P(F_{(df(B), df(E))} > F_o)</math>
|-
| A vs.B Interaction || df(AB)=(a-1)(b-1) || <math>SS(AB)=r\times \sum_{i=1}^{a}{\sum_{j=1}^{b}{((\bar{y}_{i, j,.}-\bar{y}_{i, .,.})+(\bar{y}_{., j,.}-\bar{y}))^2}}</math> || <math>{SS(AB)\over df(AB)}</math> || <math>F_o = {MS(AB)\over MSE}</math> || <math>P(F_{(df(AB), df(E))} > F_o)</math>
|-
| Error || <math>N-a\times b</math> || <math>SSE=\sum_{k=1}^r{\sum_{i=1}^{a}{\sum_{j=1}^{b}{(\bar{y}_{i, j,k}-\bar{y}_{i, j,.})^2}}}</math> || <math>{SSE\over df(Error)}</math> || ||
|-
| Total || N-1 || <math>SST=\sum_{k=1}^r{\sum_{i=1}^{a}{\sum_{j=1}^{b}{(\bar{y}_{i, j,k}-\bar{y}_{., .,.})^2}}}</math> || || || [[SOCR_EduMaterials_AnalysisActivities_ANOVA_2 | ANOVA Activity]]
|}
</center>

* Two-way ANOVA is valid if:
:: (1) the population from which the samples were obtained are normally or approximately normally distributed;
:: (2) the samples are independent;
:: (3) the variances of the populations are equal;
:: (4) the groups have the same sample size.

* Example: [http://wiki.stat.ucla.edu/socr/index.php/AP_Statistics_Curriculum_2007_ANOVA_2Way clinical example of knee pain study]

===Applications===
* [[SOCR_EduMaterials_Activities_BoxAndWhiskerChart|This activity]] presents the Box and Whisker Chart, which is often used in exploratory data analyses. It demonstrates the range, standard deviation, mean and quartiles of the values and is especially useful in comparing statistical data. This article illustrated the implementation of the chart in SOCR with comprehensive introduction. It also included the application of this method in different areas.

* [[SOCR_EduMaterials_AnalysisActivities_ANOVA_2|The SOCR Two-Way ANOVA Java Applet]] includes examples of two-way analysis of variance using SOCR tools. It illustrated the application of two-way ANOVA with examples applied in the SOCR. It also expanded the two-way ANOVA in softwares like R and SAS.

* [[SOCR_Activity_ANOVA_SnailsSexualDimorphism| Ther SOCR Snails Sexual Dimorphism Activity]] shows an application of ANOVA. This activity recreates part of the design of a classification method for the Cocholotoma septemspirale snail. By observing multiple traits of the shells, the original researchers were able to decide on a series of dimorphisms (difference in forms) between male and female snails. This article presents a comprehensive illustration of the example.

===Software ===
* [http://socr.umich.edu/html/ana/ SOCR Analyses Java Applets]
* [[SOCR_EduMaterials_AnalysisActivities_ANOVA_1 | One-Way ANOVA Activity]]
* R:
# fit a model
# one-way ANOVA with completely randomized design
fit <- aov(y ~ A, data = mydata)
# randomized block design (B as the blocking factor)
fit <- aov(y ~ A + B, data = mydata)
# two-way factorial design
fit <- aov(y ~ A + B + A*B, data = mydata)
# to check out the model fitted with type I ANOVA table
summary(fit)
# type III SS and F test
drop1(fit, ~., test=’F’)

===Problems===
* Tom was shopping for a ping pong table that could be taken apart quickly and easily. For some reason, the salesman happened to have a table of the assembly times (sec) for the three tables. Using ANOVA, do you think there is a difference in the average time of assembly for the three brands of ping pong tables?
<center>
{| class="wikitable" style="text-align:center;" border="1"
|-
! Assembly_time_(sec)||Brand
|-
| 93.0||1
|-
| 67.0||1
|-
| 77.0||1
|-
| 92.0||1
|-
| 97.0||1
|-
| 62.0||1
|-
| 136.0||2
|-
| 120.0||2
|-
| 115.0||2
|-
| 104.0||2
|-
| 115.0||2
|-
| 121.0||2
|-
| 102.0||2
|-
| 130.0||2
|-
| 198.0||3
|-
| 217.0||3
|-
| 209.0||3
|-
| 221.0||3
|-
| 190.0||3
|}
</center>

: (a) We can say that there is no reason to reject the null that the average assembly times are the same
: (b) We should reject the null that the average assembly times are the same

* Based on the data in the previous problem, what is the value for R square:
: (a) 0.342
: (b) 0.143
: (c) 0.832
: (d) 0.943

* Tom is curious to see if two-door vehicles drive faster on average than four-door vehicles. He parks behind a bush so as not to be seen, and records the car type and the speed reading. Here are the results (1 means two-door, and 2 means four-door):
<center>
{| class="wikitable" style="text-align:center;" border="1"
|-
! Speed_(MPH)||Vehicle_Type
|-
| 45||2
|-
| 45||2
|-
| 40||2
|-
| 69||1
|-
| 72||1
|-
| 40||1
|-
| 75||2
|-
| 19||2
|-
| 62||1
|-
| 43||2
|-
| 75||1
|-
| 42||2
|-
| 58||1
|-
| 58||1
|-
| 47||2
|-
| 48||2
|-
| 49||2
|-
| 45||2
|-
| 54||2
|}
</center>

: At the 1% significance level, should we reject the null hypothesis that that average speed is the same for both types of vehicles?
: (a) Yes, we should reject the null hypothesis.
: (b) No, we should not reject the null hypothesis.
: (c) There is not enough information.

* Based on data above, what is the value for R square?
: (a) 0.432
: (b) 0.983
: (c) 0.308
: (d) 0.231

* In a two-way ANOVA test, which of the following is not the typical null hypothesizes?
: (a) The population means of the first factor are equal.
: (b) The population means of the first and second factor are equal.
: (c) The population means of the second factor are equal.
: (d) There is no interaction between the two factors.

* Suppose that two factors, A and B, is thought to affect the top speed of a car. We will use two-way ANOVA analysis. Are the population means of factor A equal?
<center>
{| class="wikitable" style="text-align:center;" border="1"
|-
! Top_Speed||A||B
|-
| 93.0||1||1
|-
| 136.0||1||2
|-
| 198.0||1||3
|-
| 88.0||2||1
|-
| 148.0||2||2
|-
| 279.0||2||3
|}
</center>
: (a) Yes, they are equal.
: (b) No, they are not equal.

* Use the data above and apply the two-way ANOVA analysis, are the population means of factor B equal?
: (a) Yes, they are equal.
: (b) No, they are not equal.

* Use data from table above and apply the two-way ANOVA analysis, is there an interaction effect between the two factors
: (a) Yes, they are equal.
: (b) No, they are not equal.

=== References===
* [http://wiki.stat.ucla.edu/socr/index.php/Probability_and_statistics_EBook#Chapter_XI:_Analysis_of_Variance_.28ANOVA.29 SOCR]
* [http://en.wikipedia.org/wiki/Analysis_of_variance ANOVA Wikipedia]

<hr>
* SOCR Home page: http://www.socr.umich.edu

{{translate|pageName=http://wiki.socr.umich.edu/index.php?title=SMHS_ANOVA}}

SMHS ANOVA

2014-09-29T20:22:08Z

Jslavine: /* Overview */

==[[SMHS| Scientific Methods for Health Sciences]] - Analysis of Variance (ANOVA) ==

===Overview===
Analysis of Variance (ANOVA) is a method that is commonly applied when analyzing the differences between group means. In ANOVA, we divide the observed variance into components attributed to different sources of variation. It is a widely used statistical technique that provides a statistical test of whether or not the means of several groups are equal; ANOVA can be thought of as a generalized t-test for more than 2 groups. If there are only 2 groups, ANOVA results coincide with the corresponding results of a 2-sample independent t-test. Here, we introduce ANOVAs, both one-way and two-way, and provide examples.

===Motivation===
In the previous two-sample inference, we applied a t-test to compare two independent group means. What if we want to compare mroe than 2 independent samples? In this case, we will need to decompose the entire variation into components that allow us to analyze the variance of the entire dataset. Suppose 5 varieties of a particular crop are tested for further study. A field was divided into 20 plots, with each variety planted in 4 plots. The measurements are shown in the table below:

<center>
{| class="wikitable" style="text-align:center; width:35%" border="1"
|-
|A ||B|| C|| D|| E
|-
|26.2|| 29.2|| 29.1 ||21.3|| 20.1
|-
|24.3|| 28.1 ||30.8 ||22.4|| 19.3
|-
|21.8 ||27.3|| 33.9|| 24.3 ||19.9
|-
|28.1|| 31.2|| 32.8|| 21.8|| 22.1
|-
|}
</center>

<center>
{| class="wikitable" style="text-align:center; width:35%" border="1"
|-
|A|| 26.2,24.3,21.8,28.1
|-
|B|| 29.2,28.1,27.3,31.2
|-
|C|| 29.1,30.8,33.9,32.8
|-
|D|| 21.3,22.4,24.3,21.8
|-
|E|| 20.1,19.3,19.9,22.1
|-
|}
</center>

Using ANOVA, the data are regarded as random samples from 5 populations. Suppose the population means are denoted as $\mu_{1},\mu_{2},\mu_{3},\mu_{4},$ and $\mu_{5}$, and the population standard deviations are denoted as $\sigma_{1},\sigma_{2},\sigma_{3},\sigma_{4},$ and $\sigma_{5}$. One method would be to apply $\binom{5}{2}=10$ separate t-tests and compare all independent pairs of groups. However, in this case, ANOVA would be much easier and more powerful.

===Theory===

====One-way ANOVA====
One-way ANOVA expands our inference methods to study and compare k independent samples. In this case, we will be decomposing the entire variation in the data into independent components.

*Notation: $y_{ij}$ is the $j^{th}$ measurement from group $i$; $k$ is the number of groups; $n_{i}$ is the number of observations in group $i$; $n$ is the total number of observations and $n=n_{1}+n_{2}+⋯+n_{k}$. The group mean for group $i$ is $\bar y_{i}$=$\frac{\sum_{j=1}^{n_{i}} y_{ij}} {n_{i}}$, and the grand mean is $\bar y =\bar y_{..}=$ $\frac{\sum_{i=1}^{k}\sum_{j=1}^{n}_{i}y_{ij}}{n}$

*Difference between means (i.e., compare each group mean to the grand mean).
**The total variance is calculated as the total sum of squares (SST) divided by the total degrees of freedom (df(total)). $SST=\sum_{i=1}^{k}\sum_{j=1}^{n_i}(y_{ij}-\bar y_{..})^{2}$ and $df(total)=n-1$.
**The difference between each group mean and the grand mean: $SST(between)$=$\sum_{i=1}^{k} \sum_{j=1}^{n_i}(y_{ij}-\bar y_{..})^{2}$, degrees of freedom $df(between)=k-1$; Sum square due to error (combination of variation within group):$SSE=\sum_{i=1}^{k} n_{i}(\bar y_{l.}- \bar y_{..})^2$, degrees of freedom $df(within)=n-k$. With ANOVA decomposition, we have $\sum_{i=1}^{k} {\sum_{j=1}^{n_{i}} {(y_{ij}- \bar y_{..})^2 }} = \sum_{i=1}^{k} {n_{i} (y_{l.}-\bar y_{..})^2} + \sum_{i=1}^{k} {\sum_{j=1}^{n_i} {(y_{ij}-\bar y_{l.})^2}},$ that is $SST(total)$=$SST(between)$+$SSE(within)$ and $df(total)$=$df(between)$+$df(within).$

*Calculations:
<center>
{| class="wikitable" style="text-align:center; width:50%" border="1"
|-
| Variance Source || Degrees of Freedom (df) || Sum of Squares (SS) || Mean Sum of Squares (MS) || F-Statistics || [http://socr.ucla.edu/htmls/SOCR_Distributions.html P-value]
|-
| Treatment Effect (Between Group) || k-1 || <math>\sum_{i=1}^{k}{n_i(\bar{y}_{i,.}-\bar{y})^2}</math> || <math>MST(Between)={SST(Between)\over df(Between)}</math> || <math>F_o = {MST(Between)\over MSE(Within)}</math> || <math>P(F_{(df(Between), df(Within))} > F_o)</math>
|-
| Error (Within Group) || n-k || <math>\sum_{i=1}^{k}{\sum_{j=1}^{n_i}{(y_{i,j}-\bar{y}_{i,.})^2}}</math> || <math>MSE(Within)={SSE(Within)\over df(Within)}</math> || || [http://socr.ucla.edu/Applets.dir/Normal_T_Chi2_F_Tables.htm F-Distribution Calculator]
|-
| Total || n-1 || <math>\sum_{i=1}^{k}{\sum_{j=1}^{n_i}{(y_{i,j} - \bar{y})^2}}</math> || || || [[SOCR_EduMaterials_AnalysisActivities_ANOVA_1 | ANOVA Activity]]
|}
</center>

* ANOVA hypotheses (general form): $H_{\sigma}:\mu_{1}=\mu_{2}=⋯=\mu_{k}$; $H_{a}:\mu_{I}≠\mu_{j}$ for some $i≠j$. The test statistics: $F_{0}=\frac{MST(between)}{MSE(within)}$ , if $F_{0}$ is large, then there is a lot between group variation, relative to the within group variation. Therefore, the discrepancies between the group means are large compared to the variability within the groups (error). That is large $F_{0}$ provides strong evidence against $H_{0}$.

*Examples: given the following data from a hands-on study.
<center>
{| class="wikitable" style="text-align:center; width:35%" border="1"
|-
| || colspan=3|Groups
|-
|Index|| A|| B|| C
|-
|1 || 0|| 1|| 4
|-
|2|| 1|| 0|| 5
|-
|3|| ||2||
|-
|$n_{i}$|| 2|| 3|| 2
|-
| $s$ ||1|| 3|| 9
|-
|$\bar y_{l}$|| 0.5|| 1|| 4.5
|-
|}
</center>

Using this data, we have the following ANOVA table:
<center>
{| class="wikitable" style="text-align:center; width:50%" border="1"
|-
| Variance Source || Degrees of Freedom (df) || Sum of Squares (SS) || Mean Sum of Squares (MS) || F-Statistics || [http://socr.ucla.edu/htmls/dist/Fisher_Distribution.html P-value]
|-
| Treatment Effect (Between Group) || 3-1 || <math>\sum_{i=1}^{k}{n_i(\bar{y}_{i,.}-\bar{y})^2}=19.86</math> || <math>{SST(Between)\over df(Between)}={19.86\over 2}</math> || <math>F_o = {MST(Between)\over MSE(Within)}=13.24</math> || <math>P(F_{(df(Between), df(Within))} > F_o)=0.017</math>
|-
| Error (Within Group) || 7-3 || <math>\sum_{i=1}^{k}{\sum_{j=1}^{n_i}{(y_{i,j}-\bar{y}_{i,.})^2}}=3</math> || <math>{SSE(Within)\over df(Within)}={3\over 4}</math> || || [http://socr.ucla.edu/Applets.dir/Normal_T_Chi2_F_Tables.htm F-Distribution Calculator]
|-
| Total || 7-1 || <math>\sum_{i=1}^{k}{\sum_{j=1}^{n_i}{(y_{i,j} - \bar{y})^2}}=22.86</math> || || || [[SOCR_EduMaterials_AnalysisActivities_ANOVA_1 | Anova Activity]]
|}
</center>

Based on the ANOVA table above, we can reject the null hypothesis at $\alpha=0.05.$

*ANOVA conditions: valid if (1) design conditions: all groups of observations represent random samples from their population respectively. Plus, all the observations within each group are independent of each other; (2) population conditions: the k population distributions must be approximately normal. If sample size is large, the normality condition is less crucial. Plus, the standard deviations of all populations are equal, which can be slightly relaxed when $0.5≤\frac{\sigma_{i}}{\sigma_{j}}≤2,$ for all $i$ and $j$, none of the population variance is twice larger than any of the other ones.

====Two-way ANOVA====
Two-way ANOVA decomposes the variance of a dataset into independent (orthogonal) components when we have two grouping factors.
Notations first: two-way model:$y_{ijk}=\mu+\tau_{i}+\beta_{j}+γ_{ij}+\varepsilon_{ijk},$ for all $1≤i≤a,1≤j≤b$ and $1≤k≤r.$ $y_{ijk}$ is the A-factor level $i$, and B-factor level $j$, observation-index $k$ measurement; $k$ is the number of replications; $a_{i}$ is the number of A-factor observations at level $i,a=a_{1}+⋯+a_{I}$; $b_{j}$ is the number of B-factor observations at level $j$, $b=b_{1}+⋯+b_{J}$; $N$ is the total number of observations and $N=a*a*b$. Here $\mu$ is the overall mean response, $\tau_{i}$ is the effect due to the $i^{th}$ level of factor A, $\beta_{j}$ is the effect due to the $j^{th}$ level of factor B, and $\gamma_{ij}$ is the effect due to any interaction between the $i^{th}$ level of factor A and $j^{th}$ level of factor B. The mean for A-factor group mean at level $I$ and B-factor at level $j$ is $\bar{y}_{ij.}=\frac{\sum_{k=1}^{r} {y_{ijk}}} {r},$ the grand mean is $\bar {y} =\bar{y}_{...} = \frac{\sum_{k=1}^{r} {\sum_{i=1}^{a} {\sum_{j=1}^{b} {y_{ijk}}}}} {n}$, and we have we have $SST(total)=SS(A)+SS(B)+SS(AB)+SSE$.

*Hypotheses:
**Null hypotheses: (1) the population means of the first factor are equal, which is like the one-way ANOVA for the row factor; (2) the population means of the second factor are equal, which is like the one-way ANOVA for the column factor; (3) there is no interaction between the two factors, which is similar to performing a test for independence with contingency tables.
**Factors: factor A and factor B are independent variables in two-way ANOVA.
**Treatment groups: formed by making all possible combinations of two factors. For example, if the factor A has 3 levels and factor B has 5 levels, then there will be 3*5=15 different treatment groups.
**Main effect: involves the dependent variable one at a time. The interaction is ignored for this part.
**Interaction effect: the effect that one factor has on the other factor. The degree of freedom is the product of the two degrees of freedom of each factor.

*Calculations:
It is assumed that main effect A has ''a'' levels (and df(A) = a-1), main effect B has ''b'' levels (and (df(B) = b-1), ''r'' is the sample size of each treatment, and <math>N = a\times b\times n</math> is the total sample size. Notice the overall degree of freedom is once again one less than the total sample size.

<center>
{| class="wikitable" style="text-align:center; width:50%" border="1"
|-
| Variance Source || Degrees of Freedom (df) || Sum of Squares (SS) || Mean Sum of Squares (MS) || F-Statistics || [http://socr.umich.edu/html/dist/ P-value]
|-
| Main Effect A || df(A)=a-1 || <math>SS(A)=r\times b\times\sum_{i=1}^{a}{(\bar{y}_{i,.,.}-\bar{y})^2}</math> || <math>{SS(A)\over df(A)}</math> || <math>F_o = {MS(A)\over MSE}</math> || <math>P(F_{(df(A), df(E))} > F_o)</math>
|-
| Main Effect B || df(B)=b-1 || <math>SS(B)=r\times a\times\sum_{j=1}^{b}{(\bar{y}_{., j,.}-\bar{y})^2}</math> || <math>{SS(B)\over df(B)}</math> || <math>F_o = {MS(B)\over MSE}</math> || <math>P(F_{(df(B), df(E))} > F_o)</math>
|-
| A vs.B Interaction || df(AB)=(a-1)(b-1) || <math>SS(AB)=r\times \sum_{i=1}^{a}{\sum_{j=1}^{b}{((\bar{y}_{i, j,.}-\bar{y}_{i, .,.})+(\bar{y}_{., j,.}-\bar{y}))^2}}</math> || <math>{SS(AB)\over df(AB)}</math> || <math>F_o = {MS(AB)\over MSE}</math> || <math>P(F_{(df(AB), df(E))} > F_o)</math>
|-
| Error || <math>N-a\times b</math> || <math>SSE=\sum_{k=1}^r{\sum_{i=1}^{a}{\sum_{j=1}^{b}{(\bar{y}_{i, j,k}-\bar{y}_{i, j,.})^2}}}</math> || <math>{SSE\over df(Error)}</math> || ||
|-
| Total || N-1 || <math>SST=\sum_{k=1}^r{\sum_{i=1}^{a}{\sum_{j=1}^{b}{(\bar{y}_{i, j,k}-\bar{y}_{., .,.})^2}}}</math> || || || [[SOCR_EduMaterials_AnalysisActivities_ANOVA_2 | ANOVA Activity]]
|}
</center>

* Two-way ANOVA is valid if:
:: (1) the population from which the samples were obtained are normally or approximately normally distributed;
:: (2) the samples are independent;
:: (3) the variances of the populations are equal;
:: (4) the groups have the same sample size.

* Example: [http://wiki.stat.ucla.edu/socr/index.php/AP_Statistics_Curriculum_2007_ANOVA_2Way clinical example of knee pain study]

===Applications===
* [[SOCR_EduMaterials_Activities_BoxAndWhiskerChart|This activity]] presents the Box and Whisker Chart, which is often used in exploratory data analyses. It demonstrates the range, standard deviation, mean and quartiles of the values and is especially useful in comparing statistical data. This article illustrated the implementation of the chart in SOCR with comprehensive introduction. It also included the application of this method in different areas.

* [[SOCR_EduMaterials_AnalysisActivities_ANOVA_2|The SOCR Two-Way ANOVA Java Applet]] includes examples of two-way analysis of variance using SOCR tools. It illustrated the application of two-way ANOVA with examples applied in the SOCR. It also expanded the two-way ANOVA in softwares like R and SAS.

* [[SOCR_Activity_ANOVA_SnailsSexualDimorphism| Ther SOCR Snails Sexual Dimorphism Activity]] shows an application of ANOVA. This activity recreates part of the design of a classification method for the Cocholotoma septemspirale snail. By observing multiple traits of the shells, the original researchers were able to decide on a series of dimorphisms (difference in forms) between male and female snails. This article presents a comprehensive illustration of the example.

===Software ===
* [http://socr.umich.edu/html/ana/ SOCR Analyses Java Applets]
* [[SOCR_EduMaterials_AnalysisActivities_ANOVA_1 | One-Way ANOVA Activity]]
* R:
# fit a model
# one-way ANOVA with completely randomized design
fit <- aov(y ~ A, data = mydata)
# randomized block design (B as the blocking factor)
fit <- aov(y ~ A + B, data = mydata)
# two-way factorial design
fit <- aov(y ~ A + B + A*B, data = mydata)
# to check out the model fitted with type I ANOVA table
summary(fit)
# type III SS and F test
drop1(fit, ~., test=’F’)

===Problems===
* Tom was shopping for a ping pong table that could be taken apart quickly and easily. For some reason, the salesman happened to have a table of the assembly times (sec) for the three tables. Using ANOVA, do you think there is a difference in the average time of assembly for the three brands of ping pong tables?
<center>
{| class="wikitable" style="text-align:center;" border="1"
|-
! Assembly_time_(sec)||Brand
|-
| 93.0||1
|-
| 67.0||1
|-
| 77.0||1
|-
| 92.0||1
|-
| 97.0||1
|-
| 62.0||1
|-
| 136.0||2
|-
| 120.0||2
|-
| 115.0||2
|-
| 104.0||2
|-
| 115.0||2
|-
| 121.0||2
|-
| 102.0||2
|-
| 130.0||2
|-
| 198.0||3
|-
| 217.0||3
|-
| 209.0||3
|-
| 221.0||3
|-
| 190.0||3
|}
</center>

: (a) We can say that there is no reason to reject the null that the average assembly times are the same
: (b) We should reject the null that the average assembly times are the same

* Based on the data in the previous problem, what is the value for R square:
: (a) 0.342
: (b) 0.143
: (c) 0.832
: (d) 0.943

* Tom is curious to see if two-door vehicles drive faster on average than four-door vehicles. He parks behind a bush so as not to be seen, and records the car type and the speed reading. Here are the results (1 means two-door, and 2 means four-door):
<center>
{| class="wikitable" style="text-align:center;" border="1"
|-
! Speed_(MPH)||Vehicle_Type
|-
| 45||2
|-
| 45||2
|-
| 40||2
|-
| 69||1
|-
| 72||1
|-
| 40||1
|-
| 75||2
|-
| 19||2
|-
| 62||1
|-
| 43||2
|-
| 75||1
|-
| 42||2
|-
| 58||1
|-
| 58||1
|-
| 47||2
|-
| 48||2
|-
| 49||2
|-
| 45||2
|-
| 54||2
|}
</center>

: At the 1% significance level, should we reject the null hypothesis that that average speed is the same for both types of vehicles?
: (a) Yes, we should reject the null hypothesis.
: (b) No, we should not reject the null hypothesis.
: (c) There is not enough information.

* Based on data above, what is the value for R square?
: (a) 0.432
: (b) 0.983
: (c) 0.308
: (d) 0.231

* In a two-way ANOVA test, which of the following is not the typical null hypothesizes?
: (a) The population means of the first factor are equal.
: (b) The population means of the first and second factor are equal.
: (c) The population means of the second factor are equal.
: (d) There is no interaction between the two factors.

* Suppose that two factors, A and B, is thought to affect the top speed of a car. We will use two-way ANOVA analysis. Are the population means of factor A equal?
<center>
{| class="wikitable" style="text-align:center;" border="1"
|-
! Top_Speed||A||B
|-
| 93.0||1||1
|-
| 136.0||1||2
|-
| 198.0||1||3
|-
| 88.0||2||1
|-
| 148.0||2||2
|-
| 279.0||2||3
|}
</center>
: (a) Yes, they are equal.
: (b) No, they are not equal.

* Use the data above and apply the two-way ANOVA analysis, are the population means of factor B equal?
: (a) Yes, they are equal.
: (b) No, they are not equal.

* Use data from table above and apply the two-way ANOVA analysis, is there an interaction effect between the two factors
: (a) Yes, they are equal.
: (b) No, they are not equal.

=== References===
* [http://wiki.stat.ucla.edu/socr/index.php/Probability_and_statistics_EBook#Chapter_XI:_Analysis_of_Variance_.28ANOVA.29 SOCR]
* [http://en.wikipedia.org/wiki/Analysis_of_variance ANOVA Wikipedia]

<hr>
* SOCR Home page: http://www.socr.umich.edu

{{translate|pageName=http://wiki.socr.umich.edu/index.php?title=SMHS_ANOVA}}

SMHS ANOVA

2014-09-29T20:19:50Z

Jslavine: /* Overview */

==[[SMHS| Scientific Methods for Health Sciences]] - Analysis of Variance (ANOVA) ==

===Overview===
Analysis of Variance (ANOVA) is a method that is commonly applied when analyzing the differences between group means. In ANOVA, we divide the observed variance into components attributed to different sources of variation. It is a widely used statistical technique that provides a statistical test of whether or not the means of several groups are equal; ANOVA can be thought of as a generalized t-test for more than 2 groups. ANOVA results in the case of 2 groups coincide with the corresponding results of a 2-sample independent t-test. Here, we introduce ANOVAs, both one-way and two-way, and provide examples.

===Motivation===
In the previous two-sample inference, we applied a t-test to compare two independent group means. What if we want to compare mroe than 2 independent samples? In this case, we will need to decompose the entire variation into components that allow us to analyze the variance of the entire dataset. Suppose 5 varieties of a particular crop are tested for further study. A field was divided into 20 plots, with each variety planted in 4 plots. The measurements are shown in the table below:

<center>
{| class="wikitable" style="text-align:center; width:35%" border="1"
|-
|A ||B|| C|| D|| E
|-
|26.2|| 29.2|| 29.1 ||21.3|| 20.1
|-
|24.3|| 28.1 ||30.8 ||22.4|| 19.3
|-
|21.8 ||27.3|| 33.9|| 24.3 ||19.9
|-
|28.1|| 31.2|| 32.8|| 21.8|| 22.1
|-
|}
</center>

<center>
{| class="wikitable" style="text-align:center; width:35%" border="1"
|-
|A|| 26.2,24.3,21.8,28.1
|-
|B|| 29.2,28.1,27.3,31.2
|-
|C|| 29.1,30.8,33.9,32.8
|-
|D|| 21.3,22.4,24.3,21.8
|-
|E|| 20.1,19.3,19.9,22.1
|-
|}
</center>

Using ANOVA, the data are regarded as random samples from 5 populations. Suppose the population means are denoted as $\mu_{1},\mu_{2},\mu_{3},\mu_{4},$ and $\mu_{5}$, and the population standard deviations are denoted as $\sigma_{1},\sigma_{2},\sigma_{3},\sigma_{4},$ and $\sigma_{5}$. One method would be to apply $\binom{5}{2}=10$ separate t-tests and compare all independent pairs of groups. However, in this case, ANOVA would be much easier and more powerful.

===Theory===

====One-way ANOVA====
One-way ANOVA expands our inference methods to study and compare k independent samples. In this case, we will be decomposing the entire variation in the data into independent components.

*Notation: $y_{ij}$ is the $j^{th}$ measurement from group $i$; $k$ is the number of groups; $n_{i}$ is the number of observations in group $i$; $n$ is the total number of observations and $n=n_{1}+n_{2}+⋯+n_{k}$. The group mean for group $i$ is $\bar y_{i}$=$\frac{\sum_{j=1}^{n_{i}} y_{ij}} {n_{i}}$, and the grand mean is $\bar y =\bar y_{..}=$ $\frac{\sum_{i=1}^{k}\sum_{j=1}^{n}_{i}y_{ij}}{n}$

*Difference between means (i.e., compare each group mean to the grand mean).
**The total variance is calculated as the total sum of squares (SST) divided by the total degrees of freedom (df(total)). $SST=\sum_{i=1}^{k}\sum_{j=1}^{n_i}(y_{ij}-\bar y_{..})^{2}$ and $df(total)=n-1$.
**The difference between each group mean and the grand mean: $SST(between)$=$\sum_{i=1}^{k} \sum_{j=1}^{n_i}(y_{ij}-\bar y_{..})^{2}$, degrees of freedom $df(between)=k-1$; Sum square due to error (combination of variation within group):$SSE=\sum_{i=1}^{k} n_{i}(\bar y_{l.}- \bar y_{..})^2$, degrees of freedom $df(within)=n-k$. With ANOVA decomposition, we have $\sum_{i=1}^{k} {\sum_{j=1}^{n_{i}} {(y_{ij}- \bar y_{..})^2 }} = \sum_{i=1}^{k} {n_{i} (y_{l.}-\bar y_{..})^2} + \sum_{i=1}^{k} {\sum_{j=1}^{n_i} {(y_{ij}-\bar y_{l.})^2}},$ that is $SST(total)$=$SST(between)$+$SSE(within)$ and $df(total)$=$df(between)$+$df(within).$

*Calculations:
<center>
{| class="wikitable" style="text-align:center; width:50%" border="1"
|-
| Variance Source || Degrees of Freedom (df) || Sum of Squares (SS) || Mean Sum of Squares (MS) || F-Statistics || [http://socr.ucla.edu/htmls/SOCR_Distributions.html P-value]
|-
| Treatment Effect (Between Group) || k-1 || <math>\sum_{i=1}^{k}{n_i(\bar{y}_{i,.}-\bar{y})^2}</math> || <math>MST(Between)={SST(Between)\over df(Between)}</math> || <math>F_o = {MST(Between)\over MSE(Within)}</math> || <math>P(F_{(df(Between), df(Within))} > F_o)</math>
|-
| Error (Within Group) || n-k || <math>\sum_{i=1}^{k}{\sum_{j=1}^{n_i}{(y_{i,j}-\bar{y}_{i,.})^2}}</math> || <math>MSE(Within)={SSE(Within)\over df(Within)}</math> || || [http://socr.ucla.edu/Applets.dir/Normal_T_Chi2_F_Tables.htm F-Distribution Calculator]
|-
| Total || n-1 || <math>\sum_{i=1}^{k}{\sum_{j=1}^{n_i}{(y_{i,j} - \bar{y})^2}}</math> || || || [[SOCR_EduMaterials_AnalysisActivities_ANOVA_1 | ANOVA Activity]]
|}
</center>

* ANOVA hypotheses (general form): $H_{\sigma}:\mu_{1}=\mu_{2}=⋯=\mu_{k}$; $H_{a}:\mu_{I}≠\mu_{j}$ for some $i≠j$. The test statistics: $F_{0}=\frac{MST(between)}{MSE(within)}$ , if $F_{0}$ is large, then there is a lot between group variation, relative to the within group variation. Therefore, the discrepancies between the group means are large compared to the variability within the groups (error). That is large $F_{0}$ provides strong evidence against $H_{0}$.

*Examples: given the following data from a hands-on study.
<center>
{| class="wikitable" style="text-align:center; width:35%" border="1"
|-
| || colspan=3|Groups
|-
|Index|| A|| B|| C
|-
|1 || 0|| 1|| 4
|-
|2|| 1|| 0|| 5
|-
|3|| ||2||
|-
|$n_{i}$|| 2|| 3|| 2
|-
| $s$ ||1|| 3|| 9
|-
|$\bar y_{l}$|| 0.5|| 1|| 4.5
|-
|}
</center>

Using this data, we have the following ANOVA table:
<center>
{| class="wikitable" style="text-align:center; width:50%" border="1"
|-
| Variance Source || Degrees of Freedom (df) || Sum of Squares (SS) || Mean Sum of Squares (MS) || F-Statistics || [http://socr.ucla.edu/htmls/dist/Fisher_Distribution.html P-value]
|-
| Treatment Effect (Between Group) || 3-1 || <math>\sum_{i=1}^{k}{n_i(\bar{y}_{i,.}-\bar{y})^2}=19.86</math> || <math>{SST(Between)\over df(Between)}={19.86\over 2}</math> || <math>F_o = {MST(Between)\over MSE(Within)}=13.24</math> || <math>P(F_{(df(Between), df(Within))} > F_o)=0.017</math>
|-
| Error (Within Group) || 7-3 || <math>\sum_{i=1}^{k}{\sum_{j=1}^{n_i}{(y_{i,j}-\bar{y}_{i,.})^2}}=3</math> || <math>{SSE(Within)\over df(Within)}={3\over 4}</math> || || [http://socr.ucla.edu/Applets.dir/Normal_T_Chi2_F_Tables.htm F-Distribution Calculator]
|-
| Total || 7-1 || <math>\sum_{i=1}^{k}{\sum_{j=1}^{n_i}{(y_{i,j} - \bar{y})^2}}=22.86</math> || || || [[SOCR_EduMaterials_AnalysisActivities_ANOVA_1 | Anova Activity]]
|}
</center>

Based on the ANOVA table above, we can reject the null hypothesis at $\alpha=0.05.$

*ANOVA conditions: valid if (1) design conditions: all groups of observations represent random samples from their population respectively. Plus, all the observations within each group are independent of each other; (2) population conditions: the k population distributions must be approximately normal. If sample size is large, the normality condition is less crucial. Plus, the standard deviations of all populations are equal, which can be slightly relaxed when $0.5≤\frac{\sigma_{i}}{\sigma_{j}}≤2,$ for all $i$ and $j$, none of the population variance is twice larger than any of the other ones.

====Two-way ANOVA====
Two-way ANOVA decomposes the variance of a dataset into independent (orthogonal) components when we have two grouping factors.
Notations first: two-way model:$y_{ijk}=\mu+\tau_{i}+\beta_{j}+γ_{ij}+\varepsilon_{ijk},$ for all $1≤i≤a,1≤j≤b$ and $1≤k≤r.$ $y_{ijk}$ is the A-factor level $i$, and B-factor level $j$, observation-index $k$ measurement; $k$ is the number of replications; $a_{i}$ is the number of A-factor observations at level $i,a=a_{1}+⋯+a_{I}$; $b_{j}$ is the number of B-factor observations at level $j$, $b=b_{1}+⋯+b_{J}$; $N$ is the total number of observations and $N=a*a*b$. Here $\mu$ is the overall mean response, $\tau_{i}$ is the effect due to the $i^{th}$ level of factor A, $\beta_{j}$ is the effect due to the $j^{th}$ level of factor B, and $\gamma_{ij}$ is the effect due to any interaction between the $i^{th}$ level of factor A and $j^{th}$ level of factor B. The mean for A-factor group mean at level $I$ and B-factor at level $j$ is $\bar{y}_{ij.}=\frac{\sum_{k=1}^{r} {y_{ijk}}} {r},$ the grand mean is $\bar {y} =\bar{y}_{...} = \frac{\sum_{k=1}^{r} {\sum_{i=1}^{a} {\sum_{j=1}^{b} {y_{ijk}}}}} {n}$, and we have we have $SST(total)=SS(A)+SS(B)+SS(AB)+SSE$.

*Hypotheses:
**Null hypotheses: (1) the population means of the first factor are equal, which is like the one-way ANOVA for the row factor; (2) the population means of the second factor are equal, which is like the one-way ANOVA for the column factor; (3) there is no interaction between the two factors, which is similar to performing a test for independence with contingency tables.
**Factors: factor A and factor B are independent variables in two-way ANOVA.
**Treatment groups: formed by making all possible combinations of two factors. For example, if the factor A has 3 levels and factor B has 5 levels, then there will be 3*5=15 different treatment groups.
**Main effect: involves the dependent variable one at a time. The interaction is ignored for this part.
**Interaction effect: the effect that one factor has on the other factor. The degree of freedom is the product of the two degrees of freedom of each factor.

*Calculations:
It is assumed that main effect A has ''a'' levels (and df(A) = a-1), main effect B has ''b'' levels (and (df(B) = b-1), ''r'' is the sample size of each treatment, and <math>N = a\times b\times n</math> is the total sample size. Notice the overall degree of freedom is once again one less than the total sample size.

<center>
{| class="wikitable" style="text-align:center; width:50%" border="1"
|-
| Variance Source || Degrees of Freedom (df) || Sum of Squares (SS) || Mean Sum of Squares (MS) || F-Statistics || [http://socr.umich.edu/html/dist/ P-value]
|-
| Main Effect A || df(A)=a-1 || <math>SS(A)=r\times b\times\sum_{i=1}^{a}{(\bar{y}_{i,.,.}-\bar{y})^2}</math> || <math>{SS(A)\over df(A)}</math> || <math>F_o = {MS(A)\over MSE}</math> || <math>P(F_{(df(A), df(E))} > F_o)</math>
|-
| Main Effect B || df(B)=b-1 || <math>SS(B)=r\times a\times\sum_{j=1}^{b}{(\bar{y}_{., j,.}-\bar{y})^2}</math> || <math>{SS(B)\over df(B)}</math> || <math>F_o = {MS(B)\over MSE}</math> || <math>P(F_{(df(B), df(E))} > F_o)</math>
|-
| A vs.B Interaction || df(AB)=(a-1)(b-1) || <math>SS(AB)=r\times \sum_{i=1}^{a}{\sum_{j=1}^{b}{((\bar{y}_{i, j,.}-\bar{y}_{i, .,.})+(\bar{y}_{., j,.}-\bar{y}))^2}}</math> || <math>{SS(AB)\over df(AB)}</math> || <math>F_o = {MS(AB)\over MSE}</math> || <math>P(F_{(df(AB), df(E))} > F_o)</math>
|-
| Error || <math>N-a\times b</math> || <math>SSE=\sum_{k=1}^r{\sum_{i=1}^{a}{\sum_{j=1}^{b}{(\bar{y}_{i, j,k}-\bar{y}_{i, j,.})^2}}}</math> || <math>{SSE\over df(Error)}</math> || ||
|-
| Total || N-1 || <math>SST=\sum_{k=1}^r{\sum_{i=1}^{a}{\sum_{j=1}^{b}{(\bar{y}_{i, j,k}-\bar{y}_{., .,.})^2}}}</math> || || || [[SOCR_EduMaterials_AnalysisActivities_ANOVA_2 | ANOVA Activity]]
|}
</center>

* Two-way ANOVA is valid if:
:: (1) the population from which the samples were obtained are normally or approximately normally distributed;
:: (2) the samples are independent;
:: (3) the variances of the populations are equal;
:: (4) the groups have the same sample size.

* Example: [http://wiki.stat.ucla.edu/socr/index.php/AP_Statistics_Curriculum_2007_ANOVA_2Way clinical example of knee pain study]

===Applications===
* [[SOCR_EduMaterials_Activities_BoxAndWhiskerChart|This activity]] presents the Box and Whisker Chart, which is often used in exploratory data analyses. It demonstrates the range, standard deviation, mean and quartiles of the values and is especially useful in comparing statistical data. This article illustrated the implementation of the chart in SOCR with comprehensive introduction. It also included the application of this method in different areas.

* [[SOCR_EduMaterials_AnalysisActivities_ANOVA_2|The SOCR Two-Way ANOVA Java Applet]] includes examples of two-way analysis of variance using SOCR tools. It illustrated the application of two-way ANOVA with examples applied in the SOCR. It also expanded the two-way ANOVA in softwares like R and SAS.

* [[SOCR_Activity_ANOVA_SnailsSexualDimorphism| Ther SOCR Snails Sexual Dimorphism Activity]] shows an application of ANOVA. This activity recreates part of the design of a classification method for the Cocholotoma septemspirale snail. By observing multiple traits of the shells, the original researchers were able to decide on a series of dimorphisms (difference in forms) between male and female snails. This article presents a comprehensive illustration of the example.

===Software ===
* [http://socr.umich.edu/html/ana/ SOCR Analyses Java Applets]
* [[SOCR_EduMaterials_AnalysisActivities_ANOVA_1 | One-Way ANOVA Activity]]
* R:
# fit a model
# one-way ANOVA with completely randomized design
fit <- aov(y ~ A, data = mydata)
# randomized block design (B as the blocking factor)
fit <- aov(y ~ A + B, data = mydata)
# two-way factorial design
fit <- aov(y ~ A + B + A*B, data = mydata)
# to check out the model fitted with type I ANOVA table
summary(fit)
# type III SS and F test
drop1(fit, ~., test=’F’)

===Problems===
* Tom was shopping for a ping pong table that could be taken apart quickly and easily. For some reason, the salesman happened to have a table of the assembly times (sec) for the three tables. Using ANOVA, do you think there is a difference in the average time of assembly for the three brands of ping pong tables?
<center>
{| class="wikitable" style="text-align:center;" border="1"
|-
! Assembly_time_(sec)||Brand
|-
| 93.0||1
|-
| 67.0||1
|-
| 77.0||1
|-
| 92.0||1
|-
| 97.0||1
|-
| 62.0||1
|-
| 136.0||2
|-
| 120.0||2
|-
| 115.0||2
|-
| 104.0||2
|-
| 115.0||2
|-
| 121.0||2
|-
| 102.0||2
|-
| 130.0||2
|-
| 198.0||3
|-
| 217.0||3
|-
| 209.0||3
|-
| 221.0||3
|-
| 190.0||3
|}
</center>

: (a) We can say that there is no reason to reject the null that the average assembly times are the same
: (b) We should reject the null that the average assembly times are the same

* Based on the data in the previous problem, what is the value for R square:
: (a) 0.342
: (b) 0.143
: (c) 0.832
: (d) 0.943

* Tom is curious to see if two-door vehicles drive faster on average than four-door vehicles. He parks behind a bush so as not to be seen, and records the car type and the speed reading. Here are the results (1 means two-door, and 2 means four-door):
<center>
{| class="wikitable" style="text-align:center;" border="1"
|-
! Speed_(MPH)||Vehicle_Type
|-
| 45||2
|-
| 45||2
|-
| 40||2
|-
| 69||1
|-
| 72||1
|-
| 40||1
|-
| 75||2
|-
| 19||2
|-
| 62||1
|-
| 43||2
|-
| 75||1
|-
| 42||2
|-
| 58||1
|-
| 58||1
|-
| 47||2
|-
| 48||2
|-
| 49||2
|-
| 45||2
|-
| 54||2
|}
</center>

: At the 1% significance level, should we reject the null hypothesis that that average speed is the same for both types of vehicles?
: (a) Yes, we should reject the null hypothesis.
: (b) No, we should not reject the null hypothesis.
: (c) There is not enough information.

* Based on data above, what is the value for R square?
: (a) 0.432
: (b) 0.983
: (c) 0.308
: (d) 0.231

* In a two-way ANOVA test, which of the following is not the typical null hypothesizes?
: (a) The population means of the first factor are equal.
: (b) The population means of the first and second factor are equal.
: (c) The population means of the second factor are equal.
: (d) There is no interaction between the two factors.

* Suppose that two factors, A and B, is thought to affect the top speed of a car. We will use two-way ANOVA analysis. Are the population means of factor A equal?
<center>
{| class="wikitable" style="text-align:center;" border="1"
|-
! Top_Speed||A||B
|-
| 93.0||1||1
|-
| 136.0||1||2
|-
| 198.0||1||3
|-
| 88.0||2||1
|-
| 148.0||2||2
|-
| 279.0||2||3
|}
</center>
: (a) Yes, they are equal.
: (b) No, they are not equal.

* Use the data above and apply the two-way ANOVA analysis, are the population means of factor B equal?
: (a) Yes, they are equal.
: (b) No, they are not equal.

* Use data from table above and apply the two-way ANOVA analysis, is there an interaction effect between the two factors
: (a) Yes, they are equal.
: (b) No, they are not equal.

=== References===
* [http://wiki.stat.ucla.edu/socr/index.php/Probability_and_statistics_EBook#Chapter_XI:_Analysis_of_Variance_.28ANOVA.29 SOCR]
* [http://en.wikipedia.org/wiki/Analysis_of_variance ANOVA Wikipedia]

<hr>
* SOCR Home page: http://www.socr.umich.edu

{{translate|pageName=http://wiki.socr.umich.edu/index.php?title=SMHS_ANOVA}}

SMHS ANOVA

2014-09-29T19:47:25Z

Jslavine: /* One-way ANOVA */

==[[SMHS| Scientific Methods for Health Sciences]] - Analysis of Variance (ANOVA) ==

===Overview===
Analysis of Variance (ANOVA) is a common method applied when analyzing the differences between group means. In ANOVA, we divide the observed variance into components attributed to different sources of variation. It is a widely used statistical technique that provides a statistical test of whether or not the means of several groups are equal; ANOVA can be thought of as a generalized t-test for more than 2 groups. ANOVA results in the case of 2 groups coincide with the corresponding results of a 2-sample independent t-test. Here, we introduce ANOVAs, both one-way and two-way, and provide examples.

===Motivation===
In the previous two-sample inference, we applied a t-test to compare two independent group means. What if we want to compare mroe than 2 independent samples? In this case, we will need to decompose the entire variation into components that allow us to analyze the variance of the entire dataset. Suppose 5 varieties of a particular crop are tested for further study. A field was divided into 20 plots, with each variety planted in 4 plots. The measurements are shown in the table below:

<center>
{| class="wikitable" style="text-align:center; width:35%" border="1"
|-
|A ||B|| C|| D|| E
|-
|26.2|| 29.2|| 29.1 ||21.3|| 20.1
|-
|24.3|| 28.1 ||30.8 ||22.4|| 19.3
|-
|21.8 ||27.3|| 33.9|| 24.3 ||19.9
|-
|28.1|| 31.2|| 32.8|| 21.8|| 22.1
|-
|}
</center>

<center>
{| class="wikitable" style="text-align:center; width:35%" border="1"
|-
|A|| 26.2,24.3,21.8,28.1
|-
|B|| 29.2,28.1,27.3,31.2
|-
|C|| 29.1,30.8,33.9,32.8
|-
|D|| 21.3,22.4,24.3,21.8
|-
|E|| 20.1,19.3,19.9,22.1
|-
|}
</center>

Using ANOVA, the data are regarded as random samples from 5 populations. Suppose the population means are denoted as $\mu_{1},\mu_{2},\mu_{3},\mu_{4},$ and $\mu_{5}$, and the population standard deviations are denoted as $\sigma_{1},\sigma_{2},\sigma_{3},\sigma_{4},$ and $\sigma_{5}$. One method would be to apply $\binom{5}{2}=10$ separate t-tests and compare all independent pairs of groups. However, in this case, ANOVA would be much easier and more powerful.

===Theory===

====One-way ANOVA====
One-way ANOVA expands our inference methods to study and compare k independent samples. In this case, we will be decomposing the entire variation in the data into independent components.

*Notation: $y_{ij}$ is the $j^{th}$ measurement from group $i$; $k$ is the number of groups; $n_{i}$ is the number of observations in group $i$; $n$ is the total number of observations and $n=n_{1}+n_{2}+⋯+n_{k}$. The group mean for group $i$ is $\bar y_{i}$=$\frac{\sum_{j=1}^{n_{i}} y_{ij}} {n_{i}}$, and the grand mean is $\bar y =\bar y_{..}=$ $\frac{\sum_{i=1}^{k}\sum_{j=1}^{n}_{i}y_{ij}}{n}$

*Difference between means (i.e., compare each group mean to the grand mean).
**The total variance is calculated as the total sum of squares (SST) divided by the total degrees of freedom (df(total)). $SST=\sum_{i=1}^{k}\sum_{j=1}^{n_i}(y_{ij}-\bar y_{..})^{2}$ and $df(total)=n-1$.
**The difference between each group mean and the grand mean: $SST(between)$=$\sum_{i=1}^{k} \sum_{j=1}^{n_i}(y_{ij}-\bar y_{..})^{2}$, degrees of freedom $df(between)=k-1$; Sum square due to error (combination of variation within group):$SSE=\sum_{i=1}^{k} n_{i}(\bar y_{l.}- \bar y_{..})^2$, degrees of freedom $df(within)=n-k$. With ANOVA decomposition, we have $\sum_{i=1}^{k} {\sum_{j=1}^{n_{i}} {(y_{ij}- \bar y_{..})^2 }} = \sum_{i=1}^{k} {n_{i} (y_{l.}-\bar y_{..})^2} + \sum_{i=1}^{k} {\sum_{j=1}^{n_i} {(y_{ij}-\bar y_{l.})^2}},$ that is $SST(total)$=$SST(between)$+$SSE(within)$ and $df(total)$=$df(between)$+$df(within).$

*Calculations:
<center>
{| class="wikitable" style="text-align:center; width:50%" border="1"
|-
| Variance Source || Degrees of Freedom (df) || Sum of Squares (SS) || Mean Sum of Squares (MS) || F-Statistics || [http://socr.ucla.edu/htmls/SOCR_Distributions.html P-value]
|-
| Treatment Effect (Between Group) || k-1 || <math>\sum_{i=1}^{k}{n_i(\bar{y}_{i,.}-\bar{y})^2}</math> || <math>MST(Between)={SST(Between)\over df(Between)}</math> || <math>F_o = {MST(Between)\over MSE(Within)}</math> || <math>P(F_{(df(Between), df(Within))} > F_o)</math>
|-
| Error (Within Group) || n-k || <math>\sum_{i=1}^{k}{\sum_{j=1}^{n_i}{(y_{i,j}-\bar{y}_{i,.})^2}}</math> || <math>MSE(Within)={SSE(Within)\over df(Within)}</math> || || [http://socr.ucla.edu/Applets.dir/Normal_T_Chi2_F_Tables.htm F-Distribution Calculator]
|-
| Total || n-1 || <math>\sum_{i=1}^{k}{\sum_{j=1}^{n_i}{(y_{i,j} - \bar{y})^2}}</math> || || || [[SOCR_EduMaterials_AnalysisActivities_ANOVA_1 | ANOVA Activity]]
|}
</center>

* ANOVA hypotheses (general form): $H_{\sigma}:\mu_{1}=\mu_{2}=⋯=\mu_{k}$; $H_{a}:\mu_{I}≠\mu_{j}$ for some $i≠j$. The test statistics: $F_{0}=\frac{MST(between)}{MSE(within)}$ , if $F_{0}$ is large, then there is a lot between group variation, relative to the within group variation. Therefore, the discrepancies between the group means are large compared to the variability within the groups (error). That is large $F_{0}$ provides strong evidence against $H_{0}$.

*Examples: given the following data from a hands-on study.
<center>
{| class="wikitable" style="text-align:center; width:35%" border="1"
|-
| || colspan=3|Groups
|-
|Index|| A|| B|| C
|-
|1 || 0|| 1|| 4
|-
|2|| 1|| 0|| 5
|-
|3|| ||2||
|-
|$n_{i}$|| 2|| 3|| 2
|-
| $s$ ||1|| 3|| 9
|-
|$\bar y_{l}$|| 0.5|| 1|| 4.5
|-
|}
</center>

Using this data, we have the following ANOVA table:
<center>
{| class="wikitable" style="text-align:center; width:50%" border="1"
|-
| Variance Source || Degrees of Freedom (df) || Sum of Squares (SS) || Mean Sum of Squares (MS) || F-Statistics || [http://socr.ucla.edu/htmls/dist/Fisher_Distribution.html P-value]
|-
| Treatment Effect (Between Group) || 3-1 || <math>\sum_{i=1}^{k}{n_i(\bar{y}_{i,.}-\bar{y})^2}=19.86</math> || <math>{SST(Between)\over df(Between)}={19.86\over 2}</math> || <math>F_o = {MST(Between)\over MSE(Within)}=13.24</math> || <math>P(F_{(df(Between), df(Within))} > F_o)=0.017</math>
|-
| Error (Within Group) || 7-3 || <math>\sum_{i=1}^{k}{\sum_{j=1}^{n_i}{(y_{i,j}-\bar{y}_{i,.})^2}}=3</math> || <math>{SSE(Within)\over df(Within)}={3\over 4}</math> || || [http://socr.ucla.edu/Applets.dir/Normal_T_Chi2_F_Tables.htm F-Distribution Calculator]
|-
| Total || 7-1 || <math>\sum_{i=1}^{k}{\sum_{j=1}^{n_i}{(y_{i,j} - \bar{y})^2}}=22.86</math> || || || [[SOCR_EduMaterials_AnalysisActivities_ANOVA_1 | Anova Activity]]
|}
</center>

Based on the ANOVA table above, we can reject the null hypothesis at $\alpha=0.05.$

*ANOVA conditions: valid if (1) design conditions: all groups of observations represent random samples from their population respectively. Plus, all the observations within each group are independent of each other; (2) population conditions: the k population distributions must be approximately normal. If sample size is large, the normality condition is less crucial. Plus, the standard deviations of all populations are equal, which can be slightly relaxed when $0.5≤\frac{\sigma_{i}}{\sigma_{j}}≤2,$ for all $i$ and $j$, none of the population variance is twice larger than any of the other ones.

====Two-way ANOVA====
Two-way ANOVA decomposes the variance of a dataset into independent (orthogonal) components when we have two grouping factors.
Notations first: two-way model:$y_{ijk}=\mu+\tau_{i}+\beta_{j}+γ_{ij}+\varepsilon_{ijk},$ for all $1≤i≤a,1≤j≤b$ and $1≤k≤r.$ $y_{ijk}$ is the A-factor level $i$, and B-factor level $j$, observation-index $k$ measurement; $k$ is the number of replications; $a_{i}$ is the number of A-factor observations at level $i,a=a_{1}+⋯+a_{I}$; $b_{j}$ is the number of B-factor observations at level $j$, $b=b_{1}+⋯+b_{J}$; $N$ is the total number of observations and $N=a*a*b$. Here $\mu$ is the overall mean response, $\tau_{i}$ is the effect due to the $i^{th}$ level of factor A, $\beta_{j}$ is the effect due to the $j^{th}$ level of factor B, and $\gamma_{ij}$ is the effect due to any interaction between the $i^{th}$ level of factor A and $j^{th}$ level of factor B. The mean for A-factor group mean at level $I$ and B-factor at level $j$ is $\bar{y}_{ij.}=\frac{\sum_{k=1}^{r} {y_{ijk}}} {r},$ the grand mean is $\bar {y} =\bar{y}_{...} = \frac{\sum_{k=1}^{r} {\sum_{i=1}^{a} {\sum_{j=1}^{b} {y_{ijk}}}}} {n}$, and we have we have $SST(total)=SS(A)+SS(B)+SS(AB)+SSE$.

*Hypotheses:
**Null hypotheses: (1) the population means of the first factor are equal, which is like the one-way ANOVA for the row factor; (2) the population means of the second factor are equal, which is like the one-way ANOVA for the column factor; (3) there is no interaction between the two factors, which is similar to performing a test for independence with contingency tables.
**Factors: factor A and factor B are independent variables in two-way ANOVA.
**Treatment groups: formed by making all possible combinations of two factors. For example, if the factor A has 3 levels and factor B has 5 levels, then there will be 3*5=15 different treatment groups.
**Main effect: involves the dependent variable one at a time. The interaction is ignored for this part.
**Interaction effect: the effect that one factor has on the other factor. The degree of freedom is the product of the two degrees of freedom of each factor.

*Calculations:
It is assumed that main effect A has ''a'' levels (and df(A) = a-1), main effect B has ''b'' levels (and (df(B) = b-1), ''r'' is the sample size of each treatment, and <math>N = a\times b\times n</math> is the total sample size. Notice the overall degree of freedom is once again one less than the total sample size.

<center>
{| class="wikitable" style="text-align:center; width:50%" border="1"
|-
| Variance Source || Degrees of Freedom (df) || Sum of Squares (SS) || Mean Sum of Squares (MS) || F-Statistics || [http://socr.umich.edu/html/dist/ P-value]
|-
| Main Effect A || df(A)=a-1 || <math>SS(A)=r\times b\times\sum_{i=1}^{a}{(\bar{y}_{i,.,.}-\bar{y})^2}</math> || <math>{SS(A)\over df(A)}</math> || <math>F_o = {MS(A)\over MSE}</math> || <math>P(F_{(df(A), df(E))} > F_o)</math>
|-
| Main Effect B || df(B)=b-1 || <math>SS(B)=r\times a\times\sum_{j=1}^{b}{(\bar{y}_{., j,.}-\bar{y})^2}</math> || <math>{SS(B)\over df(B)}</math> || <math>F_o = {MS(B)\over MSE}</math> || <math>P(F_{(df(B), df(E))} > F_o)</math>
|-
| A vs.B Interaction || df(AB)=(a-1)(b-1) || <math>SS(AB)=r\times \sum_{i=1}^{a}{\sum_{j=1}^{b}{((\bar{y}_{i, j,.}-\bar{y}_{i, .,.})+(\bar{y}_{., j,.}-\bar{y}))^2}}</math> || <math>{SS(AB)\over df(AB)}</math> || <math>F_o = {MS(AB)\over MSE}</math> || <math>P(F_{(df(AB), df(E))} > F_o)</math>
|-
| Error || <math>N-a\times b</math> || <math>SSE=\sum_{k=1}^r{\sum_{i=1}^{a}{\sum_{j=1}^{b}{(\bar{y}_{i, j,k}-\bar{y}_{i, j,.})^2}}}</math> || <math>{SSE\over df(Error)}</math> || ||
|-
| Total || N-1 || <math>SST=\sum_{k=1}^r{\sum_{i=1}^{a}{\sum_{j=1}^{b}{(\bar{y}_{i, j,k}-\bar{y}_{., .,.})^2}}}</math> || || || [[SOCR_EduMaterials_AnalysisActivities_ANOVA_2 | ANOVA Activity]]
|}
</center>

* Two-way ANOVA is valid if:
:: (1) the population from which the samples were obtained are normally or approximately normally distributed;
:: (2) the samples are independent;
:: (3) the variances of the populations are equal;
:: (4) the groups have the same sample size.

* Example: [http://wiki.stat.ucla.edu/socr/index.php/AP_Statistics_Curriculum_2007_ANOVA_2Way clinical example of knee pain study]

===Applications===
* [[SOCR_EduMaterials_Activities_BoxAndWhiskerChart|This activity]] presents the Box and Whisker Chart, which is often used in exploratory data analyses. It demonstrates the range, standard deviation, mean and quartiles of the values and is especially useful in comparing statistical data. This article illustrated the implementation of the chart in SOCR with comprehensive introduction. It also included the application of this method in different areas.

* [[SOCR_EduMaterials_AnalysisActivities_ANOVA_2|The SOCR Two-Way ANOVA Java Applet]] includes examples of two-way analysis of variance using SOCR tools. It illustrated the application of two-way ANOVA with examples applied in the SOCR. It also expanded the two-way ANOVA in softwares like R and SAS.

* [[SOCR_Activity_ANOVA_SnailsSexualDimorphism| Ther SOCR Snails Sexual Dimorphism Activity]] shows an application of ANOVA. This activity recreates part of the design of a classification method for the Cocholotoma septemspirale snail. By observing multiple traits of the shells, the original researchers were able to decide on a series of dimorphisms (difference in forms) between male and female snails. This article presents a comprehensive illustration of the example.

===Software ===
* [http://socr.umich.edu/html/ana/ SOCR Analyses Java Applets]
* [[SOCR_EduMaterials_AnalysisActivities_ANOVA_1 | One-Way ANOVA Activity]]
* R:
# fit a model
# one-way ANOVA with completely randomized design
fit <- aov(y ~ A, data = mydata)
# randomized block design (B as the blocking factor)
fit <- aov(y ~ A + B, data = mydata)
# two-way factorial design
fit <- aov(y ~ A + B + A*B, data = mydata)
# to check out the model fitted with type I ANOVA table
summary(fit)
# type III SS and F test
drop1(fit, ~., test=’F’)

===Problems===
* Tom was shopping for a ping pong table that could be taken apart quickly and easily. For some reason, the salesman happened to have a table of the assembly times (sec) for the three tables. Using ANOVA, do you think there is a difference in the average time of assembly for the three brands of ping pong tables?
<center>
{| class="wikitable" style="text-align:center;" border="1"
|-
! Assembly_time_(sec)||Brand
|-
| 93.0||1
|-
| 67.0||1
|-
| 77.0||1
|-
| 92.0||1
|-
| 97.0||1
|-
| 62.0||1
|-
| 136.0||2
|-
| 120.0||2
|-
| 115.0||2
|-
| 104.0||2
|-
| 115.0||2
|-
| 121.0||2
|-
| 102.0||2
|-
| 130.0||2
|-
| 198.0||3
|-
| 217.0||3
|-
| 209.0||3
|-
| 221.0||3
|-
| 190.0||3
|}
</center>

: (a) We can say that there is no reason to reject the null that the average assembly times are the same
: (b) We should reject the null that the average assembly times are the same

* Based on the data in the previous problem, what is the value for R square:
: (a) 0.342
: (b) 0.143
: (c) 0.832
: (d) 0.943

* Tom is curious to see if two-door vehicles drive faster on average than four-door vehicles. He parks behind a bush so as not to be seen, and records the car type and the speed reading. Here are the results (1 means two-door, and 2 means four-door):
<center>
{| class="wikitable" style="text-align:center;" border="1"
|-
! Speed_(MPH)||Vehicle_Type
|-
| 45||2
|-
| 45||2
|-
| 40||2
|-
| 69||1
|-
| 72||1
|-
| 40||1
|-
| 75||2
|-
| 19||2
|-
| 62||1
|-
| 43||2
|-
| 75||1
|-
| 42||2
|-
| 58||1
|-
| 58||1
|-
| 47||2
|-
| 48||2
|-
| 49||2
|-
| 45||2
|-
| 54||2
|}
</center>

: At the 1% significance level, should we reject the null hypothesis that that average speed is the same for both types of vehicles?
: (a) Yes, we should reject the null hypothesis.
: (b) No, we should not reject the null hypothesis.
: (c) There is not enough information.

* Based on data above, what is the value for R square?
: (a) 0.432
: (b) 0.983
: (c) 0.308
: (d) 0.231

* In a two-way ANOVA test, which of the following is not the typical null hypothesizes?
: (a) The population means of the first factor are equal.
: (b) The population means of the first and second factor are equal.
: (c) The population means of the second factor are equal.
: (d) There is no interaction between the two factors.

* Suppose that two factors, A and B, is thought to affect the top speed of a car. We will use two-way ANOVA analysis. Are the population means of factor A equal?
<center>
{| class="wikitable" style="text-align:center;" border="1"
|-
! Top_Speed||A||B
|-
| 93.0||1||1
|-
| 136.0||1||2
|-
| 198.0||1||3
|-
| 88.0||2||1
|-
| 148.0||2||2
|-
| 279.0||2||3
|}
</center>
: (a) Yes, they are equal.
: (b) No, they are not equal.

* Use the data above and apply the two-way ANOVA analysis, are the population means of factor B equal?
: (a) Yes, they are equal.
: (b) No, they are not equal.

* Use data from table above and apply the two-way ANOVA analysis, is there an interaction effect between the two factors
: (a) Yes, they are equal.
: (b) No, they are not equal.

=== References===
* [http://wiki.stat.ucla.edu/socr/index.php/Probability_and_statistics_EBook#Chapter_XI:_Analysis_of_Variance_.28ANOVA.29 SOCR]
* [http://en.wikipedia.org/wiki/Analysis_of_variance ANOVA Wikipedia]

<hr>
* SOCR Home page: http://www.socr.umich.edu

{{translate|pageName=http://wiki.socr.umich.edu/index.php?title=SMHS_ANOVA}}

SMHS ANOVA

2014-09-29T18:17:10Z

Jslavine: /* Motivation */

==[[SMHS| Scientific Methods for Health Sciences]] - Analysis of Variance (ANOVA) ==

===Overview===
Analysis of Variance (ANOVA) is a common method applied when analyzing the differences between group means. In ANOVA, we divide the observed variance into components attributed to different sources of variation. It is a widely used statistical technique that provides a statistical test of whether or not the means of several groups are equal; ANOVA can be thought of as a generalized t-test for more than 2 groups. ANOVA results in the case of 2 groups coincide with the corresponding results of a 2-sample independent t-test. Here, we introduce ANOVAs, both one-way and two-way, and provide examples.

===Motivation===
In the previous two-sample inference, we applied a t-test to compare two independent group means. What if we want to compare mroe than 2 independent samples? In this case, we will need to decompose the entire variation into components that allow us to analyze the variance of the entire dataset. Suppose 5 varieties of a particular crop are tested for further study. A field was divided into 20 plots, with each variety planted in 4 plots. The measurements are shown in the table below:

<center>
{| class="wikitable" style="text-align:center; width:35%" border="1"
|-
|A ||B|| C|| D|| E
|-
|26.2|| 29.2|| 29.1 ||21.3|| 20.1
|-
|24.3|| 28.1 ||30.8 ||22.4|| 19.3
|-
|21.8 ||27.3|| 33.9|| 24.3 ||19.9
|-
|28.1|| 31.2|| 32.8|| 21.8|| 22.1
|-
|}
</center>

<center>
{| class="wikitable" style="text-align:center; width:35%" border="1"
|-
|A|| 26.2,24.3,21.8,28.1
|-
|B|| 29.2,28.1,27.3,31.2
|-
|C|| 29.1,30.8,33.9,32.8
|-
|D|| 21.3,22.4,24.3,21.8
|-
|E|| 20.1,19.3,19.9,22.1
|-
|}
</center>

Using ANOVA, the data are regarded as random samples from 5 populations. Suppose the population means are denoted as $\mu_{1},\mu_{2},\mu_{3},\mu_{4},$ and $\mu_{5}$, and the population standard deviations are denoted as $\sigma_{1},\sigma_{2},\sigma_{3},\sigma_{4},$ and $\sigma_{5}$. One method would be to apply $\binom{5}{2}=10$ separate t-tests and compare all independent pairs of groups. However, in this case, ANOVA would be much easier and more powerful.

===Theory===

====One-way ANOVA====
One-way ANOVA expands our inference methods to study and compare k independent samples. In this case, we will be decomposing the entire variation in the data into independent components.

*Notations first: $y_{ij}$ is the measurement from group $i$, observation index $j$; $k$ is the number of groups; $n_{i}$ is the number of observations in group $i$; $n$ is the total number of observations and $n=n_{1}+n_{2}+⋯+n_{k}$. The group mean for group $i$ is $\bar y_{l}$=$\frac{\sum_{j=1}^{n_{i}} y_{ij}} {n_{i}}$, the grand mean is $\bar y =\bar y_{..}=$ $\frac{\sum_{i=1}^{k}\sum_{j=1}^{n}_{i}y_{ij}}{n}$

*Difference between the means (compare each group mean to the grand mean): total variance $SST(total)=\sum_{i=1}^{k}\sum_{j=1}^{n_i}(y_{ij}-(\bar y_{..})^{2}$, degrees of freedom $df(total)=n-1$; difference between each group mean and grand mean: $SST(between)$=$\sum_{i=1}^{k} \sum_{j=1}^{n_i}(y_{ij}-\bar y_{..})^{2}$, degrees of freedom $df(between)=k-1$; Sum square due to error (combination of variation within group):$SSE=\sum_{i=1}^{k} n_{i}(\bar y_{l.}- \bar y_{..})^2$, degrees of freedom $df(within)=n-k$. With ANOVA decomposition, we have $\sum_{i=1}^{k} {\sum_{j=1}^{n_{i}} {(y_{ij}- \bar y_{..})^2 }} = \sum_{i=1}^{k} {n_{i} (y_{l.}-\bar y_{..})^2} + \sum_{i=1}^{k} {\sum_{j=1}^{n_i} {(y_{ij}-\bar y_{l.})^2}},$ that is $SST(total)$=$SST(between)$+$SSE(within)$ and $df(total)$=$df(between)$+$df(within).$

*Calculations:
<center>
{| class="wikitable" style="text-align:center; width:50%" border="1"
|-
| Variance Source || Degrees of Freedom (df) || Sum of Squares (SS) || Mean Sum of Squares (MS) || F-Statistics || [http://socr.ucla.edu/htmls/SOCR_Distributions.html P-value]
|-
| Treatment Effect (Between Group) || k-1 || <math>\sum_{i=1}^{k}{n_i(\bar{y}_{i,.}-\bar{y})^2}</math> || <math>MST(Between)={SST(Between)\over df(Between)}</math> || <math>F_o = {MST(Between)\over MSE(Within)}</math> || <math>P(F_{(df(Between), df(Within))} > F_o)</math>
|-
| Error (Within Group) || n-k || <math>\sum_{i=1}^{k}{\sum_{j=1}^{n_i}{(y_{i,j}-\bar{y}_{i,.})^2}}</math> || <math>MSE(Within)={SSE(Within)\over df(Within)}</math> || || [http://socr.ucla.edu/Applets.dir/Normal_T_Chi2_F_Tables.htm F-Distribution Calculator]
|-
| Total || n-1 || <math>\sum_{i=1}^{k}{\sum_{j=1}^{n_i}{(y_{i,j} - \bar{y})^2}}</math> || || || [[SOCR_EduMaterials_AnalysisActivities_ANOVA_1 | ANOVA Activity]]
|}
</center>

* ANOVA hypotheses (general form): $H_{\sigma}:\mu_{1}=\mu_{2}=⋯=\mu_{k}$; $H_{a}:\mu_{I}≠\mu_{j}$ for some $i≠j$. The test statistics: $F_{0}=\frac{MST(between)}{MSE(within)}$ , if $F_{0}$ is large, then there is a lot between group variation, relative to the within group variation. Therefore, the discrepancies between the group means are large compared to the variability within the groups (error). That is large $F_{0}$ provides strong evidence against $H_{0}$.

*Examples: given the following data from a hands-on study.
<center>
{| class="wikitable" style="text-align:center; width:35%" border="1"
|-
| || colspan=3|Groups
|-
|Index|| A|| B|| C
|-
|1 || 0|| 1|| 4
|-
|2|| 1|| 0|| 5
|-
|3|| ||2||
|-
|$n_{i}$|| 2|| 3|| 2
|-
| $s$ ||1|| 3|| 9
|-
|$\bar y_{l}$|| 0.5|| 1|| 4.5
|-
|}
</center>

Using this data, we have the following ANOVA table:
<center>
{| class="wikitable" style="text-align:center; width:50%" border="1"
|-
| Variance Source || Degrees of Freedom (df) || Sum of Squares (SS) || Mean Sum of Squares (MS) || F-Statistics || [http://socr.ucla.edu/htmls/dist/Fisher_Distribution.html P-value]
|-
| Treatment Effect (Between Group) || 3-1 || <math>\sum_{i=1}^{k}{n_i(\bar{y}_{i,.}-\bar{y})^2}=19.86</math> || <math>{SST(Between)\over df(Between)}={19.86\over 2}</math> || <math>F_o = {MST(Between)\over MSE(Within)}=13.24</math> || <math>P(F_{(df(Between), df(Within))} > F_o)=0.017</math>
|-
| Error (Within Group) || 7-3 || <math>\sum_{i=1}^{k}{\sum_{j=1}^{n_i}{(y_{i,j}-\bar{y}_{i,.})^2}}=3</math> || <math>{SSE(Within)\over df(Within)}={3\over 4}</math> || || [http://socr.ucla.edu/Applets.dir/Normal_T_Chi2_F_Tables.htm F-Distribution Calculator]
|-
| Total || 7-1 || <math>\sum_{i=1}^{k}{\sum_{j=1}^{n_i}{(y_{i,j} - \bar{y})^2}}=22.86</math> || || || [[SOCR_EduMaterials_AnalysisActivities_ANOVA_1 | Anova Activity]]
|}
</center>

Based on the ANOVA table above, we can reject the null hypothesis at $\alpha=0.05.$

*ANOVA conditions: valid if (1) design conditions: all groups of observations represent random samples from their population respectively. Plus, all the observations within each group are independent of each other; (2) population conditions: the k population distributions must be approximately normal. If sample size is large, the normality condition is less crucial. Plus, the standard deviations of all populations are equal, which can be slightly relaxed when $0.5≤\frac{\sigma_{i}}{\sigma_{j}}≤2,$ for all $i$ and $j$, none of the population variance is twice larger than any of the other ones.

====Two-way ANOVA====
Two-way ANOVA decomposes the variance of a dataset into independent (orthogonal) components when we have two grouping factors.
Notations first: two-way model:$y_{ijk}=\mu+\tau_{i}+\beta_{j}+γ_{ij}+\varepsilon_{ijk},$ for all $1≤i≤a,1≤j≤b$ and $1≤k≤r.$ $y_{ijk}$ is the A-factor level $i$, and B-factor level $j$, observation-index $k$ measurement; $k$ is the number of replications; $a_{i}$ is the number of A-factor observations at level $i,a=a_{1}+⋯+a_{I}$; $b_{j}$ is the number of B-factor observations at level $j$, $b=b_{1}+⋯+b_{J}$; $N$ is the total number of observations and $N=a*a*b$. Here $\mu$ is the overall mean response, $\tau_{i}$ is the effect due to the $i^{th}$ level of factor A, $\beta_{j}$ is the effect due to the $j^{th}$ level of factor B, and $\gamma_{ij}$ is the effect due to any interaction between the $i^{th}$ level of factor A and $j^{th}$ level of factor B. The mean for A-factor group mean at level $I$ and B-factor at level $j$ is $\bar{y}_{ij.}=\frac{\sum_{k=1}^{r} {y_{ijk}}} {r},$ the grand mean is $\bar {y} =\bar{y}_{...} = \frac{\sum_{k=1}^{r} {\sum_{i=1}^{a} {\sum_{j=1}^{b} {y_{ijk}}}}} {n}$, and we have we have $SST(total)=SS(A)+SS(B)+SS(AB)+SSE$.

*Hypotheses:
**Null hypotheses: (1) the population means of the first factor are equal, which is like the one-way ANOVA for the row factor; (2) the population means of the second factor are equal, which is like the one-way ANOVA for the column factor; (3) there is no interaction between the two factors, which is similar to performing a test for independence with contingency tables.
**Factors: factor A and factor B are independent variables in two-way ANOVA.
**Treatment groups: formed by making all possible combinations of two factors. For example, if the factor A has 3 levels and factor B has 5 levels, then there will be 3*5=15 different treatment groups.
**Main effect: involves the dependent variable one at a time. The interaction is ignored for this part.
**Interaction effect: the effect that one factor has on the other factor. The degree of freedom is the product of the two degrees of freedom of each factor.

*Calculations:
It is assumed that main effect A has ''a'' levels (and df(A) = a-1), main effect B has ''b'' levels (and (df(B) = b-1), ''r'' is the sample size of each treatment, and <math>N = a\times b\times n</math> is the total sample size. Notice the overall degree of freedom is once again one less than the total sample size.

<center>
{| class="wikitable" style="text-align:center; width:50%" border="1"
|-
| Variance Source || Degrees of Freedom (df) || Sum of Squares (SS) || Mean Sum of Squares (MS) || F-Statistics || [http://socr.umich.edu/html/dist/ P-value]
|-
| Main Effect A || df(A)=a-1 || <math>SS(A)=r\times b\times\sum_{i=1}^{a}{(\bar{y}_{i,.,.}-\bar{y})^2}</math> || <math>{SS(A)\over df(A)}</math> || <math>F_o = {MS(A)\over MSE}</math> || <math>P(F_{(df(A), df(E))} > F_o)</math>
|-
| Main Effect B || df(B)=b-1 || <math>SS(B)=r\times a\times\sum_{j=1}^{b}{(\bar{y}_{., j,.}-\bar{y})^2}</math> || <math>{SS(B)\over df(B)}</math> || <math>F_o = {MS(B)\over MSE}</math> || <math>P(F_{(df(B), df(E))} > F_o)</math>
|-
| A vs.B Interaction || df(AB)=(a-1)(b-1) || <math>SS(AB)=r\times \sum_{i=1}^{a}{\sum_{j=1}^{b}{((\bar{y}_{i, j,.}-\bar{y}_{i, .,.})+(\bar{y}_{., j,.}-\bar{y}))^2}}</math> || <math>{SS(AB)\over df(AB)}</math> || <math>F_o = {MS(AB)\over MSE}</math> || <math>P(F_{(df(AB), df(E))} > F_o)</math>
|-
| Error || <math>N-a\times b</math> || <math>SSE=\sum_{k=1}^r{\sum_{i=1}^{a}{\sum_{j=1}^{b}{(\bar{y}_{i, j,k}-\bar{y}_{i, j,.})^2}}}</math> || <math>{SSE\over df(Error)}</math> || ||
|-
| Total || N-1 || <math>SST=\sum_{k=1}^r{\sum_{i=1}^{a}{\sum_{j=1}^{b}{(\bar{y}_{i, j,k}-\bar{y}_{., .,.})^2}}}</math> || || || [[SOCR_EduMaterials_AnalysisActivities_ANOVA_2 | ANOVA Activity]]
|}
</center>

* Two-way ANOVA is valid if:
:: (1) the population from which the samples were obtained are normally or approximately normally distributed;
:: (2) the samples are independent;
:: (3) the variances of the populations are equal;
:: (4) the groups have the same sample size.

* Example: [http://wiki.stat.ucla.edu/socr/index.php/AP_Statistics_Curriculum_2007_ANOVA_2Way clinical example of knee pain study]

===Applications===
* [[SOCR_EduMaterials_Activities_BoxAndWhiskerChart|This activity]] presents the Box and Whisker Chart, which is often used in exploratory data analyses. It demonstrates the range, standard deviation, mean and quartiles of the values and is especially useful in comparing statistical data. This article illustrated the implementation of the chart in SOCR with comprehensive introduction. It also included the application of this method in different areas.

* [[SOCR_EduMaterials_AnalysisActivities_ANOVA_2|The SOCR Two-Way ANOVA Java Applet]] includes examples of two-way analysis of variance using SOCR tools. It illustrated the application of two-way ANOVA with examples applied in the SOCR. It also expanded the two-way ANOVA in softwares like R and SAS.

* [[SOCR_Activity_ANOVA_SnailsSexualDimorphism| Ther SOCR Snails Sexual Dimorphism Activity]] shows an application of ANOVA. This activity recreates part of the design of a classification method for the Cocholotoma septemspirale snail. By observing multiple traits of the shells, the original researchers were able to decide on a series of dimorphisms (difference in forms) between male and female snails. This article presents a comprehensive illustration of the example.

===Software ===
* [http://socr.umich.edu/html/ana/ SOCR Analyses Java Applets]
* [[SOCR_EduMaterials_AnalysisActivities_ANOVA_1 | One-Way ANOVA Activity]]
* R:
# fit a model
# one-way ANOVA with completely randomized design
fit <- aov(y ~ A, data = mydata)
# randomized block design (B as the blocking factor)
fit <- aov(y ~ A + B, data = mydata)
# two-way factorial design
fit <- aov(y ~ A + B + A*B, data = mydata)
# to check out the model fitted with type I ANOVA table
summary(fit)
# type III SS and F test
drop1(fit, ~., test=’F’)

===Problems===
* Tom was shopping for a ping pong table that could be taken apart quickly and easily. For some reason, the salesman happened to have a table of the assembly times (sec) for the three tables. Using ANOVA, do you think there is a difference in the average time of assembly for the three brands of ping pong tables?
<center>
{| class="wikitable" style="text-align:center;" border="1"
|-
! Assembly_time_(sec)||Brand
|-
| 93.0||1
|-
| 67.0||1
|-
| 77.0||1
|-
| 92.0||1
|-
| 97.0||1
|-
| 62.0||1
|-
| 136.0||2
|-
| 120.0||2
|-
| 115.0||2
|-
| 104.0||2
|-
| 115.0||2
|-
| 121.0||2
|-
| 102.0||2
|-
| 130.0||2
|-
| 198.0||3
|-
| 217.0||3
|-
| 209.0||3
|-
| 221.0||3
|-
| 190.0||3
|}
</center>

: (a) We can say that there is no reason to reject the null that the average assembly times are the same
: (b) We should reject the null that the average assembly times are the same

* Based on the data in the previous problem, what is the value for R square:
: (a) 0.342
: (b) 0.143
: (c) 0.832
: (d) 0.943

* Tom is curious to see if two-door vehicles drive faster on average than four-door vehicles. He parks behind a bush so as not to be seen, and records the car type and the speed reading. Here are the results (1 means two-door, and 2 means four-door):
<center>
{| class="wikitable" style="text-align:center;" border="1"
|-
! Speed_(MPH)||Vehicle_Type
|-
| 45||2
|-
| 45||2
|-
| 40||2
|-
| 69||1
|-
| 72||1
|-
| 40||1
|-
| 75||2
|-
| 19||2
|-
| 62||1
|-
| 43||2
|-
| 75||1
|-
| 42||2
|-
| 58||1
|-
| 58||1
|-
| 47||2
|-
| 48||2
|-
| 49||2
|-
| 45||2
|-
| 54||2
|}
</center>

: At the 1% significance level, should we reject the null hypothesis that that average speed is the same for both types of vehicles?
: (a) Yes, we should reject the null hypothesis.
: (b) No, we should not reject the null hypothesis.
: (c) There is not enough information.

* Based on data above, what is the value for R square?
: (a) 0.432
: (b) 0.983
: (c) 0.308
: (d) 0.231

* In a two-way ANOVA test, which of the following is not the typical null hypothesizes?
: (a) The population means of the first factor are equal.
: (b) The population means of the first and second factor are equal.
: (c) The population means of the second factor are equal.
: (d) There is no interaction between the two factors.

* Suppose that two factors, A and B, is thought to affect the top speed of a car. We will use two-way ANOVA analysis. Are the population means of factor A equal?
<center>
{| class="wikitable" style="text-align:center;" border="1"
|-
! Top_Speed||A||B
|-
| 93.0||1||1
|-
| 136.0||1||2
|-
| 198.0||1||3
|-
| 88.0||2||1
|-
| 148.0||2||2
|-
| 279.0||2||3
|}
</center>
: (a) Yes, they are equal.
: (b) No, they are not equal.

* Use the data above and apply the two-way ANOVA analysis, are the population means of factor B equal?
: (a) Yes, they are equal.
: (b) No, they are not equal.

* Use data from table above and apply the two-way ANOVA analysis, is there an interaction effect between the two factors
: (a) Yes, they are equal.
: (b) No, they are not equal.

=== References===
* [http://wiki.stat.ucla.edu/socr/index.php/Probability_and_statistics_EBook#Chapter_XI:_Analysis_of_Variance_.28ANOVA.29 SOCR]
* [http://en.wikipedia.org/wiki/Analysis_of_variance ANOVA Wikipedia]

<hr>
* SOCR Home page: http://www.socr.umich.edu

{{translate|pageName=http://wiki.socr.umich.edu/index.php?title=SMHS_ANOVA}}

SMHS ANOVA

2014-09-29T18:12:51Z

Jslavine: /* Overview */

==[[SMHS| Scientific Methods for Health Sciences]] - Analysis of Variance (ANOVA) ==

===Overview===
Analysis of Variance (ANOVA) is a common method applied when analyzing the differences between group means. In ANOVA, we divide the observed variance into components attributed to different sources of variation. It is a widely used statistical technique that provides a statistical test of whether or not the means of several groups are equal; ANOVA can be thought of as a generalized t-test for more than 2 groups. ANOVA results in the case of 2 groups coincide with the corresponding results of a 2-sample independent t-test. Here, we introduce ANOVAs, both one-way and two-way, and provide examples.

===Motivation===
In the previous two-sample inference, we applied the independent t-test to compare two independent group means. What if we want to compare k (k>2) independent samples? In this case, we will need to decompose the entire variation into components allowing us to analyze the variance of the entire dataset. Suppose 5 varieties of products are tested for further study. A filed was divided into 20 plots, with each variety planted in four plots. The measurements are shown in the table below:

<center>
{| class="wikitable" style="text-align:center; width:35%" border="1"
|-
|A ||B|| C|| D|| E
|-
|26.2|| 29.2|| 29.1 ||21.3|| 20.1
|-
|24.3|| 28.1 ||30.8 ||22.4|| 19.3
|-
|21.8 ||27.3|| 33.9|| 24.3 ||19.9
|-
|28.1|| 31.2|| 32.8|| 21.8|| 22.1
|-
|}
</center>

<center>
{| class="wikitable" style="text-align:center; width:35%" border="1"
|-
|A|| 26.2,24.3,21.8,28.1
|-
|B|| 29.2,28.1,27.3,31.2
|-
|C|| 29.1,30.8,33.9,32.8
|-
|D|| 21.3,22.4,24.3,21.8
|-
|E|| 20.1,19.3,19.9,22.1
|-
|}
</center>

Using ANOVA, the data are regarded as random samples from k populations. Suppose the population means of the sample are denoted as $\mu_{1},\mu_{2},\mu_{3},\mu_{4},\mu_{5}$and their population standard deviation are denoted as $\sigma_{1},\sigma_{2},\sigma_{3},\sigma_{4},\sigma_{5}$. An obvious method is to do $\binom{5}{2}=10$ separate t-tests and compare all independent pairs of groups. In this case, ANOVA would be much easier and powerful.

===Theory===

====One-way ANOVA====
One-way ANOVA expands our inference methods to study and compare k independent samples. In this case, we will be decomposing the entire variation in the data into independent components.

*Notations first: $y_{ij}$ is the measurement from group $i$, observation index $j$; $k$ is the number of groups; $n_{i}$ is the number of observations in group $i$; $n$ is the total number of observations and $n=n_{1}+n_{2}+⋯+n_{k}$. The group mean for group $i$ is $\bar y_{l}$=$\frac{\sum_{j=1}^{n_{i}} y_{ij}} {n_{i}}$, the grand mean is $\bar y =\bar y_{..}=$ $\frac{\sum_{i=1}^{k}\sum_{j=1}^{n}_{i}y_{ij}}{n}$

*Difference between the means (compare each group mean to the grand mean): total variance $SST(total)=\sum_{i=1}^{k}\sum_{j=1}^{n_i}(y_{ij}-(\bar y_{..})^{2}$, degrees of freedom $df(total)=n-1$; difference between each group mean and grand mean: $SST(between)$=$\sum_{i=1}^{k} \sum_{j=1}^{n_i}(y_{ij}-\bar y_{..})^{2}$, degrees of freedom $df(between)=k-1$; Sum square due to error (combination of variation within group):$SSE=\sum_{i=1}^{k} n_{i}(\bar y_{l.}- \bar y_{..})^2$, degrees of freedom $df(within)=n-k$. With ANOVA decomposition, we have $\sum_{i=1}^{k} {\sum_{j=1}^{n_{i}} {(y_{ij}- \bar y_{..})^2 }} = \sum_{i=1}^{k} {n_{i} (y_{l.}-\bar y_{..})^2} + \sum_{i=1}^{k} {\sum_{j=1}^{n_i} {(y_{ij}-\bar y_{l.})^2}},$ that is $SST(total)$=$SST(between)$+$SSE(within)$ and $df(total)$=$df(between)$+$df(within).$

*Calculations:
<center>
{| class="wikitable" style="text-align:center; width:50%" border="1"
|-
| Variance Source || Degrees of Freedom (df) || Sum of Squares (SS) || Mean Sum of Squares (MS) || F-Statistics || [http://socr.ucla.edu/htmls/SOCR_Distributions.html P-value]
|-
| Treatment Effect (Between Group) || k-1 || <math>\sum_{i=1}^{k}{n_i(\bar{y}_{i,.}-\bar{y})^2}</math> || <math>MST(Between)={SST(Between)\over df(Between)}</math> || <math>F_o = {MST(Between)\over MSE(Within)}</math> || <math>P(F_{(df(Between), df(Within))} > F_o)</math>
|-
| Error (Within Group) || n-k || <math>\sum_{i=1}^{k}{\sum_{j=1}^{n_i}{(y_{i,j}-\bar{y}_{i,.})^2}}</math> || <math>MSE(Within)={SSE(Within)\over df(Within)}</math> || || [http://socr.ucla.edu/Applets.dir/Normal_T_Chi2_F_Tables.htm F-Distribution Calculator]
|-
| Total || n-1 || <math>\sum_{i=1}^{k}{\sum_{j=1}^{n_i}{(y_{i,j} - \bar{y})^2}}</math> || || || [[SOCR_EduMaterials_AnalysisActivities_ANOVA_1 | ANOVA Activity]]
|}
</center>

* ANOVA hypotheses (general form): $H_{\sigma}:\mu_{1}=\mu_{2}=⋯=\mu_{k}$; $H_{a}:\mu_{I}≠\mu_{j}$ for some $i≠j$. The test statistics: $F_{0}=\frac{MST(between)}{MSE(within)}$ , if $F_{0}$ is large, then there is a lot between group variation, relative to the within group variation. Therefore, the discrepancies between the group means are large compared to the variability within the groups (error). That is large $F_{0}$ provides strong evidence against $H_{0}$.

*Examples: given the following data from a hands-on study.
<center>
{| class="wikitable" style="text-align:center; width:35%" border="1"
|-
| || colspan=3|Groups
|-
|Index|| A|| B|| C
|-
|1 || 0|| 1|| 4
|-
|2|| 1|| 0|| 5
|-
|3|| ||2||
|-
|$n_{i}$|| 2|| 3|| 2
|-
| $s$ ||1|| 3|| 9
|-
|$\bar y_{l}$|| 0.5|| 1|| 4.5
|-
|}
</center>

Using this data, we have the following ANOVA table:
<center>
{| class="wikitable" style="text-align:center; width:50%" border="1"
|-
| Variance Source || Degrees of Freedom (df) || Sum of Squares (SS) || Mean Sum of Squares (MS) || F-Statistics || [http://socr.ucla.edu/htmls/dist/Fisher_Distribution.html P-value]
|-
| Treatment Effect (Between Group) || 3-1 || <math>\sum_{i=1}^{k}{n_i(\bar{y}_{i,.}-\bar{y})^2}=19.86</math> || <math>{SST(Between)\over df(Between)}={19.86\over 2}</math> || <math>F_o = {MST(Between)\over MSE(Within)}=13.24</math> || <math>P(F_{(df(Between), df(Within))} > F_o)=0.017</math>
|-
| Error (Within Group) || 7-3 || <math>\sum_{i=1}^{k}{\sum_{j=1}^{n_i}{(y_{i,j}-\bar{y}_{i,.})^2}}=3</math> || <math>{SSE(Within)\over df(Within)}={3\over 4}</math> || || [http://socr.ucla.edu/Applets.dir/Normal_T_Chi2_F_Tables.htm F-Distribution Calculator]
|-
| Total || 7-1 || <math>\sum_{i=1}^{k}{\sum_{j=1}^{n_i}{(y_{i,j} - \bar{y})^2}}=22.86</math> || || || [[SOCR_EduMaterials_AnalysisActivities_ANOVA_1 | Anova Activity]]
|}
</center>

Based on the ANOVA table above, we can reject the null hypothesis at $\alpha=0.05.$

*ANOVA conditions: valid if (1) design conditions: all groups of observations represent random samples from their population respectively. Plus, all the observations within each group are independent of each other; (2) population conditions: the k population distributions must be approximately normal. If sample size is large, the normality condition is less crucial. Plus, the standard deviations of all populations are equal, which can be slightly relaxed when $0.5≤\frac{\sigma_{i}}{\sigma_{j}}≤2,$ for all $i$ and $j$, none of the population variance is twice larger than any of the other ones.

====Two-way ANOVA====
Two-way ANOVA decomposes the variance of a dataset into independent (orthogonal) components when we have two grouping factors.
Notations first: two-way model:$y_{ijk}=\mu+\tau_{i}+\beta_{j}+γ_{ij}+\varepsilon_{ijk},$ for all $1≤i≤a,1≤j≤b$ and $1≤k≤r.$ $y_{ijk}$ is the A-factor level $i$, and B-factor level $j$, observation-index $k$ measurement; $k$ is the number of replications; $a_{i}$ is the number of A-factor observations at level $i,a=a_{1}+⋯+a_{I}$; $b_{j}$ is the number of B-factor observations at level $j$, $b=b_{1}+⋯+b_{J}$; $N$ is the total number of observations and $N=a*a*b$. Here $\mu$ is the overall mean response, $\tau_{i}$ is the effect due to the $i^{th}$ level of factor A, $\beta_{j}$ is the effect due to the $j^{th}$ level of factor B, and $\gamma_{ij}$ is the effect due to any interaction between the $i^{th}$ level of factor A and $j^{th}$ level of factor B. The mean for A-factor group mean at level $I$ and B-factor at level $j$ is $\bar{y}_{ij.}=\frac{\sum_{k=1}^{r} {y_{ijk}}} {r},$ the grand mean is $\bar {y} =\bar{y}_{...} = \frac{\sum_{k=1}^{r} {\sum_{i=1}^{a} {\sum_{j=1}^{b} {y_{ijk}}}}} {n}$, and we have we have $SST(total)=SS(A)+SS(B)+SS(AB)+SSE$.

*Hypotheses:
**Null hypotheses: (1) the population means of the first factor are equal, which is like the one-way ANOVA for the row factor; (2) the population means of the second factor are equal, which is like the one-way ANOVA for the column factor; (3) there is no interaction between the two factors, which is similar to performing a test for independence with contingency tables.
**Factors: factor A and factor B are independent variables in two-way ANOVA.
**Treatment groups: formed by making all possible combinations of two factors. For example, if the factor A has 3 levels and factor B has 5 levels, then there will be 3*5=15 different treatment groups.
**Main effect: involves the dependent variable one at a time. The interaction is ignored for this part.
**Interaction effect: the effect that one factor has on the other factor. The degree of freedom is the product of the two degrees of freedom of each factor.

*Calculations:
It is assumed that main effect A has ''a'' levels (and df(A) = a-1), main effect B has ''b'' levels (and (df(B) = b-1), ''r'' is the sample size of each treatment, and <math>N = a\times b\times n</math> is the total sample size. Notice the overall degree of freedom is once again one less than the total sample size.

<center>
{| class="wikitable" style="text-align:center; width:50%" border="1"
|-
| Variance Source || Degrees of Freedom (df) || Sum of Squares (SS) || Mean Sum of Squares (MS) || F-Statistics || [http://socr.umich.edu/html/dist/ P-value]
|-
| Main Effect A || df(A)=a-1 || <math>SS(A)=r\times b\times\sum_{i=1}^{a}{(\bar{y}_{i,.,.}-\bar{y})^2}</math> || <math>{SS(A)\over df(A)}</math> || <math>F_o = {MS(A)\over MSE}</math> || <math>P(F_{(df(A), df(E))} > F_o)</math>
|-
| Main Effect B || df(B)=b-1 || <math>SS(B)=r\times a\times\sum_{j=1}^{b}{(\bar{y}_{., j,.}-\bar{y})^2}</math> || <math>{SS(B)\over df(B)}</math> || <math>F_o = {MS(B)\over MSE}</math> || <math>P(F_{(df(B), df(E))} > F_o)</math>
|-
| A vs.B Interaction || df(AB)=(a-1)(b-1) || <math>SS(AB)=r\times \sum_{i=1}^{a}{\sum_{j=1}^{b}{((\bar{y}_{i, j,.}-\bar{y}_{i, .,.})+(\bar{y}_{., j,.}-\bar{y}))^2}}</math> || <math>{SS(AB)\over df(AB)}</math> || <math>F_o = {MS(AB)\over MSE}</math> || <math>P(F_{(df(AB), df(E))} > F_o)</math>
|-
| Error || <math>N-a\times b</math> || <math>SSE=\sum_{k=1}^r{\sum_{i=1}^{a}{\sum_{j=1}^{b}{(\bar{y}_{i, j,k}-\bar{y}_{i, j,.})^2}}}</math> || <math>{SSE\over df(Error)}</math> || ||
|-
| Total || N-1 || <math>SST=\sum_{k=1}^r{\sum_{i=1}^{a}{\sum_{j=1}^{b}{(\bar{y}_{i, j,k}-\bar{y}_{., .,.})^2}}}</math> || || || [[SOCR_EduMaterials_AnalysisActivities_ANOVA_2 | ANOVA Activity]]
|}
</center>

* Two-way ANOVA is valid if:
:: (1) the population from which the samples were obtained are normally or approximately normally distributed;
:: (2) the samples are independent;
:: (3) the variances of the populations are equal;
:: (4) the groups have the same sample size.

* Example: [http://wiki.stat.ucla.edu/socr/index.php/AP_Statistics_Curriculum_2007_ANOVA_2Way clinical example of knee pain study]

===Applications===
* [[SOCR_EduMaterials_Activities_BoxAndWhiskerChart|This activity]] presents the Box and Whisker Chart, which is often used in exploratory data analyses. It demonstrates the range, standard deviation, mean and quartiles of the values and is especially useful in comparing statistical data. This article illustrated the implementation of the chart in SOCR with comprehensive introduction. It also included the application of this method in different areas.

* [[SOCR_EduMaterials_AnalysisActivities_ANOVA_2|The SOCR Two-Way ANOVA Java Applet]] includes examples of two-way analysis of variance using SOCR tools. It illustrated the application of two-way ANOVA with examples applied in the SOCR. It also expanded the two-way ANOVA in softwares like R and SAS.

* [[SOCR_Activity_ANOVA_SnailsSexualDimorphism| Ther SOCR Snails Sexual Dimorphism Activity]] shows an application of ANOVA. This activity recreates part of the design of a classification method for the Cocholotoma septemspirale snail. By observing multiple traits of the shells, the original researchers were able to decide on a series of dimorphisms (difference in forms) between male and female snails. This article presents a comprehensive illustration of the example.

===Software ===
* [http://socr.umich.edu/html/ana/ SOCR Analyses Java Applets]
* [[SOCR_EduMaterials_AnalysisActivities_ANOVA_1 | One-Way ANOVA Activity]]
* R:
# fit a model
# one-way ANOVA with completely randomized design
fit <- aov(y ~ A, data = mydata)
# randomized block design (B as the blocking factor)
fit <- aov(y ~ A + B, data = mydata)
# two-way factorial design
fit <- aov(y ~ A + B + A*B, data = mydata)
# to check out the model fitted with type I ANOVA table
summary(fit)
# type III SS and F test
drop1(fit, ~., test=’F’)

===Problems===
* Tom was shopping for a ping pong table that could be taken apart quickly and easily. For some reason, the salesman happened to have a table of the assembly times (sec) for the three tables. Using ANOVA, do you think there is a difference in the average time of assembly for the three brands of ping pong tables?
<center>
{| class="wikitable" style="text-align:center;" border="1"
|-
! Assembly_time_(sec)||Brand
|-
| 93.0||1
|-
| 67.0||1
|-
| 77.0||1
|-
| 92.0||1
|-
| 97.0||1
|-
| 62.0||1
|-
| 136.0||2
|-
| 120.0||2
|-
| 115.0||2
|-
| 104.0||2
|-
| 115.0||2
|-
| 121.0||2
|-
| 102.0||2
|-
| 130.0||2
|-
| 198.0||3
|-
| 217.0||3
|-
| 209.0||3
|-
| 221.0||3
|-
| 190.0||3
|}
</center>

: (a) We can say that there is no reason to reject the null that the average assembly times are the same
: (b) We should reject the null that the average assembly times are the same

* Based on the data in the previous problem, what is the value for R square:
: (a) 0.342
: (b) 0.143
: (c) 0.832
: (d) 0.943

* Tom is curious to see if two-door vehicles drive faster on average than four-door vehicles. He parks behind a bush so as not to be seen, and records the car type and the speed reading. Here are the results (1 means two-door, and 2 means four-door):
<center>
{| class="wikitable" style="text-align:center;" border="1"
|-
! Speed_(MPH)||Vehicle_Type
|-
| 45||2
|-
| 45||2
|-
| 40||2
|-
| 69||1
|-
| 72||1
|-
| 40||1
|-
| 75||2
|-
| 19||2
|-
| 62||1
|-
| 43||2
|-
| 75||1
|-
| 42||2
|-
| 58||1
|-
| 58||1
|-
| 47||2
|-
| 48||2
|-
| 49||2
|-
| 45||2
|-
| 54||2
|}
</center>

: At the 1% significance level, should we reject the null hypothesis that that average speed is the same for both types of vehicles?
: (a) Yes, we should reject the null hypothesis.
: (b) No, we should not reject the null hypothesis.
: (c) There is not enough information.

* Based on data above, what is the value for R square?
: (a) 0.432
: (b) 0.983
: (c) 0.308
: (d) 0.231

* In a two-way ANOVA test, which of the following is not the typical null hypothesizes?
: (a) The population means of the first factor are equal.
: (b) The population means of the first and second factor are equal.
: (c) The population means of the second factor are equal.
: (d) There is no interaction between the two factors.

* Suppose that two factors, A and B, is thought to affect the top speed of a car. We will use two-way ANOVA analysis. Are the population means of factor A equal?
<center>
{| class="wikitable" style="text-align:center;" border="1"
|-
! Top_Speed||A||B
|-
| 93.0||1||1
|-
| 136.0||1||2
|-
| 198.0||1||3
|-
| 88.0||2||1
|-
| 148.0||2||2
|-
| 279.0||2||3
|}
</center>
: (a) Yes, they are equal.
: (b) No, they are not equal.

* Use the data above and apply the two-way ANOVA analysis, are the population means of factor B equal?
: (a) Yes, they are equal.
: (b) No, they are not equal.

* Use data from table above and apply the two-way ANOVA analysis, is there an interaction effect between the two factors
: (a) Yes, they are equal.
: (b) No, they are not equal.

=== References===
* [http://wiki.stat.ucla.edu/socr/index.php/Probability_and_statistics_EBook#Chapter_XI:_Analysis_of_Variance_.28ANOVA.29 SOCR]
* [http://en.wikipedia.org/wiki/Analysis_of_variance ANOVA Wikipedia]

<hr>
* SOCR Home page: http://www.socr.umich.edu

{{translate|pageName=http://wiki.socr.umich.edu/index.php?title=SMHS_ANOVA}}

SMHS SLR

2014-09-22T15:02:27Z

Jslavine: /* Statistical inference on correlation coefficient */

==[[SMHS| Scientific Methods for Health Sciences]] - Correlation and Simple Linear Regression (SLR) ==

===Overview===
Many scientific applications involve the analysis of relationships between two or more variables involved in studying a process of interest. In this section, we will study the correlations between 2 variables and start with simple linear regressions. Consider the simplest of all situations in which bivariate data (i.e., X and Y) are measured for a process, and we are interested in determining the association between the two variables with an appropriate model for the given observations. The first part of this lecture will discuss correlations; we will then elaborate on the use of SLR to assess correlations.

===Motivation===
The analysis of relationships between two or more variables involved in a process of interest is widely applicable. We begin with the simplest of all situations, in which bivariate data (i.e., X and Y) are measured for a process, and we are interested in determining the relationship between these two variable using an appropriate model for the observations (e.g., fitting a straight line to the pairs of (X,Y) data). For example, we measured students' math scores on a final exam, and we want to find out whether there is any association between the final score and their participation rate in the math class. Another potential relationship of interest might be whether there is an association between weight and lung capacity. Simple linear regression is a useful method for addressing these questions, and it is particularly appropriate for assessing associations in simple casess.

===Theory===

*Correlation: The correlation coefficient $(-1≤\rho≤1)$ is a measure of linear association or clustering around a line of multivariate data. The primary relationship between two variables (X,Y) can be summarized by $(\mu_{X},\sigma_{X})$,$(\mu_{Y},\sigma_{Y})$ and the correlation coefficient denoted by $\rho$=$\rho(X,Y)$.
**The correlation is defined only if both of the standard deviations are finite and are nonzero and it is bounded by -1≤$\rho$≤1.
**If $\rho$=1, perfect positive correlation (straight line relationship between the two variables); if $\rho$=0, no correlation (random cloud scatter), i.e., no linear relation between X and Y; if $\rho$=-1, a perfect negative correlation between the variables.
**$\rho(X,Y)$ $=\frac{cov(X,Y)}{\sigma_{X}\sigma_{Y}}$=$\frac{E((X-μ_{X})(Y-μ_{Y}))}{\sigma_{X}\sigma_{Y}}$=${E(XY)-E(X)E(Y)}\over{\sqrt{E(X^{2})-E^{2}(X)}\sqrt{E(Y^{2})-E^{2}(Y)}},$ where E is the expectation operator, and cov is the covariance. $\mu_{X}=E(X)$,$\sigma_{X}^{2}=E(X^{2})-E^{2}(X),$ and similarly for the second variable, Y, and $cov(X,Y)=E(XY)-E(X)*E(Y)$.
**Sample correlation: replace the unknown expectations and standard deviations by sample mean and sample standard deviation: suppose ${X_{1},X_{2},…,X_{n}}$ and ${Y_{1},Y_{2},…,Y_{n}}$ are bivariate observations of the same process and $(\mu_{X}$,$\sigma_{X})$,$\mu_{Y},\sigma_{Y})$ are the mean and standard deviations for the X and Y measurements respectively. $\rho(x,y)=\frac{\sum x_{i} y_{i}-n\bar{x}\bar{y}}{(n-1)s_{x} s_{y}}$=$\frac{n \sum x_{i} y_{i}-\sum x_{i}\sum y_{i}} {{\sqrt{n\sum x_{i}^{2} -(\sum x_{i})^{2}}} {\sqrt{ n\sum y_{i}^{2}-y_{i})^{2}}}}$; $\rho(x,y)=\frac{\sum(x_{i}-\bar x)(y_{i}-\bar y)}{(n-1)s_{x} s_{y}}$ $=\frac{1}{n-1}$ $\sum$ $\frac{x_{i}-\bar x}{s_{x}}\frac{y_{i}-\bar y}{s_{y}}$, $\bar X$ and $\bar y$ are the sample mean for $X$ and $Y$, $s_{x}$ and $s_{y}$ are the sample standard deviation for $X$ and $Y$.

====Hands-on Example====
Human weight and height (suppose we took only 6 of the over 25000 observations of human weight and height included in [[SOCR_Data_Dinov_020108_HeightsWeights| SOCR dataset]].
<center>
{| class="wikitable" style="text-align:center; width:95%" border="1"
|-
|Subject Index|| Height $(x_{i})$ in cm || Weight $(y_{i})$ in kg || $x_{i}-\bar x$ ||$y_{i}-\bar y$ || $(x_{i}-\bar x)^{2}$ || $(y_{i}-\bar y)^{2}$ ||$(x_{i}-\bar x)(y_{i}-\bar y)$
|-
|1||167||60|| 6|| 4.6|| 36|| 21.82|| 28.02
|-
|2|| 170|| 64|| 9|| 8.67 ||81|| 75.17|| 78.03
|-
|3|| 160|| 57|| -1|| 1.67|| 1|| 2.79|| -1.67
|-
|4|| 152|| 46|| -9|| -9.33|| 81|| 87.05 ||83.97
|-
|5|| 157|| 55|| -4|| -0.33|| 16|| 0.11|| 1.32
|-
|6|| 160|| 50|| -1|| -5.33|| 1|| 28.41|| 5.33
|-
|Total||966 ||332 ||0 ||0 ||216|| 215.33||195
|}
</center>

$\bar x\frac{966}{6}=161,\bar y=\frac{322}{6}= 55,s_{x}=\sqrt{\frac{216.5}{5}}=6.57, s_{y}=\sqrt{\frac {215.3}{5}}=6.56.$

$\rho(x,y)=\frac{1}{n-1}$ $\sum$ $\frac{x_{i}-\bar x}{s_{x}}\frac{y_{1}-\bar y}{s_{y}}=0.904$

====Slope inference====
We can conduct inference based on the linear relationship between two quantitative variables by inference on the slope. The basic idea is that we conduct a linear regression of the dependent variable on the predictor suppose they have a linear relationship and we came up with the linear model of y=α+βx+ε, and β is referred to as the true slope of the linear relationship and α represents the intercept of the true linear relationship on y-axis and ε is the random variation. We have talked about the slope in the linear regression, which describes the change in dependent variable y concerned with change in x.

*Test of the significance of the slope β: (1) is there evidence of a real linear relationship which can be done by checking the fit of the residual plots and the initial scatterplots of y vs. x; (2) observations must be independent and the best evidence would be random sample; (3) the variation about the line must be constant, that is the variance of the residuals should be constant which can be checked by the plots of the residuals; (4) the response variable must have normal distribution centered on the line which can be checked with a histogram or normal probability plot.
*Formula we use:$ t=\frac{b-\beta}{SE_{b}}$ , where b stands for the statistic value, $\beta$ is the parameter we are testing on, $SE_{b}$ is the measure of the variation. For the null hypothesis is the $\beta$=0 that is there is no relationship between y and x, so under the null hypothesis, we have the test statistic $t=\frac {b} {SE_{b}}$.

====R Examples====
=====Body Fat and Age=====
Consider a research conducted on see if body fat is associated with age. The data included 18 subjects with the percentage of body fat and the age of the subjects.

<center>
{| class="wikitable" style="text-align:center;" border="1"
|-
|Age|| Percentage of Body Fat
|-
|23||9.5
|-
|23||27.9
|-
|27||7.8
|-
|27|| 17.8
|-
|39 ||31.4
|-
|41|| 25.9
|-
|45 ||27.4
|-
|49|| 25.2
|-
|50 ||31.1
|-
|53 ||34.7
|-
|53 ||42
|-
|54 ||29.1
|-
|56 ||32.5
|-
|57 ||30.3
|-
|58|| 33
|-
|58|| 33.8
|-
|60|| 41.1
|-
|61|| 34.5
|}
</center>

The hypothesis tested: $H_{0}:\beta=0$ vs.$H_{a}:\beta\ne0;$ a t-test would be the test we are going to use here given that the data drawn is a random sample from the population.

In R
###
###
## first check the linearity of the relationship using a scatterplot
x <- c(23,23,27,27,39,41,45,49,50,53,53,54,56,57,58,58,60,61)
y <- c(9.5,27.9,7.8,17.8,31.4,25.9,27.4,25.2,31.1,34.7,42,29.1,32.5,30.3,33,33.8,41.1,34.5)
plot(x,y,main='Scatterplot',xlab='Age',ylab='% fat')
cor(x,y)

[1] 0.7920862

[[Image:SMHS SLR Fig 1.png|500px]]

The scatterplot shows that there is a linear relationship between x and y, and there is strong positive association of $r=0.7920862$ which further confirms the eye-bow test from the scatterplot about the linear relationship of age and percentage of body fat.

Then we fit a simple linear regression of y on x and draw the scatterplot along with the fitted line:

fit <- lm(y~x)

plot(x,y,main='Scatterplot',xlab='Age',ylab='% fat')

abline(fit)

[[Image:SMHS SLR Fig 2.png|500px]]

summary(fit)

Call:
lm(formula = y ~ x)

Residuals:
Min 1Q Median 3Q Max
-10.2166 -3.3214 -0.8424 1.9466 12.0753

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.2209 5.0762 0.635 0.535
x 0.5480 0.1056 5.191 8.93e-05 \***

plot(fit$\$ $resid,main='Residual Plot')
abline(y=0)

[[Image:SMHS SLR Fig3.png|500px]]

qqnorm(fit$\$ $resid) # check the normality of the residuals

[[Image:SMHS SLR Fig4.png|500px]]

From the residual plot and the QQ plot of residuals we can see that meet the constant variance and normality requirement with no heavy tails and the regression model is reasonable. From the summary of the regression model we have the t-test on the slope has the t value is 5.191 and the p-value is 8.93 e-05. We can reject the null hypothesis of no linear relationship and conclude that is significant linear relationship between age and percentage of body fat at 5% level of significance.

The confidence interval for the parameter tested is $b±t^{*} SE_{b}$, where b is the slope of the least square regression line, $t^{*}$ is the upper $\frac {1-C} {2}$ critical value from the t distribution with degrees of freedom n-2 and $SE_{b}$ is the standard error of the slope.

The standard error of the slope is 0.1056, so we have the 95% CI of the slope is $(0.5480-0.1056*2.12,0.5480+0.1056*2.12)$, that is $(0.324,0.772)$. So, we are 95% confident that the slope will fall in the range between 0.324 and 0.772.

=====Baseball Example=====
we are studying on a random sample (size 16) of baseball teams and the data show the team’s batting average and the total number of runs scored for the season.

<center>
{| class="wikitable" style="text-align:center; width:35%" border="1"
|-
|Batting average|| Number of runs scored
|-
|0.294|| 968
|-
|0.278|| 938
|-
|0.278 ||925
|-
|0.27|| 887
|-
|0.274 ||825
|-
|0.271|| 810
|-
|0.263|| 807
|-
|0.257 ||798
|-
|0.267 ||793
|-
|0.265 || 792
|-
|0.254|| 764
|-
|0.246|| 740
|-
|0.266|| 738
|-
|0.262||31
|-
|.251 ||708
|}
</center>

In R:
x <- c(0.294,0.278,0.278,0.270,0.274,0.271,0.263,0.257,0.267,0.265,0.256,0.254,0.246,0.266,0.262,0.251)
y <- c(968,938,925,887,825,810,807,798,793,792,764,752,740,738,731,708)
cor(x,y)
[1] 0.8654923

The correlation between x and y is 0.8655 which is pretty strong positive correlation. So it would be reasonable to make the assumption of a linear regression model of number of runs scored and the average batting.

fit <- lm(y~x)
summary(fit)
Call:
lm(formula = y ~ x)

Residuals:
*in 1Q Median 3Q Max
-74.427 -26.596 1.899 38.156 57.062

Coefficients:
*Estimate Std. Error t value Pr(>|t|)

(Intercept) -706.2 234.9 -3.006 0.00943 **
x 5709.2 883.1 6.465 1.49e-05 ***

---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 40.98 on 14 degrees of freedom

Multiple R-squared: 0.7491, Adjusted R-squared: 0.7312

F-statistic: 41.79 on 1 and 14 DF, p-value: 1.486e-05

plot(x,y,main='Scatterplot',xlab='Batting average',ylab='Number of runs')

abline(fit)

[[Image:SMHS_SLR_Fig5.png|500px]]

par(mfrow=c(1,2))

plot(fit$\$ $resid,main='Residual Plot')

abline(y=0)

qqnorm(fit$\$ $resid)

[[Image:SMHS SLR Fig6.png|500px]]

The estimated value of the slope is 5709.2, standard error 833.1, t value = 6.465, and the p-value is 1.49 e-05, so we reject the null hypothesis and conclude that there is significant linear relationship between the average batting and the number of runs. We have the 95% CI of the slope is $(5709.2-833.1*2.145,5709.2+833.1*2.145)$, that is $(3922.2,7496.2)$. So, we are 95% confident that the slope will fall in the range between 3922.2 and 7496.2.

You can also use [http://www.socr.ucla.edu/htmls/ana/SimpleRegression_Analysis.html SOCR SLR Analysis Simple Regression] to copy-paste the data in the applet, estimate regression slope and intercept and compute the corresponding statistics and p-values.

Simple Linear Regression Results:

Mean of C1 = 46.33333
Mean of C2 = 28.61111
Regression Line:
C2 = 3.22086 + 0.5479910213243551 C1
Correlation(C1, C2) = .79209
R-Square = .62740
Intercept:
Parameter Estimate: 3.22086
Standard Error: 5.07616
T-Statistics: .63451
P-Value: .53472
Slope:
Parameter Estimate: .54799
Standard Error: .10558
T-Statistics: 5.19053
P-Value: .00009

[[Image:SMHS SLR Fig7.png|600px]]

[[Image:SMHS SLR Fig8.png|600px]]

[[Image:SMHS SLR Fig9.png|600px]]

====Statistical inference on correlation coefficient====
Test on $H_{O}:r=\rho$ vs. $H_{a}:r≠\rho$ is the correlation between X and Y. $t_o = \frac{r-\rho}{\sqrt{\frac{1-r^2}{N-2}}}$ with [[AP_Statistics_Curriculum_2007_StudentsT|T distribution]] with $df=N-2$.

Comparing two correlation coefficients: this Fisher’s transformation provides a mechanism to test for comparing two correlation coefficients using Normal distribution. Suppose we have 2 independent paired samples
${(X_{i},Y_{i})}_{i=1}^{n_{1}}$ and ${(U_{j},V_{j} )}_{j=1}^{n_{2}}$ and the $r_{1}=corr(X,Y)$ and $r_{2}=corr(U,V)$ and we are testing $H_{0}: r_{1}=r_{2}$ vs. $H_{a}:r_{1}≠r_{2}$ The Fisher’s transformation for the 2 correlations is defined by $\hat{r}=\frac{1}{2}log_{e}\|\frac{1+r}{1-r}\|$, transforming the two correlation coefficients separately yields $r_{11}=\frac{1}{2}log_{e}\|\frac {1+r_{1}}{1-r_{1}}\|$ and $r_{22}=\frac{1}{2}log_{e}\|\frac{1+r_{22}}{1-r_{22}}\|$. $Z_{0}$ $ =\frac {r_{11}-r_{22}} {\sqrt{\frac{1}{n_{1}-3}+\frac{1}{n_{2}-3}}}$

Note that the hypotheses for the single and double sample inference are $H_{0}:r=0$ vs. $H_{a}:r≠0$ and $H_{0}:r_{1}-r_{2}=0$ vs. $H_{a}:r_{1}-r_{2}≠0$ respectively. And an estimate of the standard deviation of the (Fisher transformed!) correlation is $SD\hat{(r)}=\sqrt{\frac{1}{n-3}}$, thus $r\sim $ $N\bigg (0,\sqrt\frac{1} {n-3}\bigg )$.

=====Brain Volume Example=====
The brain volumes (responses) and age (predictors) for 2 cohorts of subjects (Group 1 and Group 2).

<center>
{|class="wikitable" style="text-align:center; width:90%" border="1"
|-
|Group1 ||Age1 ||Volume1||Group2||Age2 ||Volume2
|-
|1|| 58|| 0.269609 ||2|| 59 ||0.27905
|-
|1|| 55|| 0.277243 ||2|| 50 ||0.262916
|-
|1|| 61|| 0.236264|| 2|| 58|| 0.290697
|-
|1|| 70|| 0.218015|| 2|| 58|| 0.269361
|-
|1|| 38|| 0.287205|| 2|| 61|| 0.268247
|-
|1|| 41 ||0.307387 ||2|| 57|| 0.294204
|-
|1|| 40|| 0.271063|| 2|| 50|| 0.292699
|-
|1|| 25 ||0.307688|| 2|| 38|| 0.273969
|-
|1|| 70|| 0.237811|| 2|| 57|| 0.29049
|-
|1|| 49|| 0.293371|| 2|| 64|| 0.286564
|-
|1|| 56|| 0.252592|| 2|| 71|| 0.257386
|-
|1|| 56|| 0.251349|| 2|| 34|| 0.314958
|-
|1 ||40|| 0.29616 ||2|| 53|| 0.298022
|-
|1|| 50|| 0.249596|| 2|| 53|| 0.269229
|-
|1|| 55|| 0.282721|| 2|| 25|| 0.270634
|-
|1 ||69|| 0.247565|| 2|| 61|| 0.266905
|}
</center>

We have two independent groups and $Y=volume1$ (response) and $X=age1$ (predictor); $V=volume2$ and $U=age2$, where $n_{1}=27$, $n_{2}=27$. We compute the 2 correlation coefficients: $r_{1}=corr(X,Y)=-0.75338$ and $r_{2}=corr(U,V)=-0.49491.$ Using the Fisher’s transformation we obtain: $r_{11}=\frac{1}{2}log_{e}\|\frac {1+r_{1}}{1-r_{1}}\| = -0.980749 $ and $r_{22}=\frac{1}{2}log_{e}\|\frac{1+r_{2}}{1-r_{2}}\| = -0.5425423,$ $Z_{0}$ $ =\frac {r_{11}-r_{22}} {\sqrt {\frac{1}{n_{1}-3}-\frac{1}{n_{2}-3}}} = 11.517993.$ The corresponding 1-sided p-value =$0.064508$, double-sided p-value =$0.129016$.

===Simple linear regression (SLR)===
Modeling of the linear relations between two variables using regression analysis.
$Y$ is an observed variable and $X$ is specified by the researcher, e.g. $Y$ is hair growth after $X$ months, for individuals at certain does level of hair growth cream; $X$ and $Y$ are both observed variables.
*Estimating the best linear fit: simple linear regression model $Y=a+bX+\varepsilon $ can be estimated using least square, which fits a line minimizing the sums of $ \varepsilon_{l}=\hat y_{l} -y_{i}, \sum_{i=1}^{N} \hat\varepsilon_l^{2}=\sum_{i=1}^{N}(\hat y_{l}-y_{i} )^{2}$, where $ \hat y_{l} = a+bx_{i}$ are observed and predicted values of $Y$ for $x_{i}$.

*Solving for the minimization problem:
: $\hat{b} = \frac {\sum_{i=1}^{N} (x_{i} - \bar{x})(y_{i} - \bar{y}) } {\sum_{i=1}^{N} (x_{i} - \bar{x}) ^2} = \frac {\sum_{i=1}^{N} {(x_{i}y_{i})} - N \bar{x} \bar{y}} {\sum_{i=1}^{N} (x_{i})^2 - N \bar{x}^2} = \rho_{X,Y} \frac {s_y}{s_x},$
: where [[AP_Statistics_Curriculum_2007_GLM_Corr | $\rho_{X,Y}$ is the correlation coefficient]].

: $\hat a=\bar y-\hat b\bar x$.

*Properties of the least square line: (1) the line goes through the point of ($\bar{X},\bar{Y}$); (2) the sum of the residuals is equal to zero; (3) the estimates are unbiased, that is their expected values are equal to the real slope and intercept values.

*Regression coefficients inference: when the error terms are normally distributed, then the estimate of the slope coefficient has a normal distribution ith mean equal to $b$ and standard error $SE(\hat b)$ = $s_{\hat b}=\sqrt\frac{1}{N-2}\frac{\sum_{i=1}^{N}\hat\varepsilon_{i}^{2}} {\sum_{i=1}^{N}(x_{i}-\bar x)^{2}}$ To carry out the confidence interval estimating of the slope and intercept of linear model. Given that b follows $\hat b$ follows a T distribution with $N-2$ degrees of freedom, we can calculate the confidence interval for b:$[\hat b-s_{\hat b}t(\frac{\alpha}{2},N-2),\hat b-s_{\hat b}t(\frac{\alpha}{2},N-2)]$ The corresponding test for the regression slope coefficient b is analogously computed $(H_{0}:b=b_{0}$ vs. $H_{a}: b≠b_{0})$ and the test statistic is $t_{0}=\frac{\hat b-b_{0}}{s_{\hat b}} \sim T_{\{df=N-2\}}.$

====Earthquake Data Example====
* Us the [[SOCR_Data_Dinov_021708_Earthquakes |SOCR Data Earthquakes]] to fit the best linear fit between the longitude and the latitude of the California earthquake since 1900. The SOCR Geomap of these earthquake
*[http://socr.ucla.edu/docs/resources/SOCR_Data/SOCR_Earthquake5Data_GoogleMap.html SOCR Google Map Earthquakes] shows using the SLR fit to the earthquake data.

===Applications===
* [http://wiki.stat.ucla.edu/socr/index.php/SOCR_EduMaterials_AnalysisActivities_SLR This article ] presents the SLR analysis activity in SOCR analysis. It starts with a general introduction to SLR model and then illustrate this method in details with various examples. The article help read results of SLR, make interpretation of the slope and intercept and observe and interpret various data and resulting plots including scatter plots, normal QQ plot and different diagnostic plots such as residual on fit plot.

* [http://europepmc.org/abstract/MED/3840866 This article ] titled Simple Linear Regression In Medical Research discussed the method of fitting a straight line to data by linear regression and focuses on examples from 36 original articles published in 1978 and 1979. It concluded that investigators need to become better acquainted with residual plots, which give insight into how well the fitted lie models the data, and with confidence bounds for regression lines. Statistical computing package enable investigators to use these techniques easily.

* [http://ww2.coastal.edu/kingw/statistics/R-tutorials/simplelinear.html This article ]) presents the r tutorial for simple linear regression. It starts with the fundamental check on the data and comment on the existing patterns found and then fit the linear regression model with the height and weight. It also modified the regression with the Lowess smoothing and talked about the local weighted scatter plot smooth. This article is a comprehensive study on the SLR and correlation in R.

* [http://www.tandfonline.com/doi/abs/10.1080/00401706.1975.10489279 This article]titled The Probability Plot Correlation Coefficient Test For Normality introduced the normal probability plot correlation coefficient as a test statistic in complete samples for the composite hypothesis of normality. The proposed test statistic is conceptually simple, and is readily extendable to testing non-normal distribution hypotheses. The paper included an empirical power study which shows that the normal probability plot correlation coefficient compared favorably with seven other normal test statistics.

===Software===
* [http://socr.ucla.edu/htmls/SOCR_Distributions.html SOCR Distributions]
* [http://socr.ucla.edu/htmls/exp/Bivariate_Normal_Experiment.html Bivariate Normal Experiment]
* [http://socr.ucla.edu/Applets.dir/Normal_T_Chi2_F_Tables.htm Normal Chi-Squared F Tables]

===Problems===
Example 1: Simple linear correlation and regression in R:
> library(MASS)
> data(cats)
> str(cats)

'data.frame': 144 obs. of 3 variables:
$ Sex: Factor w/ 2 levels "F","M": 1 1 1 1 1 1 1 1 1 1 ...
$ Bwt: num 2 2 2 2.1 2.1 2.1 2.1 2.1 2.1 2.1 ...
$ Hwt: num 7 7.4 9.5 7.2 7.3 7.6 8.1 8.2 8.3 8.5 ...

> summary(cats)

Sex Bwt Hwt
F:47 Min. :2.000 Min. : 6.30
M:97 1st Qu.:2.300 1st Qu.: 8.95
Median :2.700 Median :10.10
Mean :2.724 Mean :10.63
3rd Qu.:3.025 3rd Qu.:12.12
Max. :3.900 Max. :20.50
[[Image:SMHS SLR Fig10.png|300]]

A positive correlation between two variables X and Y means that if X increases, this will cause the value of Y to increase.
*(a) This is always true.
*(b) This is sometimes true.
*(c) This is never true.

The correlation between working out and body fat was found to be exactly -1.0. Which of the following would not be true about the corresponding scatterplot?
*(a) The slope of the best line of fit should be -1.0.
*(b) All the points would lie along a perfect straight line, with no deviation at all.
*(c) The best fitting line would have a downhill (negative) slope.
*(d) 100% of the variance in body fat can be predicted from workout.

Suppose that the correlation between working out and body fat was found to be exactly -1.0. Which of the following would NOT be true, about the corresponding scatterplot?
*(a) All points would lie along a straight line, with no deviation at all.
*(b) 100% of the variance in body fat can be predicted from the workout.
*(c) The slope of the linear model is -1.0.
*(d) The best fitting line would have a negative slope.

If the correlation coefficient is 0.80, then:
*(a) The explanatory variable is usually less than the response variable.
*(b) The explanatory variable is usually more than the response variable.
*(c) None of the statements are correct.
*(d) Below-average values of the explanatory variable are more often associated with below-average values of the response variable.
*(e) Below-average values of the explanatory variable are more often associated with above-average values of the response variable.

Two different researchers wanted to study the relationship between math anxiety and taking exams. Researcher A measured anxiety with a scale that had a minimum score of 0 and a maximum score of 20, and a final exam that had a minimum score of 0 and a maximum score of 50. He tested 120 students. Researcher B measured anxiety with a scale that had a minimum of 0 and a maximum of 30, and a final exam that had a minimum score of 0 and a maximum score of 35. He tested 60 students. Researcher A found that the coefficient of correlation between a student's math anxiety and his or her score on the final was -0.60. Researcher B found the correlation between a student's math anxiety and his or her score on the final was -0.30.
*(a) The coefficient of correlation for researcher A is twice as strong as the coefficient of correlation for researcher B.
*(b) Based on the study by researcher A one can conclude that high math anxiety is the reason that a lot of the students do not do well in math.
*(c) Given that coefficient of correlation shows the association between standardized scores, one can conclude that for researcher A a greater precentage of the students who have above average anxiety are likely to have below average score on the final.
*(d) Given that the minimum and the maximum values for math and anxiety are so different for the two researchers one cannot compare the coefficient of correlation found by these two researchers.

In the early 1900's when Francis Galton and Karl Pearson measured 1078 pairs of fathers and their grown-up sons, they calculated that the mean height for fathers was 68 inches with deviation of 3 inches. For their sons, the mean height was 69 inches with deviation of 3 inches. (The actual deviations a bit smaller, but we will work with these values to keep the calculations simple.) The correlation coefficient was 0.50. Use the information to calculate the slope of the linear model that predicts the height of the son from the height of the fathers.
*(a) 35.00
*(b) 0.50
*(c) The slope cannot be determined without the actual data
*(d) 3/3 = 1.00

Suppose that wildlife researchers monitor the local alligator population by taking aerial photographs on a regular schedule. They determine that the best fitting linear model to predict weight in pounds from the length of the gators inches is:
Weight = -393 + 5.9*Length,with r2 = 0.836.
Which of the following statements is true?
*(a) A gator that is about 10 inches above average in length is about 59 pounds above the average weight of these gators.
*(b) The correlation between a gator's length and weight is 0.836.
*(c) The correlation between a gator's height and weight cannot be determined without the actual data.
*(d) The correlation between a gator's height and weight is about -0.914.

Which of the following is NOT a property of the LSR Line?
*(a) The sum of the distances between each point and the LSR Line is minimized.
*(b) The average x value and the average y value lies on the LSR Line
*(c) The sum of squared residuals is minimized
*(d) The sum of the residuals = 0

Suppose that the linear model that predicts fat content in grams from the protein of selected items from Burger Queen menu is: Fat = 6.83 + 0.97*Protein. We learn that there are actually 20 grams of fat in the Chucking burger that has 20 grams of protein. Which of the following statements is true?
*(a) The linear model underestimates the actual fat content and produces a residual of -6.23
*(b) The linear model overestimates the fat content and produces a residual of -6.23
*(c) The linear model underestimates the fat content and produces a residual of -6.23
*(d) The linear model overestimates the fat content and produces a residual of 6.2

Which statement describes the principle of "least squares" that we use in determining the best-fit line?
*(a) The best-fit line minimizes the distances between the observed values and the predicted values.
*(b) The best-fit line minimizes the sum of the squared residuals.
*(c) The best-fit line minimizes the sum of the residuals.
*(d) The best-fit line minimizes the sum of the distances between the actual values and the predicted values.

The scores of midterm and final exams for a random sample of Stats 10 students can be summarized as follows:
Mean of midterm score = 36.92; SD of midterm score = 37.79 Mean of final score = 24.71; SD of final score= 25.21 r= 0.978
Choose one answer.
*(a) 23.44
*(b) 0.62
*(c) 25.21
*(d) 35

Which of the following is NOT a property of the Least Squares Regression Line?
*(a) The sum of the distances between each point and the LSR Line is minimized.
*(b) The sum of squared residuals is minimized
*(c) The average x value and the average y value lie on the LSR Line
*(d) The sum of the residuals = 0

Tom and Sue wanted to estimate the average self-esteem score. The true population average for self esteem score is 20. Tom estimates that average by taking a sample of size n and then constructing a confidence interval. What of the following is true?
I. The resulting interval will contain 20 II. The 95 percent confidence interval for n = 100 will generally be more narrow than the 95 percent confidence interval for n = 50. III. For n = 100, the 95 percent confidence interval will be wider than the 90 percent confidence interval.
*(a) II only
*(b) III only
*(c) I only
*(d) II and III

A simple random sample of 1000 persons is taken to estimate the percentage of Democrats in a large population. It turns out that 543 of the people in the sample are Democrats. Is the following statement true or false? Explain (51%, 57.5%) is approximately a 95% confidence interval for the sample percentage of democrats.
*(a) False, that is the approximate confidence interval for p. There is no confidence interval for the sample proportion.
*(b) True, we did the computations and those are approximately the numbers for the confidence interval for p.
*(c) True, that is the confidence interval for the sample mean.
*(d) False, the confidence interval for the sample proportion should be smaller than that.

Use the linear model to predict the height of a son whose father's height is 6 feet.
*(a) The son's height = 35 + 0.5(6) inches
*(b) The son's height = 35 + 0.5(72) inches
*(c) The "Regression Effect" states that the son will be a bit taller than his father
*(d) Cannot be determined without the data

A statistician wants to predict Z from Y. He finds that r-squared is 5%.Which one of the following conclusions is correct?
*(a) The coefficient of correlation between Y and Z is 0.05
*(b) Y explains 5% of the variance in Z
*(c) Y is a good predictor of Z
*(d) Z is a good predictor of Y

===References===
*[Probability_and_statistics_EBook#Chapter_X:_Correlation_and_Regression SOCR COrrelation and Regression Chapter]

<hr>
* SOCR Home page: http://www.socr.umich.edu

{{translate|pageName=http://wiki.socr.umich.edu/index.php?title=SMHS_SLR}}

SMHS SLR

2014-09-22T14:54:17Z

Jslavine: /* Statistical inference on correlation coefficient */

==[[SMHS| Scientific Methods for Health Sciences]] - Correlation and Simple Linear Regression (SLR) ==

===Overview===
Many scientific applications involve the analysis of relationships between two or more variables involved in studying a process of interest. In this section, we will study the correlations between 2 variables and start with simple linear regressions. Consider the simplest of all situations in which bivariate data (i.e., X and Y) are measured for a process, and we are interested in determining the association between the two variables with an appropriate model for the given observations. The first part of this lecture will discuss correlations; we will then elaborate on the use of SLR to assess correlations.

===Motivation===
The analysis of relationships between two or more variables involved in a process of interest is widely applicable. We begin with the simplest of all situations, in which bivariate data (i.e., X and Y) are measured for a process, and we are interested in determining the relationship between these two variable using an appropriate model for the observations (e.g., fitting a straight line to the pairs of (X,Y) data). For example, we measured students' math scores on a final exam, and we want to find out whether there is any association between the final score and their participation rate in the math class. Another potential relationship of interest might be whether there is an association between weight and lung capacity. Simple linear regression is a useful method for addressing these questions, and it is particularly appropriate for assessing associations in simple casess.

===Theory===

*Correlation: The correlation coefficient $(-1≤\rho≤1)$ is a measure of linear association or clustering around a line of multivariate data. The primary relationship between two variables (X,Y) can be summarized by $(\mu_{X},\sigma_{X})$,$(\mu_{Y},\sigma_{Y})$ and the correlation coefficient denoted by $\rho$=$\rho(X,Y)$.
**The correlation is defined only if both of the standard deviations are finite and are nonzero and it is bounded by -1≤$\rho$≤1.
**If $\rho$=1, perfect positive correlation (straight line relationship between the two variables); if $\rho$=0, no correlation (random cloud scatter), i.e., no linear relation between X and Y; if $\rho$=-1, a perfect negative correlation between the variables.
**$\rho(X,Y)$ $=\frac{cov(X,Y)}{\sigma_{X}\sigma_{Y}}$=$\frac{E((X-μ_{X})(Y-μ_{Y}))}{\sigma_{X}\sigma_{Y}}$=${E(XY)-E(X)E(Y)}\over{\sqrt{E(X^{2})-E^{2}(X)}\sqrt{E(Y^{2})-E^{2}(Y)}},$ where E is the expectation operator, and cov is the covariance. $\mu_{X}=E(X)$,$\sigma_{X}^{2}=E(X^{2})-E^{2}(X),$ and similarly for the second variable, Y, and $cov(X,Y)=E(XY)-E(X)*E(Y)$.
**Sample correlation: replace the unknown expectations and standard deviations by sample mean and sample standard deviation: suppose ${X_{1},X_{2},…,X_{n}}$ and ${Y_{1},Y_{2},…,Y_{n}}$ are bivariate observations of the same process and $(\mu_{X}$,$\sigma_{X})$,$\mu_{Y},\sigma_{Y})$ are the mean and standard deviations for the X and Y measurements respectively. $\rho(x,y)=\frac{\sum x_{i} y_{i}-n\bar{x}\bar{y}}{(n-1)s_{x} s_{y}}$=$\frac{n \sum x_{i} y_{i}-\sum x_{i}\sum y_{i}} {{\sqrt{n\sum x_{i}^{2} -(\sum x_{i})^{2}}} {\sqrt{ n\sum y_{i}^{2}-y_{i})^{2}}}}$; $\rho(x,y)=\frac{\sum(x_{i}-\bar x)(y_{i}-\bar y)}{(n-1)s_{x} s_{y}}$ $=\frac{1}{n-1}$ $\sum$ $\frac{x_{i}-\bar x}{s_{x}}\frac{y_{i}-\bar y}{s_{y}}$, $\bar X$ and $\bar y$ are the sample mean for $X$ and $Y$, $s_{x}$ and $s_{y}$ are the sample standard deviation for $X$ and $Y$.

====Hands-on Example====
Human weight and height (suppose we took only 6 of the over 25000 observations of human weight and height included in [[SOCR_Data_Dinov_020108_HeightsWeights| SOCR dataset]].
<center>
{| class="wikitable" style="text-align:center; width:95%" border="1"
|-
|Subject Index|| Height $(x_{i})$ in cm || Weight $(y_{i})$ in kg || $x_{i}-\bar x$ ||$y_{i}-\bar y$ || $(x_{i}-\bar x)^{2}$ || $(y_{i}-\bar y)^{2}$ ||$(x_{i}-\bar x)(y_{i}-\bar y)$
|-
|1||167||60|| 6|| 4.6|| 36|| 21.82|| 28.02
|-
|2|| 170|| 64|| 9|| 8.67 ||81|| 75.17|| 78.03
|-
|3|| 160|| 57|| -1|| 1.67|| 1|| 2.79|| -1.67
|-
|4|| 152|| 46|| -9|| -9.33|| 81|| 87.05 ||83.97
|-
|5|| 157|| 55|| -4|| -0.33|| 16|| 0.11|| 1.32
|-
|6|| 160|| 50|| -1|| -5.33|| 1|| 28.41|| 5.33
|-
|Total||966 ||332 ||0 ||0 ||216|| 215.33||195
|}
</center>

$\bar x\frac{966}{6}=161,\bar y=\frac{322}{6}= 55,s_{x}=\sqrt{\frac{216.5}{5}}=6.57, s_{y}=\sqrt{\frac {215.3}{5}}=6.56.$

$\rho(x,y)=\frac{1}{n-1}$ $\sum$ $\frac{x_{i}-\bar x}{s_{x}}\frac{y_{1}-\bar y}{s_{y}}=0.904$

====Slope inference====
We can conduct inference based on the linear relationship between two quantitative variables by inference on the slope. The basic idea is that we conduct a linear regression of the dependent variable on the predictor suppose they have a linear relationship and we came up with the linear model of y=α+βx+ε, and β is referred to as the true slope of the linear relationship and α represents the intercept of the true linear relationship on y-axis and ε is the random variation. We have talked about the slope in the linear regression, which describes the change in dependent variable y concerned with change in x.

*Test of the significance of the slope β: (1) is there evidence of a real linear relationship which can be done by checking the fit of the residual plots and the initial scatterplots of y vs. x; (2) observations must be independent and the best evidence would be random sample; (3) the variation about the line must be constant, that is the variance of the residuals should be constant which can be checked by the plots of the residuals; (4) the response variable must have normal distribution centered on the line which can be checked with a histogram or normal probability plot.
*Formula we use:$ t=\frac{b-\beta}{SE_{b}}$ , where b stands for the statistic value, $\beta$ is the parameter we are testing on, $SE_{b}$ is the measure of the variation. For the null hypothesis is the $\beta$=0 that is there is no relationship between y and x, so under the null hypothesis, we have the test statistic $t=\frac {b} {SE_{b}}$.

====R Examples====
=====Body Fat and Age=====
Consider a research conducted on see if body fat is associated with age. The data included 18 subjects with the percentage of body fat and the age of the subjects.

<center>
{| class="wikitable" style="text-align:center;" border="1"
|-
|Age|| Percentage of Body Fat
|-
|23||9.5
|-
|23||27.9
|-
|27||7.8
|-
|27|| 17.8
|-
|39 ||31.4
|-
|41|| 25.9
|-
|45 ||27.4
|-
|49|| 25.2
|-
|50 ||31.1
|-
|53 ||34.7
|-
|53 ||42
|-
|54 ||29.1
|-
|56 ||32.5
|-
|57 ||30.3
|-
|58|| 33
|-
|58|| 33.8
|-
|60|| 41.1
|-
|61|| 34.5
|}
</center>

The hypothesis tested: $H_{0}:\beta=0$ vs.$H_{a}:\beta\ne0;$ a t-test would be the test we are going to use here given that the data drawn is a random sample from the population.

In R
###
###
## first check the linearity of the relationship using a scatterplot
x <- c(23,23,27,27,39,41,45,49,50,53,53,54,56,57,58,58,60,61)
y <- c(9.5,27.9,7.8,17.8,31.4,25.9,27.4,25.2,31.1,34.7,42,29.1,32.5,30.3,33,33.8,41.1,34.5)
plot(x,y,main='Scatterplot',xlab='Age',ylab='% fat')
cor(x,y)

[1] 0.7920862

[[Image:SMHS SLR Fig 1.png|500px]]

The scatterplot shows that there is a linear relationship between x and y, and there is strong positive association of $r=0.7920862$ which further confirms the eye-bow test from the scatterplot about the linear relationship of age and percentage of body fat.

Then we fit a simple linear regression of y on x and draw the scatterplot along with the fitted line:

fit <- lm(y~x)

plot(x,y,main='Scatterplot',xlab='Age',ylab='% fat')

abline(fit)

[[Image:SMHS SLR Fig 2.png|500px]]

summary(fit)

Call:
lm(formula = y ~ x)

Residuals:
Min 1Q Median 3Q Max
-10.2166 -3.3214 -0.8424 1.9466 12.0753

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.2209 5.0762 0.635 0.535
x 0.5480 0.1056 5.191 8.93e-05 \***

plot(fit$\$ $resid,main='Residual Plot')
abline(y=0)

[[Image:SMHS SLR Fig3.png|500px]]

qqnorm(fit$\$ $resid) # check the normality of the residuals

[[Image:SMHS SLR Fig4.png|500px]]

From the residual plot and the QQ plot of residuals we can see that meet the constant variance and normality requirement with no heavy tails and the regression model is reasonable. From the summary of the regression model we have the t-test on the slope has the t value is 5.191 and the p-value is 8.93 e-05. We can reject the null hypothesis of no linear relationship and conclude that is significant linear relationship between age and percentage of body fat at 5% level of significance.

The confidence interval for the parameter tested is $b±t^{*} SE_{b}$, where b is the slope of the least square regression line, $t^{*}$ is the upper $\frac {1-C} {2}$ critical value from the t distribution with degrees of freedom n-2 and $SE_{b}$ is the standard error of the slope.

The standard error of the slope is 0.1056, so we have the 95% CI of the slope is $(0.5480-0.1056*2.12,0.5480+0.1056*2.12)$, that is $(0.324,0.772)$. So, we are 95% confident that the slope will fall in the range between 0.324 and 0.772.

=====Baseball Example=====
we are studying on a random sample (size 16) of baseball teams and the data show the team’s batting average and the total number of runs scored for the season.

<center>
{| class="wikitable" style="text-align:center; width:35%" border="1"
|-
|Batting average|| Number of runs scored
|-
|0.294|| 968
|-
|0.278|| 938
|-
|0.278 ||925
|-
|0.27|| 887
|-
|0.274 ||825
|-
|0.271|| 810
|-
|0.263|| 807
|-
|0.257 ||798
|-
|0.267 ||793
|-
|0.265 || 792
|-
|0.254|| 764
|-
|0.246|| 740
|-
|0.266|| 738
|-
|0.262||31
|-
|.251 ||708
|}
</center>

In R:
x <- c(0.294,0.278,0.278,0.270,0.274,0.271,0.263,0.257,0.267,0.265,0.256,0.254,0.246,0.266,0.262,0.251)
y <- c(968,938,925,887,825,810,807,798,793,792,764,752,740,738,731,708)
cor(x,y)
[1] 0.8654923

The correlation between x and y is 0.8655 which is pretty strong positive correlation. So it would be reasonable to make the assumption of a linear regression model of number of runs scored and the average batting.

fit <- lm(y~x)
summary(fit)
Call:
lm(formula = y ~ x)

Residuals:
*in 1Q Median 3Q Max
-74.427 -26.596 1.899 38.156 57.062

Coefficients:
*Estimate Std. Error t value Pr(>|t|)

(Intercept) -706.2 234.9 -3.006 0.00943 **
x 5709.2 883.1 6.465 1.49e-05 ***

---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 40.98 on 14 degrees of freedom

Multiple R-squared: 0.7491, Adjusted R-squared: 0.7312

F-statistic: 41.79 on 1 and 14 DF, p-value: 1.486e-05

plot(x,y,main='Scatterplot',xlab='Batting average',ylab='Number of runs')

abline(fit)

[[Image:SMHS_SLR_Fig5.png|500px]]

par(mfrow=c(1,2))

plot(fit$\$ $resid,main='Residual Plot')

abline(y=0)

qqnorm(fit$\$ $resid)

[[Image:SMHS SLR Fig6.png|500px]]

The estimated value of the slope is 5709.2, standard error 833.1, t value = 6.465, and the p-value is 1.49 e-05, so we reject the null hypothesis and conclude that there is significant linear relationship between the average batting and the number of runs. We have the 95% CI of the slope is $(5709.2-833.1*2.145,5709.2+833.1*2.145)$, that is $(3922.2,7496.2)$. So, we are 95% confident that the slope will fall in the range between 3922.2 and 7496.2.

You can also use [http://www.socr.ucla.edu/htmls/ana/SimpleRegression_Analysis.html SOCR SLR Analysis Simple Regression] to copy-paste the data in the applet, estimate regression slope and intercept and compute the corresponding statistics and p-values.

Simple Linear Regression Results:

Mean of C1 = 46.33333
Mean of C2 = 28.61111
Regression Line:
C2 = 3.22086 + 0.5479910213243551 C1
Correlation(C1, C2) = .79209
R-Square = .62740
Intercept:
Parameter Estimate: 3.22086
Standard Error: 5.07616
T-Statistics: .63451
P-Value: .53472
Slope:
Parameter Estimate: .54799
Standard Error: .10558
T-Statistics: 5.19053
P-Value: .00009

[[Image:SMHS SLR Fig7.png|600px]]

[[Image:SMHS SLR Fig8.png|600px]]

[[Image:SMHS SLR Fig9.png|600px]]

====Statistical inference on correlation coefficient====
Test on $H_{O}:r=\rho$ vs. $H_{a}:r≠\rho$ is the correlation between X and Y. $t_o = \frac{r-\rho}{\sqrt{\frac{1-r^2}{N-2}}}$ with [[AP_Statistics_Curriculum_2007_StudentsT|T distribution]] with $df=N-2$.

Comparing two correlation coefficients: this Fisher’s transformation provides a mechanism to test for comparing two correlation coefficients using Normal distribution. Suppose we have 2 independent paired samples
${(X_{i},Y_{i})}_{i=1}^{n_{1}}$ and ${(U_{j},V_{j} )}_{j=1}^{n_{2}}$ and the $r_{1}=corr(X,Y)$ and $r_{2}=corr(U,V)$ and we are testing $H_{0}: r_{1}=r_{2}$ vs. $H_{a}:r_{1}≠r_{2}$ The Fisher’s transformation for the 2 correlations is defined by $\hat{r}=\frac{1}{2}log_{e}\|\frac{1+r}{1-r}\|$, transforming the two correlation coefficients separately yields $r_{11}=\frac{1}{2}log_{e}\|\frac {1+r_{1}}{1-r_{1}}\|$ and $r_{22}=\frac{1}{2}log_{e}\|\frac{1+r_{22}}{1-r_{22}}\|$. $Z_{0}$ $ =\frac {r_{11}-r_{22}} {\sqrt{\frac{1}{n_{1}-3}+\frac{1}{n_{2}-3}}}$

Note that the hypotheses for the single and double sample inference are $H_{0}:r=0$ vs. $H_{a}:r≠0$ and $H_{0}:r_{1}-r_{2}=0$ vs. $H_{a}:r_{1}-r_{2}≠0$ respectively. And an estimate of the standard deviation of the (Fisher transfored!) correlation is $SD\hat{(r)}=\sqrt{\frac{1}{n-3}}$, thus $r\sim $ $N\bigg (0,\sqrt\frac{1} {n-3}\bigg )$.

=====Brain Volume Example=====
The brain volumes (responses) and age (predictors) for 2 cohorts of subjects (Group 1 and Group 2).

<center>
{|class="wikitable" style="text-align:center; width:90%" border="1"
|-
|Group1 ||Age1 ||Volume1||Group2||Age2 ||Volume2
|-
|1|| 58|| 0.269609 ||2|| 59 ||0.27905
|-
|1|| 55|| 0.277243 ||2|| 50 ||0.262916
|-
|1|| 61|| 0.236264|| 2|| 58|| 0.290697
|-
|1|| 70|| 0.218015|| 2|| 58|| 0.269361
|-
|1|| 38|| 0.287205|| 2|| 61|| 0.268247
|-
|1|| 41 ||0.307387 ||2|| 57|| 0.294204
|-
|1|| 40|| 0.271063|| 2|| 50|| 0.292699
|-
|1|| 25 ||0.307688|| 2|| 38|| 0.273969
|-
|1|| 70|| 0.237811|| 2|| 57|| 0.29049
|-
|1|| 49|| 0.293371|| 2|| 64|| 0.286564
|-
|1|| 56|| 0.252592|| 2|| 71|| 0.257386
|-
|1|| 56|| 0.251349|| 2|| 34|| 0.314958
|-
|1 ||40|| 0.29616 ||2|| 53|| 0.298022
|-
|1|| 50|| 0.249596|| 2|| 53|| 0.269229
|-
|1|| 55|| 0.282721|| 2|| 25|| 0.270634
|-
|1 ||69|| 0.247565|| 2|| 61|| 0.266905
|}
</center>

We have two independent groups and $Y=volume1$ (response) and $X=age1$ (predictor); $V=volume2$ and $U=age2$, where $n_{1}=27$, $n_{2}=27$. We compute the 2 correlation coefficients: $r_{1}=corr(X,Y)=-0.75338$ and $r_{2}=corr(U,V)=-0.49491.$ Using the Fisher’s transformation we obtain: $r_{11}=\frac{1}{2}log_{e}\|\frac {1+r_{1}}{1-r_{1}}\| = -0.980749 $ and $r_{22}=\frac{1}{2}log_{e}\|\frac{1+r_{2}}{1-r_{2}}\| = -0.5425423,$ $Z_{0}$ $ =\frac {r_{11}-r_{22}} {\sqrt {\frac{1}{n_{1}-3}-\frac{1}{n_{2}-3}}} = 11.517993.$ The corresponding 1-sided p-value =$0.064508$, double-sided p-value =$0.129016$.

===Simple linear regression (SLR)===
Modeling of the linear relations between two variables using regression analysis.
$Y$ is an observed variable and $X$ is specified by the researcher, e.g. $Y$ is hair growth after $X$ months, for individuals at certain does level of hair growth cream; $X$ and $Y$ are both observed variables.
*Estimating the best linear fit: simple linear regression model $Y=a+bX+\varepsilon $ can be estimated using least square, which fits a line minimizing the sums of $ \varepsilon_{l}=\hat y_{l} -y_{i}, \sum_{i=1}^{N} \hat\varepsilon_l^{2}=\sum_{i=1}^{N}(\hat y_{l}-y_{i} )^{2}$, where $ \hat y_{l} = a+bx_{i}$ are observed and predicted values of $Y$ for $x_{i}$.

*Solving for the minimization problem:
: $\hat{b} = \frac {\sum_{i=1}^{N} (x_{i} - \bar{x})(y_{i} - \bar{y}) } {\sum_{i=1}^{N} (x_{i} - \bar{x}) ^2} = \frac {\sum_{i=1}^{N} {(x_{i}y_{i})} - N \bar{x} \bar{y}} {\sum_{i=1}^{N} (x_{i})^2 - N \bar{x}^2} = \rho_{X,Y} \frac {s_y}{s_x},$
: where [[AP_Statistics_Curriculum_2007_GLM_Corr | $\rho_{X,Y}$ is the correlation coefficient]].

: $\hat a=\bar y-\hat b\bar x$.

*Properties of the least square line: (1) the line goes through the point of ($\bar{X},\bar{Y}$); (2) the sum of the residuals is equal to zero; (3) the estimates are unbiased, that is their expected values are equal to the real slope and intercept values.

*Regression coefficients inference: when the error terms are normally distributed, then the estimate of the slope coefficient has a normal distribution ith mean equal to $b$ and standard error $SE(\hat b)$ = $s_{\hat b}=\sqrt\frac{1}{N-2}\frac{\sum_{i=1}^{N}\hat\varepsilon_{i}^{2}} {\sum_{i=1}^{N}(x_{i}-\bar x)^{2}}$ To carry out the confidence interval estimating of the slope and intercept of linear model. Given that b follows $\hat b$ follows a T distribution with $N-2$ degrees of freedom, we can calculate the confidence interval for b:$[\hat b-s_{\hat b}t(\frac{\alpha}{2},N-2),\hat b-s_{\hat b}t(\frac{\alpha}{2},N-2)]$ The corresponding test for the regression slope coefficient b is analogously computed $(H_{0}:b=b_{0}$ vs. $H_{a}: b≠b_{0})$ and the test statistic is $t_{0}=\frac{\hat b-b_{0}}{s_{\hat b}} \sim T_{\{df=N-2\}}.$

====Earthquake Data Example====
* Us the [[SOCR_Data_Dinov_021708_Earthquakes |SOCR Data Earthquakes]] to fit the best linear fit between the longitude and the latitude of the California earthquake since 1900. The SOCR Geomap of these earthquake
*[http://socr.ucla.edu/docs/resources/SOCR_Data/SOCR_Earthquake5Data_GoogleMap.html SOCR Google Map Earthquakes] shows using the SLR fit to the earthquake data.

===Applications===
* [http://wiki.stat.ucla.edu/socr/index.php/SOCR_EduMaterials_AnalysisActivities_SLR This article ] presents the SLR analysis activity in SOCR analysis. It starts with a general introduction to SLR model and then illustrate this method in details with various examples. The article help read results of SLR, make interpretation of the slope and intercept and observe and interpret various data and resulting plots including scatter plots, normal QQ plot and different diagnostic plots such as residual on fit plot.

* [http://europepmc.org/abstract/MED/3840866 This article ] titled Simple Linear Regression In Medical Research discussed the method of fitting a straight line to data by linear regression and focuses on examples from 36 original articles published in 1978 and 1979. It concluded that investigators need to become better acquainted with residual plots, which give insight into how well the fitted lie models the data, and with confidence bounds for regression lines. Statistical computing package enable investigators to use these techniques easily.

* [http://ww2.coastal.edu/kingw/statistics/R-tutorials/simplelinear.html This article ]) presents the r tutorial for simple linear regression. It starts with the fundamental check on the data and comment on the existing patterns found and then fit the linear regression model with the height and weight. It also modified the regression with the Lowess smoothing and talked about the local weighted scatter plot smooth. This article is a comprehensive study on the SLR and correlation in R.

* [http://www.tandfonline.com/doi/abs/10.1080/00401706.1975.10489279 This article]titled The Probability Plot Correlation Coefficient Test For Normality introduced the normal probability plot correlation coefficient as a test statistic in complete samples for the composite hypothesis of normality. The proposed test statistic is conceptually simple, and is readily extendable to testing non-normal distribution hypotheses. The paper included an empirical power study which shows that the normal probability plot correlation coefficient compared favorably with seven other normal test statistics.

===Software===
* [http://socr.ucla.edu/htmls/SOCR_Distributions.html SOCR Distributions]
* [http://socr.ucla.edu/htmls/exp/Bivariate_Normal_Experiment.html Bivariate Normal Experiment]
* [http://socr.ucla.edu/Applets.dir/Normal_T_Chi2_F_Tables.htm Normal Chi-Squared F Tables]

===Problems===
Example 1: Simple linear correlation and regression in R:
> library(MASS)
> data(cats)
> str(cats)

'data.frame': 144 obs. of 3 variables:
$ Sex: Factor w/ 2 levels "F","M": 1 1 1 1 1 1 1 1 1 1 ...
$ Bwt: num 2 2 2 2.1 2.1 2.1 2.1 2.1 2.1 2.1 ...
$ Hwt: num 7 7.4 9.5 7.2 7.3 7.6 8.1 8.2 8.3 8.5 ...

> summary(cats)

Sex Bwt Hwt
F:47 Min. :2.000 Min. : 6.30
M:97 1st Qu.:2.300 1st Qu.: 8.95
Median :2.700 Median :10.10
Mean :2.724 Mean :10.63
3rd Qu.:3.025 3rd Qu.:12.12
Max. :3.900 Max. :20.50
[[Image:SMHS SLR Fig10.png|300]]

A positive correlation between two variables X and Y means that if X increases, this will cause the value of Y to increase.
*(a) This is always true.
*(b) This is sometimes true.
*(c) This is never true.

The correlation between working out and body fat was found to be exactly -1.0. Which of the following would not be true about the corresponding scatterplot?
*(a) The slope of the best line of fit should be -1.0.
*(b) All the points would lie along a perfect straight line, with no deviation at all.
*(c) The best fitting line would have a downhill (negative) slope.
*(d) 100% of the variance in body fat can be predicted from workout.

Suppose that the correlation between working out and body fat was found to be exactly -1.0. Which of the following would NOT be true, about the corresponding scatterplot?
*(a) All points would lie along a straight line, with no deviation at all.
*(b) 100% of the variance in body fat can be predicted from the workout.
*(c) The slope of the linear model is -1.0.
*(d) The best fitting line would have a negative slope.

If the correlation coefficient is 0.80, then:
*(a) The explanatory variable is usually less than the response variable.
*(b) The explanatory variable is usually more than the response variable.
*(c) None of the statements are correct.
*(d) Below-average values of the explanatory variable are more often associated with below-average values of the response variable.
*(e) Below-average values of the explanatory variable are more often associated with above-average values of the response variable.

Two different researchers wanted to study the relationship between math anxiety and taking exams. Researcher A measured anxiety with a scale that had a minimum score of 0 and a maximum score of 20, and a final exam that had a minimum score of 0 and a maximum score of 50. He tested 120 students. Researcher B measured anxiety with a scale that had a minimum of 0 and a maximum of 30, and a final exam that had a minimum score of 0 and a maximum score of 35. He tested 60 students. Researcher A found that the coefficient of correlation between a student's math anxiety and his or her score on the final was -0.60. Researcher B found the correlation between a student's math anxiety and his or her score on the final was -0.30.
*(a) The coefficient of correlation for researcher A is twice as strong as the coefficient of correlation for researcher B.
*(b) Based on the study by researcher A one can conclude that high math anxiety is the reason that a lot of the students do not do well in math.
*(c) Given that coefficient of correlation shows the association between standardized scores, one can conclude that for researcher A a greater precentage of the students who have above average anxiety are likely to have below average score on the final.
*(d) Given that the minimum and the maximum values for math and anxiety are so different for the two researchers one cannot compare the coefficient of correlation found by these two researchers.

In the early 1900's when Francis Galton and Karl Pearson measured 1078 pairs of fathers and their grown-up sons, they calculated that the mean height for fathers was 68 inches with deviation of 3 inches. For their sons, the mean height was 69 inches with deviation of 3 inches. (The actual deviations a bit smaller, but we will work with these values to keep the calculations simple.) The correlation coefficient was 0.50. Use the information to calculate the slope of the linear model that predicts the height of the son from the height of the fathers.
*(a) 35.00
*(b) 0.50
*(c) The slope cannot be determined without the actual data
*(d) 3/3 = 1.00

Suppose that wildlife researchers monitor the local alligator population by taking aerial photographs on a regular schedule. They determine that the best fitting linear model to predict weight in pounds from the length of the gators inches is:
Weight = -393 + 5.9*Length,with r2 = 0.836.
Which of the following statements is true?
*(a) A gator that is about 10 inches above average in length is about 59 pounds above the average weight of these gators.
*(b) The correlation between a gator's length and weight is 0.836.
*(c) The correlation between a gator's height and weight cannot be determined without the actual data.
*(d) The correlation between a gator's height and weight is about -0.914.

Which of the following is NOT a property of the LSR Line?
*(a) The sum of the distances between each point and the LSR Line is minimized.
*(b) The average x value and the average y value lies on the LSR Line
*(c) The sum of squared residuals is minimized
*(d) The sum of the residuals = 0

Suppose that the linear model that predicts fat content in grams from the protein of selected items from Burger Queen menu is: Fat = 6.83 + 0.97*Protein. We learn that there are actually 20 grams of fat in the Chucking burger that has 20 grams of protein. Which of the following statements is true?
*(a) The linear model underestimates the actual fat content and produces a residual of -6.23
*(b) The linear model overestimates the fat content and produces a residual of -6.23
*(c) The linear model underestimates the fat content and produces a residual of -6.23
*(d) The linear model overestimates the fat content and produces a residual of 6.2

Which statement describes the principle of "least squares" that we use in determining the best-fit line?
*(a) The best-fit line minimizes the distances between the observed values and the predicted values.
*(b) The best-fit line minimizes the sum of the squared residuals.
*(c) The best-fit line minimizes the sum of the residuals.
*(d) The best-fit line minimizes the sum of the distances between the actual values and the predicted values.

The scores of midterm and final exams for a random sample of Stats 10 students can be summarized as follows:
Mean of midterm score = 36.92; SD of midterm score = 37.79 Mean of final score = 24.71; SD of final score= 25.21 r= 0.978
Choose one answer.
*(a) 23.44
*(b) 0.62
*(c) 25.21
*(d) 35

Which of the following is NOT a property of the Least Squares Regression Line?
*(a) The sum of the distances between each point and the LSR Line is minimized.
*(b) The sum of squared residuals is minimized
*(c) The average x value and the average y value lie on the LSR Line
*(d) The sum of the residuals = 0

Tom and Sue wanted to estimate the average self-esteem score. The true population average for self esteem score is 20. Tom estimates that average by taking a sample of size n and then constructing a confidence interval. What of the following is true?
I. The resulting interval will contain 20 II. The 95 percent confidence interval for n = 100 will generally be more narrow than the 95 percent confidence interval for n = 50. III. For n = 100, the 95 percent confidence interval will be wider than the 90 percent confidence interval.
*(a) II only
*(b) III only
*(c) I only
*(d) II and III

A simple random sample of 1000 persons is taken to estimate the percentage of Democrats in a large population. It turns out that 543 of the people in the sample are Democrats. Is the following statement true or false? Explain (51%, 57.5%) is approximately a 95% confidence interval for the sample percentage of democrats.
*(a) False, that is the approximate confidence interval for p. There is no confidence interval for the sample proportion.
*(b) True, we did the computations and those are approximately the numbers for the confidence interval for p.
*(c) True, that is the confidence interval for the sample mean.
*(d) False, the confidence interval for the sample proportion should be smaller than that.

Use the linear model to predict the height of a son whose father's height is 6 feet.
*(a) The son's height = 35 + 0.5(6) inches
*(b) The son's height = 35 + 0.5(72) inches
*(c) The "Regression Effect" states that the son will be a bit taller than his father
*(d) Cannot be determined without the data

A statistician wants to predict Z from Y. He finds that r-squared is 5%.Which one of the following conclusions is correct?
*(a) The coefficient of correlation between Y and Z is 0.05
*(b) Y explains 5% of the variance in Z
*(c) Y is a good predictor of Z
*(d) Z is a good predictor of Y

===References===
*[Probability_and_statistics_EBook#Chapter_X:_Correlation_and_Regression SOCR COrrelation and Regression Chapter]

<hr>
* SOCR Home page: http://www.socr.umich.edu

{{translate|pageName=http://wiki.socr.umich.edu/index.php?title=SMHS_SLR}}

SMHS SLR

2014-09-22T14:21:00Z

Jslavine: /* Theory */

==[[SMHS| Scientific Methods for Health Sciences]] - Correlation and Simple Linear Regression (SLR) ==

===Overview===
Many scientific applications involve the analysis of relationships between two or more variables involved in studying a process of interest. In this section, we will study the correlations between 2 variables and start with simple linear regressions. Consider the simplest of all situations in which bivariate data (i.e., X and Y) are measured for a process, and we are interested in determining the association between the two variables with an appropriate model for the given observations. The first part of this lecture will discuss correlations; we will then elaborate on the use of SLR to assess correlations.

===Motivation===
The analysis of relationships between two or more variables involved in a process of interest is widely applicable. We begin with the simplest of all situations, in which bivariate data (i.e., X and Y) are measured for a process, and we are interested in determining the relationship between these two variable using an appropriate model for the observations (e.g., fitting a straight line to the pairs of (X,Y) data). For example, we measured students' math scores on a final exam, and we want to find out whether there is any association between the final score and their participation rate in the math class. Another potential relationship of interest might be whether there is an association between weight and lung capacity. Simple linear regression is a useful method for addressing these questions, and it is particularly appropriate for assessing associations in simple casess.

===Theory===

*Correlation: The correlation coefficient $(-1≤\rho≤1)$ is a measure of linear association or clustering around a line of multivariate data. The primary relationship between two variables (X,Y) can be summarized by $(\mu_{X},\sigma_{X})$,$(\mu_{Y},\sigma_{Y})$ and the correlation coefficient denoted by $\rho$=$\rho(X,Y)$.
**The correlation is defined only if both of the standard deviations are finite and are nonzero and it is bounded by -1≤$\rho$≤1.
**If $\rho$=1, perfect positive correlation (straight line relationship between the two variables); if $\rho$=0, no correlation (random cloud scatter), i.e., no linear relation between X and Y; if $\rho$=-1, a perfect negative correlation between the variables.
**$\rho(X,Y)$ $=\frac{cov(X,Y)}{\sigma_{X}\sigma_{Y}}$=$\frac{E((X-μ_{X})(Y-μ_{Y}))}{\sigma_{X}\sigma_{Y}}$=${E(XY)-E(X)E(Y)}\over{\sqrt{E(X^{2})-E^{2}(X)}\sqrt{E(Y^{2})-E^{2}(Y)}},$ where E is the expectation operator, and cov is the covariance. $\mu_{X}=E(X)$,$\sigma_{X}^{2}=E(X^{2})-E^{2}(X),$ and similarly for the second variable, Y, and $cov(X,Y)=E(XY)-E(X)*E(Y)$.
**Sample correlation: replace the unknown expectations and standard deviations by sample mean and sample standard deviation: suppose ${X_{1},X_{2},…,X_{n}}$ and ${Y_{1},Y_{2},…,Y_{n}}$ are bivariate observations of the same process and $(\mu_{X}$,$\sigma_{X})$,$\mu_{Y},\sigma_{Y})$ are the mean and standard deviations for the X and Y measurements respectively. $\rho(x,y)=\frac{\sum x_{i} y_{i}-n\bar{x}\bar{y}}{(n-1)s_{x} s_{y}}$=$\frac{n \sum x_{i} y_{i}-\sum x_{i}\sum y_{i}} {{\sqrt{n\sum x_{i}^{2} -(\sum x_{i})^{2}}} {\sqrt{ n\sum y_{i}^{2}-y_{i})^{2}}}}$; $\rho(x,y)=\frac{\sum(x_{i}-\bar x)(y_{i}-\bar y)}{(n-1)s_{x} s_{y}}$ $=\frac{1}{n-1}$ $\sum$ $\frac{x_{i}-\bar x}{s_{x}}\frac{y_{i}-\bar y}{s_{y}}$, $\bar X$ and $\bar y$ are the sample mean for $X$ and $Y$, $s_{x}$ and $s_{y}$ are the sample standard deviation for $X$ and $Y$.

====Hands-on Example====
Human weight and height (suppose we took only 6 of the over 25000 observations of human weight and height included in [[SOCR_Data_Dinov_020108_HeightsWeights| SOCR dataset]].
<center>
{| class="wikitable" style="text-align:center; width:95%" border="1"
|-
|Subject Index|| Height $(x_{i})$ in cm || Weight $(y_{i})$ in kg || $x_{i}-\bar x$ ||$y_{i}-\bar y$ || $(x_{i}-\bar x)^{2}$ || $(y_{i}-\bar y)^{2}$ ||$(x_{i}-\bar x)(y_{i}-\bar y)$
|-
|1||167||60|| 6|| 4.6|| 36|| 21.82|| 28.02
|-
|2|| 170|| 64|| 9|| 8.67 ||81|| 75.17|| 78.03
|-
|3|| 160|| 57|| -1|| 1.67|| 1|| 2.79|| -1.67
|-
|4|| 152|| 46|| -9|| -9.33|| 81|| 87.05 ||83.97
|-
|5|| 157|| 55|| -4|| -0.33|| 16|| 0.11|| 1.32
|-
|6|| 160|| 50|| -1|| -5.33|| 1|| 28.41|| 5.33
|-
|Total||966 ||332 ||0 ||0 ||216|| 215.33||195
|}
</center>

$\bar x\frac{966}{6}=161,\bar y=\frac{322}{6}= 55,s_{x}=\sqrt{\frac{216.5}{5}}=6.57, s_{y}=\sqrt{\frac {215.3}{5}}=6.56.$

$\rho(x,y)=\frac{1}{n-1}$ $\sum$ $\frac{x_{i}-\bar x}{s_{x}}\frac{y_{1}-\bar y}{s_{y}}=0.904$

====Slope inference====
We can conduct inference based on the linear relationship between two quantitative variables by inference on the slope. The basic idea is that we conduct a linear regression of the dependent variable on the predictor suppose they have a linear relationship and we came up with the linear model of y=α+βx+ε, and β is referred to as the true slope of the linear relationship and α represents the intercept of the true linear relationship on y-axis and ε is the random variation. We have talked about the slope in the linear regression, which describes the change in dependent variable y concerned with change in x.

*Test of the significance of the slope β: (1) is there evidence of a real linear relationship which can be done by checking the fit of the residual plots and the initial scatterplots of y vs. x; (2) observations must be independent and the best evidence would be random sample; (3) the variation about the line must be constant, that is the variance of the residuals should be constant which can be checked by the plots of the residuals; (4) the response variable must have normal distribution centered on the line which can be checked with a histogram or normal probability plot.
*Formula we use:$ t=\frac{b-\beta}{SE_{b}}$ , where b stands for the statistic value, $\beta$ is the parameter we are testing on, $SE_{b}$ is the measure of the variation. For the null hypothesis is the $\beta$=0 that is there is no relationship between y and x, so under the null hypothesis, we have the test statistic $t=\frac {b} {SE_{b}}$.

====R Examples====
=====Body Fat and Age=====
Consider a research conducted on see if body fat is associated with age. The data included 18 subjects with the percentage of body fat and the age of the subjects.

<center>
{| class="wikitable" style="text-align:center;" border="1"
|-
|Age|| Percentage of Body Fat
|-
|23||9.5
|-
|23||27.9
|-
|27||7.8
|-
|27|| 17.8
|-
|39 ||31.4
|-
|41|| 25.9
|-
|45 ||27.4
|-
|49|| 25.2
|-
|50 ||31.1
|-
|53 ||34.7
|-
|53 ||42
|-
|54 ||29.1
|-
|56 ||32.5
|-
|57 ||30.3
|-
|58|| 33
|-
|58|| 33.8
|-
|60|| 41.1
|-
|61|| 34.5
|}
</center>

The hypothesis tested: $H_{0}:\beta=0$ vs.$H_{a}:\beta\ne0;$ a t-test would be the test we are going to use here given that the data drawn is a random sample from the population.

In R
###
###
## first check the linearity of the relationship using a scatterplot
x <- c(23,23,27,27,39,41,45,49,50,53,53,54,56,57,58,58,60,61)
y <- c(9.5,27.9,7.8,17.8,31.4,25.9,27.4,25.2,31.1,34.7,42,29.1,32.5,30.3,33,33.8,41.1,34.5)
plot(x,y,main='Scatterplot',xlab='Age',ylab='% fat')
cor(x,y)

[1] 0.7920862

[[Image:SMHS SLR Fig 1.png|500px]]

The scatterplot shows that there is a linear relationship between x and y, and there is strong positive association of $r=0.7920862$ which further confirms the eye-bow test from the scatterplot about the linear relationship of age and percentage of body fat.

Then we fit a simple linear regression of y on x and draw the scatterplot along with the fitted line:

fit <- lm(y~x)

plot(x,y,main='Scatterplot',xlab='Age',ylab='% fat')

abline(fit)

[[Image:SMHS SLR Fig 2.png|500px]]

summary(fit)

Call:
lm(formula = y ~ x)

Residuals:
Min 1Q Median 3Q Max
-10.2166 -3.3214 -0.8424 1.9466 12.0753

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.2209 5.0762 0.635 0.535
x 0.5480 0.1056 5.191 8.93e-05 \***

plot(fit$\$ $resid,main='Residual Plot')
abline(y=0)

[[Image:SMHS SLR Fig3.png|500px]]

qqnorm(fit$\$ $resid) # check the normality of the residuals

[[Image:SMHS SLR Fig4.png|500px]]

From the residual plot and the QQ plot of residuals we can see that meet the constant variance and normality requirement with no heavy tails and the regression model is reasonable. From the summary of the regression model we have the t-test on the slope has the t value is 5.191 and the p-value is 8.93 e-05. We can reject the null hypothesis of no linear relationship and conclude that is significant linear relationship between age and percentage of body fat at 5% level of significance.

The confidence interval for the parameter tested is $b±t^{*} SE_{b}$, where b is the slope of the least square regression line, $t^{*}$ is the upper $\frac {1-C} {2}$ critical value from the t distribution with degrees of freedom n-2 and $SE_{b}$ is the standard error of the slope.

The standard error of the slope is 0.1056, so we have the 95% CI of the slope is $(0.5480-0.1056*2.12,0.5480+0.1056*2.12)$, that is $(0.324,0.772)$. So, we are 95% confident that the slope will fall in the range between 0.324 and 0.772.

=====Baseball Example=====
we are studying on a random sample (size 16) of baseball teams and the data show the team’s batting average and the total number of runs scored for the season.

<center>
{| class="wikitable" style="text-align:center; width:35%" border="1"
|-
|Batting average|| Number of runs scored
|-
|0.294|| 968
|-
|0.278|| 938
|-
|0.278 ||925
|-
|0.27|| 887
|-
|0.274 ||825
|-
|0.271|| 810
|-
|0.263|| 807
|-
|0.257 ||798
|-
|0.267 ||793
|-
|0.265 || 792
|-
|0.254|| 764
|-
|0.246|| 740
|-
|0.266|| 738
|-
|0.262||31
|-
|.251 ||708
|}
</center>

In R:
x <- c(0.294,0.278,0.278,0.270,0.274,0.271,0.263,0.257,0.267,0.265,0.256,0.254,0.246,0.266,0.262,0.251)
y <- c(968,938,925,887,825,810,807,798,793,792,764,752,740,738,731,708)
cor(x,y)
[1] 0.8654923

The correlation between x and y is 0.8655 which is pretty strong positive correlation. So it would be reasonable to make the assumption of a linear regression model of number of runs scored and the average batting.

fit <- lm(y~x)
summary(fit)
Call:
lm(formula = y ~ x)

Residuals:
*in 1Q Median 3Q Max
-74.427 -26.596 1.899 38.156 57.062

Coefficients:
*Estimate Std. Error t value Pr(>|t|)

(Intercept) -706.2 234.9 -3.006 0.00943 **
x 5709.2 883.1 6.465 1.49e-05 ***

---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 40.98 on 14 degrees of freedom

Multiple R-squared: 0.7491, Adjusted R-squared: 0.7312

F-statistic: 41.79 on 1 and 14 DF, p-value: 1.486e-05

plot(x,y,main='Scatterplot',xlab='Batting average',ylab='Number of runs')

abline(fit)

[[Image:SMHS_SLR_Fig5.png|500px]]

par(mfrow=c(1,2))

plot(fit$\$ $resid,main='Residual Plot')

abline(y=0)

qqnorm(fit$\$ $resid)

[[Image:SMHS SLR Fig6.png|500px]]

The estimated value of the slope is 5709.2, standard error 833.1, t value = 6.465, and the p-value is 1.49 e-05, so we reject the null hypothesis and conclude that there is significant linear relationship between the average batting and the number of runs. We have the 95% CI of the slope is $(5709.2-833.1*2.145,5709.2+833.1*2.145)$, that is $(3922.2,7496.2)$. So, we are 95% confident that the slope will fall in the range between 3922.2 and 7496.2.

You can also use [http://www.socr.ucla.edu/htmls/ana/SimpleRegression_Analysis.html SOCR SLR Analysis Simple Regression] to copy-paste the data in the applet, estimate regression slope and intercept and compute the corresponding statistics and p-values.

Simple Linear Regression Results:

Mean of C1 = 46.33333
Mean of C2 = 28.61111
Regression Line:
C2 = 3.22086 + 0.5479910213243551 C1
Correlation(C1, C2) = .79209
R-Square = .62740
Intercept:
Parameter Estimate: 3.22086
Standard Error: 5.07616
T-Statistics: .63451
P-Value: .53472
Slope:
Parameter Estimate: .54799
Standard Error: .10558
T-Statistics: 5.19053
P-Value: .00009

[[Image:SMHS SLR Fig7.png|600px]]

[[Image:SMHS SLR Fig8.png|600px]]

[[Image:SMHS SLR Fig9.png|600px]]

====Statistical inference on correlation coefficient====
Test on $H_{O}:r=\rho$ vs. $H_{a}:r≠\rho$ is the correlation between X and Y. $t_o = \frac{r-\rho}{\sqrt{\frac{1-r^2}{N-2}}}$ with [[AP_Statistics_Curriculum_2007_StudentsT|T distribution]] with $df=N-2$.

Comparing two correlation coefficients: this Fisher’s transformation provides a mechanism to test for comparing two correlation coefficients using Normal distribution. Suppose we have 2 independent paired samples
${(X_{i},Y_{i})}_{i=1}^{n_{1}}$ and ${(U_{j},V_{j} )}_{j=1}^{n_{2}}$ and the $r_{1}=corr(X,Y)$ and $r_{2}=corr(U,V)$ and we are testing $H_{0}: r_{1}=r_{2}$ vs. $H_{a}:r_{1}≠r_{2}$ The Fisher’s transformation for the 2 correlations is defined by $\hat{r}=\frac{1}{2}log_{e}\|\frac{1+r}{1-r}\|$, transforming the two correlation coefficients separately yields $r_{11}=\frac{1}{2}log_{e}\|\frac {1+r_{1}}{1-r_{1}}\|$ and $r_{22}=\frac{1}{2}log_{e}\|\frac{1+r_{22}}{1-r_{22}}\|$. $Z_{0}$ $ =\frac {r_{11}-r_{22}} {\sqrt{\frac{1}{n_{1-3}}-\frac{1}{n_{2-3}}}}$

Note that the hypotheses for the single and double sample inference are $H_{0}:r=0$ vs. $H_{a}:r≠0$ and $H_{0}:r_{1}-r_{2}=0$ vs. $H_{a}:r_{1}-r_{2}≠0$ respectively. And an estimate of the standard deviation of the (Fisher transfored!) correlation is $SD\hat{(r)}=\sqrt{\frac{1}{n-3}}$, thus $r\sim $ $N\bigg (0,\sqrt\frac{1} {n-3}\bigg )$.

=====Brain Volume Example=====
The brain volumes (responses) and age (predictors) for 2 cohorts of subjects (Group 1 and Group 2).

<center>
{|class="wikitable" style="text-align:center; width:90%" border="1"
|-
|Group1 ||Age1 ||Volume1||Group2||Age2 ||Volume2
|-
|1|| 58|| 0.269609 ||2|| 59 ||0.27905
|-
|1|| 55|| 0.277243 ||2|| 50 ||0.262916
|-
|1|| 61|| 0.236264|| 2|| 58|| 0.290697
|-
|1|| 70|| 0.218015|| 2|| 58|| 0.269361
|-
|1|| 38|| 0.287205|| 2|| 61|| 0.268247
|-
|1|| 41 ||0.307387 ||2|| 57|| 0.294204
|-
|1|| 40|| 0.271063|| 2|| 50|| 0.292699
|-
|1|| 25 ||0.307688|| 2|| 38|| 0.273969
|-
|1|| 70|| 0.237811|| 2|| 57|| 0.29049
|-
|1|| 49|| 0.293371|| 2|| 64|| 0.286564
|-
|1|| 56|| 0.252592|| 2|| 71|| 0.257386
|-
|1|| 56|| 0.251349|| 2|| 34|| 0.314958
|-
|1 ||40|| 0.29616 ||2|| 53|| 0.298022
|-
|1|| 50|| 0.249596|| 2|| 53|| 0.269229
|-
|1|| 55|| 0.282721|| 2|| 25|| 0.270634
|-
|1 ||69|| 0.247565|| 2|| 61|| 0.266905
|}
</center>

We have two independent groups and $Y=volume1$ (response) and $X=age1$ (predictor); $V=volume2$ and $U=age2$, where $n_{1}=27$, $n_{2}=27$. We compute the 2 correlation coefficients: $r_{1}=corr(X,Y)=-0.75338$ and $r_{2}=corr(U,V)=-0.49491.$ Using the Fisher’s transformation we obtain: $r_{11}=\frac{1}{2}log_{e}\|\frac {1+r_{1}}{1-r_{1}}\| = -0.980749 $ and $r_{22}=\frac{1}{2}log_{e}\|\frac{1+r_{2}}{1-r_{2}}\| = -0.5425423,$ $Z_{0}$ $ =\frac {r_{11}-r_{22}} {\sqrt {\frac{1}{n_{1}-3}-\frac{1}{n_{2}-3}}} = 11.517993.$ The corresponding 1-sided p-value =$0.064508$, double-sided p-value =$0.129016$.

===Simple linear regression (SLR)===
Modeling of the linear relations between two variables using regression analysis.
$Y$ is an observed variable and $X$ is specified by the researcher, e.g. $Y$ is hair growth after $X$ months, for individuals at certain does level of hair growth cream; $X$ and $Y$ are both observed variables.
*Estimating the best linear fit: simple linear regression model $Y=a+bX+\varepsilon $ can be estimated using least square, which fits a line minimizing the sums of $ \varepsilon_{l}=\hat y_{l} -y_{i}, \sum_{i=1}^{N} \hat\varepsilon_l^{2}=\sum_{i=1}^{N}(\hat y_{l}-y_{i} )^{2}$, where $ \hat y_{l} = a+bx_{i}$ are observed and predicted values of $Y$ for $x_{i}$.

*Solving for the minimization problem:
: $\hat{b} = \frac {\sum_{i=1}^{N} (x_{i} - \bar{x})(y_{i} - \bar{y}) } {\sum_{i=1}^{N} (x_{i} - \bar{x}) ^2} = \frac {\sum_{i=1}^{N} {(x_{i}y_{i})} - N \bar{x} \bar{y}} {\sum_{i=1}^{N} (x_{i})^2 - N \bar{x}^2} = \rho_{X,Y} \frac {s_y}{s_x},$
: where [[AP_Statistics_Curriculum_2007_GLM_Corr | $\rho_{X,Y}$ is the correlation coefficient]].

: $\hat a=\bar y-\hat b\bar x$.

*Properties of the least square line: (1) the line goes through the point of ($\bar{X},\bar{Y}$); (2) the sum of the residuals is equal to zero; (3) the estimates are unbiased, that is their expected values are equal to the real slope and intercept values.

*Regression coefficients inference: when the error terms are normally distributed, then the estimate of the slope coefficient has a normal distribution ith mean equal to $b$ and standard error $SE(\hat b)$ = $s_{\hat b}=\sqrt\frac{1}{N-2}\frac{\sum_{i=1}^{N}\hat\varepsilon_{i}^{2}} {\sum_{i=1}^{N}(x_{i}-\bar x)^{2}}$ To carry out the confidence interval estimating of the slope and intercept of linear model. Given that b follows $\hat b$ follows a T distribution with $N-2$ degrees of freedom, we can calculate the confidence interval for b:$[\hat b-s_{\hat b}t(\frac{\alpha}{2},N-2),\hat b-s_{\hat b}t(\frac{\alpha}{2},N-2)]$ The corresponding test for the regression slope coefficient b is analogously computed $(H_{0}:b=b_{0}$ vs. $H_{a}: b≠b_{0})$ and the test statistic is $t_{0}=\frac{\hat b-b_{0}}{s_{\hat b}} \sim T_{\{df=N-2\}}.$

====Earthquake Data Example====
* Us the [[SOCR_Data_Dinov_021708_Earthquakes |SOCR Data Earthquakes]] to fit the best linear fit between the longitude and the latitude of the California earthquake since 1900. The SOCR Geomap of these earthquake
*[http://socr.ucla.edu/docs/resources/SOCR_Data/SOCR_Earthquake5Data_GoogleMap.html SOCR Google Map Earthquakes] shows using the SLR fit to the earthquake data.

===Applications===
* [http://wiki.stat.ucla.edu/socr/index.php/SOCR_EduMaterials_AnalysisActivities_SLR This article ] presents the SLR analysis activity in SOCR analysis. It starts with a general introduction to SLR model and then illustrate this method in details with various examples. The article help read results of SLR, make interpretation of the slope and intercept and observe and interpret various data and resulting plots including scatter plots, normal QQ plot and different diagnostic plots such as residual on fit plot.

* [http://europepmc.org/abstract/MED/3840866 This article ] titled Simple Linear Regression In Medical Research discussed the method of fitting a straight line to data by linear regression and focuses on examples from 36 original articles published in 1978 and 1979. It concluded that investigators need to become better acquainted with residual plots, which give insight into how well the fitted lie models the data, and with confidence bounds for regression lines. Statistical computing package enable investigators to use these techniques easily.

* [http://ww2.coastal.edu/kingw/statistics/R-tutorials/simplelinear.html This article ]) presents the r tutorial for simple linear regression. It starts with the fundamental check on the data and comment on the existing patterns found and then fit the linear regression model with the height and weight. It also modified the regression with the Lowess smoothing and talked about the local weighted scatter plot smooth. This article is a comprehensive study on the SLR and correlation in R.

* [http://www.tandfonline.com/doi/abs/10.1080/00401706.1975.10489279 This article]titled The Probability Plot Correlation Coefficient Test For Normality introduced the normal probability plot correlation coefficient as a test statistic in complete samples for the composite hypothesis of normality. The proposed test statistic is conceptually simple, and is readily extendable to testing non-normal distribution hypotheses. The paper included an empirical power study which shows that the normal probability plot correlation coefficient compared favorably with seven other normal test statistics.

===Software===
* [http://socr.ucla.edu/htmls/SOCR_Distributions.html SOCR Distributions]
* [http://socr.ucla.edu/htmls/exp/Bivariate_Normal_Experiment.html Bivariate Normal Experiment]
* [http://socr.ucla.edu/Applets.dir/Normal_T_Chi2_F_Tables.htm Normal Chi-Squared F Tables]

===Problems===
Example 1: Simple linear correlation and regression in R:
> library(MASS)
> data(cats)
> str(cats)

'data.frame': 144 obs. of 3 variables:
$ Sex: Factor w/ 2 levels "F","M": 1 1 1 1 1 1 1 1 1 1 ...
$ Bwt: num 2 2 2 2.1 2.1 2.1 2.1 2.1 2.1 2.1 ...
$ Hwt: num 7 7.4 9.5 7.2 7.3 7.6 8.1 8.2 8.3 8.5 ...

> summary(cats)

Sex Bwt Hwt
F:47 Min. :2.000 Min. : 6.30
M:97 1st Qu.:2.300 1st Qu.: 8.95
Median :2.700 Median :10.10
Mean :2.724 Mean :10.63
3rd Qu.:3.025 3rd Qu.:12.12
Max. :3.900 Max. :20.50
[[Image:SMHS SLR Fig10.png|300]]

A positive correlation between two variables X and Y means that if X increases, this will cause the value of Y to increase.
*(a) This is always true.
*(b) This is sometimes true.
*(c) This is never true.

The correlation between working out and body fat was found to be exactly -1.0. Which of the following would not be true about the corresponding scatterplot?
*(a) The slope of the best line of fit should be -1.0.
*(b) All the points would lie along a perfect straight line, with no deviation at all.
*(c) The best fitting line would have a downhill (negative) slope.
*(d) 100% of the variance in body fat can be predicted from workout.

Suppose that the correlation between working out and body fat was found to be exactly -1.0. Which of the following would NOT be true, about the corresponding scatterplot?
*(a) All points would lie along a straight line, with no deviation at all.
*(b) 100% of the variance in body fat can be predicted from the workout.
*(c) The slope of the linear model is -1.0.
*(d) The best fitting line would have a negative slope.

If the correlation coefficient is 0.80, then:
*(a) The explanatory variable is usually less than the response variable.
*(b) The explanatory variable is usually more than the response variable.
*(c) None of the statements are correct.
*(d) Below-average values of the explanatory variable are more often associated with below-average values of the response variable.
*(e) Below-average values of the explanatory variable are more often associated with above-average values of the response variable.

Two different researchers wanted to study the relationship between math anxiety and taking exams. Researcher A measured anxiety with a scale that had a minimum score of 0 and a maximum score of 20, and a final exam that had a minimum score of 0 and a maximum score of 50. He tested 120 students. Researcher B measured anxiety with a scale that had a minimum of 0 and a maximum of 30, and a final exam that had a minimum score of 0 and a maximum score of 35. He tested 60 students. Researcher A found that the coefficient of correlation between a student's math anxiety and his or her score on the final was -0.60. Researcher B found the correlation between a student's math anxiety and his or her score on the final was -0.30.
*(a) The coefficient of correlation for researcher A is twice as strong as the coefficient of correlation for researcher B.
*(b) Based on the study by researcher A one can conclude that high math anxiety is the reason that a lot of the students do not do well in math.
*(c) Given that coefficient of correlation shows the association between standardized scores, one can conclude that for researcher A a greater precentage of the students who have above average anxiety are likely to have below average score on the final.
*(d) Given that the minimum and the maximum values for math and anxiety are so different for the two researchers one cannot compare the coefficient of correlation found by these two researchers.

In the early 1900's when Francis Galton and Karl Pearson measured 1078 pairs of fathers and their grown-up sons, they calculated that the mean height for fathers was 68 inches with deviation of 3 inches. For their sons, the mean height was 69 inches with deviation of 3 inches. (The actual deviations a bit smaller, but we will work with these values to keep the calculations simple.) The correlation coefficient was 0.50. Use the information to calculate the slope of the linear model that predicts the height of the son from the height of the fathers.
*(a) 35.00
*(b) 0.50
*(c) The slope cannot be determined without the actual data
*(d) 3/3 = 1.00

Suppose that wildlife researchers monitor the local alligator population by taking aerial photographs on a regular schedule. They determine that the best fitting linear model to predict weight in pounds from the length of the gators inches is:
Weight = -393 + 5.9*Length,with r2 = 0.836.
Which of the following statements is true?
*(a) A gator that is about 10 inches above average in length is about 59 pounds above the average weight of these gators.
*(b) The correlation between a gator's length and weight is 0.836.
*(c) The correlation between a gator's height and weight cannot be determined without the actual data.
*(d) The correlation between a gator's height and weight is about -0.914.

Which of the following is NOT a property of the LSR Line?
*(a) The sum of the distances between each point and the LSR Line is minimized.
*(b) The average x value and the average y value lies on the LSR Line
*(c) The sum of squared residuals is minimized
*(d) The sum of the residuals = 0

Suppose that the linear model that predicts fat content in grams from the protein of selected items from Burger Queen menu is: Fat = 6.83 + 0.97*Protein. We learn that there are actually 20 grams of fat in the Chucking burger that has 20 grams of protein. Which of the following statements is true?
*(a) The linear model underestimates the actual fat content and produces a residual of -6.23
*(b) The linear model overestimates the fat content and produces a residual of -6.23
*(c) The linear model underestimates the fat content and produces a residual of -6.23
*(d) The linear model overestimates the fat content and produces a residual of 6.2

Which statement describes the principle of "least squares" that we use in determining the best-fit line?
*(a) The best-fit line minimizes the distances between the observed values and the predicted values.
*(b) The best-fit line minimizes the sum of the squared residuals.
*(c) The best-fit line minimizes the sum of the residuals.
*(d) The best-fit line minimizes the sum of the distances between the actual values and the predicted values.

The scores of midterm and final exams for a random sample of Stats 10 students can be summarized as follows:
Mean of midterm score = 36.92; SD of midterm score = 37.79 Mean of final score = 24.71; SD of final score= 25.21 r= 0.978
Choose one answer.
*(a) 23.44
*(b) 0.62
*(c) 25.21
*(d) 35

Which of the following is NOT a property of the Least Squares Regression Line?
*(a) The sum of the distances between each point and the LSR Line is minimized.
*(b) The sum of squared residuals is minimized
*(c) The average x value and the average y value lie on the LSR Line
*(d) The sum of the residuals = 0

Tom and Sue wanted to estimate the average self-esteem score. The true population average for self esteem score is 20. Tom estimates that average by taking a sample of size n and then constructing a confidence interval. What of the following is true?
I. The resulting interval will contain 20 II. The 95 percent confidence interval for n = 100 will generally be more narrow than the 95 percent confidence interval for n = 50. III. For n = 100, the 95 percent confidence interval will be wider than the 90 percent confidence interval.
*(a) II only
*(b) III only
*(c) I only
*(d) II and III

A simple random sample of 1000 persons is taken to estimate the percentage of Democrats in a large population. It turns out that 543 of the people in the sample are Democrats. Is the following statement true or false? Explain (51%, 57.5%) is approximately a 95% confidence interval for the sample percentage of democrats.
*(a) False, that is the approximate confidence interval for p. There is no confidence interval for the sample proportion.
*(b) True, we did the computations and those are approximately the numbers for the confidence interval for p.
*(c) True, that is the confidence interval for the sample mean.
*(d) False, the confidence interval for the sample proportion should be smaller than that.

Use the linear model to predict the height of a son whose father's height is 6 feet.
*(a) The son's height = 35 + 0.5(6) inches
*(b) The son's height = 35 + 0.5(72) inches
*(c) The "Regression Effect" states that the son will be a bit taller than his father
*(d) Cannot be determined without the data

A statistician wants to predict Z from Y. He finds that r-squared is 5%.Which one of the following conclusions is correct?
*(a) The coefficient of correlation between Y and Z is 0.05
*(b) Y explains 5% of the variance in Z
*(c) Y is a good predictor of Z
*(d) Z is a good predictor of Y

===References===
*[Probability_and_statistics_EBook#Chapter_X:_Correlation_and_Regression SOCR COrrelation and Regression Chapter]

<hr>
* SOCR Home page: http://www.socr.umich.edu

{{translate|pageName=http://wiki.socr.umich.edu/index.php?title=SMHS_SLR}}

SMHS SLR

2014-09-22T14:20:42Z

Jslavine: /* Scientific Methods for Health Sciences - Correlation and Simple Linear Regression (SLR) */

==[[SMHS| Scientific Methods for Health Sciences]] - Correlation and Simple Linear Regression (SLR) ==

===Overview===
Many scientific applications involve the analysis of relationships between two or more variables involved in studying a process of interest. In this section, we will study the correlations between 2 variables and start with simple linear regressions. Consider the simplest of all situations in which bivariate data (i.e., X and Y) are measured for a process, and we are interested in determining the association between the two variables with an appropriate model for the given observations. The first part of this lecture will discuss correlations; we will then elaborate on the use of SLR to assess correlations.

===Motivation===
The analysis of relationships between two or more variables involved in a process of interest is widely applicable. We begin with the simplest of all situations, in which bivariate data (i.e., X and Y) are measured for a process, and we are interested in determining the relationship between these two variable using an appropriate model for the observations (e.g., fitting a straight line to the pairs of (X,Y) data). For example, we measured students' math scores on a final exam, and we want to find out whether there is any association between the final score and their participation rate in the math class. Another potential relationship of interest might be whether there is an association between weight and lung capacity. Simple linear regression is a useful method for addressing these questions, and it is particularly appropriate for assessing associations in simple casess.

===Theory===

*Correlation: The correlation efficient $(-1≤\rho≤1)$ is a measure of linear association or clustering around a line of multivariate data. The primary relationship between two variables (X,Y) can be summarized by $(\mu_{X},\sigma_{X})$,$(\mu_{Y},\sigma_{Y})$ and the correlation coefficient denoted by $\rho$=$\rho(X,Y)$.
**The correlation is defined only if both of the standard deviations are finite and are nonzero and it is bounded by -1≤$\rho$≤1.
**If $\rho$=1, perfect positive correlation (straight line relationship between the two variables); if $\rho$=0, no correlation (random cloud scatter), i.e., no linear relation between X and Y; if $\rho$=-1, a perfect negative correlation between the variables.
**$\rho(X,Y)$ $=\frac{cov(X,Y)}{\sigma_{X}\sigma_{Y}}$=$\frac{E((X-μ_{X})(Y-μ_{Y}))}{\sigma_{X}\sigma_{Y}}$=${E(XY)-E(X)E(Y)}\over{\sqrt{E(X^{2})-E^{2}(X)}\sqrt{E(Y^{2})-E^{2}(Y)}},$ where E is the expectation operator, and cov is the covariance. $\mu_{X}=E(X)$,$\sigma_{X}^{2}=E(X^{2})-E^{2}(X),$ and similarly for the second variable, Y, and $cov(X,Y)=E(XY)-E(X)*E(Y)$.
**Sample correlation: replace the unknown expectations and standard deviations by sample mean and sample standard deviation: suppose ${X_{1},X_{2},…,X_{n}}$ and ${Y_{1},Y_{2},…,Y_{n}}$ are bivariate observations of the same process and $(\mu_{X}$,$\sigma_{X})$,$\mu_{Y},\sigma_{Y})$ are the mean and standard deviations for the X and Y measurements respectively. $\rho(x,y)=\frac{\sum x_{i} y_{i}-n\bar{x}\bar{y}}{(n-1)s_{x} s_{y}}$=$\frac{n \sum x_{i} y_{i}-\sum x_{i}\sum y_{i}} {{\sqrt{n\sum x_{i}^{2} -(\sum x_{i})^{2}}} {\sqrt{ n\sum y_{i}^{2}-y_{i})^{2}}}}$; $\rho(x,y)=\frac{\sum(x_{i}-\bar x)(y_{i}-\bar y)}{(n-1)s_{x} s_{y}}$ $=\frac{1}{n-1}$ $\sum$ $\frac{x_{i}-\bar x}{s_{x}}\frac{y_{i}-\bar y}{s_{y}}$, $\bar X$ and $\bar y$ are the sample mean for $X$ and $Y$, $s_{x}$ and $s_{y}$ are the sample standard deviation for $X$ and $Y$.

====Hands-on Example====
Human weight and height (suppose we took only 6 of the over 25000 observations of human weight and height included in [[SOCR_Data_Dinov_020108_HeightsWeights| SOCR dataset]].
<center>
{| class="wikitable" style="text-align:center; width:95%" border="1"
|-
|Subject Index|| Height $(x_{i})$ in cm || Weight $(y_{i})$ in kg || $x_{i}-\bar x$ ||$y_{i}-\bar y$ || $(x_{i}-\bar x)^{2}$ || $(y_{i}-\bar y)^{2}$ ||$(x_{i}-\bar x)(y_{i}-\bar y)$
|-
|1||167||60|| 6|| 4.6|| 36|| 21.82|| 28.02
|-
|2|| 170|| 64|| 9|| 8.67 ||81|| 75.17|| 78.03
|-
|3|| 160|| 57|| -1|| 1.67|| 1|| 2.79|| -1.67
|-
|4|| 152|| 46|| -9|| -9.33|| 81|| 87.05 ||83.97
|-
|5|| 157|| 55|| -4|| -0.33|| 16|| 0.11|| 1.32
|-
|6|| 160|| 50|| -1|| -5.33|| 1|| 28.41|| 5.33
|-
|Total||966 ||332 ||0 ||0 ||216|| 215.33||195
|}
</center>

$\bar x\frac{966}{6}=161,\bar y=\frac{322}{6}= 55,s_{x}=\sqrt{\frac{216.5}{5}}=6.57, s_{y}=\sqrt{\frac {215.3}{5}}=6.56.$

$\rho(x,y)=\frac{1}{n-1}$ $\sum$ $\frac{x_{i}-\bar x}{s_{x}}\frac{y_{1}-\bar y}{s_{y}}=0.904$

====Slope inference====
We can conduct inference based on the linear relationship between two quantitative variables by inference on the slope. The basic idea is that we conduct a linear regression of the dependent variable on the predictor suppose they have a linear relationship and we came up with the linear model of y=α+βx+ε, and β is referred to as the true slope of the linear relationship and α represents the intercept of the true linear relationship on y-axis and ε is the random variation. We have talked about the slope in the linear regression, which describes the change in dependent variable y concerned with change in x.

*Test of the significance of the slope β: (1) is there evidence of a real linear relationship which can be done by checking the fit of the residual plots and the initial scatterplots of y vs. x; (2) observations must be independent and the best evidence would be random sample; (3) the variation about the line must be constant, that is the variance of the residuals should be constant which can be checked by the plots of the residuals; (4) the response variable must have normal distribution centered on the line which can be checked with a histogram or normal probability plot.
*Formula we use:$ t=\frac{b-\beta}{SE_{b}}$ , where b stands for the statistic value, $\beta$ is the parameter we are testing on, $SE_{b}$ is the measure of the variation. For the null hypothesis is the $\beta$=0 that is there is no relationship between y and x, so under the null hypothesis, we have the test statistic $t=\frac {b} {SE_{b}}$.

====R Examples====
=====Body Fat and Age=====
Consider a research conducted on see if body fat is associated with age. The data included 18 subjects with the percentage of body fat and the age of the subjects.

<center>
{| class="wikitable" style="text-align:center;" border="1"
|-
|Age|| Percentage of Body Fat
|-
|23||9.5
|-
|23||27.9
|-
|27||7.8
|-
|27|| 17.8
|-
|39 ||31.4
|-
|41|| 25.9
|-
|45 ||27.4
|-
|49|| 25.2
|-
|50 ||31.1
|-
|53 ||34.7
|-
|53 ||42
|-
|54 ||29.1
|-
|56 ||32.5
|-
|57 ||30.3
|-
|58|| 33
|-
|58|| 33.8
|-
|60|| 41.1
|-
|61|| 34.5
|}
</center>

The hypothesis tested: $H_{0}:\beta=0$ vs.$H_{a}:\beta\ne0;$ a t-test would be the test we are going to use here given that the data drawn is a random sample from the population.

In R
###
###
## first check the linearity of the relationship using a scatterplot
x <- c(23,23,27,27,39,41,45,49,50,53,53,54,56,57,58,58,60,61)
y <- c(9.5,27.9,7.8,17.8,31.4,25.9,27.4,25.2,31.1,34.7,42,29.1,32.5,30.3,33,33.8,41.1,34.5)
plot(x,y,main='Scatterplot',xlab='Age',ylab='% fat')
cor(x,y)

[1] 0.7920862

[[Image:SMHS SLR Fig 1.png|500px]]

The scatterplot shows that there is a linear relationship between x and y, and there is strong positive association of $r=0.7920862$ which further confirms the eye-bow test from the scatterplot about the linear relationship of age and percentage of body fat.

Then we fit a simple linear regression of y on x and draw the scatterplot along with the fitted line:

fit <- lm(y~x)

plot(x,y,main='Scatterplot',xlab='Age',ylab='% fat')

abline(fit)

[[Image:SMHS SLR Fig 2.png|500px]]

summary(fit)

Call:
lm(formula = y ~ x)

Residuals:
Min 1Q Median 3Q Max
-10.2166 -3.3214 -0.8424 1.9466 12.0753

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.2209 5.0762 0.635 0.535
x 0.5480 0.1056 5.191 8.93e-05 \***

plot(fit$\$ $resid,main='Residual Plot')
abline(y=0)

[[Image:SMHS SLR Fig3.png|500px]]

qqnorm(fit$\$ $resid) # check the normality of the residuals

[[Image:SMHS SLR Fig4.png|500px]]

From the residual plot and the QQ plot of residuals we can see that meet the constant variance and normality requirement with no heavy tails and the regression model is reasonable. From the summary of the regression model we have the t-test on the slope has the t value is 5.191 and the p-value is 8.93 e-05. We can reject the null hypothesis of no linear relationship and conclude that is significant linear relationship between age and percentage of body fat at 5% level of significance.

The confidence interval for the parameter tested is $b±t^{*} SE_{b}$, where b is the slope of the least square regression line, $t^{*}$ is the upper $\frac {1-C} {2}$ critical value from the t distribution with degrees of freedom n-2 and $SE_{b}$ is the standard error of the slope.

The standard error of the slope is 0.1056, so we have the 95% CI of the slope is $(0.5480-0.1056*2.12,0.5480+0.1056*2.12)$, that is $(0.324,0.772)$. So, we are 95% confident that the slope will fall in the range between 0.324 and 0.772.

=====Baseball Example=====
we are studying on a random sample (size 16) of baseball teams and the data show the team’s batting average and the total number of runs scored for the season.

<center>
{| class="wikitable" style="text-align:center; width:35%" border="1"
|-
|Batting average|| Number of runs scored
|-
|0.294|| 968
|-
|0.278|| 938
|-
|0.278 ||925
|-
|0.27|| 887
|-
|0.274 ||825
|-
|0.271|| 810
|-
|0.263|| 807
|-
|0.257 ||798
|-
|0.267 ||793
|-
|0.265 || 792
|-
|0.254|| 764
|-
|0.246|| 740
|-
|0.266|| 738
|-
|0.262||31
|-
|.251 ||708
|}
</center>

In R:
x <- c(0.294,0.278,0.278,0.270,0.274,0.271,0.263,0.257,0.267,0.265,0.256,0.254,0.246,0.266,0.262,0.251)
y <- c(968,938,925,887,825,810,807,798,793,792,764,752,740,738,731,708)
cor(x,y)
[1] 0.8654923

The correlation between x and y is 0.8655 which is pretty strong positive correlation. So it would be reasonable to make the assumption of a linear regression model of number of runs scored and the average batting.

fit <- lm(y~x)
summary(fit)
Call:
lm(formula = y ~ x)

Residuals:
*in 1Q Median 3Q Max
-74.427 -26.596 1.899 38.156 57.062

Coefficients:
*Estimate Std. Error t value Pr(>|t|)

(Intercept) -706.2 234.9 -3.006 0.00943 **
x 5709.2 883.1 6.465 1.49e-05 ***

---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 40.98 on 14 degrees of freedom

Multiple R-squared: 0.7491, Adjusted R-squared: 0.7312

F-statistic: 41.79 on 1 and 14 DF, p-value: 1.486e-05

plot(x,y,main='Scatterplot',xlab='Batting average',ylab='Number of runs')

abline(fit)

[[Image:SMHS_SLR_Fig5.png|500px]]

par(mfrow=c(1,2))

plot(fit$\$ $resid,main='Residual Plot')

abline(y=0)

qqnorm(fit$\$ $resid)

[[Image:SMHS SLR Fig6.png|500px]]

The estimated value of the slope is 5709.2, standard error 833.1, t value = 6.465, and the p-value is 1.49 e-05, so we reject the null hypothesis and conclude that there is significant linear relationship between the average batting and the number of runs. We have the 95% CI of the slope is $(5709.2-833.1*2.145,5709.2+833.1*2.145)$, that is $(3922.2,7496.2)$. So, we are 95% confident that the slope will fall in the range between 3922.2 and 7496.2.

You can also use [http://www.socr.ucla.edu/htmls/ana/SimpleRegression_Analysis.html SOCR SLR Analysis Simple Regression] to copy-paste the data in the applet, estimate regression slope and intercept and compute the corresponding statistics and p-values.

Simple Linear Regression Results:

Mean of C1 = 46.33333
Mean of C2 = 28.61111
Regression Line:
C2 = 3.22086 + 0.5479910213243551 C1
Correlation(C1, C2) = .79209
R-Square = .62740
Intercept:
Parameter Estimate: 3.22086
Standard Error: 5.07616
T-Statistics: .63451
P-Value: .53472
Slope:
Parameter Estimate: .54799
Standard Error: .10558
T-Statistics: 5.19053
P-Value: .00009

[[Image:SMHS SLR Fig7.png|600px]]

[[Image:SMHS SLR Fig8.png|600px]]

[[Image:SMHS SLR Fig9.png|600px]]

====Statistical inference on correlation coefficient====
Test on $H_{O}:r=\rho$ vs. $H_{a}:r≠\rho$ is the correlation between X and Y. $t_o = \frac{r-\rho}{\sqrt{\frac{1-r^2}{N-2}}}$ with [[AP_Statistics_Curriculum_2007_StudentsT|T distribution]] with $df=N-2$.

Comparing two correlation coefficients: this Fisher’s transformation provides a mechanism to test for comparing two correlation coefficients using Normal distribution. Suppose we have 2 independent paired samples
${(X_{i},Y_{i})}_{i=1}^{n_{1}}$ and ${(U_{j},V_{j} )}_{j=1}^{n_{2}}$ and the $r_{1}=corr(X,Y)$ and $r_{2}=corr(U,V)$ and we are testing $H_{0}: r_{1}=r_{2}$ vs. $H_{a}:r_{1}≠r_{2}$ The Fisher’s transformation for the 2 correlations is defined by $\hat{r}=\frac{1}{2}log_{e}\|\frac{1+r}{1-r}\|$, transforming the two correlation coefficients separately yields $r_{11}=\frac{1}{2}log_{e}\|\frac {1+r_{1}}{1-r_{1}}\|$ and $r_{22}=\frac{1}{2}log_{e}\|\frac{1+r_{22}}{1-r_{22}}\|$. $Z_{0}$ $ =\frac {r_{11}-r_{22}} {\sqrt{\frac{1}{n_{1-3}}-\frac{1}{n_{2-3}}}}$

Note that the hypotheses for the single and double sample inference are $H_{0}:r=0$ vs. $H_{a}:r≠0$ and $H_{0}:r_{1}-r_{2}=0$ vs. $H_{a}:r_{1}-r_{2}≠0$ respectively. And an estimate of the standard deviation of the (Fisher transfored!) correlation is $SD\hat{(r)}=\sqrt{\frac{1}{n-3}}$, thus $r\sim $ $N\bigg (0,\sqrt\frac{1} {n-3}\bigg )$.

=====Brain Volume Example=====
The brain volumes (responses) and age (predictors) for 2 cohorts of subjects (Group 1 and Group 2).

<center>
{|class="wikitable" style="text-align:center; width:90%" border="1"
|-
|Group1 ||Age1 ||Volume1||Group2||Age2 ||Volume2
|-
|1|| 58|| 0.269609 ||2|| 59 ||0.27905
|-
|1|| 55|| 0.277243 ||2|| 50 ||0.262916
|-
|1|| 61|| 0.236264|| 2|| 58|| 0.290697
|-
|1|| 70|| 0.218015|| 2|| 58|| 0.269361
|-
|1|| 38|| 0.287205|| 2|| 61|| 0.268247
|-
|1|| 41 ||0.307387 ||2|| 57|| 0.294204
|-
|1|| 40|| 0.271063|| 2|| 50|| 0.292699
|-
|1|| 25 ||0.307688|| 2|| 38|| 0.273969
|-
|1|| 70|| 0.237811|| 2|| 57|| 0.29049
|-
|1|| 49|| 0.293371|| 2|| 64|| 0.286564
|-
|1|| 56|| 0.252592|| 2|| 71|| 0.257386
|-
|1|| 56|| 0.251349|| 2|| 34|| 0.314958
|-
|1 ||40|| 0.29616 ||2|| 53|| 0.298022
|-
|1|| 50|| 0.249596|| 2|| 53|| 0.269229
|-
|1|| 55|| 0.282721|| 2|| 25|| 0.270634
|-
|1 ||69|| 0.247565|| 2|| 61|| 0.266905
|}
</center>

We have two independent groups and $Y=volume1$ (response) and $X=age1$ (predictor); $V=volume2$ and $U=age2$, where $n_{1}=27$, $n_{2}=27$. We compute the 2 correlation coefficients: $r_{1}=corr(X,Y)=-0.75338$ and $r_{2}=corr(U,V)=-0.49491.$ Using the Fisher’s transformation we obtain: $r_{11}=\frac{1}{2}log_{e}\|\frac {1+r_{1}}{1-r_{1}}\| = -0.980749 $ and $r_{22}=\frac{1}{2}log_{e}\|\frac{1+r_{2}}{1-r_{2}}\| = -0.5425423,$ $Z_{0}$ $ =\frac {r_{11}-r_{22}} {\sqrt {\frac{1}{n_{1}-3}-\frac{1}{n_{2}-3}}} = 11.517993.$ The corresponding 1-sided p-value =$0.064508$, double-sided p-value =$0.129016$.

===Simple linear regression (SLR)===
Modeling of the linear relations between two variables using regression analysis.
$Y$ is an observed variable and $X$ is specified by the researcher, e.g. $Y$ is hair growth after $X$ months, for individuals at certain does level of hair growth cream; $X$ and $Y$ are both observed variables.
*Estimating the best linear fit: simple linear regression model $Y=a+bX+\varepsilon $ can be estimated using least square, which fits a line minimizing the sums of $ \varepsilon_{l}=\hat y_{l} -y_{i}, \sum_{i=1}^{N} \hat\varepsilon_l^{2}=\sum_{i=1}^{N}(\hat y_{l}-y_{i} )^{2}$, where $ \hat y_{l} = a+bx_{i}$ are observed and predicted values of $Y$ for $x_{i}$.

*Solving for the minimization problem:
: $\hat{b} = \frac {\sum_{i=1}^{N} (x_{i} - \bar{x})(y_{i} - \bar{y}) } {\sum_{i=1}^{N} (x_{i} - \bar{x}) ^2} = \frac {\sum_{i=1}^{N} {(x_{i}y_{i})} - N \bar{x} \bar{y}} {\sum_{i=1}^{N} (x_{i})^2 - N \bar{x}^2} = \rho_{X,Y} \frac {s_y}{s_x},$
: where [[AP_Statistics_Curriculum_2007_GLM_Corr | $\rho_{X,Y}$ is the correlation coefficient]].

: $\hat a=\bar y-\hat b\bar x$.

*Properties of the least square line: (1) the line goes through the point of ($\bar{X},\bar{Y}$); (2) the sum of the residuals is equal to zero; (3) the estimates are unbiased, that is their expected values are equal to the real slope and intercept values.

*Regression coefficients inference: when the error terms are normally distributed, then the estimate of the slope coefficient has a normal distribution ith mean equal to $b$ and standard error $SE(\hat b)$ = $s_{\hat b}=\sqrt\frac{1}{N-2}\frac{\sum_{i=1}^{N}\hat\varepsilon_{i}^{2}} {\sum_{i=1}^{N}(x_{i}-\bar x)^{2}}$ To carry out the confidence interval estimating of the slope and intercept of linear model. Given that b follows $\hat b$ follows a T distribution with $N-2$ degrees of freedom, we can calculate the confidence interval for b:$[\hat b-s_{\hat b}t(\frac{\alpha}{2},N-2),\hat b-s_{\hat b}t(\frac{\alpha}{2},N-2)]$ The corresponding test for the regression slope coefficient b is analogously computed $(H_{0}:b=b_{0}$ vs. $H_{a}: b≠b_{0})$ and the test statistic is $t_{0}=\frac{\hat b-b_{0}}{s_{\hat b}} \sim T_{\{df=N-2\}}.$

====Earthquake Data Example====
* Us the [[SOCR_Data_Dinov_021708_Earthquakes |SOCR Data Earthquakes]] to fit the best linear fit between the longitude and the latitude of the California earthquake since 1900. The SOCR Geomap of these earthquake
*[http://socr.ucla.edu/docs/resources/SOCR_Data/SOCR_Earthquake5Data_GoogleMap.html SOCR Google Map Earthquakes] shows using the SLR fit to the earthquake data.

===Applications===
* [http://wiki.stat.ucla.edu/socr/index.php/SOCR_EduMaterials_AnalysisActivities_SLR This article ] presents the SLR analysis activity in SOCR analysis. It starts with a general introduction to SLR model and then illustrate this method in details with various examples. The article help read results of SLR, make interpretation of the slope and intercept and observe and interpret various data and resulting plots including scatter plots, normal QQ plot and different diagnostic plots such as residual on fit plot.

* [http://europepmc.org/abstract/MED/3840866 This article ] titled Simple Linear Regression In Medical Research discussed the method of fitting a straight line to data by linear regression and focuses on examples from 36 original articles published in 1978 and 1979. It concluded that investigators need to become better acquainted with residual plots, which give insight into how well the fitted lie models the data, and with confidence bounds for regression lines. Statistical computing package enable investigators to use these techniques easily.

* [http://ww2.coastal.edu/kingw/statistics/R-tutorials/simplelinear.html This article ]) presents the r tutorial for simple linear regression. It starts with the fundamental check on the data and comment on the existing patterns found and then fit the linear regression model with the height and weight. It also modified the regression with the Lowess smoothing and talked about the local weighted scatter plot smooth. This article is a comprehensive study on the SLR and correlation in R.

* [http://www.tandfonline.com/doi/abs/10.1080/00401706.1975.10489279 This article]titled The Probability Plot Correlation Coefficient Test For Normality introduced the normal probability plot correlation coefficient as a test statistic in complete samples for the composite hypothesis of normality. The proposed test statistic is conceptually simple, and is readily extendable to testing non-normal distribution hypotheses. The paper included an empirical power study which shows that the normal probability plot correlation coefficient compared favorably with seven other normal test statistics.

===Software===
* [http://socr.ucla.edu/htmls/SOCR_Distributions.html SOCR Distributions]
* [http://socr.ucla.edu/htmls/exp/Bivariate_Normal_Experiment.html Bivariate Normal Experiment]
* [http://socr.ucla.edu/Applets.dir/Normal_T_Chi2_F_Tables.htm Normal Chi-Squared F Tables]

===Problems===
Example 1: Simple linear correlation and regression in R:
> library(MASS)
> data(cats)
> str(cats)

'data.frame': 144 obs. of 3 variables:
$ Sex: Factor w/ 2 levels "F","M": 1 1 1 1 1 1 1 1 1 1 ...
$ Bwt: num 2 2 2 2.1 2.1 2.1 2.1 2.1 2.1 2.1 ...
$ Hwt: num 7 7.4 9.5 7.2 7.3 7.6 8.1 8.2 8.3 8.5 ...

> summary(cats)

Sex Bwt Hwt
F:47 Min. :2.000 Min. : 6.30
M:97 1st Qu.:2.300 1st Qu.: 8.95
Median :2.700 Median :10.10
Mean :2.724 Mean :10.63
3rd Qu.:3.025 3rd Qu.:12.12
Max. :3.900 Max. :20.50
[[Image:SMHS SLR Fig10.png|300]]

A positive correlation between two variables X and Y means that if X increases, this will cause the value of Y to increase.
*(a) This is always true.
*(b) This is sometimes true.
*(c) This is never true.

The correlation between working out and body fat was found to be exactly -1.0. Which of the following would not be true about the corresponding scatterplot?
*(a) The slope of the best line of fit should be -1.0.
*(b) All the points would lie along a perfect straight line, with no deviation at all.
*(c) The best fitting line would have a downhill (negative) slope.
*(d) 100% of the variance in body fat can be predicted from workout.

Suppose that the correlation between working out and body fat was found to be exactly -1.0. Which of the following would NOT be true, about the corresponding scatterplot?
*(a) All points would lie along a straight line, with no deviation at all.
*(b) 100% of the variance in body fat can be predicted from the workout.
*(c) The slope of the linear model is -1.0.
*(d) The best fitting line would have a negative slope.

If the correlation coefficient is 0.80, then:
*(a) The explanatory variable is usually less than the response variable.
*(b) The explanatory variable is usually more than the response variable.
*(c) None of the statements are correct.
*(d) Below-average values of the explanatory variable are more often associated with below-average values of the response variable.
*(e) Below-average values of the explanatory variable are more often associated with above-average values of the response variable.

Two different researchers wanted to study the relationship between math anxiety and taking exams. Researcher A measured anxiety with a scale that had a minimum score of 0 and a maximum score of 20, and a final exam that had a minimum score of 0 and a maximum score of 50. He tested 120 students. Researcher B measured anxiety with a scale that had a minimum of 0 and a maximum of 30, and a final exam that had a minimum score of 0 and a maximum score of 35. He tested 60 students. Researcher A found that the coefficient of correlation between a student's math anxiety and his or her score on the final was -0.60. Researcher B found the correlation between a student's math anxiety and his or her score on the final was -0.30.
*(a) The coefficient of correlation for researcher A is twice as strong as the coefficient of correlation for researcher B.
*(b) Based on the study by researcher A one can conclude that high math anxiety is the reason that a lot of the students do not do well in math.
*(c) Given that coefficient of correlation shows the association between standardized scores, one can conclude that for researcher A a greater precentage of the students who have above average anxiety are likely to have below average score on the final.
*(d) Given that the minimum and the maximum values for math and anxiety are so different for the two researchers one cannot compare the coefficient of correlation found by these two researchers.

In the early 1900's when Francis Galton and Karl Pearson measured 1078 pairs of fathers and their grown-up sons, they calculated that the mean height for fathers was 68 inches with deviation of 3 inches. For their sons, the mean height was 69 inches with deviation of 3 inches. (The actual deviations a bit smaller, but we will work with these values to keep the calculations simple.) The correlation coefficient was 0.50. Use the information to calculate the slope of the linear model that predicts the height of the son from the height of the fathers.
*(a) 35.00
*(b) 0.50
*(c) The slope cannot be determined without the actual data
*(d) 3/3 = 1.00

Suppose that wildlife researchers monitor the local alligator population by taking aerial photographs on a regular schedule. They determine that the best fitting linear model to predict weight in pounds from the length of the gators inches is:
Weight = -393 + 5.9*Length,with r2 = 0.836.
Which of the following statements is true?
*(a) A gator that is about 10 inches above average in length is about 59 pounds above the average weight of these gators.
*(b) The correlation between a gator's length and weight is 0.836.
*(c) The correlation between a gator's height and weight cannot be determined without the actual data.
*(d) The correlation between a gator's height and weight is about -0.914.

Which of the following is NOT a property of the LSR Line?
*(a) The sum of the distances between each point and the LSR Line is minimized.
*(b) The average x value and the average y value lies on the LSR Line
*(c) The sum of squared residuals is minimized
*(d) The sum of the residuals = 0

Suppose that the linear model that predicts fat content in grams from the protein of selected items from Burger Queen menu is: Fat = 6.83 + 0.97*Protein. We learn that there are actually 20 grams of fat in the Chucking burger that has 20 grams of protein. Which of the following statements is true?
*(a) The linear model underestimates the actual fat content and produces a residual of -6.23
*(b) The linear model overestimates the fat content and produces a residual of -6.23
*(c) The linear model underestimates the fat content and produces a residual of -6.23
*(d) The linear model overestimates the fat content and produces a residual of 6.2

Which statement describes the principle of "least squares" that we use in determining the best-fit line?
*(a) The best-fit line minimizes the distances between the observed values and the predicted values.
*(b) The best-fit line minimizes the sum of the squared residuals.
*(c) The best-fit line minimizes the sum of the residuals.
*(d) The best-fit line minimizes the sum of the distances between the actual values and the predicted values.

The scores of midterm and final exams for a random sample of Stats 10 students can be summarized as follows:
Mean of midterm score = 36.92; SD of midterm score = 37.79 Mean of final score = 24.71; SD of final score= 25.21 r= 0.978
Choose one answer.
*(a) 23.44
*(b) 0.62
*(c) 25.21
*(d) 35

Which of the following is NOT a property of the Least Squares Regression Line?
*(a) The sum of the distances between each point and the LSR Line is minimized.
*(b) The sum of squared residuals is minimized
*(c) The average x value and the average y value lie on the LSR Line
*(d) The sum of the residuals = 0

Tom and Sue wanted to estimate the average self-esteem score. The true population average for self esteem score is 20. Tom estimates that average by taking a sample of size n and then constructing a confidence interval. What of the following is true?
I. The resulting interval will contain 20 II. The 95 percent confidence interval for n = 100 will generally be more narrow than the 95 percent confidence interval for n = 50. III. For n = 100, the 95 percent confidence interval will be wider than the 90 percent confidence interval.
*(a) II only
*(b) III only
*(c) I only
*(d) II and III

A simple random sample of 1000 persons is taken to estimate the percentage of Democrats in a large population. It turns out that 543 of the people in the sample are Democrats. Is the following statement true or false? Explain (51%, 57.5%) is approximately a 95% confidence interval for the sample percentage of democrats.
*(a) False, that is the approximate confidence interval for p. There is no confidence interval for the sample proportion.
*(b) True, we did the computations and those are approximately the numbers for the confidence interval for p.
*(c) True, that is the confidence interval for the sample mean.
*(d) False, the confidence interval for the sample proportion should be smaller than that.

Use the linear model to predict the height of a son whose father's height is 6 feet.
*(a) The son's height = 35 + 0.5(6) inches
*(b) The son's height = 35 + 0.5(72) inches
*(c) The "Regression Effect" states that the son will be a bit taller than his father
*(d) Cannot be determined without the data

A statistician wants to predict Z from Y. He finds that r-squared is 5%.Which one of the following conclusions is correct?
*(a) The coefficient of correlation between Y and Z is 0.05
*(b) Y explains 5% of the variance in Z
*(c) Y is a good predictor of Z
*(d) Z is a good predictor of Y

===References===
*[Probability_and_statistics_EBook#Chapter_X:_Correlation_and_Regression SOCR COrrelation and Regression Chapter]

<hr>
* SOCR Home page: http://www.socr.umich.edu

{{translate|pageName=http://wiki.socr.umich.edu/index.php?title=SMHS_SLR}}

SMHS PowerSensitivitySpecificity

2014-09-15T02:20:30Z

Jslavine: /* Type I Error, Type II Error and Power */

==[[SMHS| Scientific Methods for Health Sciences]] - Statistical Power, Sample-Size, Sensitivity and Specificity ==

===Overview:===

In statistics, we have many ways to evaluate and choose tests or models. In this lecture, we are going to introduce some commonly used methods to describe the characteristics of a test: power, sample size, effect size, sensitivity and specificity. This lecture will present background knowledge of these concepts and illustrate their power and application through examples.

===Motivation:===

Experiments, models and tests are fundamental to the field of statistics. All researchers are faced with the question of how to choose appropriate models and set up tests. We are interested in studying some of the most commonly used methods, including power, effect size, sensitivity and specificity. Focusing on these characteristics will greatly help us to choose and appropriate model and understand the results. We must consider questions such as, what would be a reasonable sample size to reach a balance in the trade off between cost and efficiency? What is the probability that a test will reject a false null hypothesis? What is the test’s ability to correctly accept a true null hypothesis or reject a false alternative hypothesis?

===Theory===

====Type I Error, Type II Error and Power====
*Type I error: the false positive (Type I) error of rejecting the null hypothesis given that it is actually true; e.g., the purses are detected to containing the radioactive material while they actually do not.
*Type II error: the false negative (Type II) error of failing to reject the null hypothesis given that the alternative hypothesis is actually true; e.g., the purses are detected to not containing the radioactive material while they actually do.
*Statistical power: the probability that the test will reject a false null hypothesis (that it will not make a Type II error). When power increases, the chances of a Type II error decrease.
*Test specificity (ability of a test to correctly accept the null hypothesis $ =\frac{d}{b+d}$.
*Test sensitivity (ability of a test to correctly reject the alternative hypothesis $=\frac{a}{a+c}$.

*The table below gives an example of calculating specificity, sensitivity, False positive rate $\alpha$, False Negative Rate $\beta$ and power given the information of ''TN'' and ''FN''.

<center>
{| class="wikitable" style="text-align:center;width: 25%"border="1"
|-
| colspan=2 rowspan=2| || colspan=2| '''Actual Condition'''
|-
| '''Absent (H_0 is true)''' || '''Present (H_1 is true)'''
|-
| rowspan=2| '''Test Result'''|| '''Negative(fail to reject H_0)''' || Condition absent + Negative result = True (accurate) Negative ('''TN''', 0.98505) || ''Condition present + Negative result = False (invalid) Negative ('''FN''', 0.00025)'''Type II error''' (proportional to $\beta$)
|-
| '''Positive (reject H_0)''' || Condition absent + Positive result = False Positive ('''FP''', 0.00995)'''Type I error''' (α) || Condition Present + Positive result = True Positive ('''TP''', 0.00475)
|-
|'''Test Interpretation''' || $Power = 1-\frac{FN}{FN+TP}= 0.95 $ ||'''Specificity''': TN/(TN+FP) = 0.98505/(0.98505+ 0.00995) = 0.99 ||'''Sensitivity''': TP/(TP+FN) = 0.00475/(0.00475+ 0.00025)= 0.95
|-
|}
</center>

Specificity $=\frac{TN}{TN + FP}$, Sensitivity $=\dfrac{TP}{TP+FN}$, $\alpha=\dfrac {FP}{FP+TN}$, $\beta=\frac{FN}{FN+TP}$, power$=1-\beta.$ Note that both (''Type I ($\alpha$)'' and ''Type II ($\beta$)'') errors are proportions in the range [0,1], so they represent ''error-rates''. The reason they are listed in the corresponding cells is that they are directly proportionate to the numerical values of the FP and FN, respectively.

Note that the two alternative definitions of ''power'' are equivalent:
: power$=1-\beta$, and
: power=sensitivity
This is because power=$1-\beta=1-\frac{FN}{FN+TP}=\frac{FN+TP}{FN+TP} - \frac{FN}{FN+TP}=\frac{TP}{FN+TP}=$ sensitivity.

====Sample size====
The number of observations or replicates included in a statistical sample. It is an important feature of any empirical study, which aims to make inference about a population. In complicated studies, there may be several different sample sizes involved in the study: for example, in a survey sampling involving stratified sampling, there may be different sizes of samples for each population.

*Factors influence sample size: expense of data collection; need to have sufficient statistical power.

*Ways to choose sample sizes: (1) expedience. Consider a simple experiment where the sample data is readily available or convenient to collect, yet the size of sample is crucial in avoiding wide confidence intervals or risks of errors in statistical hypothesis testing. (2) using a target variance for an estimate to be derived from the sample eventually obtained; (3) using a target for the power of a statistical test to be applied once the sample is collected.

*Intuitively, larger sample size generally lead to increased precision in estimating unknown parameters. However, in some situations, the increase in accuracy for larger sample size is minimal, or even doesn’t exist. This can result from the presence of systematic error or strong dependence in the data, or if the data follow a heavy-tailed distribution. Sample size is judged based on the quality of the resulting estimates. For example, if a proportion is being estimated, one may wish to have the 95% confidence interval be less than 0.06 units wide. Alternatively, sample size may be assessed based on the power of a hypothesis test.

*Choose the sample size based on our expectation of other measures.
*Suppose the simple experiment of flipping a coin, where estimator of a proportion is $\hat{p}=\frac{X}{n}$, where $X$ is the number of heads out of n experiments. The estimator follows a binomial distribution and when n is sufficiently large, the distribution will be closely approximated by a normal distribution. With approximation, it can be shown that around $95\%$of this distribution’s probability lies within 2 standard deviations of the mean. Use Wald method for the binomial distribution, an interval of the form $(\hat{p} -2\sqrt{\frac{0.25}{n}}, \hat{p} + 2\sqrt{\frac{.25}{n}}) $ will form a 95% CI for the true proportion. If this interval needs to be no more than $W$ units wide, then we have $4\sqrt{\frac{0.25}{n}}=W$, solved for $n$, we have $ n=\frac{4}{W^2}=\frac{1}{B^2}$ where $B$ is the error bound on the estimate, i.e., the estimate is usually given as within $\pm B$. Hence, if $B=10$, then $n=100$; and if $B=0.05$ (5%), then $n=400$.

=====Proportion=====
A proportion is a special case of mean. When estimating the population mean using an independent and identically distributed sample of size n, where each data has variance $ \sigma ^{2}$, the standard error of the sample mean is $\frac{\sigma}{\sqrt{n}}$. With [[SMHS_CLT_LLN|CLT]], the 95% CI is $(\bar x - \frac {2\sigma}{\sqrt n},\bar x +\frac{2\sigma}{\sqrt n})$. If we wish to have a confidence interval with W units in width, then solve for n, we have $n=\frac{16\sigma^2}{W^2}$.

=====Hypothesis tests=====
Sample size for hypothesis tests: Let $X_i,i=1,2,…,n$ be independent observations taken from a normal distribution with unknown mean μ and known variance $\sigma^2$. The null hypothesis vs. alternative hypothesis: $H_0:\mu=0$ vs.$H_a:\mu=\mu^*$. If we wish to:
# reject $H_0$ with a probability of at least $1-\beta$ when $H_a$ is true,
# reject $H_0$ with probability $\alpha$ when $H_0$ is true.

Thus, we need: $P(\bar x >\frac{z_{\alpha}\sigma}{\sqrt n}|H_0 \text{ true})=\alpha $, and so reject $H_0$ if our sample average is more than $\frac{z_\alpha\sigma} {\sqrt n}$ is a decision rule which satisfies (2). $z_\alpha$ is the upper percentage point of the standard normal distribution (as this is a one-sided test!) If we wish this to happen with a probability $1-\beta$ when $H_a$ is true, that is our sample average comes from a normal distribution with a different mean $μ^*$.

Therefore, require $P (\bar x >\frac {z_{\alpha}\sigma}{\sqrt n}|H_a \text{ true})\le 1-\beta $. Solving for $n$, we have $n \ge \left( \frac{z_{\alpha}-\Phi^{-1}(1-\beta)}{\frac{\mu^{*}}{\sigma}}\right)^{2}$, where $\Phi$ is the [[SMHS_ProbabilityDistributions#Normal_distribution|normal cumulative distribution function]]. Recall that by CLT, $\bar{x} \sim N(\mu, \frac{\sigma^2}{n})$ and under $H_a$, $x \sim N(\mu^*, \sigma^2)$. Then standardizing $\bar{x}$, we have that $\frac{\bar{x} -\mu^*}{\frac{\sigma}{\sqrt{n}}} \sim N(0,1)$. $\frac{\bar{x} -\mu^*}{\frac{\sigma}{\sqrt{n}}} \ge z_{\alpha}$ can be solved for $\sqrt{n}$ (or $n$) given the specified lower boundary on the right-tail probability ($1-\beta$). Thus, $\Phi^{-1}(1-\beta) \ge z_{\alpha} - \frac{\mu^*}{\frac{\sigma}{\sqrt{n}}},$ and $n \ge \left( \frac{z_{\alpha}-\Phi^{-1}(1-\beta)}{\frac{\mu^{*}}{\sigma}}\right)^{2}$.

====Effect size====
[http://books.google.com/books?id=J8AlAgAAQBAJ&pg=PT176&lpg=PT176&dq=Effect+size+is+a+descriptive+statistic+that+conveys+the+estimated+magnitude+of+a+relationship+without+making+any+statement+about+whether+the+apparent+relationship+in+the+data+reflects+a+true+relationship+in+the+population&source=bl&ots=YcgNM4azVu&sig=ut-4IHx-SrRoHqMZjAmQtXxxYp4&hl=en&sa=X&ei=wQkGVPzhIsrHggT68YDQBA&ved=0CDMQ6AEwAg#v=onepage&q=Effect%20size%20is%20a%20descriptive%20statistic%20that%20conveys%20the%20estimated%20magnitude%20of%20a%20relationship%20without%20making%20any%20statement%20about%20whether%20the%20apparent%20relationship%20in%20the%20data%20reflects%20a%20true%20relationship%20in%20the%20population&f=false Effect size] is a descriptive statistic that conveys the estimated magnitude of a relationship without making any statement about whether the apparent relationship in the data reflects a true relationship in the population. It complements inferential statistics such as p-value and plays an important role in statistical studies. The term effect size can refer to a statistic calculated from a sample of data, or to a parameter of a hypothetical statistical population. These effect sizes estimate the amount of the variance within an experiment that is "explained" or "accounted for" by the experiment's model.

====Other common measures====
*Pearson $r$ (correlation): an effect size when paired quantitative data are available, for instance if one were studying the relationship between birth weight and longevity. It varies from -1 to 1, 1 indicating a perfect negative linear relation, 1 indicating a perfect positive linear relation and 0 indicating no linear relation between two variables.

*Correlation coefficient, $ r^2 $: a coefficient determination calculated as the square of Pearson correlation r. It varies from 0 to 1 and is always nonnegative. For example, if $r=0.2$ then $r^2=0.04$ meaning that $4\%$ of the variance of either variable is shared with the other variable.

*Eta-squared, $ \eta^2 $, describes the ratio of variance explained in the dependent variable by a predictor while controlling for other predictors, making it analogous to the $ r^2 $. It is a biased estimator of the variance explained by the model in the population. $ \eta^2=\frac{SS_{treatment}} {SS_{total}} $ .

*Omega-squared, $\omega^2$: a less biased estimator of the variance explained in the population. $\omega^2 =\frac{SS_{treatment}-df_{treatment}*MS_{error}}{SS_{total}+MS_{error}}$. Given it is less biased, $\omega^2$ is preferable to $\eta^2$, however, it can be more inconvenient to calculate for complex analyses.

* Cohen’s $ f^2 $: one of several effect size measures to use in the context of an F test for ANOVA or multiple regression. Its amount of bias depends on the bias of its underlying measurement of variance explained. $f^2=\frac{R^2}{1-R^2}$,$R^2 $ is the squared multiple correlation.

===Applications===
*[http://www.sciencedirect.com/science/article/pii/0197245681900015 This article]titled Introduction To Sample Size Determination And Power Analysis For Clinical Trials reviewed the importance of sample size in clinical trials and presented a general method from which specific equations are derived for sample size determination and analysis of power for a wide variety of statistical procedures. This paper discussed the method in details with illustration in relation to the t test, test for proportions, test for survival time and tests for correlations that commonly occurred in clinical trials.

*[http://http://onlinelibrary.wiley.com/doi/10.1111/j.1469-185X.2007.00027.x/pdf This article] presents measures of the magnitude of effects (i.e., effect size statistics) and their confidence intervals in all biological journals. It illustrated the combined use of an effect size and its confidence interval, which enables one to assess the relationships within data more effectively than the use of p values, regardless of statistical significance. It focused on standardized effect size statistics and extensively discussed two dimensionless classes of effect size statistics: d statistics (standardized mean difference) and r statistics (correlation coefficient), because these can be calculated from almost all study designs and also because their calculations are essential for meta-analysis. The paper provided potential solutions for four main technical problems researchers may encounter when calculating effect size and CIs: (1) when covariates exist, (2) when bias in estimating effect size is possible, (3) when data have non-normal error structure and/or variances, and (4) when data are non-independent.

*[http://www.sciencedirect.com/science/article/pii/019724569090005M This article]reviewed methods of sample size and power calculation for most commonly study designs. It presents two generic formulae for sample size and power calculation, from which the commonly used methods are derived. It also illustrates the calculation with a computer program, which can be used for studies with dichotomous, continuous, or survival response measures.

===Software===
*[http://www.distributome.org/V3/calc/StudentCalculator.html Student Calculator]
*[http://socr.umich.edu/Applets/Normal_T_Chi2_F_Tables.html Normal T Chi-Squared F Tables]

===Problems===
Other things being equal, which of the following actions will reduce the power of a hypothesis test?

I. Increasing sample size.  II. Increasing significance level.  III. Increasing beta, the probability of a Type II error.

:(A) I only
:(B) II only  
:(C) III only
:(D) All of the above
:(E) None of the above

Suppose a researcher conducts an experiment to test a hypothesis. If she doubles her sample size, which of the following will increase?

I. The power of the hypothesis test.  II. The effect size of the hypothesis test.  III. The probability of making a Type II error.

:(A) I only
:(B) II only
:(C) III only
:(D) All of the above
:(E) None of the above

Suppose we have the following measurements taken. Calculate the corresponding power, specificity and sensitivity.
<center>
{| class="wikitable" style="text-align:center;width: 25%"border="1"
|-
| colspan=2 rowspan=2| || colspan=2| '''Actual Condition'''
|-
| '''Absent ($H_0$ is true)''' || '''Present ($H_1$ is true)'''
|-
| rowspan=2| '''Test Result'''|| '''Negative(fail to reject $H_0$)''' || 0.983 || 0.0025
|-
| '''Positive (reject $H_0$)''' || 0.0085 ||0.0055
|}
</center>

Suppose we are running a test on a simple experiment where the population standard deviation is $ 0.06$. $H_0: \mu=0$ vs. $H_a: \mu=0.5$. With type I error of 5%, what would be a reasonable sample size if we want to achieve at least 98% power.

===References===

*[http://wiki.stat.ucla.edu/socr/index.php/AP_Statistics_Curriculum_2007_Hypothesis_Basics SOCR]

*[http://en.wikipedia.org/wiki/Sample_size_determination Sample Size Determination Wikipedia]

*[http://en.wikipedia.org/wiki/Effect_size Effect Size Wikipedia]

*[http://books.google.com/books?id=whF18jCxyv0C&pg=PT4&lpg=PT4&dq=e-Study+Guide+for+Statistics+for+the+Behavioral+Sciences,+textbook+by+Susan&source=bl&ots=9vlDcJMtv1&sig=lUFE0l5GeZdyX8iasXUNgSpb6UI&hl=en&sa=X&ei=CQoGVMjCNs_HgwTi1YCICw&ved=0CD8Q6AEwAw#v=onepage&q=e-Study%20Guide%20for%20Statistics%20for%20the%20Behavioral%20Sciences%2C%20textbook%20by%20Susan&f=false e-Study Guide for Statistics for the Behavioral Sciences, textbook by Susan]

<hr>
* SOCR Home page: http://www.socr.umich.edu

{{translate|pageName=http://wiki.socr.umich.edu/index.php?title=SMHS_PowerSensitivitySpecificity}}

SMHS PowerSensitivitySpecificity

2014-09-03T14:31:49Z

Jslavine: /* Motivation: */

==[[SMHS| Scientific Methods for Health Sciences]] - Statistical Power, Sample-Size, Sensitivity and Specificity ==

===Overview:===

In statistics, we have many ways to evaluate and choose tests or models. In this lecture, we are going to introduce some commonly used methods to describe the characteristics of a test: power, sample size, effect size, sensitivity and specificity. This lecture will present background knowledge of these concepts and illustrate their power and application through examples.

===Motivation:===

Experiments, models and tests are fundamental to the field of statistics. All researchers are faced with the question of how to choose appropriate models and set up tests. We are interested in studying some of the most commonly used methods, including power, effect size, sensitivity and specificity. Focusing on these characteristics will greatly help us to choose and appropriate model and understand the results. We must consider questions such as, what would be a reasonable sample size to reach a balance in the trade off between cost and efficiency? What is the probability that a test will reject a false null hypothesis? What is the test’s ability to correctly accept a true null hypothesis or reject a false alternative hypothesis?

===Theory===

====Type I Error, Type II Error and Power====
*Type I error: the false positive (Type I) error of rejecting the null hypothesis given that it is actually true; e.g., the purses are detected to containing the radioactive material while they actually do not.
*Type II error: the false negative (Type II) error of failing to reject the null hypothesis given that the alternative hypothesis is actually true; e.g., the purses are detected to not containing the radioactive material while they actually do.
*Statistical power: the probability that the test will reject a false null hypothesis (that it will not make a Type II error). When power increases, the chances of a Type II error decrease.
*Test specificity (ability of a test to correctly accept the null hypothesis $ =\frac{d}{b+d}$.
*Test sensitivity (ability of a test to correctly reject the alternative hypothesis $=\frac{a}{a+c}$.

*The table below gives an example of calculating specificity, sensitivity, False positive rate $\alpha$, False Negative Rate $\beta$ and power given the information of ''TN'' and ''FN''.

<center>
{| class="wikitable" style="text-align:center;width: 25%"border="1"
|-
| colspan=2 rowspan=2| || colspan=2| '''Actual Condition'''
|-
| '''Absent (H_0 is true)''' || '''Present (H_1 is true)'''
|-
| rowspan=2| '''Test Result'''|| '''Negative(fail to reject H_0)''' || Condition absent + Negative result = True (accurate) Negative ('''TN''', 0.98505) || ''Condition present + Negative result = False (invalid) Negative ('''FN''', 0.00025)'''Type II error''' (β)
|-
| '''Positive (reject H_0)''' || Condition absent + Positive result = False Positive ('''FP''', 0.00995)'''Type I error''' (α) || Condition Present + Positive result = True Positive ('''TP''', 0.00475)
|-
|'''Test Interpretation''' || $Power = 1-FN= 1-0.00025 = 0.99975 $ ||'''Specificity''': TN/(TN+FP) = 0.98505/(0.98505+ 0.00995) = 0.99 ||'''Sensitivity''': TP/(TP+FN) = 0.00475/(0.00475+ 0.00025)= 0.95
|-
|}
</center>

Specificity $=\frac{TN}{TN + FP}$, Sensitivity $=\dfrac{TP}{TP+FN}$, $\alpha=\dfrac {FP}{FP+TN}$, $\beta=\frac{FN}{FN+TP}$, power$=1-\beta.$

====Sample size====
The number of observations or replicates included in a statistical sample. It is an important feature of any empirical study, which aims to make inference about a population. In complicated studies, there may be several different sample sizes involved in the study: for example, in a survey sampling involving stratified sampling, there may be different sizes of samples for each population.

*Factors influence sample size: expense of data collection; need to have sufficient statistical power.

*Ways to choose sample sizes: (1) expedience. Consider a simple experiment where the sample data is readily available or convenient to collect, yet the size of sample is crucial in avoiding wide confidence intervals or risks of errors in statistical hypothesis testing. (2) using a target variance for an estimate to be derived from the sample eventually obtained; (3) using a target for the power of a statistical test to be applied once the sample is collected.

*Intuitively, larger sample size generally lead to increased precision in estimating unknown parameters. However, in some situations, the increase in accuracy for larger sample size is minimal, or even doesn’t exist. This can result from the presence of systematic error or strong dependence in the data, or if the data follow a heavy-tailed distribution. Sample size is judged based on the quality of the resulting estimates. For example, if a proportion is being estimated, one may wish to have the 95% confidence interval be less than 0.06 units wide. Alternatively, sample size may be assessed based on the power of a hypothesis test.

*Choose the sample size based on our expectation of other measures.
*Suppose the simple experiment of flipping a coin, where estimator of a proportion is $\hat{p}=\frac{X}{n}$, where $X$ is the number of heads out of n experiments. The estimator follows a binomial distribution and when n is sufficiently large, the distribution will be closely approximated by a normal distribution. With approximation, it can be shown that around $95\%$of this distribution’s probability lies within 2 standard deviations of the mean. Use Wald method for the binomial distribution, an interval of the form $(\hat{p} -2\sqrt{\frac{0.25}{n}}, \hat{p} + 2\sqrt{\frac{.25}{n}}) $ will form a 95% CI for the true proportion. If this interval needs to be no more than $W$ units wide, then we have $4\sqrt{\frac{0.25}{n}}=W$, solved for $n$, we have $ n=\frac{4}{W^2}=\frac{1}{B^2}$ where $B$ is the error bound on the estimate, i.e., the estimate is usually given as within $\pm B$. Hence, if $B=10$, then $n=100$; and if $B=0.05$ (5%), then $n=400$.

*A proportion is a special case of mean. When estimating the population mean using an independent and identically distributed sample of size n, where each data has variance $ \sigma ^{2}$, the standard error of the sample mean is $\frac{\sigma}{\sqrt{n}}$. With [[SMHS_CLT_LLN|CLT]], the 95% CI is $(\bar x - \frac {2\sigma}{\sqrt n},\bar x +\frac{2\sigma}{\sqrt n})$. If we wish to have a confidence interval with W units in width, then solve for n, we have $n=\frac{16\sigma^2}{W^2}$.

*Sample size for hypothesis tests: Let $X_i,i=1,2,…,n$ be independent observations taken from a normal distribution with unknown mean μ and known variance $\sigma^2$. The null hypothesis vs. alternative hypothesis: $H_0:\mu=0$ vs.$H_a:\mu=\mu^*$. If we wish to (1) reject $H_0$ with a probability of at least $1-\beta$ when $H_a$ is true, (2) reject $H_0$ with probability $\alpha$ when $H_0$ is true, we need: $P(\bar x >\frac{z_{\alpha}\sigma}{\sqrt n}|H_0 \text{ true})=\alpha $, and so reject $H_0$ if our sample average is more than $\frac{z_\alpha\sigma} {\sqrt n}$ is a decision rule which satisfies (2). $z_\alpha$ is the upper percentage point of the standard normal distribution. If we wish this to happen with a probability $1-\beta$ when $H_a$ is true. In this case, our sample average will come from a normal distribution with mean $μ^*$.

: Therefore, require $P (\bar x >\frac {z_{\alpha}\sigma}{\sqrt n}|H_0 \text{ true})\le 1-\beta $. Solve for n, we have $n \ge ( \frac{z_{\alpha}-\Phi^{-1}(1-\beta)}{\frac{\mu^{*}}{\sigma}})^{2}$, where $\Phi$ is the [[SMHS_ProbabilityDistributions#Normal_distribution|normal cumulative distribution function]].

====Effect size====
[http://books.google.com/books?id=J8AlAgAAQBAJ&pg=PT176&lpg=PT176&dq=Effect+size+is+a+descriptive+statistic+that+conveys+the+estimated+magnitude+of+a+relationship+without+making+any+statement+about+whether+the+apparent+relationship+in+the+data+reflects+a+true+relationship+in+the+population&source=bl&ots=YcgNM4azVu&sig=ut-4IHx-SrRoHqMZjAmQtXxxYp4&hl=en&sa=X&ei=wQkGVPzhIsrHggT68YDQBA&ved=0CDMQ6AEwAg#v=onepage&q=Effect%20size%20is%20a%20descriptive%20statistic%20that%20conveys%20the%20estimated%20magnitude%20of%20a%20relationship%20without%20making%20any%20statement%20about%20whether%20the%20apparent%20relationship%20in%20the%20data%20reflects%20a%20true%20relationship%20in%20the%20population&f=false Effect size] is a descriptive statistic that conveys the estimated magnitude of a relationship without making any statement about whether the apparent relationship in the data reflects a true relationship in the population. It complements inferential statistics such as p-value and plays an important role in statistical studies. The term effect size can refer to a statistic calculated from a sample of data, or to a parameter of a hypothetical statistical population. These effect sizes estimate the amount of the variance within an experiment that is "explained" or "accounted for" by the experiment's model.

====Other common measures====
*Pearson $r$ (correlation): an effect size when paired quantitative data are available, for instance if one were studying the relationship between birth weight and longevity. It varies from -1 to 1, 1 indicating a perfect negative linear relation, 1 indicating a perfect positive linear relation and 0 indicating no linear relation between two variables.

*Correlation coefficient, $ r^2 $: a coefficient determination calculated as the square of Pearson correlation r. It varies from 0 to 1 and is always nonnegative. For example, if $r=0.2$ then $r^2=0.04$ meaning that $4\%$ of the variance of either variable is shared with the other variable.

*Eta-squared, $ \eta^2 $, describes the ratio of variance explained in the dependent variable by a predictor while controlling for other predictors, making it analogous to the $ r^2 $. It is a biased estimator of the variance explained by the model in the population. $ \eta^2=\frac{SS_{treatment}} {SS_{total}} $ .

*Omega-squared, $\omega^2$: a less biased estimator of the variance explained in the population. $\omega^2 =/frac{SS_{treatment}-df_{treatment}*MS_{error}}{SS_{total}+MS_error}$. Given it is less biased, $\omega^2$ is preferable to $\eta^2$, however, it can be more inconvenient to calculate for complex analyses.

* Cohen’s $ f^2 $: one of several effect size measures to use in the context of an F test for ANOVA or multiple regression. Its amount of bias depends on the bias of its underlying measurement of variance explained. $f^2=\frac{R^2}{1-R^2}$,$R^2 $ is the squared multiple correlation.

===Applications===
*[http://www.sciencedirect.com/science/article/pii/0197245681900015 This article]titled Introduction To Sample Size Determination And Power Analysis For Clinical Trials reviewed the importance of sample size in clinical trials and presented a general method from which specific equations are derived for sample size determination and analysis of power for a wide variety of statistical procedures. This paper discussed the method in details with illustration in relation to the t test, test for proportions, test for survival time and tests for correlations that commonly occurred in clinical trials.

*[http://http://onlinelibrary.wiley.com/doi/10.1111/j.1469-185X.2007.00027.x/pdf This article] presents measures of the magnitude of effects (i.e., effect size statistics) and their confidence intervals in all biological journals. It illustrated the combined use of an effect size and its confidence interval, which enables one to assess the relationships within data more effectively than the use of p values, regardless of statistical significance. It focused on standardized effect size statistics and extensively discussed two dimensionless classes of effect size statistics: d statistics (standardized mean difference) and r statistics (correlation coefficient), because these can be calculated from almost all study designs and also because their calculations are essential for meta-analysis. The paper provided potential solutions for four main technical problems researchers may encounter when calculating effect size and CIs: (1) when covariates exist, (2) when bias in estimating effect size is possible, (3) when data have non-normal error structure and/or variances, and (4) when data are non-independent.

*[http://www.sciencedirect.com/science/article/pii/019724569090005M This article]reviewed methods of sample size and power calculation for most commonly study designs. It presents two generic formulae for sample size and power calculation, from which the commonly used methods are derived. It also illustrates the calculation with a computer program, which can be used for studies with dichotomous, continuous, or survival response measures.

===Software===
*[http://www.distributome.org/V3/calc/StudentCalculator.html Student Calculator]
*[http://socr.umich.edu/Applets/Normal_T_Chi2_F_Tables.html Normal T Chi-Squared F Tables]

===Problems===
Other things being equal, which of the following actions will reduce the power of a hypothesis test?

I. Increasing sample size.  II. Increasing significance level.  III. Increasing beta, the probability of a Type II error.

:(A) I only
:(B) II only  
:(C) III only
:(D) All of the above
:(E) None of the above

Suppose a researcher conducts an experiment to test a hypothesis. If she doubles her sample size, which of the following will increase?

I. The power of the hypothesis test.  II. The effect size of the hypothesis test.  III. The probability of making a Type II error.

:(A) I only
:(B) II only
:(C) III only
:(D) All of the above
:(E) None of the above

Suppose we have the following measurements taken. Calculate the corresponding power, specificity and sensitivity.
<center>
{| class="wikitable" style="text-align:center;width: 25%"border="1"
|-
| colspan=2 rowspan=2| || colspan=2| '''Actual Condition'''
|-
| '''Absent ($H_0$ is true)''' || '''Present ($H_1$ is true)'''
|-
| rowspan=2| '''Test Result'''|| '''Negative(fail to reject $H_0$)''' || 0.983 || 0.0025
|-
| '''Positive (reject $H_0$)''' || 0.0085 ||0.0055
|}
</center>

Suppose we are running a test on a simple experiment where the population standard deviation is $ 0.06$. $H_0: \mu=0$ vs. $H_a: \mu=0.5$. With type I error of 5%, what would be a reasonable sample size if we want to achieve at least 98% power.

===References===

*[http://wiki.stat.ucla.edu/socr/index.php/AP_Statistics_Curriculum_2007_Hypothesis_Basics SOCR]

*[http://en.wikipedia.org/wiki/Sample_size_determination Sample Size Determination Wikipedia]

*[http://en.wikipedia.org/wiki/Effect_size Effect Size Wikipedia]

*[http://books.google.com/books?id=whF18jCxyv0C&pg=PT4&lpg=PT4&dq=e-Study+Guide+for+Statistics+for+the+Behavioral+Sciences,+textbook+by+Susan&source=bl&ots=9vlDcJMtv1&sig=lUFE0l5GeZdyX8iasXUNgSpb6UI&hl=en&sa=X&ei=CQoGVMjCNs_HgwTi1YCICw&ved=0CD8Q6AEwAw#v=onepage&q=e-Study%20Guide%20for%20Statistics%20for%20the%20Behavioral%20Sciences%2C%20textbook%20by%20Susan&f=false e-Study Guide for Statistics for the Behavioral Sciences, textbook by Susan]

<hr>
* SOCR Home page: http://www.socr.umich.edu

{{translate|pageName=http://wiki.socr.umich.edu/index.php?title=SMHS_PowerSensitivitySpecificity}}

SMHS PowerSensitivitySpecificity

2014-09-03T14:29:20Z

Jslavine: /* Overview: */

==[[SMHS| Scientific Methods for Health Sciences]] - Statistical Power, Sample-Size, Sensitivity and Specificity ==

===Overview:===

In statistics, we have many ways to evaluate and choose tests or models. In this lecture, we are going to introduce some commonly used methods to describe the characteristics of a test: power, sample size, effect size, sensitivity and specificity. This lecture will present background knowledge of these concepts and illustrate their power and application through examples.

===Motivation:===

Experiments, models and tests are significant fundamentals to the filed of statistics and we all experienced the question of how to set up the right test and how to choose a better model. We are interested in studying on some of the most commonly used methods including power, effect size, sensitivity and specificity, which will greatly help us in understanding and choosing the model. So, what would be a reasonable sample size to reach a balance in the trade off between cost and efficiency? What would be the probability that the test will reject a false null hypothesis? What is the test’s ability to correctly accept a true null hypothesis or reject a false alternative hypothesis?

===Theory===

====Type I Error, Type II Error and Power====
*Type I error: the false positive (Type I) error of rejecting the null hypothesis given that it is actually true; e.g., the purses are detected to containing the radioactive material while they actually do not.
*Type II error: the false negative (Type II) error of failing to reject the null hypothesis given that the alternative hypothesis is actually true; e.g., the purses are detected to not containing the radioactive material while they actually do.
*Statistical power: the probability that the test will reject a false null hypothesis (that it will not make a Type II error). When power increases, the chances of a Type II error decrease.
*Test specificity (ability of a test to correctly accept the null hypothesis $ =\frac{d}{b+d}$.
*Test sensitivity (ability of a test to correctly reject the alternative hypothesis $=\frac{a}{a+c}$.

*The table below gives an example of calculating specificity, sensitivity, False positive rate $\alpha$, False Negative Rate $\beta$ and power given the information of ''TN'' and ''FN''.

<center>
{| class="wikitable" style="text-align:center;width: 25%"border="1"
|-
| colspan=2 rowspan=2| || colspan=2| '''Actual Condition'''
|-
| '''Absent (H_0 is true)''' || '''Present (H_1 is true)'''
|-
| rowspan=2| '''Test Result'''|| '''Negative(fail to reject H_0)''' || Condition absent + Negative result = True (accurate) Negative ('''TN''', 0.98505) || ''Condition present + Negative result = False (invalid) Negative ('''FN''', 0.00025)'''Type II error''' (β)
|-
| '''Positive (reject H_0)''' || Condition absent + Positive result = False Positive ('''FP''', 0.00995)'''Type I error''' (α) || Condition Present + Positive result = True Positive ('''TP''', 0.00475)
|-
|'''Test Interpretation''' || $Power = 1-FN= 1-0.00025 = 0.99975 $ ||'''Specificity''': TN/(TN+FP) = 0.98505/(0.98505+ 0.00995) = 0.99 ||'''Sensitivity''': TP/(TP+FN) = 0.00475/(0.00475+ 0.00025)= 0.95
|-
|}
</center>

Specificity $=\frac{TN}{TN + FP}$, Sensitivity $=\dfrac{TP}{TP+FN}$, $\alpha=\dfrac {FP}{FP+TN}$, $\beta=\frac{FN}{FN+TP}$, power$=1-\beta.$

====Sample size====
The number of observations or replicates included in a statistical sample. It is an important feature of any empirical study, which aims to make inference about a population. In complicated studies, there may be several different sample sizes involved in the study: for example, in a survey sampling involving stratified sampling, there may be different sizes of samples for each population.

*Factors influence sample size: expense of data collection; need to have sufficient statistical power.

*Ways to choose sample sizes: (1) expedience. Consider a simple experiment where the sample data is readily available or convenient to collect, yet the size of sample is crucial in avoiding wide confidence intervals or risks of errors in statistical hypothesis testing. (2) using a target variance for an estimate to be derived from the sample eventually obtained; (3) using a target for the power of a statistical test to be applied once the sample is collected.

*Intuitively, larger sample size generally lead to increased precision in estimating unknown parameters. However, in some situations, the increase in accuracy for larger sample size is minimal, or even doesn’t exist. This can result from the presence of systematic error or strong dependence in the data, or if the data follow a heavy-tailed distribution. Sample size is judged based on the quality of the resulting estimates. For example, if a proportion is being estimated, one may wish to have the 95% confidence interval be less than 0.06 units wide. Alternatively, sample size may be assessed based on the power of a hypothesis test.

*Choose the sample size based on our expectation of other measures.
*Suppose the simple experiment of flipping a coin, where estimator of a proportion is $\hat{p}=\frac{X}{n}$, where $X$ is the number of heads out of n experiments. The estimator follows a binomial distribution and when n is sufficiently large, the distribution will be closely approximated by a normal distribution. With approximation, it can be shown that around $95\%$of this distribution’s probability lies within 2 standard deviations of the mean. Use Wald method for the binomial distribution, an interval of the form $(\hat{p} -2\sqrt{\frac{0.25}{n}}, \hat{p} + 2\sqrt{\frac{.25}{n}}) $ will form a 95% CI for the true proportion. If this interval needs to be no more than $W$ units wide, then we have $4\sqrt{\frac{0.25}{n}}=W$, solved for $n$, we have $ n=\frac{4}{W^2}=\frac{1}{B^2}$ where $B$ is the error bound on the estimate, i.e., the estimate is usually given as within $\pm B$. Hence, if $B=10$, then $n=100$; and if $B=0.05$ (5%), then $n=400$.

*A proportion is a special case of mean. When estimating the population mean using an independent and identically distributed sample of size n, where each data has variance $ \sigma ^{2}$, the standard error of the sample mean is $\frac{\sigma}{\sqrt{n}}$. With [[SMHS_CLT_LLN|CLT]], the 95% CI is $(\bar x - \frac {2\sigma}{\sqrt n},\bar x +\frac{2\sigma}{\sqrt n})$. If we wish to have a confidence interval with W units in width, then solve for n, we have $n=\frac{16\sigma^2}{W^2}$.

*Sample size for hypothesis tests: Let $X_i,i=1,2,…,n$ be independent observations taken from a normal distribution with unknown mean μ and known variance $\sigma^2$. The null hypothesis vs. alternative hypothesis: $H_0:\mu=0$ vs.$H_a:\mu=\mu^*$. If we wish to (1) reject $H_0$ with a probability of at least $1-\beta$ when $H_a$ is true, (2) reject $H_0$ with probability $\alpha$ when $H_0$ is true, we need: $P(\bar x >\frac{z_{\alpha}\sigma}{\sqrt n}|H_0 \text{ true})=\alpha $, and so reject $H_0$ if our sample average is more than $\frac{z_\alpha\sigma} {\sqrt n}$ is a decision rule which satisfies (2). $z_\alpha$ is the upper percentage point of the standard normal distribution. If we wish this to happen with a probability $1-\beta$ when $H_a$ is true. In this case, our sample average will come from a normal distribution with mean $μ^*$.

: Therefore, require $P (\bar x >\frac {z_{\alpha}\sigma}{\sqrt n}|H_0 \text{ true})\le 1-\beta $. Solve for n, we have $n \ge ( \frac{z_{\alpha}-\Phi^{-1}(1-\beta)}{\frac{\mu^{*}}{\sigma}})^{2}$, where $\Phi$ is the [[SMHS_ProbabilityDistributions#Normal_distribution|normal cumulative distribution function]].

====Effect size====
[http://books.google.com/books?id=J8AlAgAAQBAJ&pg=PT176&lpg=PT176&dq=Effect+size+is+a+descriptive+statistic+that+conveys+the+estimated+magnitude+of+a+relationship+without+making+any+statement+about+whether+the+apparent+relationship+in+the+data+reflects+a+true+relationship+in+the+population&source=bl&ots=YcgNM4azVu&sig=ut-4IHx-SrRoHqMZjAmQtXxxYp4&hl=en&sa=X&ei=wQkGVPzhIsrHggT68YDQBA&ved=0CDMQ6AEwAg#v=onepage&q=Effect%20size%20is%20a%20descriptive%20statistic%20that%20conveys%20the%20estimated%20magnitude%20of%20a%20relationship%20without%20making%20any%20statement%20about%20whether%20the%20apparent%20relationship%20in%20the%20data%20reflects%20a%20true%20relationship%20in%20the%20population&f=false Effect size] is a descriptive statistic that conveys the estimated magnitude of a relationship without making any statement about whether the apparent relationship in the data reflects a true relationship in the population. It complements inferential statistics such as p-value and plays an important role in statistical studies. The term effect size can refer to a statistic calculated from a sample of data, or to a parameter of a hypothetical statistical population. These effect sizes estimate the amount of the variance within an experiment that is "explained" or "accounted for" by the experiment's model.

====Other common measures====
*Pearson $r$ (correlation): an effect size when paired quantitative data are available, for instance if one were studying the relationship between birth weight and longevity. It varies from -1 to 1, 1 indicating a perfect negative linear relation, 1 indicating a perfect positive linear relation and 0 indicating no linear relation between two variables.

*Correlation coefficient, $ r^2 $: a coefficient determination calculated as the square of Pearson correlation r. It varies from 0 to 1 and is always nonnegative. For example, if $r=0.2$ then $r^2=0.04$ meaning that $4\%$ of the variance of either variable is shared with the other variable.

*Eta-squared, $ \eta^2 $, describes the ratio of variance explained in the dependent variable by a predictor while controlling for other predictors, making it analogous to the $ r^2 $. It is a biased estimator of the variance explained by the model in the population. $ \eta^2=\frac{SS_{treatment}} {SS_{total}} $ .

*Omega-squared, $\omega^2$: a less biased estimator of the variance explained in the population. $\omega^2 =/frac{SS_{treatment}-df_{treatment}*MS_{error}}{SS_{total}+MS_error}$. Given it is less biased, $\omega^2$ is preferable to $\eta^2$, however, it can be more inconvenient to calculate for complex analyses.

* Cohen’s $ f^2 $: one of several effect size measures to use in the context of an F test for ANOVA or multiple regression. Its amount of bias depends on the bias of its underlying measurement of variance explained. $f^2=\frac{R^2}{1-R^2}$,$R^2 $ is the squared multiple correlation.

===Applications===
*[http://www.sciencedirect.com/science/article/pii/0197245681900015 This article]titled Introduction To Sample Size Determination And Power Analysis For Clinical Trials reviewed the importance of sample size in clinical trials and presented a general method from which specific equations are derived for sample size determination and analysis of power for a wide variety of statistical procedures. This paper discussed the method in details with illustration in relation to the t test, test for proportions, test for survival time and tests for correlations that commonly occurred in clinical trials.

*[http://http://onlinelibrary.wiley.com/doi/10.1111/j.1469-185X.2007.00027.x/pdf This article] presents measures of the magnitude of effects (i.e., effect size statistics) and their confidence intervals in all biological journals. It illustrated the combined use of an effect size and its confidence interval, which enables one to assess the relationships within data more effectively than the use of p values, regardless of statistical significance. It focused on standardized effect size statistics and extensively discussed two dimensionless classes of effect size statistics: d statistics (standardized mean difference) and r statistics (correlation coefficient), because these can be calculated from almost all study designs and also because their calculations are essential for meta-analysis. The paper provided potential solutions for four main technical problems researchers may encounter when calculating effect size and CIs: (1) when covariates exist, (2) when bias in estimating effect size is possible, (3) when data have non-normal error structure and/or variances, and (4) when data are non-independent.

*[http://www.sciencedirect.com/science/article/pii/019724569090005M This article]reviewed methods of sample size and power calculation for most commonly study designs. It presents two generic formulae for sample size and power calculation, from which the commonly used methods are derived. It also illustrates the calculation with a computer program, which can be used for studies with dichotomous, continuous, or survival response measures.

===Software===
*[http://www.distributome.org/V3/calc/StudentCalculator.html Student Calculator]
*[http://socr.umich.edu/Applets/Normal_T_Chi2_F_Tables.html Normal T Chi-Squared F Tables]

===Problems===
Other things being equal, which of the following actions will reduce the power of a hypothesis test?

I. Increasing sample size.  II. Increasing significance level.  III. Increasing beta, the probability of a Type II error.

:(A) I only
:(B) II only  
:(C) III only
:(D) All of the above
:(E) None of the above

Suppose a researcher conducts an experiment to test a hypothesis. If she doubles her sample size, which of the following will increase?

I. The power of the hypothesis test.  II. The effect size of the hypothesis test.  III. The probability of making a Type II error.

:(A) I only
:(B) II only
:(C) III only
:(D) All of the above
:(E) None of the above

Suppose we have the following measurements taken. Calculate the corresponding power, specificity and sensitivity.
<center>
{| class="wikitable" style="text-align:center;width: 25%"border="1"
|-
| colspan=2 rowspan=2| || colspan=2| '''Actual Condition'''
|-
| '''Absent ($H_0$ is true)''' || '''Present ($H_1$ is true)'''
|-
| rowspan=2| '''Test Result'''|| '''Negative(fail to reject $H_0$)''' || 0.983 || 0.0025
|-
| '''Positive (reject $H_0$)''' || 0.0085 ||0.0055
|}
</center>

Suppose we are running a test on a simple experiment where the population standard deviation is $ 0.06$. $H_0: \mu=0$ vs. $H_a: \mu=0.5$. With type I error of 5%, what would be a reasonable sample size if we want to achieve at least 98% power.

===References===

*[http://wiki.stat.ucla.edu/socr/index.php/AP_Statistics_Curriculum_2007_Hypothesis_Basics SOCR]

*[http://en.wikipedia.org/wiki/Sample_size_determination Sample Size Determination Wikipedia]

*[http://en.wikipedia.org/wiki/Effect_size Effect Size Wikipedia]

*[http://books.google.com/books?id=whF18jCxyv0C&pg=PT4&lpg=PT4&dq=e-Study+Guide+for+Statistics+for+the+Behavioral+Sciences,+textbook+by+Susan&source=bl&ots=9vlDcJMtv1&sig=lUFE0l5GeZdyX8iasXUNgSpb6UI&hl=en&sa=X&ei=CQoGVMjCNs_HgwTi1YCICw&ved=0CD8Q6AEwAw#v=onepage&q=e-Study%20Guide%20for%20Statistics%20for%20the%20Behavioral%20Sciences%2C%20textbook%20by%20Susan&f=false e-Study Guide for Statistics for the Behavioral Sciences, textbook by Susan]

<hr>
* SOCR Home page: http://www.socr.umich.edu

{{translate|pageName=http://wiki.socr.umich.edu/index.php?title=SMHS_PowerSensitivitySpecificity}}

SMHS CIs

2014-09-02T15:13:41Z

Jslavine: /* Theory */

==[[SMHS| Scientific Methods for Health Sciences]] - Point and Interval Estimation: MoM and MLE ==

===Overview===
The estimation of population parameters is critical in many applications. In statistics, estimation uses a combination of effect sizes, confidence intervals, and meta-analysis to plan experiments, analyze data and interpret results. It is most frequently carried out in terms of point-estimates or interval estimates for population parameters of interest. This lesson aims to study the various methodologies commonly used in point and interval estimates like Method of Moments (MOM) and Maximum Likelihood Estimation (MLE). We are interested in methods of estimating population parameters based on a sample distribution. We illustrate point and interval estimations of population means, proportions, and variances using methods introduced in this class.

===Motivation===
Suppose we want to estimate the probability of a head occurring when we flip a specific coin by repeating the experiment several times. How much confidence do we have in our estimation? There are a number of other similar situations where we need to evaluate, predict or estimate a population parameter of interest using an observed data sample. The method of moments (MOM) and maximum likelihood estimation (MLE) are among the most commonly used methods to estimate various population parameters.

In point and interval estimation, not only do we need to consider the distribution and model on which the estimates are based, we also need to make assumptions in terms of the population distribution. Additionally, the estimates of parameters are influenced by other population parameters. For example, the estimate of the mean of the population is influenced by parameters like variance and sample size.

A confidence interval is an interval that contains the true value of a parameter of interest for $(1-α)100\%$ of samples taken. It is called a $(1-\alpha)100\%$ confidence interval for that parameter, and the ends of the CI are called confidence limits.

===Theory===
====Method of Moments (MOM) Estimation====
This method uses the sample data to calculate some sample moments and then sets these equal to their corresponding population counterparts. Steps:
# Determine the k parameters of interest and the specific distribution for this process;
# Compute the first $k$ (or more) sample moments;
# Set the sample moments equal to the population moments and solve for a (linear or non-linear) system of $k$ equations with $k$ unknowns.

* MOM proportion example: Consider the example of flipping a coin 10 times and recording the outcomes of heads and tails. We use the outcomes to infer the true probability of a head ($p=P(Head)$). Suppose we observe the outcome $\{H,T,T,T,T,H,H,T,H,T\}$. In the MOM framework, this is a [[SMHS_ProbabilityDistributions#Binomial_distribution |Binomial experiment]] and $E[X]=np$. $X$ is the number of heads in the experiment. Hence, $np=4$, $MOM(p)=p = \frac{4}{10}$.

* MOM Beta distribution example: Suppose we have 10 observations we suspect came from a [http://www.distributome.org/js/calc/BetaCalculator.html Beta distribution].
<center>
{| class="wikitable" style="text-align:center; width:75%"border="1"
|-
! Data||0.055||1.005||0.075||0.005||0.075||1.005||0.005||0.035||0.225
|}
</center>

The beta distribution mean and variance are defined explicitly in terms of two parameters.
* Mean: $\mu=\frac{\alpha}{\alpha+\beta}$,
* Variance: $]sigma^2=\frac{\alpha \beta}{(\alpha+\beta)^2 (\alpha+\beta+1)}$.

The sample mean and sample variance are $\bar{x} = 0.251$, and $s^2=0.6187$. Solve for $\alpha$ and $\beta$.

====Maximum likelihood estimation (MLE)====
Modeling the parameters of a distribution using MLE based on observed real world data offers a way to tune the free parameters of a model to provide an optimum fit.

Suppose we observe a sample, $x_1,x_2,…,x_n$, of $n$ values from one distribution with probability density or mass function $f_\theta$, and we are trying to estimate the parameter $\theta$. We can compute the (multivariate) probability density associated with our observed data $f_\theta (x_1,x_2,…,x_n│\theta)$ as a function of $\theta$ with $x_1,x_2,…,x_n$ fixed. The likelihood function is
$$L(\theta)=f_\theta (x_1,x_2,…,x_n│\theta).$$

The MLE of $\theta$ is the value of $\theta$ that maximizes $L(\theta)$: $\arg\max_\theta{L(\theta)}.$

It is typically assumed that the observed data are independent and identically distributed (iid) with unknown parameter $\theta$. The likelihood can be written as a product of $n$ univariate probability densities: $L(\theta)=\prod_{i=1}^n {f_\theta (x_i |]theta)}$. Because maxima are unaffected by monotone transformations, one can take the logarithm of this expression and turn it into a sum: $L^* (θ)=\sum_{i=1}^n {\ln{f_θ (x_i |θ)}}$. The maximum of this expression can then be found numerically using various optimization algorithms.

* Note: The MLE may not be unique and is not guaranteed to exist.

* Example: Consider the coin flipping example above in which we observed the number of head, and use this to infer the true probability of p(Head).
: Likelihood function: $L(\theta)=f(x│\theta=p)={10 \choose 4} p^4 (1-p)^6$
: Log-likelihood function: $L^* (\theta)=\ln{10 \choose 4} + 4\ln{p} + 6\ln{(1-p)}$.
: Maximize the log-likelihood function by setting its first derivative to zero:
$$ 0=\frac{d(\ln{10 \choose 4} + 4\ln{p} + 6\ln{(1-p))}}{dp} =4/p-6/(1-p), p=2/5.$$

====MOM vs. MLE====
* The MOM is inferior to Fisher’s MLE method because MLE has a higher probability of being close to the quantities to be estimated.
* MLE may be intractable in some situations, whereas the MOM estimates can be quickly and easily calculated by hand or using a computer.
* MOM estimates may be used as first approximations to the solutions of the MLE method, and successive improved approximations may then be found by the [http://en.wikipedia.org/wiki/Newton-Raphson_method Newton-Raphson method]. In this respect, the MOM and MLE are symbiotic.
* Sometimes, MOM estimates may be outside of the parameter space, i.e., they are unreliable, which is never a problem with ML methods.
* MOM estimates are not necessarily sufficient statistics, i.e., they sometimes fail to take into account all relevant information in the sample.
* MOM may be preferred to MLE for estimating some structural parameters when appropriate probability distributions are unknown.

===Student’s T Distribution===
The distribution needed to estimate the mean of a normally distributed population when the sample size is small and the population variance is unknown. It is the basis of the popular Student’s t-tests for the statistical significance of the difference between two sample means, and for confidence intervals for the difference between two population means.

Suppose $X_1,X_2,…,X_n$ are independent random variables that are normally distributed with expected value $μ$ and variance $σ^2$. Sample mean: $\bar{x}_n = \frac{1}{n} \sum_{i=1}^n{x_i}$. Sample variance: $S_n^2=\frac{1}{n} \sum_{i=1}^n{(x_i-\bar{x})^2}$, $Z=\frac{\bar{x}_n-\mu}{\frac{\sigma}{\sqrt{n}}}$ is normally distributed with mean 0 and variance 1, since the sample mean ($\bar{x}_n$) is normally distributed with mean μ and standard deviation $\frac{\sigma}{\sqrt{n}}$.
$$Z=\frac{\bar{x}_n-\mu}{\frac{\sigma}{\sqrt{n}}}$$
$$T=\frac{\bar{x}_n-\mu}{\frac{S_n}{\sqrt{n}}}$$

T replaces $\sigma$ with with sample standard deviation. Also, $(n-1) \frac{S_n^2}{\sigma^2}$ has a [[AP_Statistics_Curriculum_2007_Chi-Square|Chi-square distribution]] $\chi_{n-1}^2$ with degree of freedom equal to $n-1$.

* Example: suppose a research involves 25 patients and relative measurements are recorded:
<center>
{| class="wikitable" style="text-align:center; width:75%" border="1"
|-
| Variable ||N || N* || Mean ||SE of Mean||StDev ||Minimum || Q1|| Median || Q3 ||Maximum
|-
| CD4 || 25|| 0 ||321.4|| 14.8 || 73.8 ||208.0 ||261.5 || 325.0 ||394.0 || 449.0
|}
</center>

What do we know from the background information?
: $\bar{y}= 321.4$
: $s = 73.8$
: $SE = 14.8$
: $n = 25$

: $CI(\alpha)=CI(0.05)$: $\bar{y} \pm t_{\alpha\over 2} {1\over \sqrt{n}} \sqrt{\sum_{i=1}^n{\frac{(x_i-\bar{x})^2}{n-1}}}.$

: $321.4 \pm t_{(24, 0.025)}{73.8\over \sqrt{25}}$
: $321.4 \pm 2.064\times 14.8$
: $[290.85, 351.95]$

====Estimating a population mean with large samples====
We use the following protocol to find point and interval estimates when the sample sizes are large, say exceeding 100.
* Assumptions: The [[SMHS_CLT_LLN|Central Limit Theorem]] guarantees that for large samples, the method above provides a valid recipe for constructing a confidence interval for the population mean, no matter what the distribution for the observed data may be. Of course, for significantly non-Normal distributions, we may need to increase the sample size to guarantee that the sampling distribution of the mean is approximately Normal.

* Point estimation of population mean: $\bar{X_n}={1\over n}\sum_{i=1}^n{X_i}$, constructed from a random sample of the process {$X_1, X_2, X_3, \cdots , X_n$}, which is an [http://en.wikipedia.org/wiki/Estimator_bias unbiased] estimate of the population mean $\mu$, if it exists! Note that the [[AP_Statistics_Curriculum_2007_EDA_Center | sample average may be susceptible to outliers]].

* Interval estimation of a population mean: Choose a confidence level <math>(1-\alpha)100%</math>, where <math>\alpha</math> is small (e.g., 0.1, 0.05, 0.025, 0.01, 0.001, etc.). Then a <math>(1-\alpha)100%</math> confidence interval for <math>\mu</math> will be
: <math>CI(\alpha): \overline{x} \pm z_{\alpha\over 2} E,</math>
:: The '''Error''' term, E, is defined as
:: <math>E = \begin{cases}{\sigma\over\sqrt{n}},& \texttt{for-known}-\sigma,\\
{SE},& \texttt{for-unknown}-\sigma.\end{cases}</math>
:: The '''Standard Error''' of the estimated <math>\overline {x}</math> is obtained by replacing the unknown population standard deviation by the sample standard deviation:
<math>SE(\overline {x}) = {1\over \sqrt{n}} \sqrt{\sum_{i=1}^n{(x_i-\overline{x})^2\over n-1}}</math>

* <math>z_{\alpha\over 2}</math> is the [[AP_Statistics_Curriculum_2007_Normal_Critical | Critical Value]] for a [[AP_Statistics_Curriculum_2007_Normal_Std |Standard Normal]] distribution at <math>{\alpha\over 2}</math>.

* Example: a random sample of the number of sentences found in 30 magazine advertisements is listed below. Use this sample to find point estimate for the population mean μ. Samples: 16, 9, 14, 11, 17, 12, 99, 18, 13, 12, 5, 9, 17, 6, 11, 17, 18, 20, 6, 14, 7, 11, 12, 5, 18, 6, 4, 13, 11, 12. Suppose the point estimate is 12.25.
A confidence interval estimate of μ is a range of values used to estimate a population parameter.
** Known variance: Suppose that we know the variance for the ''number of sentences per advertisement'' example above is known to be 256 (so the population standard deviation is <math>\sigma=16</math>).
:: For <math>{\alpha \over 2}=0.1</math>, the <math>80% CI(\mu)</math> is constructed by:
<center> <math>\overline{x}\pm 1.28SE(\overline{x})=14.77 \pm 1.28{16\over \sqrt{30}}=[11.03;18.51]</math></center>
:: For <math>{\alpha \over 2}=0.05</math>, the <math>90% CI(\mu)</math> is constructed by:
<center> <math>\overline{x}\pm 1.645SE(\overline{x})=14.77 \pm 1.645{16\over \sqrt{30}}=[9.96;19.57]</math></center>
:: For <math>{\alpha \over 2}=0.005</math>, the <math>99% CI(\mu)</math> is constructed by:
<center> <math>\overline{x}\pm 2.575SE(\overline{x})=14.77 \pm 2.575{16\over \sqrt{30}}=[7.24;22.29]</math></center>
: Notice the increase of the CI's (directly related to the decrease of <math>\alpha</math>) reflecting our choice for higher confidence.

** Unknown variance: use the sample variance 273 as an estimate (so the sample standard deviation is <math>s=\hat{\sigma}=16.54</math>).
:: For <math>{\alpha \over 2}</math>, the <math>80% CI(\mu)</math> is constructed by:
<center> <math>\overline{x}\pm 1.28SE(\overline{x})=14.77 \pm 1.28{16.54\over \sqrt{30}}=[10.90;18.63]</math></center>
:: For <math>{\alpha \over 2}=0.05</math>, the <math>90% CI(\mu)</math> is constructed by:
<center> <math>\overline{x}\pm 1.645SE(\overline{x})=14.77 \pm 1.645{16.54\over \sqrt{30}}=[9.80;19.73]</math></center>
:: For <math>{\alpha \over 2}=0.005</math>, the <math>99% CI(\mu)</math> is constructed by:
<center> <math>\overline{x}\pm 2.575SE(\overline{x})=14.77 \pm 2.575{16.54\over \sqrt{30}}=[6.99;22.54]</math></center>
: Notice the increase of the CI's (directly related to the decrease of <math>\alpha</math>) reflecting our choice for higher confidence.

: You can use the [http://www.socr.ucla.edu/htmls/ana/ConfidenceInterval_Analysis.html SOCR CI Analysis Applet] to compute these interval estimates.

====Estimating a population mean with small samples (say <30 observations)====
For small samples, the point estimates are less precise and the interval estimates produce wider intervals, compared to the case of large samples.

* Assumptions: need evidence that the data we observed and used for point and interval estimates come from a distribution, which is (approximately) normal. If this assumption is violated than the interval estimate we are going to introduce may be significantly misrepresenting the real confidence interval.
* Point estimation of population mean: Choose a confidence level <math>(1-\alpha)100%</math>, where <math>\alpha</math> is small (e.g., 0.1, 0.05, 0.025, 0.01, 0.001, etc.). Then a <math>(1-\alpha)100%</math> confidence interval for <math>\mu</math> is defined in terms of the [http://socr.ucla.edu/htmls/dist/StudentT_Distribution.html T-distribution]:
:: <math>CI(\alpha): \overline{x} \pm t_{\alpha\over 2} E.</math>
:: The '''Error''' term, E, is defined as <math>E = \begin{cases}{\sigma\over\sqrt{n}},& \texttt{for-known}-\sigma,\\
{SE},& \texttt{for-unknown}-\sigma.\end{cases}</math>
:: The '''Standard Error''' of the estimate <math>\overline {x}</math> is obtained by replacing the unknown population standard deviation by the sample standard deviation:
<math>SE(\overline {x}) = {1\over \sqrt{n}} \sqrt{\sum_{i=1}^n{(x_i-\overline{x})^2\over n-1}}</math>
:: $t_{\alpha\over 2}$ is the [[AP_Statistics_Curriculum_2007_StudentsT |Critical Value for the T(df=sample-size -1) distribution at <math>{\alpha\over 2}</math>]].

* Example: a random sample of the number of sentences found in 10 magazine advertisements is listed below. Use this sample to find point estimate for the population mean μ. Samples: 16, 9, 14, 11, 17, 12, 99, 18, 13, 12. Suppose the point estimate is 22.1.
** Known variance: Suppose that we know the variance for the ''number of sentences per advertisement'' example above is known to be 256 (so the population standard deviation is <math>\sigma=16</math>).
:: For <math>{\alpha \over 2}=0.1</math>, the <math>80% CI(\mu)</math> is constructed by:
<center> <math>\overline{x}\pm 1.383{16\over \sqrt{10}}=22.1 \pm 1.28{16\over \sqrt{10}}=[15.10 ; 29.10]</math></center>
:: For <math>{\alpha \over 2}=0.05</math>, the <math>90% CI(\mu)</math> is constructed by:
<center> <math>\overline{x}\pm 1.833{16\over \sqrt{10}}=22.1 \pm 1.833{16\over \sqrt{10}}=[12.83 ; 31.37]</math></center>
:: For <math>{\alpha \over 2}=0.005</math>, the <math>99% CI(\mu)</math> is constructed by:
<center> <math>\overline{x}\pm 3.250{16\over \sqrt{10}}=22.1 \pm 3.250{16\over \sqrt{10}}=[5.66 ; 38.54]</math></center>
:: Notice the increase of the CI's (directly related to the decrease of <math>\alpha</math>) reflecting our choice for higher confidence.

** Unknown variance: Suppose that we do '''not''' know the variance for the ''number of sentences per advertisement'' but use the sample variance 737.88 as an estimate (so the sample standard deviation is <math>s=\hat{\sigma}=27.16390579</math>).
:: For <math>{\alpha \over 2}=0.1</math>, the <math>80% CI(\mu)</math> is constructed by:
<center> <math>\overline{x}\pm 1.383{27.16390579\over \sqrt{10}}=22.1 \pm 1.383{27.16390579\over \sqrt{10}}=[10.22 ; 33.98]</math></center>
:: For <math>{\alpha \over 2}=0.05</math>, the <math>90% CI(\mu)</math> is constructed by:
<center> <math>\overline{x}\pm 1.833{27.16390579\over \sqrt{10}}=22.1 \pm 1.833{27.16390579\over \sqrt{10}}=[6.35 ; 37.85]</math></center>
:: For <math>{\alpha \over 2}=0.005</math>, the <math>99% CI(\mu)</math> is constructed by:
<center> <math>\overline{x}\pm 3.250{27.16390579 \over \sqrt{10}}=22.1 \pm 3.250{27.16390579\over \sqrt{10}}=[-5.82 ; 50.02]</math></center>
:: Notice the increase of the CI's (directly related to the decrease of <math>\alpha</math>) reflecting our choice for higher confidence.

====Estimating a population proportion====
When the sample size is large, the sampling distribution of the sample proportion <math>\hat{p}</math> is approximately Normal, by [[AP_Statistics_Curriculum_2007_Limits_CLT |CLT]], as the sample proportion may be presented as a [[AP_Statistics_Curriculum_2007_Limits_Norm2Bin |sample average or Bernoulli random variables]]. When the sample size is small, the normal approximation may be inadequate. To accommodate this, we will modify the '''sample-proportion''' <math>\hat{p}</math> slightly and obtain the '''corrected-sample-proportion''' <math>\tilde{p}</math>:
: <math>\hat{p}={y\over n} \longrightarrow \tilde{p}={y+0.5z_{\alpha \over 2}^2 \over n+z_{\alpha \over 2}^2},</math>
where [[AP_Statistics_Curriculum_2007_Normal_Critical | <math>z_{\alpha \over 2}</math> is the normal critical value we saw earlier]].

The standard error of <math>\hat{p}</math> also needs a slight modification
: <math>SE_{\hat{p}} = \sqrt{\hat{p}(1-\hat{p})\over n} \longrightarrow SE_{\tilde{p}} = \sqrt{\tilde{p}(1-\tilde{p})\over n+z_{\alpha \over 2}^2}.</math>

* Example: Suppose a researcher is interested in studying the effect of aspirin in reducing heart attacks. He randomly recruits 500 subjects with evidence of early heart disease and has them take one aspirin daily for two years. At the end of the two years, he finds that during the study only 17 subjects had a heart attack. Calculate a 95% (<math>\alpha=0.05</math>) confidence interval for the true (unknown) proportion of subjects with early heart disease that have a heart attack while taking aspirin daily. Note that [[AP_Statistics_Curriculum_2007_Normal_Critical | <math>z_{\alpha \over 2} = z_{0.025}=1.96</math>]]:
:: <math>\hat{p} = {17\over 500}=0.034</math> ; <math>\tilde{p} = {17+0.5z_{0.025}^2\over 500+z_{0.025}^2}== {17+1.92\over 500+3.84}=0.038</math>
:: <math>SE_{\hat{p}}= \sqrt{0.034(1-0.034)\over 500}=0.0036</math>; <math>SE_{\tilde{p}}= \sqrt{0.038(1-0.038)\over 500+3.84}=0.0085</math>
::And the corresponding confidence intervals are given by
:: <math>\hat{p}\pm 1.96 SE_{\hat{p}}=[0.026944, 0.041056]</math>
:: <math>\tilde{p}\pm 1.96 SE_{\tilde{p}}=[0.0213, 0.0547]</math>

:: [[AP_Statistics_Curriculum_2007_Estim_Proportion#Sample-Size_Estimation_2|See this example of estimation of sample-size, given margin of error]]

====Estimating population variance====
The most unbiased point estimate for the population variance <math>\sigma^2</math> is the [[AP_Statistics_Curriculum_2007_EDA_Var | Sample-Variance (s2)]] and the point estimate for the population standard deviation <math>\sigma</math> is the [[AP_Statistics_Curriculum_2007_EDA_Var | Sample Standard Deviation (s)]].

We use a [http://en.wikipedia.org/wiki/Chi_square_distribution Chi-Square Distribution] to construct confidence intervals for the variance and standard distribution. If the process or phenomenon we study generates a Normal random variable, then computing the following random variable (for a sample of size <math>n>1</math>) has a [[AP_Statistics_Curriculum_2007_Chi-Square|Chi-Square Distribution]]
: <math>\chi_o^2 = {(n-1)s^2 \over \sigma^2}</math>

* Chi-Square Distribution Properties
** All chi-squares values <math>\chi_o^2 \geq 0</math>.
** The chi-square distribution is a family of curves, each is determined by the degrees of freedom (n-1). See the interactive [http://socr.ucla.edu/htmls/SOCR_Distributions.html Chi-Square distribution].
** To form a confidence interval for the variance (<math>\sigma^2</math>), use the <math>\chi^2(df=n-1)</math> distribution with degrees of freedom equal to one less than the sample size.
** The area under each curve of the Chi-Square Distribution equals one.
** All Chi-Square Distributions are positively skewed.
<center>[[Image:SOCR_EBook_Dinov_Estim_Var_020408_Fig1.jpg|500px]]</center>

* Interval Estimates of Population Variance and Standard Deviation:
:: Notice that the Chi-Square Distribution is '''not''' symmetric (positively skewed) and therefore, there are two critical values for each level of confidence. The value <math>\chi_L^2</math> represents the left-tail critical value and <math>\chi_R^2</math> represents the right-tail critical value. For various degrees of freedom and areas, you can compute all critical values either using the [http://socr.ucla.edu/htmls/SOCR_Distributions.html SOCR Distributions] or using the [http://socr.ucla.edu/Applets.dir/Normal_T_Chi2_F_Tables.htm SOCR Chi-square Distribution Calculator].

::: Example: Find the critical values, <math>\chi_L^2</math> and <math>\chi_R^2</math>, for a 90% confidence interval when the sample size is 25. Use the following Protocol:
::: Identify the degrees of freedom (<math>df=n-1=24</math>) and the level of confidence (<math>{\alpha\over 2}=0.05</math>).
::: Find the left and right critical values, <math>\chi_L^2=13.848</math> and <math>\chi_R^2=36.415</math>, as in the image below.
<center>[[Image:SOCR_EBook_Dinov_Estim_Var_020408_Fig2.jpg|500px]]</center>

* Confidence Interval for <math>\sigma^2</math>
:: <math>{(n-1)s^2 \over \chi_R^2} \leq \sigma^2 \leq {(n-1)s^2 \over \chi_L^2}</math>

* Confidence Interval for <math>\sigma</math>
:: <math>\sqrt{(n-1)s^2 \over \chi_R^2} \leq \sigma \leq \sqrt{(n-1)s^2 \over \chi_L^2}</math>

====Hands-on Activity====
Construct the confidence intervals for <math>\sigma^2</math> and <math>\sigma</math> assuming the observations below represent a random sample from the liquid content (in fluid ounces) of 16 beverage cans and can be considered as Normally distributed. Use a 90% level of confidence.
<center>
{| class="wikitable" style="text-align:center; width:75%" border="1"
|-
| 14.816 || 14.863 || 14.814 || 14.998 || 14.965 || 14.824 || 14.884 || 14.838 || 14.916 || 15.021 || 14.874 || 14.856 || 14.860 || 14.772 || 14.980 || 14.919
|}
</center>

* Get the sample statistics from [http://socr.ucla.edu/htmls/SOCR_Charts.html SOCR Charts] (e.g., Index Plot); Sample-Mean=14.8875; Sample-SD=0.072700298, Sample-Var=0.005285333.
<center>[[Image:SOCR_EBook_Dinov_Estim_Var_020408_Fig3.jpg|500px]]</center>

* Identify the degrees of freedom (<math>df=n-1=15</math>) and the level of confidence (<math>{\alpha/2}=0.05</math>), as we are looking for a <math>(1-\alpha)100% CI(\sigma^2)</math>.
* Find the left and right critical values, <math>\chi_L^2=7.261</math> and <math>\chi_R^2=24.9958</math> using [http://socr.ucla.edu/htmls/SOCR_Distributions.html SOCR Chi-Square Distribution], as in the image below.
<center>[[Image:SOCR_EBook_Dinov_Estim_Var_020408_Fig4.jpg|500px]]</center>

* CI(<math>\sigma^2</math>)
: <math>0.00318={15\times 0.0053 \over 24.9958} \leq \sigma^2 \leq {15\times 0.0053 \over 7.261}=0.01095</math>

* CI(<math>\sigma</math>)
: <math>0.0564=\sqrt{15\times 0.0053 \over 24.9958} \leq \sigma \leq \sqrt{15\times 0.0053 \over 7.261}=0.10464</math>

** [[AP_Statistics_Curriculum_2007_Estim_Var#More_Examples|See more examples here]].

===Applications===
* [http://www.tandfonline.com/doi/abs/10.1207/.U5ys8BZRXKw This article] titled Reliability of Scales with General Structure: Point and Interval Estimation Using a Structural Equation discussed a method of obtaining point and interval estimates of reliability for composites of measures with a general structure. The approach is based on fitting a correspondingly constrained structural equation model and generalizes earlier covariance structure analysis methods for scale reliability estimation with congeneric tests. The procedure can be sued with weighted or unweighted composites, in which the weights need not be known in advance but may be estimated simultaneously. The method presented in this paper allows one to obtain an approximate standard error and confidence interval for scale reliability using bootstrap.

* [[SOCR_EduMaterials_ModelerActivities_NormalBetaModelFit| This activity]] shows normal and beta distribution model fit. It describes the process of SOCR model fitting in the case of using Normal or Beta distribution models. The article aims to motivate the need for analytical modeling of natural processes and illustrated how to use SOCR modeler to fit models to real data ad presented applications of model fitting. It provides specific examples illustrating model fitting and two exercises to practice and learn.

* [[SOCR_EduMaterials_Activities_General_CI_Experiment|This experiment]] shows SOCR activity on general confidence interval and demonstrates the usage and functionality of SOCR general confidence interval applet. It demonstrates the theory behind the use of interval-based estimates of the parameters, illustrates various confidence intervals construction recipes, draws parallels between the construction algorithms and intuitive meaning of confidence intervals and presents a new technology enhanced approach for understanding and utilizing confidence intervals for various applications. The article presents specific example and exercises in this topic and works as a good supplement to point and interval estimates.

===Software===
* [http://socr.ucla.edu/Applets.dir/Normal_T_Chi2_F_Tables.htm SOCR Tables]
* [http://socr.ucla.edu/htmls/exp/Confidence_Interval_Experiment_General.html SOCR General CI Experiment]
* [http://socr.ucla.edu/htmls/SOCR_Modeler.html SOCR Modeler]
* [http://socr.ucla.edu/htmls/SOCR_Experiments.html SOCR Experiments]
* [http://socr.ucla.edu/htmls/SOCR_Charts.html SOCR CHarts]

===Problems===
* Tom is in charge of sampling sugar measurements from a very large population of sugar. Lately her standard errors have been alarmingly high for her sample means. If she wants to decrease her sampling error (standard deviation of her sample means) by 1/2 what should she do?
: (a) Quadruple the variation inherent in the population.
: (b) Triple her sample size.
: (c) Quadruple her sample size.
: (d) Halve her sample size.

* The average standardized math score for eighth graders in the state of Michigan is 70 and the standard deviation is 10. We want to find out if the average standardized math score in district A is higher than the average score for the state of Michigan. The mean for a random sample of 36 students from this district is 72. What is the best response?
: (a) The p-value is around 0.76 and it is concluded that the average standardized math score in this district is not different from the overall population mean.
: (b) The p-value is around 0.12 and it is concluded that the average standardized math score in this district is not higher than the overall population mean.
: (c) The p-value is around 0.24 and it is concluded that the average standardized math score in this district is not higher than the overall population mean.
: (d) The p-value is around 0.88 and it is concluded that the average standardized math score in this district is not higher than the overall population mean.

* A random sample of 121 students from the UMich was selected to estimate the average ACT score of all UMich students. The average for the sample was 23.4 and the sample standard deviation was 3.65. If you wanted to calculate a more precise and accurate prediction of the average ACT score of UMich students, which one of the following would be the best thing to do?
: (a) Decrease the sample size to 91.
: (b) Increase the sample size to 151.
: (c) Increase the confidence level to 99%.
: (d) Decrease the confidence level to 90%.

* How does the shape, center, and spread of t-models change as its degrees of freedom increases?
: (a) The shape and center stays the same, but the spread becomes narrower.
: (b) The shape and center stays the same, but the spread becomes wider.
: (c) The shape and spread stays the same, but the center will increase.
: (d) The shape and spread stays the same, but the center will decrease.

* Estimate the critical value of t for a 95% confidence interval with df = 15
: (a) 1.71
: (b) 2.131
: (c) 1.17
: (d) 3.45

* True or False: In a well-designed sample survey like the Current Population Survey, the observed sample percentage (e.g., percentage unemployed) is equal to the population percentage. Thus, it is appropriate to just report the sample percentage, without any measure of accuracy (i.e. without the margin of error).
: (a) True
: (b) False

* Suppose an NPR news story reports that: "A polling agency reports that the percentage of the American public who agree we should spend more money on the mental health of the war veterans is 42% +/- 3%."
: (a) The probability that the American public agree that we should spend more money on the mental health of the war veterans is between 39% to 42%.
: (b) The percentage of the American public who agree that we should spend more money on the mental health of the war veterans is between 39% to 45%.
: (c) We are 95% confident that the percentage of the American public who agree that we should spend more money on the mental health of the war veterans is between 39% to 45%.
: (d) The percentage of the American public who agree that we should spend more money on the mental health of the war veterans is 42%.

* A major newspaper wants to hire a polling agency to predict who will be the next governor. Agency A proposes to do the job with a random sample of 5000 voters at a cost of $\$ 50K$ (K = one thousand). Agency B proposes to do the job with a random sample of 7500 voters at a cost of $\$ $75K. Assume both agencies find the percentage of voters to be 40% and both use the normal model to calculate the 95% interval. Which agency will you hire? Hint: Compare the margin of error for the two agencies and the relative costs before making your decision.
: (a) I will hire B.
: (b) I have no preference.
: (c) I need more information to decide who to hire.
: (d) I will hire A.

* Suppose that the proportion of the adult population who jog is 0.15. What is the probability that the proportion of joggers in a random sample of size n =200 lies between 0.13 and 0.17?
: (a) 0.5762 approximately
: (b) 0.8125 approximately
: (c) 0.2345 approximately
: (d) 0.1234 approximately

* Records at a large university indicate that 20% of all freshmen are placed on academic probation at the end of the first semester. A random sample of 100 freshmen found that 25% of them were placed on probation. The results of the sample:
: (a) are surprising since it indicates that 5% more of these freshmen were placed on probation than expected
: (b) are surprising since the standard deviation of the sampling distribution is 0.4%.
: (c) are biased since an increase of 5% could not happen without injecting bias into the sample.
: (d) are not surprising since the standard deviation of the sampling distribution is 4%.
: (e) are surprising since SAT scores have increased over the past years

* We have discussed that the standard deviation of the distribution of sample percentages, $SE(\hat{p})$ is calculated by taking the square root of $\frac{\hat{p}(1-\hat{p})}{N}$, where $\hat{p}$ is the proportion in the sample and N is sample size. What does $SE(\hat{p})$ show?
: (a) It shows the standard error of the man across repeated samples from the population.
: (b) It shows the distribution of $\hat{p}$ for the single sample that the researcher draws from the population.
: (c) It shows the standard deviation of $\hat{p}$ for repeated samples from the population.
: (d) It shows the variation for $\hat{p}$ values for repeated samples from the population.

===References===
* [http://wiki.stat.ucla.edu/socr/index.php/Probability_and_statistics_EBook#Chapter_VII:_Point_and_Interval_Estimates SOCR]
* [http://en.wikipedia.org/wiki/Method_of_moments_(statistics) MoM Wikipedia]

<hr>
* SOCR Home page: http://www.socr.umich.edu

{{translate|pageName=http://wiki.socr.umich.edu/index.php?title=SMHS_CIs}}

SMHS CIs

2014-09-01T18:40:07Z

Jslavine: /* Motivation */

==[[SMHS| Scientific Methods for Health Sciences]] - Point and Interval Estimation: MoM and MLE ==

===Overview===
The estimation of population parameters is critical in many applications. In statistics, estimation uses a combination of effect sizes, confidence intervals, and meta-analysis to plan experiments, analyze data and interpret results. It is most frequently carried out in terms of point-estimates or interval estimates for population parameters of interest. This lesson aims to study the various methodologies commonly used in point and interval estimates like Method of Moments (MOM) and Maximum Likelihood Estimation (MLE). We are interested in methods of estimating population parameters based on a sample distribution. We illustrate point and interval estimations of population means, proportions, and variances using methods introduced in this class.

===Motivation===
Suppose we want to estimate the probability of a head occurring when we flip a specific coin by repeating the experiment several times. How much confidence do we have in our estimation? There are a number of other similar situations where we need to evaluate, predict or estimate a population parameter of interest using an observed data sample. The method of moments (MOM) and maximum likelihood estimation (MLE) are among the most commonly used methods to estimate various population parameters.

In point and interval estimation, not only do we need to consider the distribution and model on which the estimates are based, we also need to make assumptions in terms of the population distribution. Additionally, the estimates of parameters are influenced by other population parameters. For example, the estimate of the mean of the population is influenced by parameters like variance and sample size.

A confidence interval is an interval that contains the true value of a parameter of interest for $(1-α)100\%$ of samples taken. It is called a $(1-\alpha)100\%$ confidence interval for that parameter, and the ends of the CI are called confidence limits.

===Theory===
====Method of Moments (MOM) Estimation====
This method uses the sample data to calculate some sample moments and then sets these equal to their corresponding population counterparts. Steps:
# Determine the k parameters of interest and the specific distribution for this process;
# Compute the first $k$ (or more) sample moments;
# Set the sample moments equal to the population moments and solve for a (linear or non-linear) system of $k$ equations with $k$ unknowns.

* MOM proportion example: consider the example of flipping a coin 10 times and recording the outcomes of heads and tails. We use the outcomes to infer the true probability of a head ($p=P(Head)$). Suppose we observe the outcome of $\{H,T,T,T,T,H,H,T,H,T\}$. With MOM we have: this is a [[SMHS_ProbabilityDistributions#Binomial_distribution |Binomial experiment]] and $E[X]=np$. $X$ is the number of heads in the experiment. Hence, $np=4$, $MOM(p)=p = \frac{4}{10}$.

* MOM Beta distribution example: Suppose we have 10 observations we suspect came from a [http://www.distributome.org/js/calc/BetaCalculator.html Beta distribution].
<center>
{| class="wikitable" style="text-align:center; width:75%"border="1"
|-
! Data||0.055||1.005||0.075||0.005||0.075||1.005||0.005||0.035||0.225
|}
</center>

The beta distribution mean and variance are defined explicitly in terms of two parameters.
* Mean: $μ=\frac{α}{α+β}$,
* Variance: $σ^2=\frac{αβ}{(α+β)^2 (α+β+1)}$.

The sample mean and sample variance are $\bar{x} = 0.251$, and $s^2=0.6187$. Solve for α and β.

====Maximum likelihood estimation (MLE)====
Modeling distribution parameters using MLE estimation based on observed real world data offers a way of tuning the free parameters of the model to provide an optimum fit.

Suppose we observe a sample $x_1,x_2,…,x_n$ of $n$ values from one distribution with probability density/mass function $f_θ$, and we are trying to estimate the parameter $θ$. We can compute the (multivariate) probability density associated with our observed data, $f_θ (x_1,x_2,…,x_n│θ)$. As a function of $θ$ with $x_1,x_2,…,x_n$ fixed, the likelihood function is
$$L(θ)=f_θ (x_1,x_2,…,x_n│θ).$$

The MLE of $θ$ is the value of $θ$ that maximizes $L(θ)$: $\arg\max_θ{L(θ)}.$

It is typically assumed that the observed data are independent and identically distributed (iid) with unknown parameter $θ$. The likelihood can be written as a product of n univariate probability densities: $L(θ)=\prod_{i=1}^n {f_θ (x_i |θ)}$ and since maxima are unaffected by monotone transformations and one can take the logarithm of this expression to turn it into a sum: $L^* (θ)=\sum_{i=1}^n {\ln{f_θ (x_i |θ)}}$. The maximum of this expression can then be found numerically using various optimization algorithms.

* Note: The MLE may not be unique, or guaranteed to exist.

* Example: consider the coin flipping example above, observing the number of heads in the outcomes and using this to infer the true probability of p(Head).
: Likelihood function: $L(θ)=f(x│θ=p)={10 \choose 4} p^4 (1-p)^6$
: Log-likelihood function: $L^* (θ)=\ln{10 \choose 4} + 4\ln{p} + 6\ln{(1-p)}$.
: Maximize the log-likelihood function by setting its first derivative to zero:
$$ 0=\frac{d(\ln{10 \choose 4} + 4\ln{p} + 6\ln{(1-p))}}{dp} =4/p-6/(1-p), p=2/5.$$

====MOM vs. MLE====
* The MOM is inferior to Fisher’s MLE method, because MLE have higher probability of being close to the quantities to be estimated.
* MLE may be intractable in some situations, whereas the MOM estimates can be quickly and easily calculated by hand or using a computer.
* MOM estimates may be used as the first approximations to the solutions of the MLE method, and successive improved approximations may then be found by the [http://en.wikipedia.org/wiki/Newton-Raphson_method Newton-Raphson method]. In this respect, the MOM and MLE are symbiotic.
* Sometimes, MOM estimates may be outside of the parameter space, i.e., they are unreliable, which is never a problem with ML method.
* MOM estimates are not necessarily sufficient statistics, i.e., they sometimes fail to take into account all relevant information in the sample.
* MOM may be preferred to MLE for estimating some structural parameters, when appropriate probability distributions are unknown.

===Student’s T Distribution===
The distribution needed to estimate the mean of a normally distributed population when the sample size is small and the population variance is unknown. It is the basis of the popular Student’s t-tests for the statistical significance of the difference between two sample means, and for confidence intervals for the difference between two population means.

Suppose $X_1,X_2,…,X_n$ are independent random variables that are normally distributed with expected value $μ$ and variance $σ^2$. Sample mean: $\bar{x}_n = \frac{1}{n} \sum_{i=1}^n{x_i}$. Sample variance: $S_n^2=\frac{1}{n} \sum_{i=1}^n{(x_i-\bar{x})^2}$, $Z=\frac{\bar{x}_n-\mu}{\frac{\sigma}{\sqrt{n}}}$ is normally distributed with mean 0 and variance 1, since the sample mean ($\bar{x}_n$) is normally distributed with mean μ and standard deviation $\frac{\sigma}{\sqrt{n}}$.
$$Z=\frac{\bar{x}_n-\mu}{\frac{\sigma}{\sqrt{n}}}$$
$$T=\frac{\bar{x}_n-\mu}{\frac{S_n}{\sqrt{n}}}$$

T replaces $\sigma$ with with sample standard deviation. Also, $(n-1) \frac{S_n^2}{\sigma^2}$ has a [[AP_Statistics_Curriculum_2007_Chi-Square|Chi-square distribution]] $\chi_{n-1}^2$ with degree of freedom equal to $n-1$.

* Example: suppose a research involves 25 patients and relative measurements are recorded:
<center>
{| class="wikitable" style="text-align:center; width:75%" border="1"
|-
| Variable ||N || N* || Mean ||SE of Mean||StDev ||Minimum || Q1|| Median || Q3 ||Maximum
|-
| CD4 || 25|| 0 ||321.4|| 14.8 || 73.8 ||208.0 ||261.5 || 325.0 ||394.0 || 449.0
|}
</center>

What do we know from the background information?
: $\bar{y}= 321.4$
: $s = 73.8$
: $SE = 14.8$
: $n = 25$

: $CI(\alpha)=CI(0.05)$: $\bar{y} \pm t_{\alpha\over 2} {1\over \sqrt{n}} \sqrt{\sum_{i=1}^n{\frac{(x_i-\bar{x})^2}{n-1}}}.$

: $321.4 \pm t_{(24, 0.025)}{73.8\over \sqrt{25}}$
: $321.4 \pm 2.064\times 14.8$
: $[290.85, 351.95]$

====Estimating a population mean with large samples====
We use the following protocol to find point and interval estimates when the sample sizes are large, say exceeding 100.
* Assumptions: The [[SMHS_CLT_LLN|Central Limit Theorem]] guarantees that for large samples, the method above provides a valid recipe for constructing a confidence interval for the population mean, no matter what the distribution for the observed data may be. Of course, for significantly non-Normal distributions, we may need to increase the sample size to guarantee that the sampling distribution of the mean is approximately Normal.

* Point estimation of population mean: $\bar{X_n}={1\over n}\sum_{i=1}^n{X_i}$, constructed from a random sample of the process {$X_1, X_2, X_3, \cdots , X_n$}, which is an [http://en.wikipedia.org/wiki/Estimator_bias unbiased] estimate of the population mean $\mu$, if it exists! Note that the [[AP_Statistics_Curriculum_2007_EDA_Center | sample average may be susceptible to outliers]].

* Interval estimation of a population mean: Choose a confidence level <math>(1-\alpha)100%</math>, where <math>\alpha</math> is small (e.g., 0.1, 0.05, 0.025, 0.01, 0.001, etc.). Then a <math>(1-\alpha)100%</math> confidence interval for <math>\mu</math> will be
: <math>CI(\alpha): \overline{x} \pm z_{\alpha\over 2} E,</math>
:: The '''Error''' term, E, is defined as
:: <math>E = \begin{cases}{\sigma\over\sqrt{n}},& \texttt{for-known}-\sigma,\\
{SE},& \texttt{for-unknown}-\sigma.\end{cases}</math>
:: The '''Standard Error''' of the estimated <math>\overline {x}</math> is obtained by replacing the unknown population standard deviation by the sample standard deviation:
<math>SE(\overline {x}) = {1\over \sqrt{n}} \sqrt{\sum_{i=1}^n{(x_i-\overline{x})^2\over n-1}}</math>

* <math>z_{\alpha\over 2}</math> is the [[AP_Statistics_Curriculum_2007_Normal_Critical | Critical Value]] for a [[AP_Statistics_Curriculum_2007_Normal_Std |Standard Normal]] distribution at <math>{\alpha\over 2}</math>.

* Example: a random sample of the number of sentences found in 30 magazine advertisements is listed below. Use this sample to find point estimate for the population mean μ. Samples: 16, 9, 14, 11, 17, 12, 99, 18, 13, 12, 5, 9, 17, 6, 11, 17, 18, 20, 6, 14, 7, 11, 12, 5, 18, 6, 4, 13, 11, 12. Suppose the point estimate is 12.25.
A confidence interval estimate of μ is a range of values used to estimate a population parameter.
** Known variance: Suppose that we know the variance for the ''number of sentences per advertisement'' example above is known to be 256 (so the population standard deviation is <math>\sigma=16</math>).
:: For <math>{\alpha \over 2}=0.1</math>, the <math>80% CI(\mu)</math> is constructed by:
<center> <math>\overline{x}\pm 1.28SE(\overline{x})=14.77 \pm 1.28{16\over \sqrt{30}}=[11.03;18.51]</math></center>
:: For <math>{\alpha \over 2}=0.05</math>, the <math>90% CI(\mu)</math> is constructed by:
<center> <math>\overline{x}\pm 1.645SE(\overline{x})=14.77 \pm 1.645{16\over \sqrt{30}}=[9.96;19.57]</math></center>
:: For <math>{\alpha \over 2}=0.005</math>, the <math>99% CI(\mu)</math> is constructed by:
<center> <math>\overline{x}\pm 2.575SE(\overline{x})=14.77 \pm 2.575{16\over \sqrt{30}}=[7.24;22.29]</math></center>
: Notice the increase of the CI's (directly related to the decrease of <math>\alpha</math>) reflecting our choice for higher confidence.

** Unknown variance: use the sample variance 273 as an estimate (so the sample standard deviation is <math>s=\hat{\sigma}=16.54</math>).
:: For <math>{\alpha \over 2}</math>, the <math>80% CI(\mu)</math> is constructed by:
<center> <math>\overline{x}\pm 1.28SE(\overline{x})=14.77 \pm 1.28{16.54\over \sqrt{30}}=[10.90;18.63]</math></center>
:: For <math>{\alpha \over 2}=0.05</math>, the <math>90% CI(\mu)</math> is constructed by:
<center> <math>\overline{x}\pm 1.645SE(\overline{x})=14.77 \pm 1.645{16.54\over \sqrt{30}}=[9.80;19.73]</math></center>
:: For <math>{\alpha \over 2}=0.005</math>, the <math>99% CI(\mu)</math> is constructed by:
<center> <math>\overline{x}\pm 2.575SE(\overline{x})=14.77 \pm 2.575{16.54\over \sqrt{30}}=[6.99;22.54]</math></center>
: Notice the increase of the CI's (directly related to the decrease of <math>\alpha</math>) reflecting our choice for higher confidence.

: You can use the [http://www.socr.ucla.edu/htmls/ana/ConfidenceInterval_Analysis.html SOCR CI Analysis Applet] to compute these interval estimates.

====Estimating a population mean with small samples (say <30 observations)====
For small samples, the point estimates are less precise and the interval estimates produce wider intervals, compared to the case of large samples.

* Assumptions: need evidence that the data we observed and used for point and interval estimates come from a distribution, which is (approximately) normal. If this assumption is violated than the interval estimate we are going to introduce may be significantly misrepresenting the real confidence interval.
* Point estimation of population mean: Choose a confidence level <math>(1-\alpha)100%</math>, where <math>\alpha</math> is small (e.g., 0.1, 0.05, 0.025, 0.01, 0.001, etc.). Then a <math>(1-\alpha)100%</math> confidence interval for <math>\mu</math> is defined in terms of the [http://socr.ucla.edu/htmls/dist/StudentT_Distribution.html T-distribution]:
:: <math>CI(\alpha): \overline{x} \pm t_{\alpha\over 2} E.</math>
:: The '''Error''' term, E, is defined as <math>E = \begin{cases}{\sigma\over\sqrt{n}},& \texttt{for-known}-\sigma,\\
{SE},& \texttt{for-unknown}-\sigma.\end{cases}</math>
:: The '''Standard Error''' of the estimate <math>\overline {x}</math> is obtained by replacing the unknown population standard deviation by the sample standard deviation:
<math>SE(\overline {x}) = {1\over \sqrt{n}} \sqrt{\sum_{i=1}^n{(x_i-\overline{x})^2\over n-1}}</math>
:: $t_{\alpha\over 2}$ is the [[AP_Statistics_Curriculum_2007_StudentsT |Critical Value for the T(df=sample-size -1) distribution at <math>{\alpha\over 2}</math>]].

* Example: a random sample of the number of sentences found in 10 magazine advertisements is listed below. Use this sample to find point estimate for the population mean μ. Samples: 16, 9, 14, 11, 17, 12, 99, 18, 13, 12. Suppose the point estimate is 22.1.
** Known variance: Suppose that we know the variance for the ''number of sentences per advertisement'' example above is known to be 256 (so the population standard deviation is <math>\sigma=16</math>).
:: For <math>{\alpha \over 2}=0.1</math>, the <math>80% CI(\mu)</math> is constructed by:
<center> <math>\overline{x}\pm 1.383{16\over \sqrt{10}}=22.1 \pm 1.28{16\over \sqrt{10}}=[15.10 ; 29.10]</math></center>
:: For <math>{\alpha \over 2}=0.05</math>, the <math>90% CI(\mu)</math> is constructed by:
<center> <math>\overline{x}\pm 1.833{16\over \sqrt{10}}=22.1 \pm 1.833{16\over \sqrt{10}}=[12.83 ; 31.37]</math></center>
:: For <math>{\alpha \over 2}=0.005</math>, the <math>99% CI(\mu)</math> is constructed by:
<center> <math>\overline{x}\pm 3.250{16\over \sqrt{10}}=22.1 \pm 3.250{16\over \sqrt{10}}=[5.66 ; 38.54]</math></center>
:: Notice the increase of the CI's (directly related to the decrease of <math>\alpha</math>) reflecting our choice for higher confidence.

** Unknown variance: Suppose that we do '''not''' know the variance for the ''number of sentences per advertisement'' but use the sample variance 737.88 as an estimate (so the sample standard deviation is <math>s=\hat{\sigma}=27.16390579</math>).
:: For <math>{\alpha \over 2}=0.1</math>, the <math>80% CI(\mu)</math> is constructed by:
<center> <math>\overline{x}\pm 1.383{27.16390579\over \sqrt{10}}=22.1 \pm 1.383{27.16390579\over \sqrt{10}}=[10.22 ; 33.98]</math></center>
:: For <math>{\alpha \over 2}=0.05</math>, the <math>90% CI(\mu)</math> is constructed by:
<center> <math>\overline{x}\pm 1.833{27.16390579\over \sqrt{10}}=22.1 \pm 1.833{27.16390579\over \sqrt{10}}=[6.35 ; 37.85]</math></center>
:: For <math>{\alpha \over 2}=0.005</math>, the <math>99% CI(\mu)</math> is constructed by:
<center> <math>\overline{x}\pm 3.250{27.16390579 \over \sqrt{10}}=22.1 \pm 3.250{27.16390579\over \sqrt{10}}=[-5.82 ; 50.02]</math></center>
:: Notice the increase of the CI's (directly related to the decrease of <math>\alpha</math>) reflecting our choice for higher confidence.

====Estimating a population proportion====
When the sample size is large, the sampling distribution of the sample proportion <math>\hat{p}</math> is approximately Normal, by [[AP_Statistics_Curriculum_2007_Limits_CLT |CLT]], as the sample proportion may be presented as a [[AP_Statistics_Curriculum_2007_Limits_Norm2Bin |sample average or Bernoulli random variables]]. When the sample size is small, the normal approximation may be inadequate. To accommodate this, we will modify the '''sample-proportion''' <math>\hat{p}</math> slightly and obtain the '''corrected-sample-proportion''' <math>\tilde{p}</math>:
: <math>\hat{p}={y\over n} \longrightarrow \tilde{p}={y+0.5z_{\alpha \over 2}^2 \over n+z_{\alpha \over 2}^2},</math>
where [[AP_Statistics_Curriculum_2007_Normal_Critical | <math>z_{\alpha \over 2}</math> is the normal critical value we saw earlier]].

The standard error of <math>\hat{p}</math> also needs a slight modification
: <math>SE_{\hat{p}} = \sqrt{\hat{p}(1-\hat{p})\over n} \longrightarrow SE_{\tilde{p}} = \sqrt{\tilde{p}(1-\tilde{p})\over n+z_{\alpha \over 2}^2}.</math>

* Example: Suppose a researcher is interested in studying the effect of aspirin in reducing heart attacks. He randomly recruits 500 subjects with evidence of early heart disease and has them take one aspirin daily for two years. At the end of the two years, he finds that during the study only 17 subjects had a heart attack. Calculate a 95% (<math>\alpha=0.05</math>) confidence interval for the true (unknown) proportion of subjects with early heart disease that have a heart attack while taking aspirin daily. Note that [[AP_Statistics_Curriculum_2007_Normal_Critical | <math>z_{\alpha \over 2} = z_{0.025}=1.96</math>]]:
:: <math>\hat{p} = {17\over 500}=0.034</math> ; <math>\tilde{p} = {17+0.5z_{0.025}^2\over 500+z_{0.025}^2}== {17+1.92\over 500+3.84}=0.038</math>
:: <math>SE_{\hat{p}}= \sqrt{0.034(1-0.034)\over 500}=0.0036</math>; <math>SE_{\tilde{p}}= \sqrt{0.038(1-0.038)\over 500+3.84}=0.0085</math>
::And the corresponding confidence intervals are given by
:: <math>\hat{p}\pm 1.96 SE_{\hat{p}}=[0.026944, 0.041056]</math>
:: <math>\tilde{p}\pm 1.96 SE_{\tilde{p}}=[0.0213, 0.0547]</math>

:: [[AP_Statistics_Curriculum_2007_Estim_Proportion#Sample-Size_Estimation_2|See this example of estimation of sample-size, given margin of error]]

====Estimating population variance====
The most unbiased point estimate for the population variance <math>\sigma^2</math> is the [[AP_Statistics_Curriculum_2007_EDA_Var | Sample-Variance (s2)]] and the point estimate for the population standard deviation <math>\sigma</math> is the [[AP_Statistics_Curriculum_2007_EDA_Var | Sample Standard Deviation (s)]].

We use a [http://en.wikipedia.org/wiki/Chi_square_distribution Chi-Square Distribution] to construct confidence intervals for the variance and standard distribution. If the process or phenomenon we study generates a Normal random variable, then computing the following random variable (for a sample of size <math>n>1</math>) has a [[AP_Statistics_Curriculum_2007_Chi-Square|Chi-Square Distribution]]
: <math>\chi_o^2 = {(n-1)s^2 \over \sigma^2}</math>

* Chi-Square Distribution Properties
** All chi-squares values <math>\chi_o^2 \geq 0</math>.
** The chi-square distribution is a family of curves, each is determined by the degrees of freedom (n-1). See the interactive [http://socr.ucla.edu/htmls/SOCR_Distributions.html Chi-Square distribution].
** To form a confidence interval for the variance (<math>\sigma^2</math>), use the <math>\chi^2(df=n-1)</math> distribution with degrees of freedom equal to one less than the sample size.
** The area under each curve of the Chi-Square Distribution equals one.
** All Chi-Square Distributions are positively skewed.
<center>[[Image:SOCR_EBook_Dinov_Estim_Var_020408_Fig1.jpg|500px]]</center>

* Interval Estimates of Population Variance and Standard Deviation:
:: Notice that the Chi-Square Distribution is '''not''' symmetric (positively skewed) and therefore, there are two critical values for each level of confidence. The value <math>\chi_L^2</math> represents the left-tail critical value and <math>\chi_R^2</math> represents the right-tail critical value. For various degrees of freedom and areas, you can compute all critical values either using the [http://socr.ucla.edu/htmls/SOCR_Distributions.html SOCR Distributions] or using the [http://socr.ucla.edu/Applets.dir/Normal_T_Chi2_F_Tables.htm SOCR Chi-square Distribution Calculator].

::: Example: Find the critical values, <math>\chi_L^2</math> and <math>\chi_R^2</math>, for a 90% confidence interval when the sample size is 25. Use the following Protocol:
::: Identify the degrees of freedom (<math>df=n-1=24</math>) and the level of confidence (<math>{\alpha\over 2}=0.05</math>).
::: Find the left and right critical values, <math>\chi_L^2=13.848</math> and <math>\chi_R^2=36.415</math>, as in the image below.
<center>[[Image:SOCR_EBook_Dinov_Estim_Var_020408_Fig2.jpg|500px]]</center>

* Confidence Interval for <math>\sigma^2</math>
:: <math>{(n-1)s^2 \over \chi_R^2} \leq \sigma^2 \leq {(n-1)s^2 \over \chi_L^2}</math>

* Confidence Interval for <math>\sigma</math>
:: <math>\sqrt{(n-1)s^2 \over \chi_R^2} \leq \sigma \leq \sqrt{(n-1)s^2 \over \chi_L^2}</math>

====Hands-on Activity====
Construct the confidence intervals for <math>\sigma^2</math> and <math>\sigma</math> assuming the observations below represent a random sample from the liquid content (in fluid ounces) of 16 beverage cans and can be considered as Normally distributed. Use a 90% level of confidence.
<center>
{| class="wikitable" style="text-align:center; width:75%" border="1"
|-
| 14.816 || 14.863 || 14.814 || 14.998 || 14.965 || 14.824 || 14.884 || 14.838 || 14.916 || 15.021 || 14.874 || 14.856 || 14.860 || 14.772 || 14.980 || 14.919
|}
</center>

* Get the sample statistics from [http://socr.ucla.edu/htmls/SOCR_Charts.html SOCR Charts] (e.g., Index Plot); Sample-Mean=14.8875; Sample-SD=0.072700298, Sample-Var=0.005285333.
<center>[[Image:SOCR_EBook_Dinov_Estim_Var_020408_Fig3.jpg|500px]]</center>

* Identify the degrees of freedom (<math>df=n-1=15</math>) and the level of confidence (<math>{\alpha/2}=0.05</math>), as we are looking for a <math>(1-\alpha)100% CI(\sigma^2)</math>.
* Find the left and right critical values, <math>\chi_L^2=7.261</math> and <math>\chi_R^2=24.9958</math> using [http://socr.ucla.edu/htmls/SOCR_Distributions.html SOCR Chi-Square Distribution], as in the image below.
<center>[[Image:SOCR_EBook_Dinov_Estim_Var_020408_Fig4.jpg|500px]]</center>

* CI(<math>\sigma^2</math>)
: <math>0.00318={15\times 0.0053 \over 24.9958} \leq \sigma^2 \leq {15\times 0.0053 \over 7.261}=0.01095</math>

* CI(<math>\sigma</math>)
: <math>0.0564=\sqrt{15\times 0.0053 \over 24.9958} \leq \sigma \leq \sqrt{15\times 0.0053 \over 7.261}=0.10464</math>

** [[AP_Statistics_Curriculum_2007_Estim_Var#More_Examples|See more examples here]].

===Applications===
* [http://www.tandfonline.com/doi/abs/10.1207/.U5ys8BZRXKw This article] titled Reliability of Scales with General Structure: Point and Interval Estimation Using a Structural Equation discussed a method of obtaining point and interval estimates of reliability for composites of measures with a general structure. The approach is based on fitting a correspondingly constrained structural equation model and generalizes earlier covariance structure analysis methods for scale reliability estimation with congeneric tests. The procedure can be sued with weighted or unweighted composites, in which the weights need not be known in advance but may be estimated simultaneously. The method presented in this paper allows one to obtain an approximate standard error and confidence interval for scale reliability using bootstrap.

* [[SOCR_EduMaterials_ModelerActivities_NormalBetaModelFit| This activity]] shows normal and beta distribution model fit. It describes the process of SOCR model fitting in the case of using Normal or Beta distribution models. The article aims to motivate the need for analytical modeling of natural processes and illustrated how to use SOCR modeler to fit models to real data ad presented applications of model fitting. It provides specific examples illustrating model fitting and two exercises to practice and learn.

* [[SOCR_EduMaterials_Activities_General_CI_Experiment|This experiment]] shows SOCR activity on general confidence interval and demonstrates the usage and functionality of SOCR general confidence interval applet. It demonstrates the theory behind the use of interval-based estimates of the parameters, illustrates various confidence intervals construction recipes, draws parallels between the construction algorithms and intuitive meaning of confidence intervals and presents a new technology enhanced approach for understanding and utilizing confidence intervals for various applications. The article presents specific example and exercises in this topic and works as a good supplement to point and interval estimates.

===Software===
* [http://socr.ucla.edu/Applets.dir/Normal_T_Chi2_F_Tables.htm SOCR Tables]
* [http://socr.ucla.edu/htmls/exp/Confidence_Interval_Experiment_General.html SOCR General CI Experiment]
* [http://socr.ucla.edu/htmls/SOCR_Modeler.html SOCR Modeler]
* [http://socr.ucla.edu/htmls/SOCR_Experiments.html SOCR Experiments]
* [http://socr.ucla.edu/htmls/SOCR_Charts.html SOCR CHarts]

===Problems===
* Tom is in charge of sampling sugar measurements from a very large population of sugar. Lately her standard errors have been alarmingly high for her sample means. If she wants to decrease her sampling error (standard deviation of her sample means) by 1/2 what should she do?
: (a) Quadruple the variation inherent in the population.
: (b) Triple her sample size.
: (c) Quadruple her sample size.
: (d) Halve her sample size.

* The average standardized math score for eighth graders in the state of Michigan is 70 and the standard deviation is 10. We want to find out if the average standardized math score in district A is higher than the average score for the state of Michigan. The mean for a random sample of 36 students from this district is 72. What is the best response?
: (a) The p-value is around 0.76 and it is concluded that the average standardized math score in this district is not different from the overall population mean.
: (b) The p-value is around 0.12 and it is concluded that the average standardized math score in this district is not higher than the overall population mean.
: (c) The p-value is around 0.24 and it is concluded that the average standardized math score in this district is not higher than the overall population mean.
: (d) The p-value is around 0.88 and it is concluded that the average standardized math score in this district is not higher than the overall population mean.

* A random sample of 121 students from the UMich was selected to estimate the average ACT score of all UMich students. The average for the sample was 23.4 and the sample standard deviation was 3.65. If you wanted to calculate a more precise and accurate prediction of the average ACT score of UMich students, which one of the following would be the best thing to do?
: (a) Decrease the sample size to 91.
: (b) Increase the sample size to 151.
: (c) Increase the confidence level to 99%.
: (d) Decrease the confidence level to 90%.

* How does the shape, center, and spread of t-models change as its degrees of freedom increases?
: (a) The shape and center stays the same, but the spread becomes narrower.
: (b) The shape and center stays the same, but the spread becomes wider.
: (c) The shape and spread stays the same, but the center will increase.
: (d) The shape and spread stays the same, but the center will decrease.

* Estimate the critical value of t for a 95% confidence interval with df = 15
: (a) 1.71
: (b) 2.131
: (c) 1.17
: (d) 3.45

* True or False: In a well-designed sample survey like the Current Population Survey, the observed sample percentage (e.g., percentage unemployed) is equal to the population percentage. Thus, it is appropriate to just report the sample percentage, without any measure of accuracy (i.e. without the margin of error).
: (a) True
: (b) False

* Suppose an NPR news story reports that: "A polling agency reports that the percentage of the American public who agree we should spend more money on the mental health of the war veterans is 42% +/- 3%."
: (a) The probability that the American public agree that we should spend more money on the mental health of the war veterans is between 39% to 42%.
: (b) The percentage of the American public who agree that we should spend more money on the mental health of the war veterans is between 39% to 45%.
: (c) We are 95% confident that the percentage of the American public who agree that we should spend more money on the mental health of the war veterans is between 39% to 45%.
: (d) The percentage of the American public who agree that we should spend more money on the mental health of the war veterans is 42%.

* A major newspaper wants to hire a polling agency to predict who will be the next governor. Agency A proposes to do the job with a random sample of 5000 voters at a cost of $\$ 50K$ (K = one thousand). Agency B proposes to do the job with a random sample of 7500 voters at a cost of $\$ $75K. Assume both agencies find the percentage of voters to be 40% and both use the normal model to calculate the 95% interval. Which agency will you hire? Hint: Compare the margin of error for the two agencies and the relative costs before making your decision.
: (a) I will hire B.
: (b) I have no preference.
: (c) I need more information to decide who to hire.
: (d) I will hire A.

* Suppose that the proportion of the adult population who jog is 0.15. What is the probability that the proportion of joggers in a random sample of size n =200 lies between 0.13 and 0.17?
: (a) 0.5762 approximately
: (b) 0.8125 approximately
: (c) 0.2345 approximately
: (d) 0.1234 approximately

* Records at a large university indicate that 20% of all freshmen are placed on academic probation at the end of the first semester. A random sample of 100 freshmen found that 25% of them were placed on probation. The results of the sample:
: (a) are surprising since it indicates that 5% more of these freshmen were placed on probation than expected
: (b) are surprising since the standard deviation of the sampling distribution is 0.4%.
: (c) are biased since an increase of 5% could not happen without injecting bias into the sample.
: (d) are not surprising since the standard deviation of the sampling distribution is 4%.
: (e) are surprising since SAT scores have increased over the past years

* We have discussed that the standard deviation of the distribution of sample percentages, $SE(\hat{p})$ is calculated by taking the square root of $\frac{\hat{p}(1-\hat{p})}{N}$, where $\hat{p}$ is the proportion in the sample and N is sample size. What does $SE(\hat{p})$ show?
: (a) It shows the standard error of the man across repeated samples from the population.
: (b) It shows the distribution of $\hat{p}$ for the single sample that the researcher draws from the population.
: (c) It shows the standard deviation of $\hat{p}$ for repeated samples from the population.
: (d) It shows the variation for $\hat{p}$ values for repeated samples from the population.

===References===
* [http://wiki.stat.ucla.edu/socr/index.php/Probability_and_statistics_EBook#Chapter_VII:_Point_and_Interval_Estimates SOCR]
* [http://en.wikipedia.org/wiki/Method_of_moments_(statistics) MoM Wikipedia]

<hr>
* SOCR Home page: http://www.socr.umich.edu

{{translate|pageName=http://wiki.socr.umich.edu/index.php?title=SMHS_CIs}}

SMHS CIs

2014-09-01T18:39:44Z

Jslavine: /* Motivation */

==[[SMHS| Scientific Methods for Health Sciences]] - Point and Interval Estimation: MoM and MLE ==

===Overview===
The estimation of population parameters is critical in many applications. In statistics, estimation uses a combination of effect sizes, confidence intervals, and meta-analysis to plan experiments, analyze data and interpret results. It is most frequently carried out in terms of point-estimates or interval estimates for population parameters of interest. This lesson aims to study the various methodologies commonly used in point and interval estimates like Method of Moments (MOM) and Maximum Likelihood Estimation (MLE). We are interested in methods of estimating population parameters based on a sample distribution. We illustrate point and interval estimations of population means, proportions, and variances using methods introduced in this class.

===Motivation===
Suppose we want to estimate the probability of a head occurring when we flip a specific coin by repeating the experiment several times. How much confidence do we have in our estimation? There are a number of other similar situations where we need to evaluate, predict or estimate a population parameter of interest using an observed data sample. The method of moments (MOM) and maximum likelihood estimation (MLE) are among the most commonly used methods to estimate various population parameters.

In point and interval estimation, not only do we need to consider the distribution and model on which the estimates are based, we also need to make assumptions in terms of the population distribution. Additionally, the estimates of parameters are influenced by other population parameters. For example, the estimate of the mean of the population is influenced by parameters like variance and sample size.

A confidence interval is an interval that contains the true value of a parameter of interest for $(1-α)100%$ of samples taken. It is called a $(1-\alpha)100\%$ confidence interval for that parameter, and the ends of the CI are called confidence limits.

===Theory===
====Method of Moments (MOM) Estimation====
This method uses the sample data to calculate some sample moments and then sets these equal to their corresponding population counterparts. Steps:
# Determine the k parameters of interest and the specific distribution for this process;
# Compute the first $k$ (or more) sample moments;
# Set the sample moments equal to the population moments and solve for a (linear or non-linear) system of $k$ equations with $k$ unknowns.

* MOM proportion example: consider the example of flipping a coin 10 times and recording the outcomes of heads and tails. We use the outcomes to infer the true probability of a head ($p=P(Head)$). Suppose we observe the outcome of $\{H,T,T,T,T,H,H,T,H,T\}$. With MOM we have: this is a [[SMHS_ProbabilityDistributions#Binomial_distribution |Binomial experiment]] and $E[X]=np$. $X$ is the number of heads in the experiment. Hence, $np=4$, $MOM(p)=p = \frac{4}{10}$.

* MOM Beta distribution example: Suppose we have 10 observations we suspect came from a [http://www.distributome.org/js/calc/BetaCalculator.html Beta distribution].
<center>
{| class="wikitable" style="text-align:center; width:75%"border="1"
|-
! Data||0.055||1.005||0.075||0.005||0.075||1.005||0.005||0.035||0.225
|}
</center>

The beta distribution mean and variance are defined explicitly in terms of two parameters.
* Mean: $μ=\frac{α}{α+β}$,
* Variance: $σ^2=\frac{αβ}{(α+β)^2 (α+β+1)}$.

The sample mean and sample variance are $\bar{x} = 0.251$, and $s^2=0.6187$. Solve for α and β.

====Maximum likelihood estimation (MLE)====
Modeling distribution parameters using MLE estimation based on observed real world data offers a way of tuning the free parameters of the model to provide an optimum fit.

Suppose we observe a sample $x_1,x_2,…,x_n$ of $n$ values from one distribution with probability density/mass function $f_θ$, and we are trying to estimate the parameter $θ$. We can compute the (multivariate) probability density associated with our observed data, $f_θ (x_1,x_2,…,x_n│θ)$. As a function of $θ$ with $x_1,x_2,…,x_n$ fixed, the likelihood function is
$$L(θ)=f_θ (x_1,x_2,…,x_n│θ).$$

The MLE of $θ$ is the value of $θ$ that maximizes $L(θ)$: $\arg\max_θ{L(θ)}.$

It is typically assumed that the observed data are independent and identically distributed (iid) with unknown parameter $θ$. The likelihood can be written as a product of n univariate probability densities: $L(θ)=\prod_{i=1}^n {f_θ (x_i |θ)}$ and since maxima are unaffected by monotone transformations and one can take the logarithm of this expression to turn it into a sum: $L^* (θ)=\sum_{i=1}^n {\ln{f_θ (x_i |θ)}}$. The maximum of this expression can then be found numerically using various optimization algorithms.

* Note: The MLE may not be unique, or guaranteed to exist.

* Example: consider the coin flipping example above, observing the number of heads in the outcomes and using this to infer the true probability of p(Head).
: Likelihood function: $L(θ)=f(x│θ=p)={10 \choose 4} p^4 (1-p)^6$
: Log-likelihood function: $L^* (θ)=\ln{10 \choose 4} + 4\ln{p} + 6\ln{(1-p)}$.
: Maximize the log-likelihood function by setting its first derivative to zero:
$$ 0=\frac{d(\ln{10 \choose 4} + 4\ln{p} + 6\ln{(1-p))}}{dp} =4/p-6/(1-p), p=2/5.$$

====MOM vs. MLE====
* The MOM is inferior to Fisher’s MLE method, because MLE have higher probability of being close to the quantities to be estimated.
* MLE may be intractable in some situations, whereas the MOM estimates can be quickly and easily calculated by hand or using a computer.
* MOM estimates may be used as the first approximations to the solutions of the MLE method, and successive improved approximations may then be found by the [http://en.wikipedia.org/wiki/Newton-Raphson_method Newton-Raphson method]. In this respect, the MOM and MLE are symbiotic.
* Sometimes, MOM estimates may be outside of the parameter space, i.e., they are unreliable, which is never a problem with ML method.
* MOM estimates are not necessarily sufficient statistics, i.e., they sometimes fail to take into account all relevant information in the sample.
* MOM may be preferred to MLE for estimating some structural parameters, when appropriate probability distributions are unknown.

===Student’s T Distribution===
The distribution needed to estimate the mean of a normally distributed population when the sample size is small and the population variance is unknown. It is the basis of the popular Student’s t-tests for the statistical significance of the difference between two sample means, and for confidence intervals for the difference between two population means.

Suppose $X_1,X_2,…,X_n$ are independent random variables that are normally distributed with expected value $μ$ and variance $σ^2$. Sample mean: $\bar{x}_n = \frac{1}{n} \sum_{i=1}^n{x_i}$. Sample variance: $S_n^2=\frac{1}{n} \sum_{i=1}^n{(x_i-\bar{x})^2}$, $Z=\frac{\bar{x}_n-\mu}{\frac{\sigma}{\sqrt{n}}}$ is normally distributed with mean 0 and variance 1, since the sample mean ($\bar{x}_n$) is normally distributed with mean μ and standard deviation $\frac{\sigma}{\sqrt{n}}$.
$$Z=\frac{\bar{x}_n-\mu}{\frac{\sigma}{\sqrt{n}}}$$
$$T=\frac{\bar{x}_n-\mu}{\frac{S_n}{\sqrt{n}}}$$

T replaces $\sigma$ with with sample standard deviation. Also, $(n-1) \frac{S_n^2}{\sigma^2}$ has a [[AP_Statistics_Curriculum_2007_Chi-Square|Chi-square distribution]] $\chi_{n-1}^2$ with degree of freedom equal to $n-1$.

* Example: suppose a research involves 25 patients and relative measurements are recorded:
<center>
{| class="wikitable" style="text-align:center; width:75%" border="1"
|-
| Variable ||N || N* || Mean ||SE of Mean||StDev ||Minimum || Q1|| Median || Q3 ||Maximum
|-
| CD4 || 25|| 0 ||321.4|| 14.8 || 73.8 ||208.0 ||261.5 || 325.0 ||394.0 || 449.0
|}
</center>

What do we know from the background information?
: $\bar{y}= 321.4$
: $s = 73.8$
: $SE = 14.8$
: $n = 25$

: $CI(\alpha)=CI(0.05)$: $\bar{y} \pm t_{\alpha\over 2} {1\over \sqrt{n}} \sqrt{\sum_{i=1}^n{\frac{(x_i-\bar{x})^2}{n-1}}}.$

: $321.4 \pm t_{(24, 0.025)}{73.8\over \sqrt{25}}$
: $321.4 \pm 2.064\times 14.8$
: $[290.85, 351.95]$

====Estimating a population mean with large samples====
We use the following protocol to find point and interval estimates when the sample sizes are large, say exceeding 100.
* Assumptions: The [[SMHS_CLT_LLN|Central Limit Theorem]] guarantees that for large samples, the method above provides a valid recipe for constructing a confidence interval for the population mean, no matter what the distribution for the observed data may be. Of course, for significantly non-Normal distributions, we may need to increase the sample size to guarantee that the sampling distribution of the mean is approximately Normal.

* Point estimation of population mean: $\bar{X_n}={1\over n}\sum_{i=1}^n{X_i}$, constructed from a random sample of the process {$X_1, X_2, X_3, \cdots , X_n$}, which is an [http://en.wikipedia.org/wiki/Estimator_bias unbiased] estimate of the population mean $\mu$, if it exists! Note that the [[AP_Statistics_Curriculum_2007_EDA_Center | sample average may be susceptible to outliers]].

* Interval estimation of a population mean: Choose a confidence level <math>(1-\alpha)100%</math>, where <math>\alpha</math> is small (e.g., 0.1, 0.05, 0.025, 0.01, 0.001, etc.). Then a <math>(1-\alpha)100%</math> confidence interval for <math>\mu</math> will be
: <math>CI(\alpha): \overline{x} \pm z_{\alpha\over 2} E,</math>
:: The '''Error''' term, E, is defined as
:: <math>E = \begin{cases}{\sigma\over\sqrt{n}},& \texttt{for-known}-\sigma,\\
{SE},& \texttt{for-unknown}-\sigma.\end{cases}</math>
:: The '''Standard Error''' of the estimated <math>\overline {x}</math> is obtained by replacing the unknown population standard deviation by the sample standard deviation:
<math>SE(\overline {x}) = {1\over \sqrt{n}} \sqrt{\sum_{i=1}^n{(x_i-\overline{x})^2\over n-1}}</math>

* <math>z_{\alpha\over 2}</math> is the [[AP_Statistics_Curriculum_2007_Normal_Critical | Critical Value]] for a [[AP_Statistics_Curriculum_2007_Normal_Std |Standard Normal]] distribution at <math>{\alpha\over 2}</math>.

* Example: a random sample of the number of sentences found in 30 magazine advertisements is listed below. Use this sample to find point estimate for the population mean μ. Samples: 16, 9, 14, 11, 17, 12, 99, 18, 13, 12, 5, 9, 17, 6, 11, 17, 18, 20, 6, 14, 7, 11, 12, 5, 18, 6, 4, 13, 11, 12. Suppose the point estimate is 12.25.
A confidence interval estimate of μ is a range of values used to estimate a population parameter.
** Known variance: Suppose that we know the variance for the ''number of sentences per advertisement'' example above is known to be 256 (so the population standard deviation is <math>\sigma=16</math>).
:: For <math>{\alpha \over 2}=0.1</math>, the <math>80% CI(\mu)</math> is constructed by:
<center> <math>\overline{x}\pm 1.28SE(\overline{x})=14.77 \pm 1.28{16\over \sqrt{30}}=[11.03;18.51]</math></center>
:: For <math>{\alpha \over 2}=0.05</math>, the <math>90% CI(\mu)</math> is constructed by:
<center> <math>\overline{x}\pm 1.645SE(\overline{x})=14.77 \pm 1.645{16\over \sqrt{30}}=[9.96;19.57]</math></center>
:: For <math>{\alpha \over 2}=0.005</math>, the <math>99% CI(\mu)</math> is constructed by:
<center> <math>\overline{x}\pm 2.575SE(\overline{x})=14.77 \pm 2.575{16\over \sqrt{30}}=[7.24;22.29]</math></center>
: Notice the increase of the CI's (directly related to the decrease of <math>\alpha</math>) reflecting our choice for higher confidence.

** Unknown variance: use the sample variance 273 as an estimate (so the sample standard deviation is <math>s=\hat{\sigma}=16.54</math>).
:: For <math>{\alpha \over 2}</math>, the <math>80% CI(\mu)</math> is constructed by:
<center> <math>\overline{x}\pm 1.28SE(\overline{x})=14.77 \pm 1.28{16.54\over \sqrt{30}}=[10.90;18.63]</math></center>
:: For <math>{\alpha \over 2}=0.05</math>, the <math>90% CI(\mu)</math> is constructed by:
<center> <math>\overline{x}\pm 1.645SE(\overline{x})=14.77 \pm 1.645{16.54\over \sqrt{30}}=[9.80;19.73]</math></center>
:: For <math>{\alpha \over 2}=0.005</math>, the <math>99% CI(\mu)</math> is constructed by:
<center> <math>\overline{x}\pm 2.575SE(\overline{x})=14.77 \pm 2.575{16.54\over \sqrt{30}}=[6.99;22.54]</math></center>
: Notice the increase of the CI's (directly related to the decrease of <math>\alpha</math>) reflecting our choice for higher confidence.

: You can use the [http://www.socr.ucla.edu/htmls/ana/ConfidenceInterval_Analysis.html SOCR CI Analysis Applet] to compute these interval estimates.

====Estimating a population mean with small samples (say <30 observations)====
For small samples, the point estimates are less precise and the interval estimates produce wider intervals, compared to the case of large samples.

* Assumptions: need evidence that the data we observed and used for point and interval estimates come from a distribution, which is (approximately) normal. If this assumption is violated than the interval estimate we are going to introduce may be significantly misrepresenting the real confidence interval.
* Point estimation of population mean: Choose a confidence level <math>(1-\alpha)100%</math>, where <math>\alpha</math> is small (e.g., 0.1, 0.05, 0.025, 0.01, 0.001, etc.). Then a <math>(1-\alpha)100%</math> confidence interval for <math>\mu</math> is defined in terms of the [http://socr.ucla.edu/htmls/dist/StudentT_Distribution.html T-distribution]:
:: <math>CI(\alpha): \overline{x} \pm t_{\alpha\over 2} E.</math>
:: The '''Error''' term, E, is defined as <math>E = \begin{cases}{\sigma\over\sqrt{n}},& \texttt{for-known}-\sigma,\\
{SE},& \texttt{for-unknown}-\sigma.\end{cases}</math>
:: The '''Standard Error''' of the estimate <math>\overline {x}</math> is obtained by replacing the unknown population standard deviation by the sample standard deviation:
<math>SE(\overline {x}) = {1\over \sqrt{n}} \sqrt{\sum_{i=1}^n{(x_i-\overline{x})^2\over n-1}}</math>
:: $t_{\alpha\over 2}$ is the [[AP_Statistics_Curriculum_2007_StudentsT |Critical Value for the T(df=sample-size -1) distribution at <math>{\alpha\over 2}</math>]].

* Example: a random sample of the number of sentences found in 10 magazine advertisements is listed below. Use this sample to find point estimate for the population mean μ. Samples: 16, 9, 14, 11, 17, 12, 99, 18, 13, 12. Suppose the point estimate is 22.1.
** Known variance: Suppose that we know the variance for the ''number of sentences per advertisement'' example above is known to be 256 (so the population standard deviation is <math>\sigma=16</math>).
:: For <math>{\alpha \over 2}=0.1</math>, the <math>80% CI(\mu)</math> is constructed by:
<center> <math>\overline{x}\pm 1.383{16\over \sqrt{10}}=22.1 \pm 1.28{16\over \sqrt{10}}=[15.10 ; 29.10]</math></center>
:: For <math>{\alpha \over 2}=0.05</math>, the <math>90% CI(\mu)</math> is constructed by:
<center> <math>\overline{x}\pm 1.833{16\over \sqrt{10}}=22.1 \pm 1.833{16\over \sqrt{10}}=[12.83 ; 31.37]</math></center>
:: For <math>{\alpha \over 2}=0.005</math>, the <math>99% CI(\mu)</math> is constructed by:
<center> <math>\overline{x}\pm 3.250{16\over \sqrt{10}}=22.1 \pm 3.250{16\over \sqrt{10}}=[5.66 ; 38.54]</math></center>
:: Notice the increase of the CI's (directly related to the decrease of <math>\alpha</math>) reflecting our choice for higher confidence.

** Unknown variance: Suppose that we do '''not''' know the variance for the ''number of sentences per advertisement'' but use the sample variance 737.88 as an estimate (so the sample standard deviation is <math>s=\hat{\sigma}=27.16390579</math>).
:: For <math>{\alpha \over 2}=0.1</math>, the <math>80% CI(\mu)</math> is constructed by:
<center> <math>\overline{x}\pm 1.383{27.16390579\over \sqrt{10}}=22.1 \pm 1.383{27.16390579\over \sqrt{10}}=[10.22 ; 33.98]</math></center>
:: For <math>{\alpha \over 2}=0.05</math>, the <math>90% CI(\mu)</math> is constructed by:
<center> <math>\overline{x}\pm 1.833{27.16390579\over \sqrt{10}}=22.1 \pm 1.833{27.16390579\over \sqrt{10}}=[6.35 ; 37.85]</math></center>
:: For <math>{\alpha \over 2}=0.005</math>, the <math>99% CI(\mu)</math> is constructed by:
<center> <math>\overline{x}\pm 3.250{27.16390579 \over \sqrt{10}}=22.1 \pm 3.250{27.16390579\over \sqrt{10}}=[-5.82 ; 50.02]</math></center>
:: Notice the increase of the CI's (directly related to the decrease of <math>\alpha</math>) reflecting our choice for higher confidence.

====Estimating a population proportion====
When the sample size is large, the sampling distribution of the sample proportion <math>\hat{p}</math> is approximately Normal, by [[AP_Statistics_Curriculum_2007_Limits_CLT |CLT]], as the sample proportion may be presented as a [[AP_Statistics_Curriculum_2007_Limits_Norm2Bin |sample average or Bernoulli random variables]]. When the sample size is small, the normal approximation may be inadequate. To accommodate this, we will modify the '''sample-proportion''' <math>\hat{p}</math> slightly and obtain the '''corrected-sample-proportion''' <math>\tilde{p}</math>:
: <math>\hat{p}={y\over n} \longrightarrow \tilde{p}={y+0.5z_{\alpha \over 2}^2 \over n+z_{\alpha \over 2}^2},</math>
where [[AP_Statistics_Curriculum_2007_Normal_Critical | <math>z_{\alpha \over 2}</math> is the normal critical value we saw earlier]].

The standard error of <math>\hat{p}</math> also needs a slight modification
: <math>SE_{\hat{p}} = \sqrt{\hat{p}(1-\hat{p})\over n} \longrightarrow SE_{\tilde{p}} = \sqrt{\tilde{p}(1-\tilde{p})\over n+z_{\alpha \over 2}^2}.</math>

* Example: Suppose a researcher is interested in studying the effect of aspirin in reducing heart attacks. He randomly recruits 500 subjects with evidence of early heart disease and has them take one aspirin daily for two years. At the end of the two years, he finds that during the study only 17 subjects had a heart attack. Calculate a 95% (<math>\alpha=0.05</math>) confidence interval for the true (unknown) proportion of subjects with early heart disease that have a heart attack while taking aspirin daily. Note that [[AP_Statistics_Curriculum_2007_Normal_Critical | <math>z_{\alpha \over 2} = z_{0.025}=1.96</math>]]:
:: <math>\hat{p} = {17\over 500}=0.034</math> ; <math>\tilde{p} = {17+0.5z_{0.025}^2\over 500+z_{0.025}^2}== {17+1.92\over 500+3.84}=0.038</math>
:: <math>SE_{\hat{p}}= \sqrt{0.034(1-0.034)\over 500}=0.0036</math>; <math>SE_{\tilde{p}}= \sqrt{0.038(1-0.038)\over 500+3.84}=0.0085</math>
::And the corresponding confidence intervals are given by
:: <math>\hat{p}\pm 1.96 SE_{\hat{p}}=[0.026944, 0.041056]</math>
:: <math>\tilde{p}\pm 1.96 SE_{\tilde{p}}=[0.0213, 0.0547]</math>

:: [[AP_Statistics_Curriculum_2007_Estim_Proportion#Sample-Size_Estimation_2|See this example of estimation of sample-size, given margin of error]]

====Estimating population variance====
The most unbiased point estimate for the population variance <math>\sigma^2</math> is the [[AP_Statistics_Curriculum_2007_EDA_Var | Sample-Variance (s2)]] and the point estimate for the population standard deviation <math>\sigma</math> is the [[AP_Statistics_Curriculum_2007_EDA_Var | Sample Standard Deviation (s)]].

We use a [http://en.wikipedia.org/wiki/Chi_square_distribution Chi-Square Distribution] to construct confidence intervals for the variance and standard distribution. If the process or phenomenon we study generates a Normal random variable, then computing the following random variable (for a sample of size <math>n>1</math>) has a [[AP_Statistics_Curriculum_2007_Chi-Square|Chi-Square Distribution]]
: <math>\chi_o^2 = {(n-1)s^2 \over \sigma^2}</math>

* Chi-Square Distribution Properties
** All chi-squares values <math>\chi_o^2 \geq 0</math>.
** The chi-square distribution is a family of curves, each is determined by the degrees of freedom (n-1). See the interactive [http://socr.ucla.edu/htmls/SOCR_Distributions.html Chi-Square distribution].
** To form a confidence interval for the variance (<math>\sigma^2</math>), use the <math>\chi^2(df=n-1)</math> distribution with degrees of freedom equal to one less than the sample size.
** The area under each curve of the Chi-Square Distribution equals one.
** All Chi-Square Distributions are positively skewed.
<center>[[Image:SOCR_EBook_Dinov_Estim_Var_020408_Fig1.jpg|500px]]</center>

* Interval Estimates of Population Variance and Standard Deviation:
:: Notice that the Chi-Square Distribution is '''not''' symmetric (positively skewed) and therefore, there are two critical values for each level of confidence. The value <math>\chi_L^2</math> represents the left-tail critical value and <math>\chi_R^2</math> represents the right-tail critical value. For various degrees of freedom and areas, you can compute all critical values either using the [http://socr.ucla.edu/htmls/SOCR_Distributions.html SOCR Distributions] or using the [http://socr.ucla.edu/Applets.dir/Normal_T_Chi2_F_Tables.htm SOCR Chi-square Distribution Calculator].

::: Example: Find the critical values, <math>\chi_L^2</math> and <math>\chi_R^2</math>, for a 90% confidence interval when the sample size is 25. Use the following Protocol:
::: Identify the degrees of freedom (<math>df=n-1=24</math>) and the level of confidence (<math>{\alpha\over 2}=0.05</math>).
::: Find the left and right critical values, <math>\chi_L^2=13.848</math> and <math>\chi_R^2=36.415</math>, as in the image below.
<center>[[Image:SOCR_EBook_Dinov_Estim_Var_020408_Fig2.jpg|500px]]</center>

* Confidence Interval for <math>\sigma^2</math>
:: <math>{(n-1)s^2 \over \chi_R^2} \leq \sigma^2 \leq {(n-1)s^2 \over \chi_L^2}</math>

* Confidence Interval for <math>\sigma</math>
:: <math>\sqrt{(n-1)s^2 \over \chi_R^2} \leq \sigma \leq \sqrt{(n-1)s^2 \over \chi_L^2}</math>

====Hands-on Activity====
Construct the confidence intervals for <math>\sigma^2</math> and <math>\sigma</math> assuming the observations below represent a random sample from the liquid content (in fluid ounces) of 16 beverage cans and can be considered as Normally distributed. Use a 90% level of confidence.
<center>
{| class="wikitable" style="text-align:center; width:75%" border="1"
|-
| 14.816 || 14.863 || 14.814 || 14.998 || 14.965 || 14.824 || 14.884 || 14.838 || 14.916 || 15.021 || 14.874 || 14.856 || 14.860 || 14.772 || 14.980 || 14.919
|}
</center>

* Get the sample statistics from [http://socr.ucla.edu/htmls/SOCR_Charts.html SOCR Charts] (e.g., Index Plot); Sample-Mean=14.8875; Sample-SD=0.072700298, Sample-Var=0.005285333.
<center>[[Image:SOCR_EBook_Dinov_Estim_Var_020408_Fig3.jpg|500px]]</center>

* Identify the degrees of freedom (<math>df=n-1=15</math>) and the level of confidence (<math>{\alpha/2}=0.05</math>), as we are looking for a <math>(1-\alpha)100% CI(\sigma^2)</math>.
* Find the left and right critical values, <math>\chi_L^2=7.261</math> and <math>\chi_R^2=24.9958</math> using [http://socr.ucla.edu/htmls/SOCR_Distributions.html SOCR Chi-Square Distribution], as in the image below.
<center>[[Image:SOCR_EBook_Dinov_Estim_Var_020408_Fig4.jpg|500px]]</center>

* CI(<math>\sigma^2</math>)
: <math>0.00318={15\times 0.0053 \over 24.9958} \leq \sigma^2 \leq {15\times 0.0053 \over 7.261}=0.01095</math>

* CI(<math>\sigma</math>)
: <math>0.0564=\sqrt{15\times 0.0053 \over 24.9958} \leq \sigma \leq \sqrt{15\times 0.0053 \over 7.261}=0.10464</math>

** [[AP_Statistics_Curriculum_2007_Estim_Var#More_Examples|See more examples here]].

===Applications===
* [http://www.tandfonline.com/doi/abs/10.1207/.U5ys8BZRXKw This article] titled Reliability of Scales with General Structure: Point and Interval Estimation Using a Structural Equation discussed a method of obtaining point and interval estimates of reliability for composites of measures with a general structure. The approach is based on fitting a correspondingly constrained structural equation model and generalizes earlier covariance structure analysis methods for scale reliability estimation with congeneric tests. The procedure can be sued with weighted or unweighted composites, in which the weights need not be known in advance but may be estimated simultaneously. The method presented in this paper allows one to obtain an approximate standard error and confidence interval for scale reliability using bootstrap.

* [[SOCR_EduMaterials_ModelerActivities_NormalBetaModelFit| This activity]] shows normal and beta distribution model fit. It describes the process of SOCR model fitting in the case of using Normal or Beta distribution models. The article aims to motivate the need for analytical modeling of natural processes and illustrated how to use SOCR modeler to fit models to real data ad presented applications of model fitting. It provides specific examples illustrating model fitting and two exercises to practice and learn.

* [[SOCR_EduMaterials_Activities_General_CI_Experiment|This experiment]] shows SOCR activity on general confidence interval and demonstrates the usage and functionality of SOCR general confidence interval applet. It demonstrates the theory behind the use of interval-based estimates of the parameters, illustrates various confidence intervals construction recipes, draws parallels between the construction algorithms and intuitive meaning of confidence intervals and presents a new technology enhanced approach for understanding and utilizing confidence intervals for various applications. The article presents specific example and exercises in this topic and works as a good supplement to point and interval estimates.

===Software===
* [http://socr.ucla.edu/Applets.dir/Normal_T_Chi2_F_Tables.htm SOCR Tables]
* [http://socr.ucla.edu/htmls/exp/Confidence_Interval_Experiment_General.html SOCR General CI Experiment]
* [http://socr.ucla.edu/htmls/SOCR_Modeler.html SOCR Modeler]
* [http://socr.ucla.edu/htmls/SOCR_Experiments.html SOCR Experiments]
* [http://socr.ucla.edu/htmls/SOCR_Charts.html SOCR CHarts]

===Problems===
* Tom is in charge of sampling sugar measurements from a very large population of sugar. Lately her standard errors have been alarmingly high for her sample means. If she wants to decrease her sampling error (standard deviation of her sample means) by 1/2 what should she do?
: (a) Quadruple the variation inherent in the population.
: (b) Triple her sample size.
: (c) Quadruple her sample size.
: (d) Halve her sample size.

* The average standardized math score for eighth graders in the state of Michigan is 70 and the standard deviation is 10. We want to find out if the average standardized math score in district A is higher than the average score for the state of Michigan. The mean for a random sample of 36 students from this district is 72. What is the best response?
: (a) The p-value is around 0.76 and it is concluded that the average standardized math score in this district is not different from the overall population mean.
: (b) The p-value is around 0.12 and it is concluded that the average standardized math score in this district is not higher than the overall population mean.
: (c) The p-value is around 0.24 and it is concluded that the average standardized math score in this district is not higher than the overall population mean.
: (d) The p-value is around 0.88 and it is concluded that the average standardized math score in this district is not higher than the overall population mean.

* A random sample of 121 students from the UMich was selected to estimate the average ACT score of all UMich students. The average for the sample was 23.4 and the sample standard deviation was 3.65. If you wanted to calculate a more precise and accurate prediction of the average ACT score of UMich students, which one of the following would be the best thing to do?
: (a) Decrease the sample size to 91.
: (b) Increase the sample size to 151.
: (c) Increase the confidence level to 99%.
: (d) Decrease the confidence level to 90%.

* How does the shape, center, and spread of t-models change as its degrees of freedom increases?
: (a) The shape and center stays the same, but the spread becomes narrower.
: (b) The shape and center stays the same, but the spread becomes wider.
: (c) The shape and spread stays the same, but the center will increase.
: (d) The shape and spread stays the same, but the center will decrease.

* Estimate the critical value of t for a 95% confidence interval with df = 15
: (a) 1.71
: (b) 2.131
: (c) 1.17
: (d) 3.45

* True or False: In a well-designed sample survey like the Current Population Survey, the observed sample percentage (e.g., percentage unemployed) is equal to the population percentage. Thus, it is appropriate to just report the sample percentage, without any measure of accuracy (i.e. without the margin of error).
: (a) True
: (b) False

* Suppose an NPR news story reports that: "A polling agency reports that the percentage of the American public who agree we should spend more money on the mental health of the war veterans is 42% +/- 3%."
: (a) The probability that the American public agree that we should spend more money on the mental health of the war veterans is between 39% to 42%.
: (b) The percentage of the American public who agree that we should spend more money on the mental health of the war veterans is between 39% to 45%.
: (c) We are 95% confident that the percentage of the American public who agree that we should spend more money on the mental health of the war veterans is between 39% to 45%.
: (d) The percentage of the American public who agree that we should spend more money on the mental health of the war veterans is 42%.

* A major newspaper wants to hire a polling agency to predict who will be the next governor. Agency A proposes to do the job with a random sample of 5000 voters at a cost of $\$ 50K$ (K = one thousand). Agency B proposes to do the job with a random sample of 7500 voters at a cost of $\$ $75K. Assume both agencies find the percentage of voters to be 40% and both use the normal model to calculate the 95% interval. Which agency will you hire? Hint: Compare the margin of error for the two agencies and the relative costs before making your decision.
: (a) I will hire B.
: (b) I have no preference.
: (c) I need more information to decide who to hire.
: (d) I will hire A.

* Suppose that the proportion of the adult population who jog is 0.15. What is the probability that the proportion of joggers in a random sample of size n =200 lies between 0.13 and 0.17?
: (a) 0.5762 approximately
: (b) 0.8125 approximately
: (c) 0.2345 approximately
: (d) 0.1234 approximately

* Records at a large university indicate that 20% of all freshmen are placed on academic probation at the end of the first semester. A random sample of 100 freshmen found that 25% of them were placed on probation. The results of the sample:
: (a) are surprising since it indicates that 5% more of these freshmen were placed on probation than expected
: (b) are surprising since the standard deviation of the sampling distribution is 0.4%.
: (c) are biased since an increase of 5% could not happen without injecting bias into the sample.
: (d) are not surprising since the standard deviation of the sampling distribution is 4%.
: (e) are surprising since SAT scores have increased over the past years

* We have discussed that the standard deviation of the distribution of sample percentages, $SE(\hat{p})$ is calculated by taking the square root of $\frac{\hat{p}(1-\hat{p})}{N}$, where $\hat{p}$ is the proportion in the sample and N is sample size. What does $SE(\hat{p})$ show?
: (a) It shows the standard error of the man across repeated samples from the population.
: (b) It shows the distribution of $\hat{p}$ for the single sample that the researcher draws from the population.
: (c) It shows the standard deviation of $\hat{p}$ for repeated samples from the population.
: (d) It shows the variation for $\hat{p}$ values for repeated samples from the population.

===References===
* [http://wiki.stat.ucla.edu/socr/index.php/Probability_and_statistics_EBook#Chapter_VII:_Point_and_Interval_Estimates SOCR]
* [http://en.wikipedia.org/wiki/Method_of_moments_(statistics) MoM Wikipedia]

<hr>
* SOCR Home page: http://www.socr.umich.edu

{{translate|pageName=http://wiki.socr.umich.edu/index.php?title=SMHS_CIs}}

SMHS CIs

2014-09-01T18:37:14Z

Jslavine: /* Motivation */

==[[SMHS| Scientific Methods for Health Sciences]] - Point and Interval Estimation: MoM and MLE ==

===Overview===
The estimation of population parameters is critical in many applications. In statistics, estimation uses a combination of effect sizes, confidence intervals, and meta-analysis to plan experiments, analyze data and interpret results. It is most frequently carried out in terms of point-estimates or interval estimates for population parameters of interest. This lesson aims to study the various methodologies commonly used in point and interval estimates like Method of Moments (MOM) and Maximum Likelihood Estimation (MLE). We are interested in methods of estimating population parameters based on a sample distribution. We illustrate point and interval estimations of population means, proportions, and variances using methods introduced in this class.

===Motivation===
Suppose we want to estimate the probability of a head occurring when we flip a specific coin by repeating the experiment several times. How much confidence do we have in our estimation? There are a number of other similar situations where we need to evaluate, predict or estimate a population parameter of interest using an observed data sample. The method of moments (MOM) and maximum likelihood estimation (MLE) are among the most commonly used methods to estimate various population parameters.

In point and interval estimation, not only do we need to consider the distribution and model on which the estimates are based, we also need to make assumptions in terms of the population distribution. Additionally, the estimates of parameters are influenced by other population parameters. For example, the estimate of the mean of the population is influenced by parameters like variance and sample size.

A confidence interval is an interval that contains the true value of a parameter of interest for $(1-α)100%$ of samples taken. It is called a %(1-α)100%% confidence interval for that parameter, and the ends of the CI are called confidence limits.

===Theory===
====Method of Moments (MOM) Estimation====
This method uses the sample data to calculate some sample moments and then sets these equal to their corresponding population counterparts. Steps:
# Determine the k parameters of interest and the specific distribution for this process;
# Compute the first $k$ (or more) sample moments;
# Set the sample moments equal to the population moments and solve for a (linear or non-linear) system of $k$ equations with $k$ unknowns.

* MOM proportion example: consider the example of flipping a coin 10 times and recording the outcomes of heads and tails. We use the outcomes to infer the true probability of a head ($p=P(Head)$). Suppose we observe the outcome of $\{H,T,T,T,T,H,H,T,H,T\}$. With MOM we have: this is a [[SMHS_ProbabilityDistributions#Binomial_distribution |Binomial experiment]] and $E[X]=np$. $X$ is the number of heads in the experiment. Hence, $np=4$, $MOM(p)=p = \frac{4}{10}$.

* MOM Beta distribution example: Suppose we have 10 observations we suspect came from a [http://www.distributome.org/js/calc/BetaCalculator.html Beta distribution].
<center>
{| class="wikitable" style="text-align:center; width:75%"border="1"
|-
! Data||0.055||1.005||0.075||0.005||0.075||1.005||0.005||0.035||0.225
|}
</center>

The beta distribution mean and variance are defined explicitly in terms of two parameters.
* Mean: $μ=\frac{α}{α+β}$,
* Variance: $σ^2=\frac{αβ}{(α+β)^2 (α+β+1)}$.

The sample mean and sample variance are $\bar{x} = 0.251$, and $s^2=0.6187$. Solve for α and β.

====Maximum likelihood estimation (MLE)====
Modeling distribution parameters using MLE estimation based on observed real world data offers a way of tuning the free parameters of the model to provide an optimum fit.

Suppose we observe a sample $x_1,x_2,…,x_n$ of $n$ values from one distribution with probability density/mass function $f_θ$, and we are trying to estimate the parameter $θ$. We can compute the (multivariate) probability density associated with our observed data, $f_θ (x_1,x_2,…,x_n│θ)$. As a function of $θ$ with $x_1,x_2,…,x_n$ fixed, the likelihood function is
$$L(θ)=f_θ (x_1,x_2,…,x_n│θ).$$

The MLE of $θ$ is the value of $θ$ that maximizes $L(θ)$: $\arg\max_θ{L(θ)}.$

It is typically assumed that the observed data are independent and identically distributed (iid) with unknown parameter $θ$. The likelihood can be written as a product of n univariate probability densities: $L(θ)=\prod_{i=1}^n {f_θ (x_i |θ)}$ and since maxima are unaffected by monotone transformations and one can take the logarithm of this expression to turn it into a sum: $L^* (θ)=\sum_{i=1}^n {\ln{f_θ (x_i |θ)}}$. The maximum of this expression can then be found numerically using various optimization algorithms.

* Note: The MLE may not be unique, or guaranteed to exist.

* Example: consider the coin flipping example above, observing the number of heads in the outcomes and using this to infer the true probability of p(Head).
: Likelihood function: $L(θ)=f(x│θ=p)={10 \choose 4} p^4 (1-p)^6$
: Log-likelihood function: $L^* (θ)=\ln{10 \choose 4} + 4\ln{p} + 6\ln{(1-p)}$.
: Maximize the log-likelihood function by setting its first derivative to zero:
$$ 0=\frac{d(\ln{10 \choose 4} + 4\ln{p} + 6\ln{(1-p))}}{dp} =4/p-6/(1-p), p=2/5.$$

====MOM vs. MLE====
* The MOM is inferior to Fisher’s MLE method, because MLE have higher probability of being close to the quantities to be estimated.
* MLE may be intractable in some situations, whereas the MOM estimates can be quickly and easily calculated by hand or using a computer.
* MOM estimates may be used as the first approximations to the solutions of the MLE method, and successive improved approximations may then be found by the [http://en.wikipedia.org/wiki/Newton-Raphson_method Newton-Raphson method]. In this respect, the MOM and MLE are symbiotic.
* Sometimes, MOM estimates may be outside of the parameter space, i.e., they are unreliable, which is never a problem with ML method.
* MOM estimates are not necessarily sufficient statistics, i.e., they sometimes fail to take into account all relevant information in the sample.
* MOM may be preferred to MLE for estimating some structural parameters, when appropriate probability distributions are unknown.

===Student’s T Distribution===
The distribution needed to estimate the mean of a normally distributed population when the sample size is small and the population variance is unknown. It is the basis of the popular Student’s t-tests for the statistical significance of the difference between two sample means, and for confidence intervals for the difference between two population means.

Suppose $X_1,X_2,…,X_n$ are independent random variables that are normally distributed with expected value $μ$ and variance $σ^2$. Sample mean: $\bar{x}_n = \frac{1}{n} \sum_{i=1}^n{x_i}$. Sample variance: $S_n^2=\frac{1}{n} \sum_{i=1}^n{(x_i-\bar{x})^2}$, $Z=\frac{\bar{x}_n-\mu}{\frac{\sigma}{\sqrt{n}}}$ is normally distributed with mean 0 and variance 1, since the sample mean ($\bar{x}_n$) is normally distributed with mean μ and standard deviation $\frac{\sigma}{\sqrt{n}}$.
$$Z=\frac{\bar{x}_n-\mu}{\frac{\sigma}{\sqrt{n}}}$$
$$T=\frac{\bar{x}_n-\mu}{\frac{S_n}{\sqrt{n}}}$$

T replaces $\sigma$ with with sample standard deviation. Also, $(n-1) \frac{S_n^2}{\sigma^2}$ has a [[AP_Statistics_Curriculum_2007_Chi-Square|Chi-square distribution]] $\chi_{n-1}^2$ with degree of freedom equal to $n-1$.

* Example: suppose a research involves 25 patients and relative measurements are recorded:
<center>
{| class="wikitable" style="text-align:center; width:75%" border="1"
|-
| Variable ||N || N* || Mean ||SE of Mean||StDev ||Minimum || Q1|| Median || Q3 ||Maximum
|-
| CD4 || 25|| 0 ||321.4|| 14.8 || 73.8 ||208.0 ||261.5 || 325.0 ||394.0 || 449.0
|}
</center>

What do we know from the background information?
: $\bar{y}= 321.4$
: $s = 73.8$
: $SE = 14.8$
: $n = 25$

: $CI(\alpha)=CI(0.05)$: $\bar{y} \pm t_{\alpha\over 2} {1\over \sqrt{n}} \sqrt{\sum_{i=1}^n{\frac{(x_i-\bar{x})^2}{n-1}}}.$

: $321.4 \pm t_{(24, 0.025)}{73.8\over \sqrt{25}}$
: $321.4 \pm 2.064\times 14.8$
: $[290.85, 351.95]$

====Estimating a population mean with large samples====
We use the following protocol to find point and interval estimates when the sample sizes are large, say exceeding 100.
* Assumptions: The [[SMHS_CLT_LLN|Central Limit Theorem]] guarantees that for large samples, the method above provides a valid recipe for constructing a confidence interval for the population mean, no matter what the distribution for the observed data may be. Of course, for significantly non-Normal distributions, we may need to increase the sample size to guarantee that the sampling distribution of the mean is approximately Normal.

* Point estimation of population mean: $\bar{X_n}={1\over n}\sum_{i=1}^n{X_i}$, constructed from a random sample of the process {$X_1, X_2, X_3, \cdots , X_n$}, which is an [http://en.wikipedia.org/wiki/Estimator_bias unbiased] estimate of the population mean $\mu$, if it exists! Note that the [[AP_Statistics_Curriculum_2007_EDA_Center | sample average may be susceptible to outliers]].

* Interval estimation of a population mean: Choose a confidence level <math>(1-\alpha)100%</math>, where <math>\alpha</math> is small (e.g., 0.1, 0.05, 0.025, 0.01, 0.001, etc.). Then a <math>(1-\alpha)100%</math> confidence interval for <math>\mu</math> will be
: <math>CI(\alpha): \overline{x} \pm z_{\alpha\over 2} E,</math>
:: The '''Error''' term, E, is defined as
:: <math>E = \begin{cases}{\sigma\over\sqrt{n}},& \texttt{for-known}-\sigma,\\
{SE},& \texttt{for-unknown}-\sigma.\end{cases}</math>
:: The '''Standard Error''' of the estimated <math>\overline {x}</math> is obtained by replacing the unknown population standard deviation by the sample standard deviation:
<math>SE(\overline {x}) = {1\over \sqrt{n}} \sqrt{\sum_{i=1}^n{(x_i-\overline{x})^2\over n-1}}</math>

* <math>z_{\alpha\over 2}</math> is the [[AP_Statistics_Curriculum_2007_Normal_Critical | Critical Value]] for a [[AP_Statistics_Curriculum_2007_Normal_Std |Standard Normal]] distribution at <math>{\alpha\over 2}</math>.

* Example: a random sample of the number of sentences found in 30 magazine advertisements is listed below. Use this sample to find point estimate for the population mean μ. Samples: 16, 9, 14, 11, 17, 12, 99, 18, 13, 12, 5, 9, 17, 6, 11, 17, 18, 20, 6, 14, 7, 11, 12, 5, 18, 6, 4, 13, 11, 12. Suppose the point estimate is 12.25.
A confidence interval estimate of μ is a range of values used to estimate a population parameter.
** Known variance: Suppose that we know the variance for the ''number of sentences per advertisement'' example above is known to be 256 (so the population standard deviation is <math>\sigma=16</math>).
:: For <math>{\alpha \over 2}=0.1</math>, the <math>80% CI(\mu)</math> is constructed by:
<center> <math>\overline{x}\pm 1.28SE(\overline{x})=14.77 \pm 1.28{16\over \sqrt{30}}=[11.03;18.51]</math></center>
:: For <math>{\alpha \over 2}=0.05</math>, the <math>90% CI(\mu)</math> is constructed by:
<center> <math>\overline{x}\pm 1.645SE(\overline{x})=14.77 \pm 1.645{16\over \sqrt{30}}=[9.96;19.57]</math></center>
:: For <math>{\alpha \over 2}=0.005</math>, the <math>99% CI(\mu)</math> is constructed by:
<center> <math>\overline{x}\pm 2.575SE(\overline{x})=14.77 \pm 2.575{16\over \sqrt{30}}=[7.24;22.29]</math></center>
: Notice the increase of the CI's (directly related to the decrease of <math>\alpha</math>) reflecting our choice for higher confidence.

** Unknown variance: use the sample variance 273 as an estimate (so the sample standard deviation is <math>s=\hat{\sigma}=16.54</math>).
:: For <math>{\alpha \over 2}</math>, the <math>80% CI(\mu)</math> is constructed by:
<center> <math>\overline{x}\pm 1.28SE(\overline{x})=14.77 \pm 1.28{16.54\over \sqrt{30}}=[10.90;18.63]</math></center>
:: For <math>{\alpha \over 2}=0.05</math>, the <math>90% CI(\mu)</math> is constructed by:
<center> <math>\overline{x}\pm 1.645SE(\overline{x})=14.77 \pm 1.645{16.54\over \sqrt{30}}=[9.80;19.73]</math></center>
:: For <math>{\alpha \over 2}=0.005</math>, the <math>99% CI(\mu)</math> is constructed by:
<center> <math>\overline{x}\pm 2.575SE(\overline{x})=14.77 \pm 2.575{16.54\over \sqrt{30}}=[6.99;22.54]</math></center>
: Notice the increase of the CI's (directly related to the decrease of <math>\alpha</math>) reflecting our choice for higher confidence.

: You can use the [http://www.socr.ucla.edu/htmls/ana/ConfidenceInterval_Analysis.html SOCR CI Analysis Applet] to compute these interval estimates.

====Estimating a population mean with small samples (say <30 observations)====
For small samples, the point estimates are less precise and the interval estimates produce wider intervals, compared to the case of large samples.

* Assumptions: need evidence that the data we observed and used for point and interval estimates come from a distribution, which is (approximately) normal. If this assumption is violated than the interval estimate we are going to introduce may be significantly misrepresenting the real confidence interval.
* Point estimation of population mean: Choose a confidence level <math>(1-\alpha)100%</math>, where <math>\alpha</math> is small (e.g., 0.1, 0.05, 0.025, 0.01, 0.001, etc.). Then a <math>(1-\alpha)100%</math> confidence interval for <math>\mu</math> is defined in terms of the [http://socr.ucla.edu/htmls/dist/StudentT_Distribution.html T-distribution]:
:: <math>CI(\alpha): \overline{x} \pm t_{\alpha\over 2} E.</math>
:: The '''Error''' term, E, is defined as <math>E = \begin{cases}{\sigma\over\sqrt{n}},& \texttt{for-known}-\sigma,\\
{SE},& \texttt{for-unknown}-\sigma.\end{cases}</math>
:: The '''Standard Error''' of the estimate <math>\overline {x}</math> is obtained by replacing the unknown population standard deviation by the sample standard deviation:
<math>SE(\overline {x}) = {1\over \sqrt{n}} \sqrt{\sum_{i=1}^n{(x_i-\overline{x})^2\over n-1}}</math>
:: $t_{\alpha\over 2}$ is the [[AP_Statistics_Curriculum_2007_StudentsT |Critical Value for the T(df=sample-size -1) distribution at <math>{\alpha\over 2}</math>]].

* Example: a random sample of the number of sentences found in 10 magazine advertisements is listed below. Use this sample to find point estimate for the population mean μ. Samples: 16, 9, 14, 11, 17, 12, 99, 18, 13, 12. Suppose the point estimate is 22.1.
** Known variance: Suppose that we know the variance for the ''number of sentences per advertisement'' example above is known to be 256 (so the population standard deviation is <math>\sigma=16</math>).
:: For <math>{\alpha \over 2}=0.1</math>, the <math>80% CI(\mu)</math> is constructed by:
<center> <math>\overline{x}\pm 1.383{16\over \sqrt{10}}=22.1 \pm 1.28{16\over \sqrt{10}}=[15.10 ; 29.10]</math></center>
:: For <math>{\alpha \over 2}=0.05</math>, the <math>90% CI(\mu)</math> is constructed by:
<center> <math>\overline{x}\pm 1.833{16\over \sqrt{10}}=22.1 \pm 1.833{16\over \sqrt{10}}=[12.83 ; 31.37]</math></center>
:: For <math>{\alpha \over 2}=0.005</math>, the <math>99% CI(\mu)</math> is constructed by:
<center> <math>\overline{x}\pm 3.250{16\over \sqrt{10}}=22.1 \pm 3.250{16\over \sqrt{10}}=[5.66 ; 38.54]</math></center>
:: Notice the increase of the CI's (directly related to the decrease of <math>\alpha</math>) reflecting our choice for higher confidence.

** Unknown variance: Suppose that we do '''not''' know the variance for the ''number of sentences per advertisement'' but use the sample variance 737.88 as an estimate (so the sample standard deviation is <math>s=\hat{\sigma}=27.16390579</math>).
:: For <math>{\alpha \over 2}=0.1</math>, the <math>80% CI(\mu)</math> is constructed by:
<center> <math>\overline{x}\pm 1.383{27.16390579\over \sqrt{10}}=22.1 \pm 1.383{27.16390579\over \sqrt{10}}=[10.22 ; 33.98]</math></center>
:: For <math>{\alpha \over 2}=0.05</math>, the <math>90% CI(\mu)</math> is constructed by:
<center> <math>\overline{x}\pm 1.833{27.16390579\over \sqrt{10}}=22.1 \pm 1.833{27.16390579\over \sqrt{10}}=[6.35 ; 37.85]</math></center>
:: For <math>{\alpha \over 2}=0.005</math>, the <math>99% CI(\mu)</math> is constructed by:
<center> <math>\overline{x}\pm 3.250{27.16390579 \over \sqrt{10}}=22.1 \pm 3.250{27.16390579\over \sqrt{10}}=[-5.82 ; 50.02]</math></center>
:: Notice the increase of the CI's (directly related to the decrease of <math>\alpha</math>) reflecting our choice for higher confidence.

====Estimating a population proportion====
When the sample size is large, the sampling distribution of the sample proportion <math>\hat{p}</math> is approximately Normal, by [[AP_Statistics_Curriculum_2007_Limits_CLT |CLT]], as the sample proportion may be presented as a [[AP_Statistics_Curriculum_2007_Limits_Norm2Bin |sample average or Bernoulli random variables]]. When the sample size is small, the normal approximation may be inadequate. To accommodate this, we will modify the '''sample-proportion''' <math>\hat{p}</math> slightly and obtain the '''corrected-sample-proportion''' <math>\tilde{p}</math>:
: <math>\hat{p}={y\over n} \longrightarrow \tilde{p}={y+0.5z_{\alpha \over 2}^2 \over n+z_{\alpha \over 2}^2},</math>
where [[AP_Statistics_Curriculum_2007_Normal_Critical | <math>z_{\alpha \over 2}</math> is the normal critical value we saw earlier]].

The standard error of <math>\hat{p}</math> also needs a slight modification
: <math>SE_{\hat{p}} = \sqrt{\hat{p}(1-\hat{p})\over n} \longrightarrow SE_{\tilde{p}} = \sqrt{\tilde{p}(1-\tilde{p})\over n+z_{\alpha \over 2}^2}.</math>

* Example: Suppose a researcher is interested in studying the effect of aspirin in reducing heart attacks. He randomly recruits 500 subjects with evidence of early heart disease and has them take one aspirin daily for two years. At the end of the two years, he finds that during the study only 17 subjects had a heart attack. Calculate a 95% (<math>\alpha=0.05</math>) confidence interval for the true (unknown) proportion of subjects with early heart disease that have a heart attack while taking aspirin daily. Note that [[AP_Statistics_Curriculum_2007_Normal_Critical | <math>z_{\alpha \over 2} = z_{0.025}=1.96</math>]]:
:: <math>\hat{p} = {17\over 500}=0.034</math> ; <math>\tilde{p} = {17+0.5z_{0.025}^2\over 500+z_{0.025}^2}== {17+1.92\over 500+3.84}=0.038</math>
:: <math>SE_{\hat{p}}= \sqrt{0.034(1-0.034)\over 500}=0.0036</math>; <math>SE_{\tilde{p}}= \sqrt{0.038(1-0.038)\over 500+3.84}=0.0085</math>
::And the corresponding confidence intervals are given by
:: <math>\hat{p}\pm 1.96 SE_{\hat{p}}=[0.026944, 0.041056]</math>
:: <math>\tilde{p}\pm 1.96 SE_{\tilde{p}}=[0.0213, 0.0547]</math>

:: [[AP_Statistics_Curriculum_2007_Estim_Proportion#Sample-Size_Estimation_2|See this example of estimation of sample-size, given margin of error]]

====Estimating population variance====
The most unbiased point estimate for the population variance <math>\sigma^2</math> is the [[AP_Statistics_Curriculum_2007_EDA_Var | Sample-Variance (s2)]] and the point estimate for the population standard deviation <math>\sigma</math> is the [[AP_Statistics_Curriculum_2007_EDA_Var | Sample Standard Deviation (s)]].

We use a [http://en.wikipedia.org/wiki/Chi_square_distribution Chi-Square Distribution] to construct confidence intervals for the variance and standard distribution. If the process or phenomenon we study generates a Normal random variable, then computing the following random variable (for a sample of size <math>n>1</math>) has a [[AP_Statistics_Curriculum_2007_Chi-Square|Chi-Square Distribution]]
: <math>\chi_o^2 = {(n-1)s^2 \over \sigma^2}</math>

* Chi-Square Distribution Properties
** All chi-squares values <math>\chi_o^2 \geq 0</math>.
** The chi-square distribution is a family of curves, each is determined by the degrees of freedom (n-1). See the interactive [http://socr.ucla.edu/htmls/SOCR_Distributions.html Chi-Square distribution].
** To form a confidence interval for the variance (<math>\sigma^2</math>), use the <math>\chi^2(df=n-1)</math> distribution with degrees of freedom equal to one less than the sample size.
** The area under each curve of the Chi-Square Distribution equals one.
** All Chi-Square Distributions are positively skewed.
<center>[[Image:SOCR_EBook_Dinov_Estim_Var_020408_Fig1.jpg|500px]]</center>

* Interval Estimates of Population Variance and Standard Deviation:
:: Notice that the Chi-Square Distribution is '''not''' symmetric (positively skewed) and therefore, there are two critical values for each level of confidence. The value <math>\chi_L^2</math> represents the left-tail critical value and <math>\chi_R^2</math> represents the right-tail critical value. For various degrees of freedom and areas, you can compute all critical values either using the [http://socr.ucla.edu/htmls/SOCR_Distributions.html SOCR Distributions] or using the [http://socr.ucla.edu/Applets.dir/Normal_T_Chi2_F_Tables.htm SOCR Chi-square Distribution Calculator].

::: Example: Find the critical values, <math>\chi_L^2</math> and <math>\chi_R^2</math>, for a 90% confidence interval when the sample size is 25. Use the following Protocol:
::: Identify the degrees of freedom (<math>df=n-1=24</math>) and the level of confidence (<math>{\alpha\over 2}=0.05</math>).
::: Find the left and right critical values, <math>\chi_L^2=13.848</math> and <math>\chi_R^2=36.415</math>, as in the image below.
<center>[[Image:SOCR_EBook_Dinov_Estim_Var_020408_Fig2.jpg|500px]]</center>

* Confidence Interval for <math>\sigma^2</math>
:: <math>{(n-1)s^2 \over \chi_R^2} \leq \sigma^2 \leq {(n-1)s^2 \over \chi_L^2}</math>

* Confidence Interval for <math>\sigma</math>
:: <math>\sqrt{(n-1)s^2 \over \chi_R^2} \leq \sigma \leq \sqrt{(n-1)s^2 \over \chi_L^2}</math>

====Hands-on Activity====
Construct the confidence intervals for <math>\sigma^2</math> and <math>\sigma</math> assuming the observations below represent a random sample from the liquid content (in fluid ounces) of 16 beverage cans and can be considered as Normally distributed. Use a 90% level of confidence.
<center>
{| class="wikitable" style="text-align:center; width:75%" border="1"
|-
| 14.816 || 14.863 || 14.814 || 14.998 || 14.965 || 14.824 || 14.884 || 14.838 || 14.916 || 15.021 || 14.874 || 14.856 || 14.860 || 14.772 || 14.980 || 14.919
|}
</center>

* Get the sample statistics from [http://socr.ucla.edu/htmls/SOCR_Charts.html SOCR Charts] (e.g., Index Plot); Sample-Mean=14.8875; Sample-SD=0.072700298, Sample-Var=0.005285333.
<center>[[Image:SOCR_EBook_Dinov_Estim_Var_020408_Fig3.jpg|500px]]</center>

* Identify the degrees of freedom (<math>df=n-1=15</math>) and the level of confidence (<math>{\alpha/2}=0.05</math>), as we are looking for a <math>(1-\alpha)100% CI(\sigma^2)</math>.
* Find the left and right critical values, <math>\chi_L^2=7.261</math> and <math>\chi_R^2=24.9958</math> using [http://socr.ucla.edu/htmls/SOCR_Distributions.html SOCR Chi-Square Distribution], as in the image below.
<center>[[Image:SOCR_EBook_Dinov_Estim_Var_020408_Fig4.jpg|500px]]</center>

* CI(<math>\sigma^2</math>)
: <math>0.00318={15\times 0.0053 \over 24.9958} \leq \sigma^2 \leq {15\times 0.0053 \over 7.261}=0.01095</math>

* CI(<math>\sigma</math>)
: <math>0.0564=\sqrt{15\times 0.0053 \over 24.9958} \leq \sigma \leq \sqrt{15\times 0.0053 \over 7.261}=0.10464</math>

** [[AP_Statistics_Curriculum_2007_Estim_Var#More_Examples|See more examples here]].

===Applications===
* [http://www.tandfonline.com/doi/abs/10.1207/.U5ys8BZRXKw This article] titled Reliability of Scales with General Structure: Point and Interval Estimation Using a Structural Equation discussed a method of obtaining point and interval estimates of reliability for composites of measures with a general structure. The approach is based on fitting a correspondingly constrained structural equation model and generalizes earlier covariance structure analysis methods for scale reliability estimation with congeneric tests. The procedure can be sued with weighted or unweighted composites, in which the weights need not be known in advance but may be estimated simultaneously. The method presented in this paper allows one to obtain an approximate standard error and confidence interval for scale reliability using bootstrap.

* [[SOCR_EduMaterials_ModelerActivities_NormalBetaModelFit| This activity]] shows normal and beta distribution model fit. It describes the process of SOCR model fitting in the case of using Normal or Beta distribution models. The article aims to motivate the need for analytical modeling of natural processes and illustrated how to use SOCR modeler to fit models to real data ad presented applications of model fitting. It provides specific examples illustrating model fitting and two exercises to practice and learn.

* [[SOCR_EduMaterials_Activities_General_CI_Experiment|This experiment]] shows SOCR activity on general confidence interval and demonstrates the usage and functionality of SOCR general confidence interval applet. It demonstrates the theory behind the use of interval-based estimates of the parameters, illustrates various confidence intervals construction recipes, draws parallels between the construction algorithms and intuitive meaning of confidence intervals and presents a new technology enhanced approach for understanding and utilizing confidence intervals for various applications. The article presents specific example and exercises in this topic and works as a good supplement to point and interval estimates.

===Software===
* [http://socr.ucla.edu/Applets.dir/Normal_T_Chi2_F_Tables.htm SOCR Tables]
* [http://socr.ucla.edu/htmls/exp/Confidence_Interval_Experiment_General.html SOCR General CI Experiment]
* [http://socr.ucla.edu/htmls/SOCR_Modeler.html SOCR Modeler]
* [http://socr.ucla.edu/htmls/SOCR_Experiments.html SOCR Experiments]
* [http://socr.ucla.edu/htmls/SOCR_Charts.html SOCR CHarts]

===Problems===
* Tom is in charge of sampling sugar measurements from a very large population of sugar. Lately her standard errors have been alarmingly high for her sample means. If she wants to decrease her sampling error (standard deviation of her sample means) by 1/2 what should she do?
: (a) Quadruple the variation inherent in the population.
: (b) Triple her sample size.
: (c) Quadruple her sample size.
: (d) Halve her sample size.

* The average standardized math score for eighth graders in the state of Michigan is 70 and the standard deviation is 10. We want to find out if the average standardized math score in district A is higher than the average score for the state of Michigan. The mean for a random sample of 36 students from this district is 72. What is the best response?
: (a) The p-value is around 0.76 and it is concluded that the average standardized math score in this district is not different from the overall population mean.
: (b) The p-value is around 0.12 and it is concluded that the average standardized math score in this district is not higher than the overall population mean.
: (c) The p-value is around 0.24 and it is concluded that the average standardized math score in this district is not higher than the overall population mean.
: (d) The p-value is around 0.88 and it is concluded that the average standardized math score in this district is not higher than the overall population mean.

* A random sample of 121 students from the UMich was selected to estimate the average ACT score of all UMich students. The average for the sample was 23.4 and the sample standard deviation was 3.65. If you wanted to calculate a more precise and accurate prediction of the average ACT score of UMich students, which one of the following would be the best thing to do?
: (a) Decrease the sample size to 91.
: (b) Increase the sample size to 151.
: (c) Increase the confidence level to 99%.
: (d) Decrease the confidence level to 90%.

* How does the shape, center, and spread of t-models change as its degrees of freedom increases?
: (a) The shape and center stays the same, but the spread becomes narrower.
: (b) The shape and center stays the same, but the spread becomes wider.
: (c) The shape and spread stays the same, but the center will increase.
: (d) The shape and spread stays the same, but the center will decrease.

* Estimate the critical value of t for a 95% confidence interval with df = 15
: (a) 1.71
: (b) 2.131
: (c) 1.17
: (d) 3.45

* True or False: In a well-designed sample survey like the Current Population Survey, the observed sample percentage (e.g., percentage unemployed) is equal to the population percentage. Thus, it is appropriate to just report the sample percentage, without any measure of accuracy (i.e. without the margin of error).
: (a) True
: (b) False

* Suppose an NPR news story reports that: "A polling agency reports that the percentage of the American public who agree we should spend more money on the mental health of the war veterans is 42% +/- 3%."
: (a) The probability that the American public agree that we should spend more money on the mental health of the war veterans is between 39% to 42%.
: (b) The percentage of the American public who agree that we should spend more money on the mental health of the war veterans is between 39% to 45%.
: (c) We are 95% confident that the percentage of the American public who agree that we should spend more money on the mental health of the war veterans is between 39% to 45%.
: (d) The percentage of the American public who agree that we should spend more money on the mental health of the war veterans is 42%.

* A major newspaper wants to hire a polling agency to predict who will be the next governor. Agency A proposes to do the job with a random sample of 5000 voters at a cost of $\$ 50K$ (K = one thousand). Agency B proposes to do the job with a random sample of 7500 voters at a cost of $\$ $75K. Assume both agencies find the percentage of voters to be 40% and both use the normal model to calculate the 95% interval. Which agency will you hire? Hint: Compare the margin of error for the two agencies and the relative costs before making your decision.
: (a) I will hire B.
: (b) I have no preference.
: (c) I need more information to decide who to hire.
: (d) I will hire A.

* Suppose that the proportion of the adult population who jog is 0.15. What is the probability that the proportion of joggers in a random sample of size n =200 lies between 0.13 and 0.17?
: (a) 0.5762 approximately
: (b) 0.8125 approximately
: (c) 0.2345 approximately
: (d) 0.1234 approximately

* Records at a large university indicate that 20% of all freshmen are placed on academic probation at the end of the first semester. A random sample of 100 freshmen found that 25% of them were placed on probation. The results of the sample:
: (a) are surprising since it indicates that 5% more of these freshmen were placed on probation than expected
: (b) are surprising since the standard deviation of the sampling distribution is 0.4%.
: (c) are biased since an increase of 5% could not happen without injecting bias into the sample.
: (d) are not surprising since the standard deviation of the sampling distribution is 4%.
: (e) are surprising since SAT scores have increased over the past years

* We have discussed that the standard deviation of the distribution of sample percentages, $SE(\hat{p})$ is calculated by taking the square root of $\frac{\hat{p}(1-\hat{p})}{N}$, where $\hat{p}$ is the proportion in the sample and N is sample size. What does $SE(\hat{p})$ show?
: (a) It shows the standard error of the man across repeated samples from the population.
: (b) It shows the distribution of $\hat{p}$ for the single sample that the researcher draws from the population.
: (c) It shows the standard deviation of $\hat{p}$ for repeated samples from the population.
: (d) It shows the variation for $\hat{p}$ values for repeated samples from the population.

===References===
* [http://wiki.stat.ucla.edu/socr/index.php/Probability_and_statistics_EBook#Chapter_VII:_Point_and_Interval_Estimates SOCR]
* [http://en.wikipedia.org/wiki/Method_of_moments_(statistics) MoM Wikipedia]

<hr>
* SOCR Home page: http://www.socr.umich.edu

{{translate|pageName=http://wiki.socr.umich.edu/index.php?title=SMHS_CIs}}

SMHS CIs

2014-09-01T18:35:06Z

Jslavine: /* Overview */

==[[SMHS| Scientific Methods for Health Sciences]] - Point and Interval Estimation: MoM and MLE ==

===Overview===
The estimation of population parameters is critical in many applications. In statistics, estimation uses a combination of effect sizes, confidence intervals, and meta-analysis to plan experiments, analyze data and interpret results. It is most frequently carried out in terms of point-estimates or interval estimates for population parameters of interest. This lesson aims to study the various methodologies commonly used in point and interval estimates like Method of Moments (MOM) and Maximum Likelihood Estimation (MLE). We are interested in methods of estimating population parameters based on a sample distribution. We illustrate point and interval estimations of population means, proportions, and variances using methods introduced in this class.

===Motivation===
Suppose we wanted to estimate the probability of a head of flipping a specific coin by repeating the experiment several times. How much confidence are we in our estimation? There are a number of other similar situations where we need to evaluate, predict or estimate a population parameter of interest using an observed data sample. The method of moments (MOM) and maximum likelihood estimation (MLE) are among the most commonly used methods to estimate various population parameters.

In point and interval estimation, not only do we need to consider the distribution and model the estimates are based on, we also need to make assumptions in terms of the population distribution. Also, the estimates of parameters are influenced by other parameters of the population. For example, the estimate of the mean of the population is influenced by parameters like variance and sample size.

Confidence interval is a type of interval that contains the true value of a parameter of interest for $(1-α)100%$ of sample taken is called a %(1-α)100%% confidence interval for that parameter, and the ends of the CI are called confidence limits.

===Theory===
====Method of Moments (MOM) Estimation====
This method uses the sample data to calculate some sample moments and then sets these equal to their corresponding population counterparts. Steps:
# Determine the k parameters of interest and the specific distribution for this process;
# Compute the first $k$ (or more) sample moments;
# Set the sample moments equal to the population moments and solve for a (linear or non-linear) system of $k$ equations with $k$ unknowns.

* MOM proportion example: consider the example of flipping a coin 10 times and recording the outcomes of heads and tails. We use the outcomes to infer the true probability of a head ($p=P(Head)$). Suppose we observe the outcome of $\{H,T,T,T,T,H,H,T,H,T\}$. With MOM we have: this is a [[SMHS_ProbabilityDistributions#Binomial_distribution |Binomial experiment]] and $E[X]=np$. $X$ is the number of heads in the experiment. Hence, $np=4$, $MOM(p)=p = \frac{4}{10}$.

* MOM Beta distribution example: Suppose we have 10 observations we suspect came from a [http://www.distributome.org/js/calc/BetaCalculator.html Beta distribution].
<center>
{| class="wikitable" style="text-align:center; width:75%"border="1"
|-
! Data||0.055||1.005||0.075||0.005||0.075||1.005||0.005||0.035||0.225
|}
</center>

The beta distribution mean and variance are defined explicitly in terms of two parameters.
* Mean: $μ=\frac{α}{α+β}$,
* Variance: $σ^2=\frac{αβ}{(α+β)^2 (α+β+1)}$.

The sample mean and sample variance are $\bar{x} = 0.251$, and $s^2=0.6187$. Solve for α and β.

====Maximum likelihood estimation (MLE)====
Modeling distribution parameters using MLE estimation based on observed real world data offers a way of tuning the free parameters of the model to provide an optimum fit.

Suppose we observe a sample $x_1,x_2,…,x_n$ of $n$ values from one distribution with probability density/mass function $f_θ$, and we are trying to estimate the parameter $θ$. We can compute the (multivariate) probability density associated with our observed data, $f_θ (x_1,x_2,…,x_n│θ)$. As a function of $θ$ with $x_1,x_2,…,x_n$ fixed, the likelihood function is
$$L(θ)=f_θ (x_1,x_2,…,x_n│θ).$$

The MLE of $θ$ is the value of $θ$ that maximizes $L(θ)$: $\arg\max_θ{L(θ)}.$

It is typically assumed that the observed data are independent and identically distributed (iid) with unknown parameter $θ$. The likelihood can be written as a product of n univariate probability densities: $L(θ)=\prod_{i=1}^n {f_θ (x_i |θ)}$ and since maxima are unaffected by monotone transformations and one can take the logarithm of this expression to turn it into a sum: $L^* (θ)=\sum_{i=1}^n {\ln{f_θ (x_i |θ)}}$. The maximum of this expression can then be found numerically using various optimization algorithms.

* Note: The MLE may not be unique, or guaranteed to exist.

* Example: consider the coin flipping example above, observing the number of heads in the outcomes and using this to infer the true probability of p(Head).
: Likelihood function: $L(θ)=f(x│θ=p)={10 \choose 4} p^4 (1-p)^6$
: Log-likelihood function: $L^* (θ)=\ln{10 \choose 4} + 4\ln{p} + 6\ln{(1-p)}$.
: Maximize the log-likelihood function by setting its first derivative to zero:
$$ 0=\frac{d(\ln{10 \choose 4} + 4\ln{p} + 6\ln{(1-p))}}{dp} =4/p-6/(1-p), p=2/5.$$

====MOM vs. MLE====
* The MOM is inferior to Fisher’s MLE method, because MLE have higher probability of being close to the quantities to be estimated.
* MLE may be intractable in some situations, whereas the MOM estimates can be quickly and easily calculated by hand or using a computer.
* MOM estimates may be used as the first approximations to the solutions of the MLE method, and successive improved approximations may then be found by the [http://en.wikipedia.org/wiki/Newton-Raphson_method Newton-Raphson method]. In this respect, the MOM and MLE are symbiotic.
* Sometimes, MOM estimates may be outside of the parameter space, i.e., they are unreliable, which is never a problem with ML method.
* MOM estimates are not necessarily sufficient statistics, i.e., they sometimes fail to take into account all relevant information in the sample.
* MOM may be preferred to MLE for estimating some structural parameters, when appropriate probability distributions are unknown.

===Student’s T Distribution===
The distribution needed to estimate the mean of a normally distributed population when the sample size is small and the population variance is unknown. It is the basis of the popular Student’s t-tests for the statistical significance of the difference between two sample means, and for confidence intervals for the difference between two population means.

Suppose $X_1,X_2,…,X_n$ are independent random variables that are normally distributed with expected value $μ$ and variance $σ^2$. Sample mean: $\bar{x}_n = \frac{1}{n} \sum_{i=1}^n{x_i}$. Sample variance: $S_n^2=\frac{1}{n} \sum_{i=1}^n{(x_i-\bar{x})^2}$, $Z=\frac{\bar{x}_n-\mu}{\frac{\sigma}{\sqrt{n}}}$ is normally distributed with mean 0 and variance 1, since the sample mean ($\bar{x}_n$) is normally distributed with mean μ and standard deviation $\frac{\sigma}{\sqrt{n}}$.
$$Z=\frac{\bar{x}_n-\mu}{\frac{\sigma}{\sqrt{n}}}$$
$$T=\frac{\bar{x}_n-\mu}{\frac{S_n}{\sqrt{n}}}$$

T replaces $\sigma$ with with sample standard deviation. Also, $(n-1) \frac{S_n^2}{\sigma^2}$ has a [[AP_Statistics_Curriculum_2007_Chi-Square|Chi-square distribution]] $\chi_{n-1}^2$ with degree of freedom equal to $n-1$.

* Example: suppose a research involves 25 patients and relative measurements are recorded:
<center>
{| class="wikitable" style="text-align:center; width:75%" border="1"
|-
| Variable ||N || N* || Mean ||SE of Mean||StDev ||Minimum || Q1|| Median || Q3 ||Maximum
|-
| CD4 || 25|| 0 ||321.4|| 14.8 || 73.8 ||208.0 ||261.5 || 325.0 ||394.0 || 449.0
|}
</center>

What do we know from the background information?
: $\bar{y}= 321.4$
: $s = 73.8$
: $SE = 14.8$
: $n = 25$

: $CI(\alpha)=CI(0.05)$: $\bar{y} \pm t_{\alpha\over 2} {1\over \sqrt{n}} \sqrt{\sum_{i=1}^n{\frac{(x_i-\bar{x})^2}{n-1}}}.$

: $321.4 \pm t_{(24, 0.025)}{73.8\over \sqrt{25}}$
: $321.4 \pm 2.064\times 14.8$
: $[290.85, 351.95]$

====Estimating a population mean with large samples====
We use the following protocol to find point and interval estimates when the sample sizes are large, say exceeding 100.
* Assumptions: The [[SMHS_CLT_LLN|Central Limit Theorem]] guarantees that for large samples, the method above provides a valid recipe for constructing a confidence interval for the population mean, no matter what the distribution for the observed data may be. Of course, for significantly non-Normal distributions, we may need to increase the sample size to guarantee that the sampling distribution of the mean is approximately Normal.

* Point estimation of population mean: $\bar{X_n}={1\over n}\sum_{i=1}^n{X_i}$, constructed from a random sample of the process {$X_1, X_2, X_3, \cdots , X_n$}, which is an [http://en.wikipedia.org/wiki/Estimator_bias unbiased] estimate of the population mean $\mu$, if it exists! Note that the [[AP_Statistics_Curriculum_2007_EDA_Center | sample average may be susceptible to outliers]].

* Interval estimation of a population mean: Choose a confidence level <math>(1-\alpha)100%</math>, where <math>\alpha</math> is small (e.g., 0.1, 0.05, 0.025, 0.01, 0.001, etc.). Then a <math>(1-\alpha)100%</math> confidence interval for <math>\mu</math> will be
: <math>CI(\alpha): \overline{x} \pm z_{\alpha\over 2} E,</math>
:: The '''Error''' term, E, is defined as
:: <math>E = \begin{cases}{\sigma\over\sqrt{n}},& \texttt{for-known}-\sigma,\\
{SE},& \texttt{for-unknown}-\sigma.\end{cases}</math>
:: The '''Standard Error''' of the estimated <math>\overline {x}</math> is obtained by replacing the unknown population standard deviation by the sample standard deviation:
<math>SE(\overline {x}) = {1\over \sqrt{n}} \sqrt{\sum_{i=1}^n{(x_i-\overline{x})^2\over n-1}}</math>

* <math>z_{\alpha\over 2}</math> is the [[AP_Statistics_Curriculum_2007_Normal_Critical | Critical Value]] for a [[AP_Statistics_Curriculum_2007_Normal_Std |Standard Normal]] distribution at <math>{\alpha\over 2}</math>.

* Example: a random sample of the number of sentences found in 30 magazine advertisements is listed below. Use this sample to find point estimate for the population mean μ. Samples: 16, 9, 14, 11, 17, 12, 99, 18, 13, 12, 5, 9, 17, 6, 11, 17, 18, 20, 6, 14, 7, 11, 12, 5, 18, 6, 4, 13, 11, 12. Suppose the point estimate is 12.25.
A confidence interval estimate of μ is a range of values used to estimate a population parameter.
** Known variance: Suppose that we know the variance for the ''number of sentences per advertisement'' example above is known to be 256 (so the population standard deviation is <math>\sigma=16</math>).
:: For <math>{\alpha \over 2}=0.1</math>, the <math>80% CI(\mu)</math> is constructed by:
<center> <math>\overline{x}\pm 1.28SE(\overline{x})=14.77 \pm 1.28{16\over \sqrt{30}}=[11.03;18.51]</math></center>
:: For <math>{\alpha \over 2}=0.05</math>, the <math>90% CI(\mu)</math> is constructed by:
<center> <math>\overline{x}\pm 1.645SE(\overline{x})=14.77 \pm 1.645{16\over \sqrt{30}}=[9.96;19.57]</math></center>
:: For <math>{\alpha \over 2}=0.005</math>, the <math>99% CI(\mu)</math> is constructed by:
<center> <math>\overline{x}\pm 2.575SE(\overline{x})=14.77 \pm 2.575{16\over \sqrt{30}}=[7.24;22.29]</math></center>
: Notice the increase of the CI's (directly related to the decrease of <math>\alpha</math>) reflecting our choice for higher confidence.

** Unknown variance: use the sample variance 273 as an estimate (so the sample standard deviation is <math>s=\hat{\sigma}=16.54</math>).
:: For <math>{\alpha \over 2}</math>, the <math>80% CI(\mu)</math> is constructed by:
<center> <math>\overline{x}\pm 1.28SE(\overline{x})=14.77 \pm 1.28{16.54\over \sqrt{30}}=[10.90;18.63]</math></center>
:: For <math>{\alpha \over 2}=0.05</math>, the <math>90% CI(\mu)</math> is constructed by:
<center> <math>\overline{x}\pm 1.645SE(\overline{x})=14.77 \pm 1.645{16.54\over \sqrt{30}}=[9.80;19.73]</math></center>
:: For <math>{\alpha \over 2}=0.005</math>, the <math>99% CI(\mu)</math> is constructed by:
<center> <math>\overline{x}\pm 2.575SE(\overline{x})=14.77 \pm 2.575{16.54\over \sqrt{30}}=[6.99;22.54]</math></center>
: Notice the increase of the CI's (directly related to the decrease of <math>\alpha</math>) reflecting our choice for higher confidence.

: You can use the [http://www.socr.ucla.edu/htmls/ana/ConfidenceInterval_Analysis.html SOCR CI Analysis Applet] to compute these interval estimates.

====Estimating a population mean with small samples (say <30 observations)====
For small samples, the point estimates are less precise and the interval estimates produce wider intervals, compared to the case of large samples.

* Assumptions: need evidence that the data we observed and used for point and interval estimates come from a distribution, which is (approximately) normal. If this assumption is violated than the interval estimate we are going to introduce may be significantly misrepresenting the real confidence interval.
* Point estimation of population mean: Choose a confidence level <math>(1-\alpha)100%</math>, where <math>\alpha</math> is small (e.g., 0.1, 0.05, 0.025, 0.01, 0.001, etc.). Then a <math>(1-\alpha)100%</math> confidence interval for <math>\mu</math> is defined in terms of the [http://socr.ucla.edu/htmls/dist/StudentT_Distribution.html T-distribution]:
:: <math>CI(\alpha): \overline{x} \pm t_{\alpha\over 2} E.</math>
:: The '''Error''' term, E, is defined as <math>E = \begin{cases}{\sigma\over\sqrt{n}},& \texttt{for-known}-\sigma,\\
{SE},& \texttt{for-unknown}-\sigma.\end{cases}</math>
:: The '''Standard Error''' of the estimate <math>\overline {x}</math> is obtained by replacing the unknown population standard deviation by the sample standard deviation:
<math>SE(\overline {x}) = {1\over \sqrt{n}} \sqrt{\sum_{i=1}^n{(x_i-\overline{x})^2\over n-1}}</math>
:: $t_{\alpha\over 2}$ is the [[AP_Statistics_Curriculum_2007_StudentsT |Critical Value for the T(df=sample-size -1) distribution at <math>{\alpha\over 2}</math>]].

* Example: a random sample of the number of sentences found in 10 magazine advertisements is listed below. Use this sample to find point estimate for the population mean μ. Samples: 16, 9, 14, 11, 17, 12, 99, 18, 13, 12. Suppose the point estimate is 22.1.
** Known variance: Suppose that we know the variance for the ''number of sentences per advertisement'' example above is known to be 256 (so the population standard deviation is <math>\sigma=16</math>).
:: For <math>{\alpha \over 2}=0.1</math>, the <math>80% CI(\mu)</math> is constructed by:
<center> <math>\overline{x}\pm 1.383{16\over \sqrt{10}}=22.1 \pm 1.28{16\over \sqrt{10}}=[15.10 ; 29.10]</math></center>
:: For <math>{\alpha \over 2}=0.05</math>, the <math>90% CI(\mu)</math> is constructed by:
<center> <math>\overline{x}\pm 1.833{16\over \sqrt{10}}=22.1 \pm 1.833{16\over \sqrt{10}}=[12.83 ; 31.37]</math></center>
:: For <math>{\alpha \over 2}=0.005</math>, the <math>99% CI(\mu)</math> is constructed by:
<center> <math>\overline{x}\pm 3.250{16\over \sqrt{10}}=22.1 \pm 3.250{16\over \sqrt{10}}=[5.66 ; 38.54]</math></center>
:: Notice the increase of the CI's (directly related to the decrease of <math>\alpha</math>) reflecting our choice for higher confidence.

** Unknown variance: Suppose that we do '''not''' know the variance for the ''number of sentences per advertisement'' but use the sample variance 737.88 as an estimate (so the sample standard deviation is <math>s=\hat{\sigma}=27.16390579</math>).
:: For <math>{\alpha \over 2}=0.1</math>, the <math>80% CI(\mu)</math> is constructed by:
<center> <math>\overline{x}\pm 1.383{27.16390579\over \sqrt{10}}=22.1 \pm 1.383{27.16390579\over \sqrt{10}}=[10.22 ; 33.98]</math></center>
:: For <math>{\alpha \over 2}=0.05</math>, the <math>90% CI(\mu)</math> is constructed by:
<center> <math>\overline{x}\pm 1.833{27.16390579\over \sqrt{10}}=22.1 \pm 1.833{27.16390579\over \sqrt{10}}=[6.35 ; 37.85]</math></center>
:: For <math>{\alpha \over 2}=0.005</math>, the <math>99% CI(\mu)</math> is constructed by:
<center> <math>\overline{x}\pm 3.250{27.16390579 \over \sqrt{10}}=22.1 \pm 3.250{27.16390579\over \sqrt{10}}=[-5.82 ; 50.02]</math></center>
:: Notice the increase of the CI's (directly related to the decrease of <math>\alpha</math>) reflecting our choice for higher confidence.

====Estimating a population proportion====
When the sample size is large, the sampling distribution of the sample proportion <math>\hat{p}</math> is approximately Normal, by [[AP_Statistics_Curriculum_2007_Limits_CLT |CLT]], as the sample proportion may be presented as a [[AP_Statistics_Curriculum_2007_Limits_Norm2Bin |sample average or Bernoulli random variables]]. When the sample size is small, the normal approximation may be inadequate. To accommodate this, we will modify the '''sample-proportion''' <math>\hat{p}</math> slightly and obtain the '''corrected-sample-proportion''' <math>\tilde{p}</math>:
: <math>\hat{p}={y\over n} \longrightarrow \tilde{p}={y+0.5z_{\alpha \over 2}^2 \over n+z_{\alpha \over 2}^2},</math>
where [[AP_Statistics_Curriculum_2007_Normal_Critical | <math>z_{\alpha \over 2}</math> is the normal critical value we saw earlier]].

The standard error of <math>\hat{p}</math> also needs a slight modification
: <math>SE_{\hat{p}} = \sqrt{\hat{p}(1-\hat{p})\over n} \longrightarrow SE_{\tilde{p}} = \sqrt{\tilde{p}(1-\tilde{p})\over n+z_{\alpha \over 2}^2}.</math>

* Example: Suppose a researcher is interested in studying the effect of aspirin in reducing heart attacks. He randomly recruits 500 subjects with evidence of early heart disease and has them take one aspirin daily for two years. At the end of the two years, he finds that during the study only 17 subjects had a heart attack. Calculate a 95% (<math>\alpha=0.05</math>) confidence interval for the true (unknown) proportion of subjects with early heart disease that have a heart attack while taking aspirin daily. Note that [[AP_Statistics_Curriculum_2007_Normal_Critical | <math>z_{\alpha \over 2} = z_{0.025}=1.96</math>]]:
:: <math>\hat{p} = {17\over 500}=0.034</math> ; <math>\tilde{p} = {17+0.5z_{0.025}^2\over 500+z_{0.025}^2}== {17+1.92\over 500+3.84}=0.038</math>
:: <math>SE_{\hat{p}}= \sqrt{0.034(1-0.034)\over 500}=0.0036</math>; <math>SE_{\tilde{p}}= \sqrt{0.038(1-0.038)\over 500+3.84}=0.0085</math>
::And the corresponding confidence intervals are given by
:: <math>\hat{p}\pm 1.96 SE_{\hat{p}}=[0.026944, 0.041056]</math>
:: <math>\tilde{p}\pm 1.96 SE_{\tilde{p}}=[0.0213, 0.0547]</math>

:: [[AP_Statistics_Curriculum_2007_Estim_Proportion#Sample-Size_Estimation_2|See this example of estimation of sample-size, given margin of error]]

====Estimating population variance====
The most unbiased point estimate for the population variance <math>\sigma^2</math> is the [[AP_Statistics_Curriculum_2007_EDA_Var | Sample-Variance (s2)]] and the point estimate for the population standard deviation <math>\sigma</math> is the [[AP_Statistics_Curriculum_2007_EDA_Var | Sample Standard Deviation (s)]].

We use a [http://en.wikipedia.org/wiki/Chi_square_distribution Chi-Square Distribution] to construct confidence intervals for the variance and standard distribution. If the process or phenomenon we study generates a Normal random variable, then computing the following random variable (for a sample of size <math>n>1</math>) has a [[AP_Statistics_Curriculum_2007_Chi-Square|Chi-Square Distribution]]
: <math>\chi_o^2 = {(n-1)s^2 \over \sigma^2}</math>

* Chi-Square Distribution Properties
** All chi-squares values <math>\chi_o^2 \geq 0</math>.
** The chi-square distribution is a family of curves, each is determined by the degrees of freedom (n-1). See the interactive [http://socr.ucla.edu/htmls/SOCR_Distributions.html Chi-Square distribution].
** To form a confidence interval for the variance (<math>\sigma^2</math>), use the <math>\chi^2(df=n-1)</math> distribution with degrees of freedom equal to one less than the sample size.
** The area under each curve of the Chi-Square Distribution equals one.
** All Chi-Square Distributions are positively skewed.
<center>[[Image:SOCR_EBook_Dinov_Estim_Var_020408_Fig1.jpg|500px]]</center>

* Interval Estimates of Population Variance and Standard Deviation:
:: Notice that the Chi-Square Distribution is '''not''' symmetric (positively skewed) and therefore, there are two critical values for each level of confidence. The value <math>\chi_L^2</math> represents the left-tail critical value and <math>\chi_R^2</math> represents the right-tail critical value. For various degrees of freedom and areas, you can compute all critical values either using the [http://socr.ucla.edu/htmls/SOCR_Distributions.html SOCR Distributions] or using the [http://socr.ucla.edu/Applets.dir/Normal_T_Chi2_F_Tables.htm SOCR Chi-square Distribution Calculator].

::: Example: Find the critical values, <math>\chi_L^2</math> and <math>\chi_R^2</math>, for a 90% confidence interval when the sample size is 25. Use the following Protocol:
::: Identify the degrees of freedom (<math>df=n-1=24</math>) and the level of confidence (<math>{\alpha\over 2}=0.05</math>).
::: Find the left and right critical values, <math>\chi_L^2=13.848</math> and <math>\chi_R^2=36.415</math>, as in the image below.
<center>[[Image:SOCR_EBook_Dinov_Estim_Var_020408_Fig2.jpg|500px]]</center>

* Confidence Interval for <math>\sigma^2</math>
:: <math>{(n-1)s^2 \over \chi_R^2} \leq \sigma^2 \leq {(n-1)s^2 \over \chi_L^2}</math>

* Confidence Interval for <math>\sigma</math>
:: <math>\sqrt{(n-1)s^2 \over \chi_R^2} \leq \sigma \leq \sqrt{(n-1)s^2 \over \chi_L^2}</math>

====Hands-on Activity====
Construct the confidence intervals for <math>\sigma^2</math> and <math>\sigma</math> assuming the observations below represent a random sample from the liquid content (in fluid ounces) of 16 beverage cans and can be considered as Normally distributed. Use a 90% level of confidence.
<center>
{| class="wikitable" style="text-align:center; width:75%" border="1"
|-
| 14.816 || 14.863 || 14.814 || 14.998 || 14.965 || 14.824 || 14.884 || 14.838 || 14.916 || 15.021 || 14.874 || 14.856 || 14.860 || 14.772 || 14.980 || 14.919
|}
</center>

* Get the sample statistics from [http://socr.ucla.edu/htmls/SOCR_Charts.html SOCR Charts] (e.g., Index Plot); Sample-Mean=14.8875; Sample-SD=0.072700298, Sample-Var=0.005285333.
<center>[[Image:SOCR_EBook_Dinov_Estim_Var_020408_Fig3.jpg|500px]]</center>

* Identify the degrees of freedom (<math>df=n-1=15</math>) and the level of confidence (<math>{\alpha/2}=0.05</math>), as we are looking for a <math>(1-\alpha)100% CI(\sigma^2)</math>.
* Find the left and right critical values, <math>\chi_L^2=7.261</math> and <math>\chi_R^2=24.9958</math> using [http://socr.ucla.edu/htmls/SOCR_Distributions.html SOCR Chi-Square Distribution], as in the image below.
<center>[[Image:SOCR_EBook_Dinov_Estim_Var_020408_Fig4.jpg|500px]]</center>

* CI(<math>\sigma^2</math>)
: <math>0.00318={15\times 0.0053 \over 24.9958} \leq \sigma^2 \leq {15\times 0.0053 \over 7.261}=0.01095</math>

* CI(<math>\sigma</math>)
: <math>0.0564=\sqrt{15\times 0.0053 \over 24.9958} \leq \sigma \leq \sqrt{15\times 0.0053 \over 7.261}=0.10464</math>

** [[AP_Statistics_Curriculum_2007_Estim_Var#More_Examples|See more examples here]].

===Applications===
* [http://www.tandfonline.com/doi/abs/10.1207/.U5ys8BZRXKw This article] titled Reliability of Scales with General Structure: Point and Interval Estimation Using a Structural Equation discussed a method of obtaining point and interval estimates of reliability for composites of measures with a general structure. The approach is based on fitting a correspondingly constrained structural equation model and generalizes earlier covariance structure analysis methods for scale reliability estimation with congeneric tests. The procedure can be sued with weighted or unweighted composites, in which the weights need not be known in advance but may be estimated simultaneously. The method presented in this paper allows one to obtain an approximate standard error and confidence interval for scale reliability using bootstrap.

* [[SOCR_EduMaterials_ModelerActivities_NormalBetaModelFit| This activity]] shows normal and beta distribution model fit. It describes the process of SOCR model fitting in the case of using Normal or Beta distribution models. The article aims to motivate the need for analytical modeling of natural processes and illustrated how to use SOCR modeler to fit models to real data ad presented applications of model fitting. It provides specific examples illustrating model fitting and two exercises to practice and learn.

* [[SOCR_EduMaterials_Activities_General_CI_Experiment|This experiment]] shows SOCR activity on general confidence interval and demonstrates the usage and functionality of SOCR general confidence interval applet. It demonstrates the theory behind the use of interval-based estimates of the parameters, illustrates various confidence intervals construction recipes, draws parallels between the construction algorithms and intuitive meaning of confidence intervals and presents a new technology enhanced approach for understanding and utilizing confidence intervals for various applications. The article presents specific example and exercises in this topic and works as a good supplement to point and interval estimates.

===Software===
* [http://socr.ucla.edu/Applets.dir/Normal_T_Chi2_F_Tables.htm SOCR Tables]
* [http://socr.ucla.edu/htmls/exp/Confidence_Interval_Experiment_General.html SOCR General CI Experiment]
* [http://socr.ucla.edu/htmls/SOCR_Modeler.html SOCR Modeler]
* [http://socr.ucla.edu/htmls/SOCR_Experiments.html SOCR Experiments]
* [http://socr.ucla.edu/htmls/SOCR_Charts.html SOCR CHarts]

===Problems===
* Tom is in charge of sampling sugar measurements from a very large population of sugar. Lately her standard errors have been alarmingly high for her sample means. If she wants to decrease her sampling error (standard deviation of her sample means) by 1/2 what should she do?
: (a) Quadruple the variation inherent in the population.
: (b) Triple her sample size.
: (c) Quadruple her sample size.
: (d) Halve her sample size.

* The average standardized math score for eighth graders in the state of Michigan is 70 and the standard deviation is 10. We want to find out if the average standardized math score in district A is higher than the average score for the state of Michigan. The mean for a random sample of 36 students from this district is 72. What is the best response?
: (a) The p-value is around 0.76 and it is concluded that the average standardized math score in this district is not different from the overall population mean.
: (b) The p-value is around 0.12 and it is concluded that the average standardized math score in this district is not higher than the overall population mean.
: (c) The p-value is around 0.24 and it is concluded that the average standardized math score in this district is not higher than the overall population mean.
: (d) The p-value is around 0.88 and it is concluded that the average standardized math score in this district is not higher than the overall population mean.

* A random sample of 121 students from the UMich was selected to estimate the average ACT score of all UMich students. The average for the sample was 23.4 and the sample standard deviation was 3.65. If you wanted to calculate a more precise and accurate prediction of the average ACT score of UMich students, which one of the following would be the best thing to do?
: (a) Decrease the sample size to 91.
: (b) Increase the sample size to 151.
: (c) Increase the confidence level to 99%.
: (d) Decrease the confidence level to 90%.

* How does the shape, center, and spread of t-models change as its degrees of freedom increases?
: (a) The shape and center stays the same, but the spread becomes narrower.
: (b) The shape and center stays the same, but the spread becomes wider.
: (c) The shape and spread stays the same, but the center will increase.
: (d) The shape and spread stays the same, but the center will decrease.

* Estimate the critical value of t for a 95% confidence interval with df = 15
: (a) 1.71
: (b) 2.131
: (c) 1.17
: (d) 3.45

* True or False: In a well-designed sample survey like the Current Population Survey, the observed sample percentage (e.g., percentage unemployed) is equal to the population percentage. Thus, it is appropriate to just report the sample percentage, without any measure of accuracy (i.e. without the margin of error).
: (a) True
: (b) False

* Suppose an NPR news story reports that: "A polling agency reports that the percentage of the American public who agree we should spend more money on the mental health of the war veterans is 42% +/- 3%."
: (a) The probability that the American public agree that we should spend more money on the mental health of the war veterans is between 39% to 42%.
: (b) The percentage of the American public who agree that we should spend more money on the mental health of the war veterans is between 39% to 45%.
: (c) We are 95% confident that the percentage of the American public who agree that we should spend more money on the mental health of the war veterans is between 39% to 45%.
: (d) The percentage of the American public who agree that we should spend more money on the mental health of the war veterans is 42%.

* A major newspaper wants to hire a polling agency to predict who will be the next governor. Agency A proposes to do the job with a random sample of 5000 voters at a cost of $\$ 50K$ (K = one thousand). Agency B proposes to do the job with a random sample of 7500 voters at a cost of $\$ $75K. Assume both agencies find the percentage of voters to be 40% and both use the normal model to calculate the 95% interval. Which agency will you hire? Hint: Compare the margin of error for the two agencies and the relative costs before making your decision.
: (a) I will hire B.
: (b) I have no preference.
: (c) I need more information to decide who to hire.
: (d) I will hire A.

* Suppose that the proportion of the adult population who jog is 0.15. What is the probability that the proportion of joggers in a random sample of size n =200 lies between 0.13 and 0.17?
: (a) 0.5762 approximately
: (b) 0.8125 approximately
: (c) 0.2345 approximately
: (d) 0.1234 approximately

* Records at a large university indicate that 20% of all freshmen are placed on academic probation at the end of the first semester. A random sample of 100 freshmen found that 25% of them were placed on probation. The results of the sample:
: (a) are surprising since it indicates that 5% more of these freshmen were placed on probation than expected
: (b) are surprising since the standard deviation of the sampling distribution is 0.4%.
: (c) are biased since an increase of 5% could not happen without injecting bias into the sample.
: (d) are not surprising since the standard deviation of the sampling distribution is 4%.
: (e) are surprising since SAT scores have increased over the past years

* We have discussed that the standard deviation of the distribution of sample percentages, $SE(\hat{p})$ is calculated by taking the square root of $\frac{\hat{p}(1-\hat{p})}{N}$, where $\hat{p}$ is the proportion in the sample and N is sample size. What does $SE(\hat{p})$ show?
: (a) It shows the standard error of the man across repeated samples from the population.
: (b) It shows the distribution of $\hat{p}$ for the single sample that the researcher draws from the population.
: (c) It shows the standard deviation of $\hat{p}$ for repeated samples from the population.
: (d) It shows the variation for $\hat{p}$ values for repeated samples from the population.

===References===
* [http://wiki.stat.ucla.edu/socr/index.php/Probability_and_statistics_EBook#Chapter_VII:_Point_and_Interval_Estimates SOCR]
* [http://en.wikipedia.org/wiki/Method_of_moments_(statistics) MoM Wikipedia]

<hr>
* SOCR Home page: http://www.socr.umich.edu

{{translate|pageName=http://wiki.socr.umich.edu/index.php?title=SMHS_CIs}}

SMHS PCA ICA FA

2014-09-01T18:31:43Z

Jslavine: /* Applications */

==[[SMHS| Scientific Methods for Health Sciences]] - Dimensionality Reduction: PCA, ICA, FA ==

===Overview===
* ''PCA'' (principal component analysis) is a mathematical procedure that transforms a number of possibly correlated variables into a smaller number of uncorrelated variables through a process known as orthogonal transformation.
* ''ICA'' (independent component analysis) is a computational tool to separate a multivariate signal into additive subcomponents by assuming that the subcomponents are non-Gaussian signals and are statistically independent from each other.
* ''Factor analysis'' is a statistical method that describes variability among observed correlated variables in terms of a potentially lower number of unobserved variables. It is related to PCA but they are not identical. In this section, we are going to introduce these three commonly used statistical tools and illustrate their application in various studies with examples and R code samples.

===Motivation===
Suppose we have a set of observable correlated random variables, and we want to reduce the dimensionality of the data into a reasonable new set. How can we achieve this dimensionality reduction? Principal component analysis, independent component analysis and factor analysis may be the answers here. How does each of them work? What are the differences among those statistical methods; what are their strengths and weaknesses? How can we decide on the best method for a specific dataset?

===Theory===
====PCA (principal component analysis)====
PCA is a statistical technique that uses an orthogonal transformation to convert a set of observations of possibly correlated variables into a set of variables that are linearly uncorrelated. The resulting uncorrelated variables are called principal components. The first principal component accounts for as much of the variability in the data as possible, and each succeeding component accounts for the remaining variability. PCA is the simplest of the true eigenvector-based multivariate analyses; it reveals the internal structure of the data in a way that best explains the variance in the data. It is a useful statistical technique that has been applied in many fields, including computer networks and image processing, and is a powerful method for finding patterns in high-dimensional datasets.

Consider a data matrix $X_{n\times p}$ with column-wise empirical means of zero (i.e., the sample mean of each column has been shifted to zero), where each of the $n$ rows represents a different repetition of the experiment, and each of the $p$ columns gives a particular kind of data-element (i.e., variable). Mathematically, the transformation is defined by a set of $p$-dimensional vectors of weights $w_{(k)}=(w_1,w_2,…,w_p)_{(k)}$, constrained to be unitary ($\\w_{(k)}||=1$), which map each row vector $x_i$ of $X$ to a new vector of principal component scores $t_{(i)}=(t_1,t_2,…,t_p)_{(i)}$, given by $t_{k(i)}=x_{(i)} w_{(k)}$. The mapping occurs such that the individual elements of $t$ considered over the data set successively inherit the maximum possible variance from $x$.

* First component: the first loading vector $w_{(1)}= \arg\max_{||w||=1} {\sum_i{(t_1)_{(i)}^2 }}= \arg\max_{||w||=1} {\sum_i{(x_{(i)} w)^2 }}$, in matrix form: $w_{(1)}=\arg\max_{||w||=1} {||Xw||^2 }=\arg\max_{||w||=1} {w^T X^T Xw}$.

* Further components: the $k^{th}$ component can be found by subtracting the first $k-1$ principal components from $X$. $\hat{X}_{k-1} = X-\sum_{s=1}^{k-1}{X w_{(s)} w_{(s)}^T }$ and finding the loading vector involves extracting the maximum variance from this new data matrix: $w_{(k)}=\arg\max_{||w||=1} {||\bar{X}_{k-1} w||^2 } = \arg\max {\frac{w^T \hat{X}_{k-1}^T \hat{X}_{k-1} w}{w^T w}}.$ This gives the remaining eigenvectors of $X^T X$ with the maximum values for the quantity in brackets given by other corresponding eigenvalues. The full principal components decomposition of $X$ can therefore be given as $T=XW$ where $W$ is a $p\times p$ matrix whose columns are the eigenvectors of $X^T X$.

* Dimensionality reduction: The faithful transformation $T = X W$ maps a data vector $x_{(i)}$ from an original space of $p$ variables to a new space of $p$ (other) variables which are uncorrelated over the dataset. However, not all the principal components need to be kept. Keeping only the first $L$ principal components, produced by using only the first $L$ loading vectors, gives the truncated transformation $T_L=XW_L$ where $T_L$ has $n$ rows but only $L$ columns. This dimensionality reduction is very useful for visualizing and processing high-dimensional datasets while still retaining as much of the variance in the dataset as possible.

* Singular value decomposition (SVD)
: The SVD of $X=U \Sigma W^T$ where $\Sigma$ is an $n\times p$ rectangular diagonal matrix of positive numbers $\sigma_{(k)}$, the singular values of $X$; $U$ is an $n\times n$ matrix, the columns of which are orthogonal until vector of length $n$ called the left singular vectors of $X$; and $W$ is a $p\times p$ whose columns are orthogonal until vector of length $p$ and called the right singular vectors of $X$. With factorization, the matrix $X^T X=W \Sigma U^T U \Sigma W^T = W \Sigma^2 W^T$; with singular value decomposition the score matrix $T=XW=U\Sigma W^T W=U\Sigma$, where each column of $T$ is given by one of the left singular vectors of $X$ multiplied by the corresponding singular value.

* Computational details of PCA:
** Begin with a dataset that contains at least two dimensions (i.e., variables). The dataset can contain as many observations (dimensions) as you like;
** Normalize the observations for each variable. To do this, simply subtract the mean (average) from each observation within a given variable. For example, let $X$ and $Y$ be the two variables from the original dataset, with variable $X$ containing observations $X_1,X_2,…,X_n$, and variable $Y$ containing observations $Y_1,Y_2,…,Y_n$. Let $\bar{X}$ be the average of the $n$ observations of $X$, i.e. $\bar{X}= \frac{X_1+X_2+⋯+X_n}{n}$, and similarly let $\bar{Y}= \frac{Y_1+Y_2+⋯+Y_n}{n}$ be the average of the $Y$ observations. Then, the normalized dataset would be, for variable $X$: $\{X_1-\bar{X}, X_2-\bar{X}, …,X_n-\bar{X}\}$ and for variable $Y$: $\{Y_1-\bar{Y},Y_2-\bar{Y}, …,Y_n-\bar{Y}\}$
** Calculate the covariance matrix between the variables of the normalized dataset;
** Calculate the eigenvalue and eigenvectors of the covariance matrix (the eigenvectors must be normalized to a length of 1);
**(5) Choose the most significant principal component, which is simply the eigenvector with the highest eigenvalue.

====PCA Properties====
* For any integer $q$, $1 \leq q \leq p$, consider the orthogonal linear transformation $y=B'x$, where $y$ is a $q$-element vector and $B'$ is a $q\times q$ matrix, and let $\Sigma_y =B'\Sigma B$ be the variance-covariance matrix for $y$. Then the trace of $\Sigma_y$, $tr(\Sigma_y)$, is maximized by taking $B=A_q$, where $A_q$ consists of the first $q$ columns of $A$;
* $y=B'x$ and the $tr(\Sigma_y)$ are minimized by taking $B=A_q^*$ where $A_q^*$ consists of the last $q$ columns of $A$. The last few principal components are not simply unstructured left-overs after removing the important ones;
* $\Sigma = \lambda_1 \alpha_1 \alpha_1'+\lambda_2 \alpha_2 \alpha_2'+⋯+\lambda_p \alpha_p \alpha_p'$. Given that $var(x_j)=\sum_{k=1}^p {\lambda_k \alpha_{kj}^2 }$, the elements of $\lambda_k \alpha_k \alpha_k'$ tends to become smaller as SkS increases, whereas the elements of $\lambda_k$ tends to stay about the same size because $\alpha_k \alpha_k'=1,$ for $k=1,2,…,p$.

====PCA Limitations====
* The results of PCA depend on the scaling of the variables.
* The applicability of PCA is limited by certain assumptions made in its derivation.

==== PCA in R====
require(graphics)
## The variances of the variables in the
## USArrests data vary by orders of magnitude, so scaling is appropriate
(pc.cr <- princomp(USArrests)) # inappropriate

Call:
princomp(x = USArrests)

Standard deviations:
Comp.1 Comp.2 Comp.3 Comp.4
82.890847 14.069560 6.424204 2.45783

princomp(USArrests, cor = TRUE) # =^= prcomp(USArrests, scale=TRUE)

Call:
princomp(x = USArrests, cor = TRUE)

Standard deviations:
Comp.1 Comp.2 Comp.3 Comp.4
1.5748783 0.9948694 0.5971291 0.4164494

## Similar, but different:
## The standard deviations differ by a factor of sqrt(49/50)
summary(pc.cr <- princomp(USArrests, cor = TRUE))

Importance of components:
Comp.1 Comp.2 Comp.3 Comp.4
Standard deviation 1.5748783 0.9948694 0.5971291 0.41644938
Proportion of Variance 0.6200604 0.2474413 0.0891408 0.04335752
Cumulative Proportion 0.6200604 0.8675017 0.9566425 1.00000000

loadings(pc.cr) # note that blank entries are small but not zero
## The signs of the columns are arbitrary

Loadings:
Comp.1 Comp.2 Comp.3 Comp.4
Murder -0.536 0.418 -0.341 0.649
Assault -0.583 0.188 -0.268 -0.743
UrbanPop -0.278 -0.873 -0.378 0.134
Rape -0.543 -0.167 0.818

Comp.1 Comp.2 Comp.3 Comp.4
SS loadings 1.00 1.00 1.00 1.00
Proportion Var 0.25 0.25 0.25 0.25
Cumulative Var 0.25 0.50 0.75 1.00

plot(pc.cr) # shows a screeplot.

# The histogram distribution presents a vivid picture of the variance attributable to the first four significant principal
# components respectively.
biplot(pc.cr) ## shows the plot of PCA in a different format

From the chart above, we can see the distribution of the variance attributed to different variables in the four principal components.

==== [[SOCR_EduMaterials_AnalysisActivities_PCA|PCA using SOCR Analyses]]====
[[SOCR_EduMaterials_AnalysisActivities_PCA| This SOCR activity]] illustrates the use of PCA.

====ICA (independent component analysis)====
ICA is a computational method that separates a multivariate signal into additive subcomponents by assuming that the subcomponents are non-Gaussian signals and are statistically independent from each other.

* ICA Assumptions:
** The source signals are independent of each other.
** The distributions of the values in each source signal are non-Gaussian.
** Independence: The sources of signals are assumed to be independent, but their signal mixtures are not independent because they share the same source signals;
** Normality: Based on [[SMHS_CLT_LLN|CLT]], the distribution of the sum of independent random variables approximates a Gaussian distribution;
** Complexity: The temporal complexity of any signal mixture is greater than that of its simplest constituent source signal.

: ICA maximizes the statistical independence of the estimated components to find the independent components. In general, ICA cannot identify the actual number of source signals nor can it identify the proper scaling of the source signals. Suppose the data is represented by the random vector $x=(x_1,x_2,…,x_m )^t$, and the components are denoted as $s=(s_1,s_2,…,s_n )^t$. We need to transform the observed data $x$ using a linear transformation $w$, $s=Wx$, into maximally independent components $s$ measured by some functions of independence. There are alternative ''models'' for ICA:
* Linear noiseless ICA: the components $x_i$ of the data $x=(x_1,x_2,…,x_m )^t$ are generated as a sum of the independent components $s_k$, for $k=1,…,n$; $x_i=a_{i,1} s_1 + ⋯ + a_{i,k} s_k + ⋯ + a_{i,n} s_n$, weighted by the mixing weights $a_{i,k}$.
* Linear noisy ICA: with additional assumption of zero-mean and uncorrelated Gaussian noise $n \sim N(0,diag(\Sigma))$, the ICA model takes the form $x=As+n$.
* Nonlinear ICA: the mixing of the sources is not necessarily linear. Using a nonlinear mixing function $f(.|θ)$ with parameter $\theta$ the nonlinear ICA model is $x=f(s│θ)+n$.
* Identifiability: the independent components are identifiable up to a permutation and scaling of the sources, which requires: (1) at most one of the sources $s_k$ is Gaussian; (2) the number of observed mixtures, $m$, must be at least as large as the number of estimated components: $n$ such that $n\leq m$, i.e., the mixing matrix $A$ must be of full rank in order to have inverse.
* ICA in R using package [http://cran.at.r-project.org/web/packages/fastICA/fastICA.pdf fastICA]. This example demonstrates how to separate two mixed independent uniform variables:

library(fastICA)
S <- matrix(runif(10000), 5000, 2)
A <- matrix(c(1, 1, -1, 3), 2, 2, byrow = TRUE)
X <- S %*% A # In R, "*" and "%*%" indicate "scalar" and matrix multiplication, respectively!

a <- fastICA(X, 2, alg.typ = "parallel", fun = "logcosh", alpha = 1,
method = "C", row.norm = FALSE, maxit = 200,
tol = 0.0001, verbose = TRUE)
par(mfrow = c(1, 2))
plot(a$\$ $X, main = "Pre-processed data")
plot(a$\$ $S, main = "ICA components")

: Another example of un-mixing two independent signals is shown below:
S <- cbind(sin((1:1000)/20), rep((((1:200)-100)/100), 5))
# [http://stat.ethz.ch/R-manual/R-devel/library/base/html/cbind.html cbind] combines objects by rows and columns. It takes a sequence of vector, matrix or data frames arguments and combines them by columns or rows, respectively.

A <- matrix(c(0.291, 0.6557, -0.5439, 0.5572), 2, 2)
X <- S %*% A
a <- fastICA(X, 2, alg.typ = "parallel", fun = "logcosh", alpha = 1,
method = "R", row.norm = FALSE, maxit = 200,
tol = 0.0001, verbose = TRUE)
par(mfcol = c(2, 3))
plot(1:1000, S[,1 ], type = "l", main = "Original Signals",
xlab = "", ylab = "")
plot(1:1000, S[,2 ], type = "l", xlab = "", ylab = "")
plot(1:1000, X[,1 ], type = "l", main = "Mixed Signals",
xlab = "", ylab = "")
plot(1:1000, X[,2 ], type = "l", xlab = "", ylab = "")
plot(1:1000, a$\$ $S[,1 ], type = "l", main = "ICA source estimates",
xlab = "", ylab = "")
plot(1:1000, a$\$ $S[, 2], type = "l", xlab = "", ylab = "")

====Factor analysis (FA)====
FA is a statistical method that describes variability among observed correlated variables in terms of a potentially lower number of unobserved variables. Consider a set of $p$ observable random variables, $x_1,x_2,…,x_p$ with means $μ_1,μ_2,…,μ_p$, respectively. Suppose for some unknown constants $l_{i,j}$ and $k$ unobserved random variables $F_j$, where $i\in \{1,…,p\}$ and $j \in \{1,…,k\}$ where $k<p$. We have $x_i-μ_i=l_{i,1} F_1 + ⋯ +l_{i,k} F_k + ε_i$, where $ε_i$ are independently distributed error terms with mean zero and finite variance. In matrix form, we have $x-μ = LF+ε$, with $n$ observations; $x$ is a $p\times n$ matrix, $L$ is a $p \times k$ matrix and $F$ is $k\times n$ matrix. Assumptions: (1) $ε$ and $F$ are independent; (2) $E(F)=0$; (3) $cov(F)=I$ to make sure the factors are uncorrelated. Solutions to the equations above yield the factors $F$ and the loading matrix, $L$.

* Example: In the following, matrices will be indicated by indexed variables. "Subject" indices will be indicated using the letters $a$, $b$ and $c$, with values running from $1$ to $N_a$, which is equal to $10$ in the above example. "Factor" indices will be indicated using the letters $p$, $q$ and $r$, with values running from $1$ to $N_p$ which is equal to $2$ in the above example. "Instance" or "sample" indices will be indicated using letters $i$, $j$ and $k$, with values running from $1$ to $N_i$. In the example above, if a sample of $N_i=1000$ students responded to the $N_a=10$ questions, the $i^{th}$ student's score for the $a^{th}$ question are given by $x_{ai}$. The purpose of factor analysis is to characterize the correlations between the variables $x_a$ of which the $x_{ai}$ is a particular instance, or set of observations. To ensure that all variables are on equal footing, they are standardized: $z_{ai}=\frac{x_{ai}- μ_a}{σ_a}$, where $μ_a$ is the sample mean and sample variance $σ_a^2=\frac{1}{N_a} \sum_i {(x_{ai}-μ_a)^2}$. The factor analysis model is expressed by:
:$$\begin{matrix}z_{1,i} & = & \ell_{1,1}F_{1,i} & + & \ell_{1,2}F_{2,i} & + & \varepsilon_{1,i} \\
\vdots & & \vdots & & \vdots & & \vdots \\
z_{10,i} & = & \ell_{10,1}F_{1,i} & + & \ell_{10,2}F_{2,i} & + & \varepsilon_{10,i}
\end{matrix}$$

In matrix form ($Z=LF+ϵ$), this model can be expressed as:
$$z_{ai}=\sum_p \ell_{ap}F_{pi}+\varepsilon_{ai},$$
where $F_{1,i}$ is the $i^{th}$ student’s ''verbal intelligence'', $F_{2,i}$ is the $i^{th}$ student’s ''mathematical intelligence''; $l_{ap}$ are the factor loadings for the $a^{th}$ subject for $p=1,2$.

=====FA in R: using ''factanal()''=====
# Maximum Likelihood Factor Analysis
# entering raw data and extracting 3 factors,
# with varimax rotation

mydata <- read.csv("http://www.ats.ucla.edu/stat/data/mmreg.csv")
# mmreg.csv includes 600 observations and 8 variables.
# The psychological variables are locus_of_control, self_concept and motivation.
# The academic variables are standardized tests in reading (read), writing (write), math (math) and science (science).
# Additionally, the variable female is a zero-one indicator variable with the one indicating a female student.

# We can get some basic descriptions of the entire data set by using summary.
# To get the standard deviations, we use sapply to apply the sd function to each variable in the dataset.
summary(mydata)

sapply(mydata, sd)

fit <- factanal(mydata, 3, rotation="varimax") # mydata can be a raw data matrix or a covariance matrix.
# Pairwise deletion of missing data is used. Rotation can be "varimax" or "promax".

print(fit, digits=2, cutoff=.3, sort=TRUE)
# plot factor 1 by factor 2
load <- fit$\$ $loadings[,1:2]
plot(load,type="n") # set up plot
text(load,labels=names(mydata),cex=.7) # add variable names

# Principal Axis Factor Analysis
library(psych)
fit <- factor.pa(mydata, nfactors=3, rotation="varimax")
fit # print results

# Determine Number of Factors to Extract
library(nFactors)
ev <- eigen(cor(mydata)) # get eigenvalues
ap <- parallel(subject=nrow(mydata),var=ncol(mydata), rep=100,cent=.05)
nS <- nScree(x=ev$\$ $values, aparallel=ap$\$ $eigen$\$ $qevpea)
plotnScree(nS)

===PCA, ICA, FA: Similarities and Differences===
* PCA is closely related to factor analysis. The latter typically incorporates more domain specific assumptions about the underlying structure and solves eigenvectors of a slightly different matrix. Principal components create variables that are linear combinations of the original variables. The new variables have the property that the variables are all orthogonal. The principal components can be used to find clusters in a set of data.

* PCA is a variance-focused approach that seeks to reproduce the variable's total variance. In PCA, the components reflect both common and unique variance of the variable. It is generally preferred for the purposes of data reduction (i.e., translating variable space into optimal factor space) but not for detecting latent constructs or factors. Factor analysis is similar to principal component analysis in that factor analysis also involves linear combinations of variables.

* In contrast to PCA, factor analysis is a correlation-focused approach seeking to reproduce the correlations among variables. The factors represent the common variance of variables and exclude unique variance. Factor analysis is generally used when the purpose of the research is detection of underlying structure in the data structure (i.e., latent constructs or factors) or causal modeling.

===Applications===
* [[SOCR_EduMaterials_AnalysisActivities_PCA| This SOCR Activity]] demonstrates the utilization of a SOCR analysis package for statistical computing in the SOCR environment. It presents a general introduction to PCA and the theoretical background of this statistical tool and demonstrates how to use PCA and read and interpret the outcome. It introduces students to inputting data in the correct format, reading the results of PCA and interpreting the resulting transformed data.

* [http://www.sciencedirect.com/science/article/pii/0169743987800849 This article] presents a general introduction to PCA. Principal component analysis of a data matrix extracts the dominant patterns in the matrix in terms of a complementary set of score and loading plots. It is the responsibility of the data analyst to formulate the scientific issue at hand in terms of PC projections, PLS regressions, etc. Ask yourself, or the investigator, why the data matrix was collected, and for what purpose the experiments and measurements were made. Specify before the analysis what kinds of patterns you would expect and what you would find exciting. The results of the analysis depend on the scaling of the matrix, which therefore must be specified. Variance scaling, where each variable is scaled to unit variance, is recommended for general use, provided that almost constant variables are left unscaled. Combining different types of variables warrants block-scaling. In the initial analysis, look for outliers and strong groupings in the plots, indicating that the data matrix perhaps should be “polished” or whether disjoint modeling is the proper course. For plotting purposes, two or three principal components are usually sufficient, but for modeling purposes the number of significant components should be properly determined, e.g. by cross-validation. Use the resulting principal components to guide your continued investigation or chemical experimentation, not as an end in itself.

* [http://link.springer.com/article/10.1007/BF02294359 This article] introduces the Akaike Information Criterion (AIC) to extend the method of maximum likelihood to the multi-model situation. It relates the order determination of an autoregressive model to the determination of the number of factors in a maximum likelihood factor analysis. The use of the AIC criterion in factor analysis is particularly interesting when it used with a Bayesian model. This observation reveals that AIC can be more widely applied than only to the conventional i.i.d. type models on which the original derivation of the criterion was based. The observation of the Bayesian structure of the factor analysis model leads to the handling of the problem of improper solutions by introducing a natural prior distribution of factor loadings.

* [http://onlinelibrary.wiley.com/doi/10.1002/9780470057339.vnn086/abstract This article] contains a good introduction to the application of ICA. Independent component models have gained increasing interest in various fields in recent years. The basic independent component model is a semi-parametric model that assumes that a p-variate observed random vector is a linear transformation of an unobserved vector of p independent latent variables. This linear transformation is given by an unknown mixing matrix, and one of the main objectives of independent component analysis (ICA) is to estimate the unmixing matrix so the latent variables can be recovered. In this article, we discuss the basic independent component model in detail, define the concepts and analysis tools carefully, and consider two families of ICA estimates. The statistical properties (i.e., consistency, asymptotic normality, efficiency, robustness) of the estimates can be analyzed and compared via the so-called gain matrices. Some extensions of the basic independent component model, such as models with additive noise or models with dependent observations, are briefly discussed. The article ends with a short example.

===Software ===
* [http://socr.ucla.edu/htmls/ana/ SOCR ANalyses]
* [[SOCR_EduMaterials_AnalysisActivities_PCA SOCR PCA Activity]]
* [http://stat.ethz.ch/R-manual/R-patched/library/stats/html/princomp.html R princomp package]
* [http://cran.r-project.org/web/packages/fastICA/ R fastICA package] and [http://cran.at.r-project.org/web/packages/fastICA/fastICA.pdf documentation]

===Problems===

====R Example 1====
# Install package ‘fastICA’

> library(fastICA)
# Using the SOCR 1981-2006 CPI Data (http://wiki.socr.umich.edu/index.php/SOCR_Data_Dinov_021808_ConsumerPriceIndex),
# save the table in ASCII text file [[SMHS_PCA_ICA_FA#Appendix|CPI_Data.dat]]. Note the "dir" (folder) where you saved the data and reference it below
> CPI_Data <- as.matrix(read.table("/dir/CPI_Data.dat",header=T))

> # compare PCA and FA analyses
> X <- CPI_Data[,-1]
> pc.cor <- princomp(X,cor=T)
> summary(pc.cor )
Importance of components:
Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7
Standard deviation 2.1348817 1.1696678 0.71243186 0.54364890 0.38449985 0.31956304 0.145200770
Proportion of Variance 0.6511029 0.1954461 0.07250845 0.04222202 0.02112002 0.01458865 0.003011895
Cumulative Proportion 0.6511029 0.8465490 0.91905742 0.96127944 0.98239946 0.99698811 1.000000000

> ica <- fastICA(X,n.comp=7)
> names(ica)
[1] "X" "K" "W" "A" "S"
# X: pre-processed data matrix (whitened/sphered data)
# K: pre-whitening matrix that projects data onto the first n.comp
# principal components.
# W: estimated un-mixing matrix (XW = S)
# A: estimated mixing matrix (X = SA)
# S: estimated source matrix (factor scores, $\Theta$ in the notes)

> windows()
> biplot(pc.cor)

> S <- ica$\$ $S
> dimnames(S) <- list(dimnames(X)[[1]],paste("Cmp.",1:7,sep=""))
> A <- ica$\$ $A
> dimnames(A) <- list(dimnames(X)[[2]],paste("Cmp.",1:7,sep=""))
> windows()
> biplot(S[,1:2],A[,1:2])

> loadings(pc.cor)

Loadings:
Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6
Electricity -0.415 -0.227 0.164 0.512 -0.373 -0.576
Fuel_Oil -0.351 0.547 -0.198 0.312
Bananas -0.373 -0.415 0.258 0.365 0.393 0.578
Tomatoes -0.369 -0.320 0.357 -0.738 -0.294
Orange_Juice -0.324 -0.311 -0.871 -0.119
Beef -0.424 0.220 -0.216 0.721 -0.449
Gasoline -0.380 0.479 -0.216 0.161
Comp.7
Electricity -0.131
Fuel_Oil -0.657
Bananas
Tomatoes
Orange_Juice 0.108
Beef
Gasoline 0.733

Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6
SS loadings 1.000 1.000 1.000 1.000 1.000 1.000
Proportion Var 0.143 0.143 0.143 0.143 0.143 0.143
Cumulative Var 0.143 0.286 0.429 0.571 0.714 0.857
Comp.7
SS loadings 1.000
Proportion Var 0.143
Cumulative Var 1.000

> field <- function(x) { substr(paste(x," ",sep=""),1,6) }
> A.str <- ifelse(abs(A)<2,field(" "),field(round(A,2)))
> dimnames(A.str) <- list(dimnames(X)[[2]],paste("Cmp.",1:7,sep=""))
> write.table(A.str[,1:4],"",quote=F,row.names=T,col.names=NA)
Cmp.1 Cmp.2 Cmp.3 Cmp.4
Electricity
Fuel_Oil
Bananas -2.59
Tomatoes -2.9
Orange_Juice -2.66
Beef
Gasoline

> L <- pc.cor$loadings
> L.str <- ifelse(abs(L)<0.3,field(" "),field(round(L,2)))
> dimnames(L.str) <- list(dimnames(X)[[2]],paste("Cmp.",1:7,sep=""))
> write.table(L.str[,1:4],"",quote=F,row.names=T,col.names=T)
Cmp.1 Cmp.2 Cmp.3 Cmp.4
Electricity -0.41 0.51
Fuel_Oil -0.35 0.55
Bananas -0.37 -0.41 0.37
Tomatoes -0.37 -0.32 0.36 -0.74
Orange_Juice -0.32 -0.31 -0.87
Beef -0.42
Gasoline -0.38 0.48

===References===
* [http://nyx-www.informatik.uni-bremen.de/664/1/smith_tr_02.pdf A tutorial on Principal Components Analysis]
* [http://mirlyn.lib.umich.edu/Record/004232056 Sampling / Steven K. Thompson]
* [http://mirlyn.lib.umich.edu/Record/004133572 Sampling theory and methods / S. Sampath]
* [http://en.wikipedia.org/wiki/Factor_analysis Factor Analysis details]

===Appendix===
[[SOCR_Data_Dinov_021808_ConsumerPriceIndex#Consumer_Price_Index_Data|SOCR 1981-2006 CPI Dataset]]

<center>
{| class="wikitable" style="text-align:center; width:25%" border="1"
|-
! Year||Electricity||Fuel_Oil||Bananas||Tomatoes||Orange_Juice||Beef||Gasoline
|-
| 1981||31.552||1.15||0.343||0.792||1.141||1.856||1.269
|-
| 1982||36.006||1.254||0.364||0.763||1.465||1.794||1.341
|-
| 1983||37.184||1.194||0.332||0.726||1.418||1.756||1.214
|-
| 1984||38.6||1.122||0.344||0.854||1.408||1.721||1.2
|-
| 1985||38.975||1.078||0.35||0.697||1.685||1.711||1.145
|-
| 1986||40.223||1.126||0.337||1.104||1.756||1.662||1.19
|-
| 1987||40.022||0.817||0.374||0.871||1.512||1.694||0.868
|-
| 1988||40.195||0.89||0.394||0.797||1.638||1.736||0.947
|-
| 1989||40.828||0.883||0.429||1.735||1.868||1.806||0.944
|-
| 1990||41.663||1.259||0.438||0.912||1.817||1.907||1.09
|-
| 1991||43.226||1.235||0.428||0.936||2.005||1.996||1.304
|-
| 1992||44.501||0.985||0.426||1.141||1.879||1.926||1.135
|-
| 1993||46.959||0.969||0.44||1.604||1.677||1.97||1.182
|-
| 1994||48.2||0.919||0.503||1.323||1.674||1.892||1.109
|-
| 1995||48.874||0.913||0.463||1.103||1.583||1.847||1.19
|-
| 1996||48.538||1.007||0.497||1.213||1.577||1.799||1.186
|-
| 1997||49.245||1.136||0.473||1.452||1.737||1.85||1.318
|-
| 1998||46.401||0.966||0.489||1.904||1.601||1.818||1.186
|-
| 1999||45.061||0.834||0.49||1.443||1.753||1.834||1.031
|-
| 2000||45.207||1.189||0.5||1.414||1.823||1.903||1.356
|-
| 2001||47.472||1.509||0.509||1.451||1.863||2.037||1.525
|-
| 2002||47.868||1.123||0.526||1.711||1.876||2.152||1.209
|-
| 2003||47.663||1.396||0.512||1.472||1.848||2.131||1.557
|-
| 2004||49.159||1.508||0.485||1.66||1.957||2.585||1.635
|-
| 2005||50.847||1.859||0.49||2.162||1.872||2.478||1.866
|-
| 2006||57.223||2.418||0.505||1.621||1.853||2.607||2.359
|}
</center>

<hr>
* SOCR Home page: http://www.socr.umich.edu

{{translate|pageName=http://wiki.socr.umich.edu/index.php?title=SMHS_PCA_ICA_FA}}

SMHS PCA ICA FA

2014-09-01T18:26:01Z

Jslavine: /* PCA, ICA, FA: Similarities and Differences */

==[[SMHS| Scientific Methods for Health Sciences]] - Dimensionality Reduction: PCA, ICA, FA ==

===Overview===
* ''PCA'' (principal component analysis) is a mathematical procedure that transforms a number of possibly correlated variables into a smaller number of uncorrelated variables through a process known as orthogonal transformation.
* ''ICA'' (independent component analysis) is a computational tool to separate a multivariate signal into additive subcomponents by assuming that the subcomponents are non-Gaussian signals and are statistically independent from each other.
* ''Factor analysis'' is a statistical method that describes variability among observed correlated variables in terms of a potentially lower number of unobserved variables. It is related to PCA but they are not identical. In this section, we are going to introduce these three commonly used statistical tools and illustrate their application in various studies with examples and R code samples.

===Motivation===
Suppose we have a set of observable correlated random variables, and we want to reduce the dimensionality of the data into a reasonable new set. How can we achieve this dimensionality reduction? Principal component analysis, independent component analysis and factor analysis may be the answers here. How does each of them work? What are the differences among those statistical methods; what are their strengths and weaknesses? How can we decide on the best method for a specific dataset?

===Theory===
====PCA (principal component analysis)====
PCA is a statistical technique that uses an orthogonal transformation to convert a set of observations of possibly correlated variables into a set of variables that are linearly uncorrelated. The resulting uncorrelated variables are called principal components. The first principal component accounts for as much of the variability in the data as possible, and each succeeding component accounts for the remaining variability. PCA is the simplest of the true eigenvector-based multivariate analyses; it reveals the internal structure of the data in a way that best explains the variance in the data. It is a useful statistical technique that has been applied in many fields, including computer networks and image processing, and is a powerful method for finding patterns in high-dimensional datasets.

Consider a data matrix $X_{n\times p}$ with column-wise empirical means of zero (i.e., the sample mean of each column has been shifted to zero), where each of the $n$ rows represents a different repetition of the experiment, and each of the $p$ columns gives a particular kind of data-element (i.e., variable). Mathematically, the transformation is defined by a set of $p$-dimensional vectors of weights $w_{(k)}=(w_1,w_2,…,w_p)_{(k)}$, constrained to be unitary ($\\w_{(k)}||=1$), which map each row vector $x_i$ of $X$ to a new vector of principal component scores $t_{(i)}=(t_1,t_2,…,t_p)_{(i)}$, given by $t_{k(i)}=x_{(i)} w_{(k)}$. The mapping occurs such that the individual elements of $t$ considered over the data set successively inherit the maximum possible variance from $x$.

* First component: the first loading vector $w_{(1)}= \arg\max_{||w||=1} {\sum_i{(t_1)_{(i)}^2 }}= \arg\max_{||w||=1} {\sum_i{(x_{(i)} w)^2 }}$, in matrix form: $w_{(1)}=\arg\max_{||w||=1} {||Xw||^2 }=\arg\max_{||w||=1} {w^T X^T Xw}$.

* Further components: the $k^{th}$ component can be found by subtracting the first $k-1$ principal components from $X$. $\hat{X}_{k-1} = X-\sum_{s=1}^{k-1}{X w_{(s)} w_{(s)}^T }$ and finding the loading vector involves extracting the maximum variance from this new data matrix: $w_{(k)}=\arg\max_{||w||=1} {||\bar{X}_{k-1} w||^2 } = \arg\max {\frac{w^T \hat{X}_{k-1}^T \hat{X}_{k-1} w}{w^T w}}.$ This gives the remaining eigenvectors of $X^T X$ with the maximum values for the quantity in brackets given by other corresponding eigenvalues. The full principal components decomposition of $X$ can therefore be given as $T=XW$ where $W$ is a $p\times p$ matrix whose columns are the eigenvectors of $X^T X$.

* Dimensionality reduction: The faithful transformation $T = X W$ maps a data vector $x_{(i)}$ from an original space of $p$ variables to a new space of $p$ (other) variables which are uncorrelated over the dataset. However, not all the principal components need to be kept. Keeping only the first $L$ principal components, produced by using only the first $L$ loading vectors, gives the truncated transformation $T_L=XW_L$ where $T_L$ has $n$ rows but only $L$ columns. This dimensionality reduction is very useful for visualizing and processing high-dimensional datasets while still retaining as much of the variance in the dataset as possible.

* Singular value decomposition (SVD)
: The SVD of $X=U \Sigma W^T$ where $\Sigma$ is an $n\times p$ rectangular diagonal matrix of positive numbers $\sigma_{(k)}$, the singular values of $X$; $U$ is an $n\times n$ matrix, the columns of which are orthogonal until vector of length $n$ called the left singular vectors of $X$; and $W$ is a $p\times p$ whose columns are orthogonal until vector of length $p$ and called the right singular vectors of $X$. With factorization, the matrix $X^T X=W \Sigma U^T U \Sigma W^T = W \Sigma^2 W^T$; with singular value decomposition the score matrix $T=XW=U\Sigma W^T W=U\Sigma$, where each column of $T$ is given by one of the left singular vectors of $X$ multiplied by the corresponding singular value.

* Computational details of PCA:
** Begin with a dataset that contains at least two dimensions (i.e., variables). The dataset can contain as many observations (dimensions) as you like;
** Normalize the observations for each variable. To do this, simply subtract the mean (average) from each observation within a given variable. For example, let $X$ and $Y$ be the two variables from the original dataset, with variable $X$ containing observations $X_1,X_2,…,X_n$, and variable $Y$ containing observations $Y_1,Y_2,…,Y_n$. Let $\bar{X}$ be the average of the $n$ observations of $X$, i.e. $\bar{X}= \frac{X_1+X_2+⋯+X_n}{n}$, and similarly let $\bar{Y}= \frac{Y_1+Y_2+⋯+Y_n}{n}$ be the average of the $Y$ observations. Then, the normalized dataset would be, for variable $X$: $\{X_1-\bar{X}, X_2-\bar{X}, …,X_n-\bar{X}\}$ and for variable $Y$: $\{Y_1-\bar{Y},Y_2-\bar{Y}, …,Y_n-\bar{Y}\}$
** Calculate the covariance matrix between the variables of the normalized dataset;
** Calculate the eigenvalue and eigenvectors of the covariance matrix (the eigenvectors must be normalized to a length of 1);
**(5) Choose the most significant principal component, which is simply the eigenvector with the highest eigenvalue.

====PCA Properties====
* For any integer $q$, $1 \leq q \leq p$, consider the orthogonal linear transformation $y=B'x$, where $y$ is a $q$-element vector and $B'$ is a $q\times q$ matrix, and let $\Sigma_y =B'\Sigma B$ be the variance-covariance matrix for $y$. Then the trace of $\Sigma_y$, $tr(\Sigma_y)$, is maximized by taking $B=A_q$, where $A_q$ consists of the first $q$ columns of $A$;
* $y=B'x$ and the $tr(\Sigma_y)$ are minimized by taking $B=A_q^*$ where $A_q^*$ consists of the last $q$ columns of $A$. The last few principal components are not simply unstructured left-overs after removing the important ones;
* $\Sigma = \lambda_1 \alpha_1 \alpha_1'+\lambda_2 \alpha_2 \alpha_2'+⋯+\lambda_p \alpha_p \alpha_p'$. Given that $var(x_j)=\sum_{k=1}^p {\lambda_k \alpha_{kj}^2 }$, the elements of $\lambda_k \alpha_k \alpha_k'$ tends to become smaller as SkS increases, whereas the elements of $\lambda_k$ tends to stay about the same size because $\alpha_k \alpha_k'=1,$ for $k=1,2,…,p$.

====PCA Limitations====
* The results of PCA depend on the scaling of the variables.
* The applicability of PCA is limited by certain assumptions made in its derivation.

==== PCA in R====
require(graphics)
## The variances of the variables in the
## USArrests data vary by orders of magnitude, so scaling is appropriate
(pc.cr <- princomp(USArrests)) # inappropriate

Call:
princomp(x = USArrests)

Standard deviations:
Comp.1 Comp.2 Comp.3 Comp.4
82.890847 14.069560 6.424204 2.45783

princomp(USArrests, cor = TRUE) # =^= prcomp(USArrests, scale=TRUE)

Call:
princomp(x = USArrests, cor = TRUE)

Standard deviations:
Comp.1 Comp.2 Comp.3 Comp.4
1.5748783 0.9948694 0.5971291 0.4164494

## Similar, but different:
## The standard deviations differ by a factor of sqrt(49/50)
summary(pc.cr <- princomp(USArrests, cor = TRUE))

Importance of components:
Comp.1 Comp.2 Comp.3 Comp.4
Standard deviation 1.5748783 0.9948694 0.5971291 0.41644938
Proportion of Variance 0.6200604 0.2474413 0.0891408 0.04335752
Cumulative Proportion 0.6200604 0.8675017 0.9566425 1.00000000

loadings(pc.cr) # note that blank entries are small but not zero
## The signs of the columns are arbitrary

Loadings:
Comp.1 Comp.2 Comp.3 Comp.4
Murder -0.536 0.418 -0.341 0.649
Assault -0.583 0.188 -0.268 -0.743
UrbanPop -0.278 -0.873 -0.378 0.134
Rape -0.543 -0.167 0.818

Comp.1 Comp.2 Comp.3 Comp.4
SS loadings 1.00 1.00 1.00 1.00
Proportion Var 0.25 0.25 0.25 0.25
Cumulative Var 0.25 0.50 0.75 1.00

plot(pc.cr) # shows a screeplot.

# The histogram distribution presents a vivid picture of the variance attributable to the first four significant principal
# components respectively.
biplot(pc.cr) ## shows the plot of PCA in a different format

From the chart above, we can see the distribution of the variance attributed to different variables in the four principal components.

==== [[SOCR_EduMaterials_AnalysisActivities_PCA|PCA using SOCR Analyses]]====
[[SOCR_EduMaterials_AnalysisActivities_PCA| This SOCR activity]] illustrates the use of PCA.

====ICA (independent component analysis)====
ICA is a computational method that separates a multivariate signal into additive subcomponents by assuming that the subcomponents are non-Gaussian signals and are statistically independent from each other.

* ICA Assumptions:
** The source signals are independent of each other.
** The distributions of the values in each source signal are non-Gaussian.
** Independence: The sources of signals are assumed to be independent, but their signal mixtures are not independent because they share the same source signals;
** Normality: Based on [[SMHS_CLT_LLN|CLT]], the distribution of the sum of independent random variables approximates a Gaussian distribution;
** Complexity: The temporal complexity of any signal mixture is greater than that of its simplest constituent source signal.

: ICA maximizes the statistical independence of the estimated components to find the independent components. In general, ICA cannot identify the actual number of source signals nor can it identify the proper scaling of the source signals. Suppose the data is represented by the random vector $x=(x_1,x_2,…,x_m )^t$, and the components are denoted as $s=(s_1,s_2,…,s_n )^t$. We need to transform the observed data $x$ using a linear transformation $w$, $s=Wx$, into maximally independent components $s$ measured by some functions of independence. There are alternative ''models'' for ICA:
* Linear noiseless ICA: the components $x_i$ of the data $x=(x_1,x_2,…,x_m )^t$ are generated as a sum of the independent components $s_k$, for $k=1,…,n$; $x_i=a_{i,1} s_1 + ⋯ + a_{i,k} s_k + ⋯ + a_{i,n} s_n$, weighted by the mixing weights $a_{i,k}$.
* Linear noisy ICA: with additional assumption of zero-mean and uncorrelated Gaussian noise $n \sim N(0,diag(\Sigma))$, the ICA model takes the form $x=As+n$.
* Nonlinear ICA: the mixing of the sources is not necessarily linear. Using a nonlinear mixing function $f(.|θ)$ with parameter $\theta$ the nonlinear ICA model is $x=f(s│θ)+n$.
* Identifiability: the independent components are identifiable up to a permutation and scaling of the sources, which requires: (1) at most one of the sources $s_k$ is Gaussian; (2) the number of observed mixtures, $m$, must be at least as large as the number of estimated components: $n$ such that $n\leq m$, i.e., the mixing matrix $A$ must be of full rank in order to have inverse.
* ICA in R using package [http://cran.at.r-project.org/web/packages/fastICA/fastICA.pdf fastICA]. This example demonstrates how to separate two mixed independent uniform variables:

library(fastICA)
S <- matrix(runif(10000), 5000, 2)
A <- matrix(c(1, 1, -1, 3), 2, 2, byrow = TRUE)
X <- S %*% A # In R, "*" and "%*%" indicate "scalar" and matrix multiplication, respectively!

a <- fastICA(X, 2, alg.typ = "parallel", fun = "logcosh", alpha = 1,
method = "C", row.norm = FALSE, maxit = 200,
tol = 0.0001, verbose = TRUE)
par(mfrow = c(1, 2))
plot(a$\$ $X, main = "Pre-processed data")
plot(a$\$ $S, main = "ICA components")

: Another example of un-mixing two independent signals is shown below:
S <- cbind(sin((1:1000)/20), rep((((1:200)-100)/100), 5))
# [http://stat.ethz.ch/R-manual/R-devel/library/base/html/cbind.html cbind] combines objects by rows and columns. It takes a sequence of vector, matrix or data frames arguments and combines them by columns or rows, respectively.

A <- matrix(c(0.291, 0.6557, -0.5439, 0.5572), 2, 2)
X <- S %*% A
a <- fastICA(X, 2, alg.typ = "parallel", fun = "logcosh", alpha = 1,
method = "R", row.norm = FALSE, maxit = 200,
tol = 0.0001, verbose = TRUE)
par(mfcol = c(2, 3))
plot(1:1000, S[,1 ], type = "l", main = "Original Signals",
xlab = "", ylab = "")
plot(1:1000, S[,2 ], type = "l", xlab = "", ylab = "")
plot(1:1000, X[,1 ], type = "l", main = "Mixed Signals",
xlab = "", ylab = "")
plot(1:1000, X[,2 ], type = "l", xlab = "", ylab = "")
plot(1:1000, a$\$ $S[,1 ], type = "l", main = "ICA source estimates",
xlab = "", ylab = "")
plot(1:1000, a$\$ $S[, 2], type = "l", xlab = "", ylab = "")

====Factor analysis (FA)====
FA is a statistical method that describes variability among observed correlated variables in terms of a potentially lower number of unobserved variables. Consider a set of $p$ observable random variables, $x_1,x_2,…,x_p$ with means $μ_1,μ_2,…,μ_p$, respectively. Suppose for some unknown constants $l_{i,j}$ and $k$ unobserved random variables $F_j$, where $i\in \{1,…,p\}$ and $j \in \{1,…,k\}$ where $k<p$. We have $x_i-μ_i=l_{i,1} F_1 + ⋯ +l_{i,k} F_k + ε_i$, where $ε_i$ are independently distributed error terms with mean zero and finite variance. In matrix form, we have $x-μ = LF+ε$, with $n$ observations; $x$ is a $p\times n$ matrix, $L$ is a $p \times k$ matrix and $F$ is $k\times n$ matrix. Assumptions: (1) $ε$ and $F$ are independent; (2) $E(F)=0$; (3) $cov(F)=I$ to make sure the factors are uncorrelated. Solutions to the equations above yield the factors $F$ and the loading matrix, $L$.

* Example: In the following, matrices will be indicated by indexed variables. "Subject" indices will be indicated using the letters $a$, $b$ and $c$, with values running from $1$ to $N_a$, which is equal to $10$ in the above example. "Factor" indices will be indicated using the letters $p$, $q$ and $r$, with values running from $1$ to $N_p$ which is equal to $2$ in the above example. "Instance" or "sample" indices will be indicated using letters $i$, $j$ and $k$, with values running from $1$ to $N_i$. In the example above, if a sample of $N_i=1000$ students responded to the $N_a=10$ questions, the $i^{th}$ student's score for the $a^{th}$ question are given by $x_{ai}$. The purpose of factor analysis is to characterize the correlations between the variables $x_a$ of which the $x_{ai}$ is a particular instance, or set of observations. To ensure that all variables are on equal footing, they are standardized: $z_{ai}=\frac{x_{ai}- μ_a}{σ_a}$, where $μ_a$ is the sample mean and sample variance $σ_a^2=\frac{1}{N_a} \sum_i {(x_{ai}-μ_a)^2}$. The factor analysis model is expressed by:
:$$\begin{matrix}z_{1,i} & = & \ell_{1,1}F_{1,i} & + & \ell_{1,2}F_{2,i} & + & \varepsilon_{1,i} \\
\vdots & & \vdots & & \vdots & & \vdots \\
z_{10,i} & = & \ell_{10,1}F_{1,i} & + & \ell_{10,2}F_{2,i} & + & \varepsilon_{10,i}
\end{matrix}$$

In matrix form ($Z=LF+ϵ$), this model can be expressed as:
$$z_{ai}=\sum_p \ell_{ap}F_{pi}+\varepsilon_{ai},$$
where $F_{1,i}$ is the $i^{th}$ student’s ''verbal intelligence'', $F_{2,i}$ is the $i^{th}$ student’s ''mathematical intelligence''; $l_{ap}$ are the factor loadings for the $a^{th}$ subject for $p=1,2$.

=====FA in R: using ''factanal()''=====
# Maximum Likelihood Factor Analysis
# entering raw data and extracting 3 factors,
# with varimax rotation

mydata <- read.csv("http://www.ats.ucla.edu/stat/data/mmreg.csv")
# mmreg.csv includes 600 observations and 8 variables.
# The psychological variables are locus_of_control, self_concept and motivation.
# The academic variables are standardized tests in reading (read), writing (write), math (math) and science (science).
# Additionally, the variable female is a zero-one indicator variable with the one indicating a female student.

# We can get some basic descriptions of the entire data set by using summary.
# To get the standard deviations, we use sapply to apply the sd function to each variable in the dataset.
summary(mydata)

sapply(mydata, sd)

fit <- factanal(mydata, 3, rotation="varimax") # mydata can be a raw data matrix or a covariance matrix.
# Pairwise deletion of missing data is used. Rotation can be "varimax" or "promax".

print(fit, digits=2, cutoff=.3, sort=TRUE)
# plot factor 1 by factor 2
load <- fit$\$ $loadings[,1:2]
plot(load,type="n") # set up plot
text(load,labels=names(mydata),cex=.7) # add variable names

# Principal Axis Factor Analysis
library(psych)
fit <- factor.pa(mydata, nfactors=3, rotation="varimax")
fit # print results

# Determine Number of Factors to Extract
library(nFactors)
ev <- eigen(cor(mydata)) # get eigenvalues
ap <- parallel(subject=nrow(mydata),var=ncol(mydata), rep=100,cent=.05)
nS <- nScree(x=ev$\$ $values, aparallel=ap$\$ $eigen$\$ $qevpea)
plotnScree(nS)

===PCA, ICA, FA: Similarities and Differences===
* PCA is closely related to factor analysis. The latter typically incorporates more domain specific assumptions about the underlying structure and solves eigenvectors of a slightly different matrix. Principal components create variables that are linear combinations of the original variables. The new variables have the property that the variables are all orthogonal. The principal components can be used to find clusters in a set of data.

* PCA is a variance-focused approach that seeks to reproduce the variable's total variance. In PCA, the components reflect both common and unique variance of the variable. It is generally preferred for the purposes of data reduction (i.e., translating variable space into optimal factor space) but not for detecting latent constructs or factors. Factor analysis is similar to principal component analysis in that factor analysis also involves linear combinations of variables.

* In contrast to PCA, factor analysis is a correlation-focused approach seeking to reproduce the correlations among variables. The factors represent the common variance of variables and exclude unique variance. Factor analysis is generally used when the purpose of the research is detection of underlying structure in the data structure (i.e., latent constructs or factors) or causal modeling.

===Applications===
* [[SOCR_EduMaterials_AnalysisActivities_PCA| This SOCR Activity]] demonstrated the utilization of SOCR analyses package for statistical computing in the SOCR environment. It presents a general introduction to PCA and the theoretical background of this statistical tool and shows how to use PCA and how to read and interpret the outcome. It introduced students to input data in the correct format, read the results of PCA and make interpretation of the resulting transformed data.

* [http://www.sciencedirect.com/science/article/pii/0169743987800849 This article] presents a general introduction to PCA. Principal component analysis of a data matrix extracts the dominant patterns in the matrix in terms of a complementary set of score and loading plots. It is the responsibility of the data analyst to formulate the scientific issue at hand in terms of PC projections, PLS regressions, etc. Ask yourself, or the investigator, why the data matrix was collected, and for what purpose the experiments and measurements were made. Specify before the analysis what kinds of patterns you would expect and what you would find exciting. The results of the analysis depend on the scaling of the matrix, which therefore must be specified. Variance scaling, where each variable is scaled to unit variance, can be recommended for general use, provided that almost constant variables are left unscaled. Combining different types of variables warrants blockscaling. In the initial analysis, look for outliers and strong groupings in the plots, indicating that the data matrix perhaps should be “polished” or whether disjoint modeling is the proper course. For plotting purposes, two or three principal components are usually sufficient, but for modeling purposes the number of significant components should be properly determined, e.g. by cross-validation. Use the resulting principal components to guide your continued investigation or chemical experimentation, not as an end in itself.

* [http://link.springer.com/article/10.1007/BF02294359 This article] introduced the Akaike Information Criterion (AIC) to extend the method of maximum likelihood to the multi-model situation. It related the successful experience of the order determination of an autoregressive model to the determination of the number of factors in the maximum likelihood factor analysis. The use of the AIC criterion in the factor analysis is particularly interesting when it is viewed as the choice of a Bayesian model. This observation shows that the area of application of AIC can be much wider than the conventional i.i.d. type models on which the original derivation of the criterion was based. The observation of the Bayesian structure of the factor analysis model leads us to the handling of the problem of improper solution by introducing a natural prior distribution of factor loadings.

* [http://onlinelibrary.wiley.com/doi/10.1002/9780470057339.vnn086/abstract This article] contains a good introduction to the application of ICA. Independent component models have gained increasing interest in various fields of applications in recent years. The basic independent component model is a semi-parametric model assuming that a p-variate observed random vector is a linear transformation of an unobserved vector of p independent latent variables. This linear transformation is given by an unknown mixing matrix, and one of the main objectives of independent component analysis (ICA) is to estimate an unmixing matrix by means of which the latent variables can be recovered. In this article, we discuss the basic independent component model in detail, define the concepts and analysis tools carefully, and consider two families of ICA estimates. The statistical properties (consistency, asymptotic normality, efficiency, robustness) of the estimates can be analyzed and compared via the so called gain matrices. Some extensions of the basic independent component model, such as models with additive noise or models with dependent observations, are briefly discussed. The article ends with a short example.

===Software ===
* [http://socr.ucla.edu/htmls/ana/ SOCR ANalyses]
* [[SOCR_EduMaterials_AnalysisActivities_PCA SOCR PCA Activity]]
* [http://stat.ethz.ch/R-manual/R-patched/library/stats/html/princomp.html R princomp package]
* [http://cran.r-project.org/web/packages/fastICA/ R fastICA package] and [http://cran.at.r-project.org/web/packages/fastICA/fastICA.pdf documentation]

===Problems===

====R Example 1====
# Install package ‘fastICA’

> library(fastICA)
# Using the SOCR 1981-2006 CPI Data (http://wiki.socr.umich.edu/index.php/SOCR_Data_Dinov_021808_ConsumerPriceIndex),
# save the table in ASCII text file [[SMHS_PCA_ICA_FA#Appendix|CPI_Data.dat]]. Note the "dir" (folder) where you saved the data and reference it below
> CPI_Data <- as.matrix(read.table("/dir/CPI_Data.dat",header=T))

> # compare PCA and FA analyses
> X <- CPI_Data[,-1]
> pc.cor <- princomp(X,cor=T)
> summary(pc.cor )
Importance of components:
Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7
Standard deviation 2.1348817 1.1696678 0.71243186 0.54364890 0.38449985 0.31956304 0.145200770
Proportion of Variance 0.6511029 0.1954461 0.07250845 0.04222202 0.02112002 0.01458865 0.003011895
Cumulative Proportion 0.6511029 0.8465490 0.91905742 0.96127944 0.98239946 0.99698811 1.000000000

> ica <- fastICA(X,n.comp=7)
> names(ica)
[1] "X" "K" "W" "A" "S"
# X: pre-processed data matrix (whitened/sphered data)
# K: pre-whitening matrix that projects data onto the first n.comp
# principal components.
# W: estimated un-mixing matrix (XW = S)
# A: estimated mixing matrix (X = SA)
# S: estimated source matrix (factor scores, $\Theta$ in the notes)

> windows()
> biplot(pc.cor)

> S <- ica$\$ $S
> dimnames(S) <- list(dimnames(X)[[1]],paste("Cmp.",1:7,sep=""))
> A <- ica$\$ $A
> dimnames(A) <- list(dimnames(X)[[2]],paste("Cmp.",1:7,sep=""))
> windows()
> biplot(S[,1:2],A[,1:2])

> loadings(pc.cor)

Loadings:
Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6
Electricity -0.415 -0.227 0.164 0.512 -0.373 -0.576
Fuel_Oil -0.351 0.547 -0.198 0.312
Bananas -0.373 -0.415 0.258 0.365 0.393 0.578
Tomatoes -0.369 -0.320 0.357 -0.738 -0.294
Orange_Juice -0.324 -0.311 -0.871 -0.119
Beef -0.424 0.220 -0.216 0.721 -0.449
Gasoline -0.380 0.479 -0.216 0.161
Comp.7
Electricity -0.131
Fuel_Oil -0.657
Bananas
Tomatoes
Orange_Juice 0.108
Beef
Gasoline 0.733

Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6
SS loadings 1.000 1.000 1.000 1.000 1.000 1.000
Proportion Var 0.143 0.143 0.143 0.143 0.143 0.143
Cumulative Var 0.143 0.286 0.429 0.571 0.714 0.857
Comp.7
SS loadings 1.000
Proportion Var 0.143
Cumulative Var 1.000

> field <- function(x) { substr(paste(x," ",sep=""),1,6) }
> A.str <- ifelse(abs(A)<2,field(" "),field(round(A,2)))
> dimnames(A.str) <- list(dimnames(X)[[2]],paste("Cmp.",1:7,sep=""))
> write.table(A.str[,1:4],"",quote=F,row.names=T,col.names=NA)
Cmp.1 Cmp.2 Cmp.3 Cmp.4
Electricity
Fuel_Oil
Bananas -2.59
Tomatoes -2.9
Orange_Juice -2.66
Beef
Gasoline

> L <- pc.cor$loadings
> L.str <- ifelse(abs(L)<0.3,field(" "),field(round(L,2)))
> dimnames(L.str) <- list(dimnames(X)[[2]],paste("Cmp.",1:7,sep=""))
> write.table(L.str[,1:4],"",quote=F,row.names=T,col.names=T)
Cmp.1 Cmp.2 Cmp.3 Cmp.4
Electricity -0.41 0.51
Fuel_Oil -0.35 0.55
Bananas -0.37 -0.41 0.37
Tomatoes -0.37 -0.32 0.36 -0.74
Orange_Juice -0.32 -0.31 -0.87
Beef -0.42
Gasoline -0.38 0.48

===References===
* [http://nyx-www.informatik.uni-bremen.de/664/1/smith_tr_02.pdf A tutorial on Principal Components Analysis]
* [http://mirlyn.lib.umich.edu/Record/004232056 Sampling / Steven K. Thompson]
* [http://mirlyn.lib.umich.edu/Record/004133572 Sampling theory and methods / S. Sampath]
* [http://en.wikipedia.org/wiki/Factor_analysis Factor Analysis details]

===Appendix===
[[SOCR_Data_Dinov_021808_ConsumerPriceIndex#Consumer_Price_Index_Data|SOCR 1981-2006 CPI Dataset]]

<center>
{| class="wikitable" style="text-align:center; width:25%" border="1"
|-
! Year||Electricity||Fuel_Oil||Bananas||Tomatoes||Orange_Juice||Beef||Gasoline
|-
| 1981||31.552||1.15||0.343||0.792||1.141||1.856||1.269
|-
| 1982||36.006||1.254||0.364||0.763||1.465||1.794||1.341
|-
| 1983||37.184||1.194||0.332||0.726||1.418||1.756||1.214
|-
| 1984||38.6||1.122||0.344||0.854||1.408||1.721||1.2
|-
| 1985||38.975||1.078||0.35||0.697||1.685||1.711||1.145
|-
| 1986||40.223||1.126||0.337||1.104||1.756||1.662||1.19
|-
| 1987||40.022||0.817||0.374||0.871||1.512||1.694||0.868
|-
| 1988||40.195||0.89||0.394||0.797||1.638||1.736||0.947
|-
| 1989||40.828||0.883||0.429||1.735||1.868||1.806||0.944
|-
| 1990||41.663||1.259||0.438||0.912||1.817||1.907||1.09
|-
| 1991||43.226||1.235||0.428||0.936||2.005||1.996||1.304
|-
| 1992||44.501||0.985||0.426||1.141||1.879||1.926||1.135
|-
| 1993||46.959||0.969||0.44||1.604||1.677||1.97||1.182
|-
| 1994||48.2||0.919||0.503||1.323||1.674||1.892||1.109
|-
| 1995||48.874||0.913||0.463||1.103||1.583||1.847||1.19
|-
| 1996||48.538||1.007||0.497||1.213||1.577||1.799||1.186
|-
| 1997||49.245||1.136||0.473||1.452||1.737||1.85||1.318
|-
| 1998||46.401||0.966||0.489||1.904||1.601||1.818||1.186
|-
| 1999||45.061||0.834||0.49||1.443||1.753||1.834||1.031
|-
| 2000||45.207||1.189||0.5||1.414||1.823||1.903||1.356
|-
| 2001||47.472||1.509||0.509||1.451||1.863||2.037||1.525
|-
| 2002||47.868||1.123||0.526||1.711||1.876||2.152||1.209
|-
| 2003||47.663||1.396||0.512||1.472||1.848||2.131||1.557
|-
| 2004||49.159||1.508||0.485||1.66||1.957||2.585||1.635
|-
| 2005||50.847||1.859||0.49||2.162||1.872||2.478||1.866
|-
| 2006||57.223||2.418||0.505||1.621||1.853||2.607||2.359
|}
</center>

<hr>
* SOCR Home page: http://www.socr.umich.edu

{{translate|pageName=http://wiki.socr.umich.edu/index.php?title=SMHS_PCA_ICA_FA}}

SMHS PCA ICA FA

2014-09-01T18:22:45Z

Jslavine: /* Factor analysis (FA) */

==[[SMHS| Scientific Methods for Health Sciences]] - Dimensionality Reduction: PCA, ICA, FA ==

===Overview===
* ''PCA'' (principal component analysis) is a mathematical procedure that transforms a number of possibly correlated variables into a smaller number of uncorrelated variables through a process known as orthogonal transformation.
* ''ICA'' (independent component analysis) is a computational tool to separate a multivariate signal into additive subcomponents by assuming that the subcomponents are non-Gaussian signals and are statistically independent from each other.
* ''Factor analysis'' is a statistical method that describes variability among observed correlated variables in terms of a potentially lower number of unobserved variables. It is related to PCA but they are not identical. In this section, we are going to introduce these three commonly used statistical tools and illustrate their application in various studies with examples and R code samples.

===Motivation===
Suppose we have a set of observable correlated random variables, and we want to reduce the dimensionality of the data into a reasonable new set. How can we achieve this dimensionality reduction? Principal component analysis, independent component analysis and factor analysis may be the answers here. How does each of them work? What are the differences among those statistical methods; what are their strengths and weaknesses? How can we decide on the best method for a specific dataset?

===Theory===
====PCA (principal component analysis)====
PCA is a statistical technique that uses an orthogonal transformation to convert a set of observations of possibly correlated variables into a set of variables that are linearly uncorrelated. The resulting uncorrelated variables are called principal components. The first principal component accounts for as much of the variability in the data as possible, and each succeeding component accounts for the remaining variability. PCA is the simplest of the true eigenvector-based multivariate analyses; it reveals the internal structure of the data in a way that best explains the variance in the data. It is a useful statistical technique that has been applied in many fields, including computer networks and image processing, and is a powerful method for finding patterns in high-dimensional datasets.

Consider a data matrix $X_{n\times p}$ with column-wise empirical means of zero (i.e., the sample mean of each column has been shifted to zero), where each of the $n$ rows represents a different repetition of the experiment, and each of the $p$ columns gives a particular kind of data-element (i.e., variable). Mathematically, the transformation is defined by a set of $p$-dimensional vectors of weights $w_{(k)}=(w_1,w_2,…,w_p)_{(k)}$, constrained to be unitary ($\\w_{(k)}||=1$), which map each row vector $x_i$ of $X$ to a new vector of principal component scores $t_{(i)}=(t_1,t_2,…,t_p)_{(i)}$, given by $t_{k(i)}=x_{(i)} w_{(k)}$. The mapping occurs such that the individual elements of $t$ considered over the data set successively inherit the maximum possible variance from $x$.

* First component: the first loading vector $w_{(1)}= \arg\max_{||w||=1} {\sum_i{(t_1)_{(i)}^2 }}= \arg\max_{||w||=1} {\sum_i{(x_{(i)} w)^2 }}$, in matrix form: $w_{(1)}=\arg\max_{||w||=1} {||Xw||^2 }=\arg\max_{||w||=1} {w^T X^T Xw}$.

* Further components: the $k^{th}$ component can be found by subtracting the first $k-1$ principal components from $X$. $\hat{X}_{k-1} = X-\sum_{s=1}^{k-1}{X w_{(s)} w_{(s)}^T }$ and finding the loading vector involves extracting the maximum variance from this new data matrix: $w_{(k)}=\arg\max_{||w||=1} {||\bar{X}_{k-1} w||^2 } = \arg\max {\frac{w^T \hat{X}_{k-1}^T \hat{X}_{k-1} w}{w^T w}}.$ This gives the remaining eigenvectors of $X^T X$ with the maximum values for the quantity in brackets given by other corresponding eigenvalues. The full principal components decomposition of $X$ can therefore be given as $T=XW$ where $W$ is a $p\times p$ matrix whose columns are the eigenvectors of $X^T X$.

* Dimensionality reduction: The faithful transformation $T = X W$ maps a data vector $x_{(i)}$ from an original space of $p$ variables to a new space of $p$ (other) variables which are uncorrelated over the dataset. However, not all the principal components need to be kept. Keeping only the first $L$ principal components, produced by using only the first $L$ loading vectors, gives the truncated transformation $T_L=XW_L$ where $T_L$ has $n$ rows but only $L$ columns. This dimensionality reduction is very useful for visualizing and processing high-dimensional datasets while still retaining as much of the variance in the dataset as possible.

* Singular value decomposition (SVD)
: The SVD of $X=U \Sigma W^T$ where $\Sigma$ is an $n\times p$ rectangular diagonal matrix of positive numbers $\sigma_{(k)}$, the singular values of $X$; $U$ is an $n\times n$ matrix, the columns of which are orthogonal until vector of length $n$ called the left singular vectors of $X$; and $W$ is a $p\times p$ whose columns are orthogonal until vector of length $p$ and called the right singular vectors of $X$. With factorization, the matrix $X^T X=W \Sigma U^T U \Sigma W^T = W \Sigma^2 W^T$; with singular value decomposition the score matrix $T=XW=U\Sigma W^T W=U\Sigma$, where each column of $T$ is given by one of the left singular vectors of $X$ multiplied by the corresponding singular value.

* Computational details of PCA:
** Begin with a dataset that contains at least two dimensions (i.e., variables). The dataset can contain as many observations (dimensions) as you like;
** Normalize the observations for each variable. To do this, simply subtract the mean (average) from each observation within a given variable. For example, let $X$ and $Y$ be the two variables from the original dataset, with variable $X$ containing observations $X_1,X_2,…,X_n$, and variable $Y$ containing observations $Y_1,Y_2,…,Y_n$. Let $\bar{X}$ be the average of the $n$ observations of $X$, i.e. $\bar{X}= \frac{X_1+X_2+⋯+X_n}{n}$, and similarly let $\bar{Y}= \frac{Y_1+Y_2+⋯+Y_n}{n}$ be the average of the $Y$ observations. Then, the normalized dataset would be, for variable $X$: $\{X_1-\bar{X}, X_2-\bar{X}, …,X_n-\bar{X}\}$ and for variable $Y$: $\{Y_1-\bar{Y},Y_2-\bar{Y}, …,Y_n-\bar{Y}\}$
** Calculate the covariance matrix between the variables of the normalized dataset;
** Calculate the eigenvalue and eigenvectors of the covariance matrix (the eigenvectors must be normalized to a length of 1);
**(5) Choose the most significant principal component, which is simply the eigenvector with the highest eigenvalue.

====PCA Properties====
* For any integer $q$, $1 \leq q \leq p$, consider the orthogonal linear transformation $y=B'x$, where $y$ is a $q$-element vector and $B'$ is a $q\times q$ matrix, and let $\Sigma_y =B'\Sigma B$ be the variance-covariance matrix for $y$. Then the trace of $\Sigma_y$, $tr(\Sigma_y)$, is maximized by taking $B=A_q$, where $A_q$ consists of the first $q$ columns of $A$;
* $y=B'x$ and the $tr(\Sigma_y)$ are minimized by taking $B=A_q^*$ where $A_q^*$ consists of the last $q$ columns of $A$. The last few principal components are not simply unstructured left-overs after removing the important ones;
* $\Sigma = \lambda_1 \alpha_1 \alpha_1'+\lambda_2 \alpha_2 \alpha_2'+⋯+\lambda_p \alpha_p \alpha_p'$. Given that $var(x_j)=\sum_{k=1}^p {\lambda_k \alpha_{kj}^2 }$, the elements of $\lambda_k \alpha_k \alpha_k'$ tends to become smaller as SkS increases, whereas the elements of $\lambda_k$ tends to stay about the same size because $\alpha_k \alpha_k'=1,$ for $k=1,2,…,p$.

====PCA Limitations====
* The results of PCA depend on the scaling of the variables.
* The applicability of PCA is limited by certain assumptions made in its derivation.

==== PCA in R====
require(graphics)
## The variances of the variables in the
## USArrests data vary by orders of magnitude, so scaling is appropriate
(pc.cr <- princomp(USArrests)) # inappropriate

Call:
princomp(x = USArrests)

Standard deviations:
Comp.1 Comp.2 Comp.3 Comp.4
82.890847 14.069560 6.424204 2.45783

princomp(USArrests, cor = TRUE) # =^= prcomp(USArrests, scale=TRUE)

Call:
princomp(x = USArrests, cor = TRUE)

Standard deviations:
Comp.1 Comp.2 Comp.3 Comp.4
1.5748783 0.9948694 0.5971291 0.4164494

## Similar, but different:
## The standard deviations differ by a factor of sqrt(49/50)
summary(pc.cr <- princomp(USArrests, cor = TRUE))

Importance of components:
Comp.1 Comp.2 Comp.3 Comp.4
Standard deviation 1.5748783 0.9948694 0.5971291 0.41644938
Proportion of Variance 0.6200604 0.2474413 0.0891408 0.04335752
Cumulative Proportion 0.6200604 0.8675017 0.9566425 1.00000000

loadings(pc.cr) # note that blank entries are small but not zero
## The signs of the columns are arbitrary

Loadings:
Comp.1 Comp.2 Comp.3 Comp.4
Murder -0.536 0.418 -0.341 0.649
Assault -0.583 0.188 -0.268 -0.743
UrbanPop -0.278 -0.873 -0.378 0.134
Rape -0.543 -0.167 0.818

Comp.1 Comp.2 Comp.3 Comp.4
SS loadings 1.00 1.00 1.00 1.00
Proportion Var 0.25 0.25 0.25 0.25
Cumulative Var 0.25 0.50 0.75 1.00

plot(pc.cr) # shows a screeplot.

# The histogram distribution presents a vivid picture of the variance attributable to the first four significant principal
# components respectively.
biplot(pc.cr) ## shows the plot of PCA in a different format

From the chart above, we can see the distribution of the variance attributed to different variables in the four principal components.

==== [[SOCR_EduMaterials_AnalysisActivities_PCA|PCA using SOCR Analyses]]====
[[SOCR_EduMaterials_AnalysisActivities_PCA| This SOCR activity]] illustrates the use of PCA.

====ICA (independent component analysis)====
ICA is a computational method that separates a multivariate signal into additive subcomponents by assuming that the subcomponents are non-Gaussian signals and are statistically independent from each other.

* ICA Assumptions:
** The source signals are independent of each other.
** The distributions of the values in each source signal are non-Gaussian.
** Independence: The sources of signals are assumed to be independent, but their signal mixtures are not independent because they share the same source signals;
** Normality: Based on [[SMHS_CLT_LLN|CLT]], the distribution of the sum of independent random variables approximates a Gaussian distribution;
** Complexity: The temporal complexity of any signal mixture is greater than that of its simplest constituent source signal.

: ICA maximizes the statistical independence of the estimated components to find the independent components. In general, ICA cannot identify the actual number of source signals nor can it identify the proper scaling of the source signals. Suppose the data is represented by the random vector $x=(x_1,x_2,…,x_m )^t$, and the components are denoted as $s=(s_1,s_2,…,s_n )^t$. We need to transform the observed data $x$ using a linear transformation $w$, $s=Wx$, into maximally independent components $s$ measured by some functions of independence. There are alternative ''models'' for ICA:
* Linear noiseless ICA: the components $x_i$ of the data $x=(x_1,x_2,…,x_m )^t$ are generated as a sum of the independent components $s_k$, for $k=1,…,n$; $x_i=a_{i,1} s_1 + ⋯ + a_{i,k} s_k + ⋯ + a_{i,n} s_n$, weighted by the mixing weights $a_{i,k}$.
* Linear noisy ICA: with additional assumption of zero-mean and uncorrelated Gaussian noise $n \sim N(0,diag(\Sigma))$, the ICA model takes the form $x=As+n$.
* Nonlinear ICA: the mixing of the sources is not necessarily linear. Using a nonlinear mixing function $f(.|θ)$ with parameter $\theta$ the nonlinear ICA model is $x=f(s│θ)+n$.
* Identifiability: the independent components are identifiable up to a permutation and scaling of the sources, which requires: (1) at most one of the sources $s_k$ is Gaussian; (2) the number of observed mixtures, $m$, must be at least as large as the number of estimated components: $n$ such that $n\leq m$, i.e., the mixing matrix $A$ must be of full rank in order to have inverse.
* ICA in R using package [http://cran.at.r-project.org/web/packages/fastICA/fastICA.pdf fastICA]. This example demonstrates how to separate two mixed independent uniform variables:

library(fastICA)
S <- matrix(runif(10000), 5000, 2)
A <- matrix(c(1, 1, -1, 3), 2, 2, byrow = TRUE)
X <- S %*% A # In R, "*" and "%*%" indicate "scalar" and matrix multiplication, respectively!

a <- fastICA(X, 2, alg.typ = "parallel", fun = "logcosh", alpha = 1,
method = "C", row.norm = FALSE, maxit = 200,
tol = 0.0001, verbose = TRUE)
par(mfrow = c(1, 2))
plot(a$\$ $X, main = "Pre-processed data")
plot(a$\$ $S, main = "ICA components")

: Another example of un-mixing two independent signals is shown below:
S <- cbind(sin((1:1000)/20), rep((((1:200)-100)/100), 5))
# [http://stat.ethz.ch/R-manual/R-devel/library/base/html/cbind.html cbind] combines objects by rows and columns. It takes a sequence of vector, matrix or data frames arguments and combines them by columns or rows, respectively.

A <- matrix(c(0.291, 0.6557, -0.5439, 0.5572), 2, 2)
X <- S %*% A
a <- fastICA(X, 2, alg.typ = "parallel", fun = "logcosh", alpha = 1,
method = "R", row.norm = FALSE, maxit = 200,
tol = 0.0001, verbose = TRUE)
par(mfcol = c(2, 3))
plot(1:1000, S[,1 ], type = "l", main = "Original Signals",
xlab = "", ylab = "")
plot(1:1000, S[,2 ], type = "l", xlab = "", ylab = "")
plot(1:1000, X[,1 ], type = "l", main = "Mixed Signals",
xlab = "", ylab = "")
plot(1:1000, X[,2 ], type = "l", xlab = "", ylab = "")
plot(1:1000, a$\$ $S[,1 ], type = "l", main = "ICA source estimates",
xlab = "", ylab = "")
plot(1:1000, a$\$ $S[, 2], type = "l", xlab = "", ylab = "")

====Factor analysis (FA)====
FA is a statistical method that describes variability among observed correlated variables in terms of a potentially lower number of unobserved variables. Consider a set of $p$ observable random variables, $x_1,x_2,…,x_p$ with means $μ_1,μ_2,…,μ_p$, respectively. Suppose for some unknown constants $l_{i,j}$ and $k$ unobserved random variables $F_j$, where $i\in \{1,…,p\}$ and $j \in \{1,…,k\}$ where $k<p$. We have $x_i-μ_i=l_{i,1} F_1 + ⋯ +l_{i,k} F_k + ε_i$, where $ε_i$ are independently distributed error terms with mean zero and finite variance. In matrix form, we have $x-μ = LF+ε$, with $n$ observations; $x$ is a $p\times n$ matrix, $L$ is a $p \times k$ matrix and $F$ is $k\times n$ matrix. Assumptions: (1) $ε$ and $F$ are independent; (2) $E(F)=0$; (3) $cov(F)=I$ to make sure the factors are uncorrelated. Solutions to the equations above yield the factors $F$ and the loading matrix, $L$.

* Example: In the following, matrices will be indicated by indexed variables. "Subject" indices will be indicated using the letters $a$, $b$ and $c$, with values running from $1$ to $N_a$, which is equal to $10$ in the above example. "Factor" indices will be indicated using the letters $p$, $q$ and $r$, with values running from $1$ to $N_p$ which is equal to $2$ in the above example. "Instance" or "sample" indices will be indicated using letters $i$, $j$ and $k$, with values running from $1$ to $N_i$. In the example above, if a sample of $N_i=1000$ students responded to the $N_a=10$ questions, the $i^{th}$ student's score for the $a^{th}$ question are given by $x_{ai}$. The purpose of factor analysis is to characterize the correlations between the variables $x_a$ of which the $x_{ai}$ is a particular instance, or set of observations. To ensure that all variables are on equal footing, they are standardized: $z_{ai}=\frac{x_{ai}- μ_a}{σ_a}$, where $μ_a$ is the sample mean and sample variance $σ_a^2=\frac{1}{N_a} \sum_i {(x_{ai}-μ_a)^2}$. The factor analysis model is expressed by:
:$$\begin{matrix}z_{1,i} & = & \ell_{1,1}F_{1,i} & + & \ell_{1,2}F_{2,i} & + & \varepsilon_{1,i} \\
\vdots & & \vdots & & \vdots & & \vdots \\
z_{10,i} & = & \ell_{10,1}F_{1,i} & + & \ell_{10,2}F_{2,i} & + & \varepsilon_{10,i}
\end{matrix}$$

In matrix form ($Z=LF+ϵ$), this model can be expressed as:
$$z_{ai}=\sum_p \ell_{ap}F_{pi}+\varepsilon_{ai},$$
where $F_{1,i}$ is the $i^{th}$ student’s ''verbal intelligence'', $F_{2,i}$ is the $i^{th}$ student’s ''mathematical intelligence''; $l_{ap}$ are the factor loadings for the $a^{th}$ subject for $p=1,2$.

=====FA in R: using ''factanal()''=====
# Maximum Likelihood Factor Analysis
# entering raw data and extracting 3 factors,
# with varimax rotation

mydata <- read.csv("http://www.ats.ucla.edu/stat/data/mmreg.csv")
# mmreg.csv includes 600 observations and 8 variables.
# The psychological variables are locus_of_control, self_concept and motivation.
# The academic variables are standardized tests in reading (read), writing (write), math (math) and science (science).
# Additionally, the variable female is a zero-one indicator variable with the one indicating a female student.

# We can get some basic descriptions of the entire data set by using summary.
# To get the standard deviations, we use sapply to apply the sd function to each variable in the dataset.
summary(mydata)

sapply(mydata, sd)

fit <- factanal(mydata, 3, rotation="varimax") # mydata can be a raw data matrix or a covariance matrix.
# Pairwise deletion of missing data is used. Rotation can be "varimax" or "promax".

print(fit, digits=2, cutoff=.3, sort=TRUE)
# plot factor 1 by factor 2
load <- fit$\$ $loadings[,1:2]
plot(load,type="n") # set up plot
text(load,labels=names(mydata),cex=.7) # add variable names

# Principal Axis Factor Analysis
library(psych)
fit <- factor.pa(mydata, nfactors=3, rotation="varimax")
fit # print results

# Determine Number of Factors to Extract
library(nFactors)
ev <- eigen(cor(mydata)) # get eigenvalues
ap <- parallel(subject=nrow(mydata),var=ncol(mydata), rep=100,cent=.05)
nS <- nScree(x=ev$\$ $values, aparallel=ap$\$ $eigen$\$ $qevpea)
plotnScree(nS)

===PCA, ICA, FA: Similarities and Differences===
* PCA is closely related to factor analysis. The later typically incorporates more domain specific assumptions about the underlying structure and solves eigenvectors of a slightly different matrix. Principal components create variables that are linear combinations of the original variables. The new variables have the property that the variables are all orthogonal. The principal components can be used to find clusters in a set of data.

* PCA is a variance-focused approach seeking to reproduce the total variable variance, in which components reflect both common and unique variance of the variable. It is generally preferred for purposes of data reduction (i.e., translating variable space into optimal factor space) but not when detect the latent construct or factors. Factor analysis is similar to principal component analysis, in that factor analysis also involves linear combinations of variables.

* Different from PCA, factor analysis is a correlation-focused approach seeking to reproduce the inter-correlations among variables, in which the factors “represent the common variance of variables, excluding unique variance. Factor analysis is generally used when the research purpose is detecting data structure (i.e., latent constructs or factors) or causal modeling.

===Applications===
* [[SOCR_EduMaterials_AnalysisActivities_PCA| This SOCR Activity]] demonstrated the utilization of SOCR analyses package for statistical computing in the SOCR environment. It presents a general introduction to PCA and the theoretical background of this statistical tool and shows how to use PCA and how to read and interpret the outcome. It introduced students to input data in the correct format, read the results of PCA and make interpretation of the resulting transformed data.

* [http://www.sciencedirect.com/science/article/pii/0169743987800849 This article] presents a general introduction to PCA. Principal component analysis of a data matrix extracts the dominant patterns in the matrix in terms of a complementary set of score and loading plots. It is the responsibility of the data analyst to formulate the scientific issue at hand in terms of PC projections, PLS regressions, etc. Ask yourself, or the investigator, why the data matrix was collected, and for what purpose the experiments and measurements were made. Specify before the analysis what kinds of patterns you would expect and what you would find exciting. The results of the analysis depend on the scaling of the matrix, which therefore must be specified. Variance scaling, where each variable is scaled to unit variance, can be recommended for general use, provided that almost constant variables are left unscaled. Combining different types of variables warrants blockscaling. In the initial analysis, look for outliers and strong groupings in the plots, indicating that the data matrix perhaps should be “polished” or whether disjoint modeling is the proper course. For plotting purposes, two or three principal components are usually sufficient, but for modeling purposes the number of significant components should be properly determined, e.g. by cross-validation. Use the resulting principal components to guide your continued investigation or chemical experimentation, not as an end in itself.

* [http://link.springer.com/article/10.1007/BF02294359 This article] introduced the Akaike Information Criterion (AIC) to extend the method of maximum likelihood to the multi-model situation. It related the successful experience of the order determination of an autoregressive model to the determination of the number of factors in the maximum likelihood factor analysis. The use of the AIC criterion in the factor analysis is particularly interesting when it is viewed as the choice of a Bayesian model. This observation shows that the area of application of AIC can be much wider than the conventional i.i.d. type models on which the original derivation of the criterion was based. The observation of the Bayesian structure of the factor analysis model leads us to the handling of the problem of improper solution by introducing a natural prior distribution of factor loadings.

* [http://onlinelibrary.wiley.com/doi/10.1002/9780470057339.vnn086/abstract This article] contains a good introduction to the application of ICA. Independent component models have gained increasing interest in various fields of applications in recent years. The basic independent component model is a semi-parametric model assuming that a p-variate observed random vector is a linear transformation of an unobserved vector of p independent latent variables. This linear transformation is given by an unknown mixing matrix, and one of the main objectives of independent component analysis (ICA) is to estimate an unmixing matrix by means of which the latent variables can be recovered. In this article, we discuss the basic independent component model in detail, define the concepts and analysis tools carefully, and consider two families of ICA estimates. The statistical properties (consistency, asymptotic normality, efficiency, robustness) of the estimates can be analyzed and compared via the so called gain matrices. Some extensions of the basic independent component model, such as models with additive noise or models with dependent observations, are briefly discussed. The article ends with a short example.

===Software ===
* [http://socr.ucla.edu/htmls/ana/ SOCR ANalyses]
* [[SOCR_EduMaterials_AnalysisActivities_PCA SOCR PCA Activity]]
* [http://stat.ethz.ch/R-manual/R-patched/library/stats/html/princomp.html R princomp package]
* [http://cran.r-project.org/web/packages/fastICA/ R fastICA package] and [http://cran.at.r-project.org/web/packages/fastICA/fastICA.pdf documentation]

===Problems===

====R Example 1====
# Install package ‘fastICA’

> library(fastICA)
# Using the SOCR 1981-2006 CPI Data (http://wiki.socr.umich.edu/index.php/SOCR_Data_Dinov_021808_ConsumerPriceIndex),
# save the table in ASCII text file [[SMHS_PCA_ICA_FA#Appendix|CPI_Data.dat]]. Note the "dir" (folder) where you saved the data and reference it below
> CPI_Data <- as.matrix(read.table("/dir/CPI_Data.dat",header=T))

> # compare PCA and FA analyses
> X <- CPI_Data[,-1]
> pc.cor <- princomp(X,cor=T)
> summary(pc.cor )
Importance of components:
Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7
Standard deviation 2.1348817 1.1696678 0.71243186 0.54364890 0.38449985 0.31956304 0.145200770
Proportion of Variance 0.6511029 0.1954461 0.07250845 0.04222202 0.02112002 0.01458865 0.003011895
Cumulative Proportion 0.6511029 0.8465490 0.91905742 0.96127944 0.98239946 0.99698811 1.000000000

> ica <- fastICA(X,n.comp=7)
> names(ica)
[1] "X" "K" "W" "A" "S"
# X: pre-processed data matrix (whitened/sphered data)
# K: pre-whitening matrix that projects data onto the first n.comp
# principal components.
# W: estimated un-mixing matrix (XW = S)
# A: estimated mixing matrix (X = SA)
# S: estimated source matrix (factor scores, $\Theta$ in the notes)

> windows()
> biplot(pc.cor)

> S <- ica$\$ $S
> dimnames(S) <- list(dimnames(X)[[1]],paste("Cmp.",1:7,sep=""))
> A <- ica$\$ $A
> dimnames(A) <- list(dimnames(X)[[2]],paste("Cmp.",1:7,sep=""))
> windows()
> biplot(S[,1:2],A[,1:2])

> loadings(pc.cor)

Loadings:
Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6
Electricity -0.415 -0.227 0.164 0.512 -0.373 -0.576
Fuel_Oil -0.351 0.547 -0.198 0.312
Bananas -0.373 -0.415 0.258 0.365 0.393 0.578
Tomatoes -0.369 -0.320 0.357 -0.738 -0.294
Orange_Juice -0.324 -0.311 -0.871 -0.119
Beef -0.424 0.220 -0.216 0.721 -0.449
Gasoline -0.380 0.479 -0.216 0.161
Comp.7
Electricity -0.131
Fuel_Oil -0.657
Bananas
Tomatoes
Orange_Juice 0.108
Beef
Gasoline 0.733

Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6
SS loadings 1.000 1.000 1.000 1.000 1.000 1.000
Proportion Var 0.143 0.143 0.143 0.143 0.143 0.143
Cumulative Var 0.143 0.286 0.429 0.571 0.714 0.857
Comp.7
SS loadings 1.000
Proportion Var 0.143
Cumulative Var 1.000

> field <- function(x) { substr(paste(x," ",sep=""),1,6) }
> A.str <- ifelse(abs(A)<2,field(" "),field(round(A,2)))
> dimnames(A.str) <- list(dimnames(X)[[2]],paste("Cmp.",1:7,sep=""))
> write.table(A.str[,1:4],"",quote=F,row.names=T,col.names=NA)
Cmp.1 Cmp.2 Cmp.3 Cmp.4
Electricity
Fuel_Oil
Bananas -2.59
Tomatoes -2.9
Orange_Juice -2.66
Beef
Gasoline

> L <- pc.cor$loadings
> L.str <- ifelse(abs(L)<0.3,field(" "),field(round(L,2)))
> dimnames(L.str) <- list(dimnames(X)[[2]],paste("Cmp.",1:7,sep=""))
> write.table(L.str[,1:4],"",quote=F,row.names=T,col.names=T)
Cmp.1 Cmp.2 Cmp.3 Cmp.4
Electricity -0.41 0.51
Fuel_Oil -0.35 0.55
Bananas -0.37 -0.41 0.37
Tomatoes -0.37 -0.32 0.36 -0.74
Orange_Juice -0.32 -0.31 -0.87
Beef -0.42
Gasoline -0.38 0.48

===References===
* [http://nyx-www.informatik.uni-bremen.de/664/1/smith_tr_02.pdf A tutorial on Principal Components Analysis]
* [http://mirlyn.lib.umich.edu/Record/004232056 Sampling / Steven K. Thompson]
* [http://mirlyn.lib.umich.edu/Record/004133572 Sampling theory and methods / S. Sampath]
* [http://en.wikipedia.org/wiki/Factor_analysis Factor Analysis details]

===Appendix===
[[SOCR_Data_Dinov_021808_ConsumerPriceIndex#Consumer_Price_Index_Data|SOCR 1981-2006 CPI Dataset]]

<center>
{| class="wikitable" style="text-align:center; width:25%" border="1"
|-
! Year||Electricity||Fuel_Oil||Bananas||Tomatoes||Orange_Juice||Beef||Gasoline
|-
| 1981||31.552||1.15||0.343||0.792||1.141||1.856||1.269
|-
| 1982||36.006||1.254||0.364||0.763||1.465||1.794||1.341
|-
| 1983||37.184||1.194||0.332||0.726||1.418||1.756||1.214
|-
| 1984||38.6||1.122||0.344||0.854||1.408||1.721||1.2
|-
| 1985||38.975||1.078||0.35||0.697||1.685||1.711||1.145
|-
| 1986||40.223||1.126||0.337||1.104||1.756||1.662||1.19
|-
| 1987||40.022||0.817||0.374||0.871||1.512||1.694||0.868
|-
| 1988||40.195||0.89||0.394||0.797||1.638||1.736||0.947
|-
| 1989||40.828||0.883||0.429||1.735||1.868||1.806||0.944
|-
| 1990||41.663||1.259||0.438||0.912||1.817||1.907||1.09
|-
| 1991||43.226||1.235||0.428||0.936||2.005||1.996||1.304
|-
| 1992||44.501||0.985||0.426||1.141||1.879||1.926||1.135
|-
| 1993||46.959||0.969||0.44||1.604||1.677||1.97||1.182
|-
| 1994||48.2||0.919||0.503||1.323||1.674||1.892||1.109
|-
| 1995||48.874||0.913||0.463||1.103||1.583||1.847||1.19
|-
| 1996||48.538||1.007||0.497||1.213||1.577||1.799||1.186
|-
| 1997||49.245||1.136||0.473||1.452||1.737||1.85||1.318
|-
| 1998||46.401||0.966||0.489||1.904||1.601||1.818||1.186
|-
| 1999||45.061||0.834||0.49||1.443||1.753||1.834||1.031
|-
| 2000||45.207||1.189||0.5||1.414||1.823||1.903||1.356
|-
| 2001||47.472||1.509||0.509||1.451||1.863||2.037||1.525
|-
| 2002||47.868||1.123||0.526||1.711||1.876||2.152||1.209
|-
| 2003||47.663||1.396||0.512||1.472||1.848||2.131||1.557
|-
| 2004||49.159||1.508||0.485||1.66||1.957||2.585||1.635
|-
| 2005||50.847||1.859||0.49||2.162||1.872||2.478||1.866
|-
| 2006||57.223||2.418||0.505||1.621||1.853||2.607||2.359
|}
</center>

<hr>
* SOCR Home page: http://www.socr.umich.edu

{{translate|pageName=http://wiki.socr.umich.edu/index.php?title=SMHS_PCA_ICA_FA}}

SMHS PCA ICA FA

2014-09-01T18:17:53Z

Jslavine: /* ICA (independent component analysis) */

==[[SMHS| Scientific Methods for Health Sciences]] - Dimensionality Reduction: PCA, ICA, FA ==

===Overview===
* ''PCA'' (principal component analysis) is a mathematical procedure that transforms a number of possibly correlated variables into a smaller number of uncorrelated variables through a process known as orthogonal transformation.
* ''ICA'' (independent component analysis) is a computational tool to separate a multivariate signal into additive subcomponents by assuming that the subcomponents are non-Gaussian signals and are statistically independent from each other.
* ''Factor analysis'' is a statistical method that describes variability among observed correlated variables in terms of a potentially lower number of unobserved variables. It is related to PCA but they are not identical. In this section, we are going to introduce these three commonly used statistical tools and illustrate their application in various studies with examples and R code samples.

===Motivation===
Suppose we have a set of observable correlated random variables, and we want to reduce the dimensionality of the data into a reasonable new set. How can we achieve this dimensionality reduction? Principal component analysis, independent component analysis and factor analysis may be the answers here. How does each of them work? What are the differences among those statistical methods; what are their strengths and weaknesses? How can we decide on the best method for a specific dataset?

===Theory===
====PCA (principal component analysis)====
PCA is a statistical technique that uses an orthogonal transformation to convert a set of observations of possibly correlated variables into a set of variables that are linearly uncorrelated. The resulting uncorrelated variables are called principal components. The first principal component accounts for as much of the variability in the data as possible, and each succeeding component accounts for the remaining variability. PCA is the simplest of the true eigenvector-based multivariate analyses; it reveals the internal structure of the data in a way that best explains the variance in the data. It is a useful statistical technique that has been applied in many fields, including computer networks and image processing, and is a powerful method for finding patterns in high-dimensional datasets.

Consider a data matrix $X_{n\times p}$ with column-wise empirical means of zero (i.e., the sample mean of each column has been shifted to zero), where each of the $n$ rows represents a different repetition of the experiment, and each of the $p$ columns gives a particular kind of data-element (i.e., variable). Mathematically, the transformation is defined by a set of $p$-dimensional vectors of weights $w_{(k)}=(w_1,w_2,…,w_p)_{(k)}$, constrained to be unitary ($\\w_{(k)}||=1$), which map each row vector $x_i$ of $X$ to a new vector of principal component scores $t_{(i)}=(t_1,t_2,…,t_p)_{(i)}$, given by $t_{k(i)}=x_{(i)} w_{(k)}$. The mapping occurs such that the individual elements of $t$ considered over the data set successively inherit the maximum possible variance from $x$.

* First component: the first loading vector $w_{(1)}= \arg\max_{||w||=1} {\sum_i{(t_1)_{(i)}^2 }}= \arg\max_{||w||=1} {\sum_i{(x_{(i)} w)^2 }}$, in matrix form: $w_{(1)}=\arg\max_{||w||=1} {||Xw||^2 }=\arg\max_{||w||=1} {w^T X^T Xw}$.

* Further components: the $k^{th}$ component can be found by subtracting the first $k-1$ principal components from $X$. $\hat{X}_{k-1} = X-\sum_{s=1}^{k-1}{X w_{(s)} w_{(s)}^T }$ and finding the loading vector involves extracting the maximum variance from this new data matrix: $w_{(k)}=\arg\max_{||w||=1} {||\bar{X}_{k-1} w||^2 } = \arg\max {\frac{w^T \hat{X}_{k-1}^T \hat{X}_{k-1} w}{w^T w}}.$ This gives the remaining eigenvectors of $X^T X$ with the maximum values for the quantity in brackets given by other corresponding eigenvalues. The full principal components decomposition of $X$ can therefore be given as $T=XW$ where $W$ is a $p\times p$ matrix whose columns are the eigenvectors of $X^T X$.

* Dimensionality reduction: The faithful transformation $T = X W$ maps a data vector $x_{(i)}$ from an original space of $p$ variables to a new space of $p$ (other) variables which are uncorrelated over the dataset. However, not all the principal components need to be kept. Keeping only the first $L$ principal components, produced by using only the first $L$ loading vectors, gives the truncated transformation $T_L=XW_L$ where $T_L$ has $n$ rows but only $L$ columns. This dimensionality reduction is very useful for visualizing and processing high-dimensional datasets while still retaining as much of the variance in the dataset as possible.

* Singular value decomposition (SVD)
: The SVD of $X=U \Sigma W^T$ where $\Sigma$ is an $n\times p$ rectangular diagonal matrix of positive numbers $\sigma_{(k)}$, the singular values of $X$; $U$ is an $n\times n$ matrix, the columns of which are orthogonal until vector of length $n$ called the left singular vectors of $X$; and $W$ is a $p\times p$ whose columns are orthogonal until vector of length $p$ and called the right singular vectors of $X$. With factorization, the matrix $X^T X=W \Sigma U^T U \Sigma W^T = W \Sigma^2 W^T$; with singular value decomposition the score matrix $T=XW=U\Sigma W^T W=U\Sigma$, where each column of $T$ is given by one of the left singular vectors of $X$ multiplied by the corresponding singular value.

* Computational details of PCA:
** Begin with a dataset that contains at least two dimensions (i.e., variables). The dataset can contain as many observations (dimensions) as you like;
** Normalize the observations for each variable. To do this, simply subtract the mean (average) from each observation within a given variable. For example, let $X$ and $Y$ be the two variables from the original dataset, with variable $X$ containing observations $X_1,X_2,…,X_n$, and variable $Y$ containing observations $Y_1,Y_2,…,Y_n$. Let $\bar{X}$ be the average of the $n$ observations of $X$, i.e. $\bar{X}= \frac{X_1+X_2+⋯+X_n}{n}$, and similarly let $\bar{Y}= \frac{Y_1+Y_2+⋯+Y_n}{n}$ be the average of the $Y$ observations. Then, the normalized dataset would be, for variable $X$: $\{X_1-\bar{X}, X_2-\bar{X}, …,X_n-\bar{X}\}$ and for variable $Y$: $\{Y_1-\bar{Y},Y_2-\bar{Y}, …,Y_n-\bar{Y}\}$
** Calculate the covariance matrix between the variables of the normalized dataset;
** Calculate the eigenvalue and eigenvectors of the covariance matrix (the eigenvectors must be normalized to a length of 1);
**(5) Choose the most significant principal component, which is simply the eigenvector with the highest eigenvalue.

====PCA Properties====
* For any integer $q$, $1 \leq q \leq p$, consider the orthogonal linear transformation $y=B'x$, where $y$ is a $q$-element vector and $B'$ is a $q\times q$ matrix, and let $\Sigma_y =B'\Sigma B$ be the variance-covariance matrix for $y$. Then the trace of $\Sigma_y$, $tr(\Sigma_y)$, is maximized by taking $B=A_q$, where $A_q$ consists of the first $q$ columns of $A$;
* $y=B'x$ and the $tr(\Sigma_y)$ are minimized by taking $B=A_q^*$ where $A_q^*$ consists of the last $q$ columns of $A$. The last few principal components are not simply unstructured left-overs after removing the important ones;
* $\Sigma = \lambda_1 \alpha_1 \alpha_1'+\lambda_2 \alpha_2 \alpha_2'+⋯+\lambda_p \alpha_p \alpha_p'$. Given that $var(x_j)=\sum_{k=1}^p {\lambda_k \alpha_{kj}^2 }$, the elements of $\lambda_k \alpha_k \alpha_k'$ tends to become smaller as SkS increases, whereas the elements of $\lambda_k$ tends to stay about the same size because $\alpha_k \alpha_k'=1,$ for $k=1,2,…,p$.

====PCA Limitations====
* The results of PCA depend on the scaling of the variables.
* The applicability of PCA is limited by certain assumptions made in its derivation.

==== PCA in R====
require(graphics)
## The variances of the variables in the
## USArrests data vary by orders of magnitude, so scaling is appropriate
(pc.cr <- princomp(USArrests)) # inappropriate

Call:
princomp(x = USArrests)

Standard deviations:
Comp.1 Comp.2 Comp.3 Comp.4
82.890847 14.069560 6.424204 2.45783

princomp(USArrests, cor = TRUE) # =^= prcomp(USArrests, scale=TRUE)

Call:
princomp(x = USArrests, cor = TRUE)

Standard deviations:
Comp.1 Comp.2 Comp.3 Comp.4
1.5748783 0.9948694 0.5971291 0.4164494

## Similar, but different:
## The standard deviations differ by a factor of sqrt(49/50)
summary(pc.cr <- princomp(USArrests, cor = TRUE))

Importance of components:
Comp.1 Comp.2 Comp.3 Comp.4
Standard deviation 1.5748783 0.9948694 0.5971291 0.41644938
Proportion of Variance 0.6200604 0.2474413 0.0891408 0.04335752
Cumulative Proportion 0.6200604 0.8675017 0.9566425 1.00000000

loadings(pc.cr) # note that blank entries are small but not zero
## The signs of the columns are arbitrary

Loadings:
Comp.1 Comp.2 Comp.3 Comp.4
Murder -0.536 0.418 -0.341 0.649
Assault -0.583 0.188 -0.268 -0.743
UrbanPop -0.278 -0.873 -0.378 0.134
Rape -0.543 -0.167 0.818

Comp.1 Comp.2 Comp.3 Comp.4
SS loadings 1.00 1.00 1.00 1.00
Proportion Var 0.25 0.25 0.25 0.25
Cumulative Var 0.25 0.50 0.75 1.00

plot(pc.cr) # shows a screeplot.

# The histogram distribution presents a vivid picture of the variance attributable to the first four significant principal
# components respectively.
biplot(pc.cr) ## shows the plot of PCA in a different format

From the chart above, we can see the distribution of the variance attributed to different variables in the four principal components.

==== [[SOCR_EduMaterials_AnalysisActivities_PCA|PCA using SOCR Analyses]]====
[[SOCR_EduMaterials_AnalysisActivities_PCA| This SOCR activity]] illustrates the use of PCA.

====ICA (independent component analysis)====
ICA is a computational method that separates a multivariate signal into additive subcomponents by assuming that the subcomponents are non-Gaussian signals and are statistically independent from each other.

* ICA Assumptions:
** The source signals are independent of each other.
** The distributions of the values in each source signal are non-Gaussian.
** Independence: The sources of signals are assumed to be independent, but their signal mixtures are not independent because they share the same source signals;
** Normality: Based on [[SMHS_CLT_LLN|CLT]], the distribution of the sum of independent random variables approximates a Gaussian distribution;
** Complexity: The temporal complexity of any signal mixture is greater than that of its simplest constituent source signal.

: ICA maximizes the statistical independence of the estimated components to find the independent components. In general, ICA cannot identify the actual number of source signals nor can it identify the proper scaling of the source signals. Suppose the data is represented by the random vector $x=(x_1,x_2,…,x_m )^t$, and the components are denoted as $s=(s_1,s_2,…,s_n )^t$. We need to transform the observed data $x$ using a linear transformation $w$, $s=Wx$, into maximally independent components $s$ measured by some functions of independence. There are alternative ''models'' for ICA:
* Linear noiseless ICA: the components $x_i$ of the data $x=(x_1,x_2,…,x_m )^t$ are generated as a sum of the independent components $s_k$, for $k=1,…,n$; $x_i=a_{i,1} s_1 + ⋯ + a_{i,k} s_k + ⋯ + a_{i,n} s_n$, weighted by the mixing weights $a_{i,k}$.
* Linear noisy ICA: with additional assumption of zero-mean and uncorrelated Gaussian noise $n \sim N(0,diag(\Sigma))$, the ICA model takes the form $x=As+n$.
* Nonlinear ICA: the mixing of the sources is not necessarily linear. Using a nonlinear mixing function $f(.|θ)$ with parameter $\theta$ the nonlinear ICA model is $x=f(s│θ)+n$.
* Identifiability: the independent components are identifiable up to a permutation and scaling of the sources, which requires: (1) at most one of the sources $s_k$ is Gaussian; (2) the number of observed mixtures, $m$, must be at least as large as the number of estimated components: $n$ such that $n\leq m$, i.e., the mixing matrix $A$ must be of full rank in order to have inverse.
* ICA in R using package [http://cran.at.r-project.org/web/packages/fastICA/fastICA.pdf fastICA]. This example demonstrates how to separate two mixed independent uniform variables:

library(fastICA)
S <- matrix(runif(10000), 5000, 2)
A <- matrix(c(1, 1, -1, 3), 2, 2, byrow = TRUE)
X <- S %*% A # In R, "*" and "%*%" indicate "scalar" and matrix multiplication, respectively!

a <- fastICA(X, 2, alg.typ = "parallel", fun = "logcosh", alpha = 1,
method = "C", row.norm = FALSE, maxit = 200,
tol = 0.0001, verbose = TRUE)
par(mfrow = c(1, 2))
plot(a$\$ $X, main = "Pre-processed data")
plot(a$\$ $S, main = "ICA components")

: Another example of un-mixing two independent signals is shown below:
S <- cbind(sin((1:1000)/20), rep((((1:200)-100)/100), 5))
# [http://stat.ethz.ch/R-manual/R-devel/library/base/html/cbind.html cbind] combines objects by rows and columns. It takes a sequence of vector, matrix or data frames arguments and combines them by columns or rows, respectively.

A <- matrix(c(0.291, 0.6557, -0.5439, 0.5572), 2, 2)
X <- S %*% A
a <- fastICA(X, 2, alg.typ = "parallel", fun = "logcosh", alpha = 1,
method = "R", row.norm = FALSE, maxit = 200,
tol = 0.0001, verbose = TRUE)
par(mfcol = c(2, 3))
plot(1:1000, S[,1 ], type = "l", main = "Original Signals",
xlab = "", ylab = "")
plot(1:1000, S[,2 ], type = "l", xlab = "", ylab = "")
plot(1:1000, X[,1 ], type = "l", main = "Mixed Signals",
xlab = "", ylab = "")
plot(1:1000, X[,2 ], type = "l", xlab = "", ylab = "")
plot(1:1000, a$\$ $S[,1 ], type = "l", main = "ICA source estimates",
xlab = "", ylab = "")
plot(1:1000, a$\$ $S[, 2], type = "l", xlab = "", ylab = "")

====Factor analysis (FA)====
FA is a statistical method, which describes variability among observed correlated variables in terms of potentially lower number of unobserved variables. Consider a set of $p$ observable random variables, $x_1,x_2,…,x_p$ with means $μ_1,μ_2,…,μ_p$, respectively. Suppose for some unknown constants $l_{i,j}$ and $k$ unobserved random variables $F_j$, where $i\in \{1,…,p\}$ and $j \in \{1,…,k\}$ where $k<p$. We have $x_i-μ_i=l_{i,1} F_1 + ⋯ +l_{i,k} F_k + ε_i$, where $ε_i$ are independently distributed error terms with mean zero and finite variance. In matrix form, we have $x-μ = LF+ε$, with $n$ observations, we have $x$ is a $p\times n$ matrix, $L$ is a $p \times k$ matrix and $F$ is $k\times n$ matrix. Assumptions: (1) $ε$ and $F$ are independent; (2) $E(F)=0$; (3) $cov(F)=I$ to make sure the factors are uncorrelated. Solutions to the equations above yield the factors $F$ and the loading matrix, $L$.

* Example: In the following, matrices will be indicated by indexed variables. "Subject" indices will be indicated using letters $a$, $b$ and $c$, with values running from $1$ to $N_a$, which is equal to $10$ in the above example. "Factor" indices will be indicated using letters $p$, $q$ and $r$, with values running from $1$ to $N_p$ which is equal to $2$ in the above example. "Instance" or "sample" indices will be indicated using letters $i$, $j$ and $k$, with values running from $1$ to $N_i$. In the example above, if a sample of $N_i=1000$ students responded to the $N_a=10$ questions, the $i^{th}$ student's score for the $a^{th}$ question are given by $x_{ai}$. The purpose of factor analysis is to characterize the correlations between the variables $x_a$ of which the $x_{ai}$ is a particular instance, or set of observations. To ensure that all variables are on equal footing, they are standardized: $z_{ai}=\frac{x_{ai}- μ_a}{σ_a}$, where $μ_a$ is the sample mean and sample variance $σ_a^2=\frac{1}{N_a} \sum_i {(x_{ai}-μ_a)^2}$. The factor analysis model is expressed by:
:$$\begin{matrix}z_{1,i} & = & \ell_{1,1}F_{1,i} & + & \ell_{1,2}F_{2,i} & + & \varepsilon_{1,i} \\
\vdots & & \vdots & & \vdots & & \vdots \\
z_{10,i} & = & \ell_{10,1}F_{1,i} & + & \ell_{10,2}F_{2,i} & + & \varepsilon_{10,i}
\end{matrix}$$

In matrix form ($Z=LF+ϵ$), this model can be expressed as:
$$z_{ai}=\sum_p \ell_{ap}F_{pi}+\varepsilon_{ai},$$
where $F_{1,i}$ is the $i^{th}$ student’s ''verbal intelligence'', $F_{2,i}$ is the $i^{th}$ student’s ''mathematical intelligence''; $l_{ap}$ are the factor loadings for the $a^{th}$ subject for $p=1,2$.

=====FA in R: using ''factanal()''=====
# Maximum Likelihood Factor Analysis
# entering raw data and extracting 3 factors,
# with varimax rotation

mydata <- read.csv("http://www.ats.ucla.edu/stat/data/mmreg.csv")
# mmreg.csv includes 600 observations and 8 variables.
# The psychological variables are locus_of_control, self_concept and motivation.
# The academic variables are standardized tests in reading (read), writing (write), math (math) and science (science).
# Additionally, the variable female is a zero-one indicator variable with the one indicating a female student.

# We can get some basic descriptions of the entire data set by using summary.
# To get the standard deviations, we use sapply to apply the sd function to each variable in the dataset.
summary(mydata)

sapply(mydata, sd)

fit <- factanal(mydata, 3, rotation="varimax") # mydata can be a raw data matrix or a covariance matrix.
# Pairwise deletion of missing data is used. Rotation can be "varimax" or "promax".

print(fit, digits=2, cutoff=.3, sort=TRUE)
# plot factor 1 by factor 2
load <- fit$\$ $loadings[,1:2]
plot(load,type="n") # set up plot
text(load,labels=names(mydata),cex=.7) # add variable names

# Principal Axis Factor Analysis
library(psych)
fit <- factor.pa(mydata, nfactors=3, rotation="varimax")
fit # print results

# Determine Number of Factors to Extract
library(nFactors)
ev <- eigen(cor(mydata)) # get eigenvalues
ap <- parallel(subject=nrow(mydata),var=ncol(mydata), rep=100,cent=.05)
nS <- nScree(x=ev$\$ $values, aparallel=ap$\$ $eigen$\$ $qevpea)
plotnScree(nS)

===PCA, ICA, FA: Similarities and Differences===
* PCA is closely related to factor analysis. The later typically incorporates more domain specific assumptions about the underlying structure and solves eigenvectors of a slightly different matrix. Principal components create variables that are linear combinations of the original variables. The new variables have the property that the variables are all orthogonal. The principal components can be used to find clusters in a set of data.

* PCA is a variance-focused approach seeking to reproduce the total variable variance, in which components reflect both common and unique variance of the variable. It is generally preferred for purposes of data reduction (i.e., translating variable space into optimal factor space) but not when detect the latent construct or factors. Factor analysis is similar to principal component analysis, in that factor analysis also involves linear combinations of variables.

* Different from PCA, factor analysis is a correlation-focused approach seeking to reproduce the inter-correlations among variables, in which the factors “represent the common variance of variables, excluding unique variance. Factor analysis is generally used when the research purpose is detecting data structure (i.e., latent constructs or factors) or causal modeling.

===Applications===
* [[SOCR_EduMaterials_AnalysisActivities_PCA| This SOCR Activity]] demonstrated the utilization of SOCR analyses package for statistical computing in the SOCR environment. It presents a general introduction to PCA and the theoretical background of this statistical tool and shows how to use PCA and how to read and interpret the outcome. It introduced students to input data in the correct format, read the results of PCA and make interpretation of the resulting transformed data.

* [http://www.sciencedirect.com/science/article/pii/0169743987800849 This article] presents a general introduction to PCA. Principal component analysis of a data matrix extracts the dominant patterns in the matrix in terms of a complementary set of score and loading plots. It is the responsibility of the data analyst to formulate the scientific issue at hand in terms of PC projections, PLS regressions, etc. Ask yourself, or the investigator, why the data matrix was collected, and for what purpose the experiments and measurements were made. Specify before the analysis what kinds of patterns you would expect and what you would find exciting. The results of the analysis depend on the scaling of the matrix, which therefore must be specified. Variance scaling, where each variable is scaled to unit variance, can be recommended for general use, provided that almost constant variables are left unscaled. Combining different types of variables warrants blockscaling. In the initial analysis, look for outliers and strong groupings in the plots, indicating that the data matrix perhaps should be “polished” or whether disjoint modeling is the proper course. For plotting purposes, two or three principal components are usually sufficient, but for modeling purposes the number of significant components should be properly determined, e.g. by cross-validation. Use the resulting principal components to guide your continued investigation or chemical experimentation, not as an end in itself.

* [http://link.springer.com/article/10.1007/BF02294359 This article] introduced the Akaike Information Criterion (AIC) to extend the method of maximum likelihood to the multi-model situation. It related the successful experience of the order determination of an autoregressive model to the determination of the number of factors in the maximum likelihood factor analysis. The use of the AIC criterion in the factor analysis is particularly interesting when it is viewed as the choice of a Bayesian model. This observation shows that the area of application of AIC can be much wider than the conventional i.i.d. type models on which the original derivation of the criterion was based. The observation of the Bayesian structure of the factor analysis model leads us to the handling of the problem of improper solution by introducing a natural prior distribution of factor loadings.

* [http://onlinelibrary.wiley.com/doi/10.1002/9780470057339.vnn086/abstract This article] contains a good introduction to the application of ICA. Independent component models have gained increasing interest in various fields of applications in recent years. The basic independent component model is a semi-parametric model assuming that a p-variate observed random vector is a linear transformation of an unobserved vector of p independent latent variables. This linear transformation is given by an unknown mixing matrix, and one of the main objectives of independent component analysis (ICA) is to estimate an unmixing matrix by means of which the latent variables can be recovered. In this article, we discuss the basic independent component model in detail, define the concepts and analysis tools carefully, and consider two families of ICA estimates. The statistical properties (consistency, asymptotic normality, efficiency, robustness) of the estimates can be analyzed and compared via the so called gain matrices. Some extensions of the basic independent component model, such as models with additive noise or models with dependent observations, are briefly discussed. The article ends with a short example.

===Software ===
* [http://socr.ucla.edu/htmls/ana/ SOCR ANalyses]
* [[SOCR_EduMaterials_AnalysisActivities_PCA SOCR PCA Activity]]
* [http://stat.ethz.ch/R-manual/R-patched/library/stats/html/princomp.html R princomp package]
* [http://cran.r-project.org/web/packages/fastICA/ R fastICA package] and [http://cran.at.r-project.org/web/packages/fastICA/fastICA.pdf documentation]

===Problems===

====R Example 1====
# Install package ‘fastICA’

> library(fastICA)
# Using the SOCR 1981-2006 CPI Data (http://wiki.socr.umich.edu/index.php/SOCR_Data_Dinov_021808_ConsumerPriceIndex),
# save the table in ASCII text file [[SMHS_PCA_ICA_FA#Appendix|CPI_Data.dat]]. Note the "dir" (folder) where you saved the data and reference it below
> CPI_Data <- as.matrix(read.table("/dir/CPI_Data.dat",header=T))

> # compare PCA and FA analyses
> X <- CPI_Data[,-1]
> pc.cor <- princomp(X,cor=T)
> summary(pc.cor )
Importance of components:
Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7
Standard deviation 2.1348817 1.1696678 0.71243186 0.54364890 0.38449985 0.31956304 0.145200770
Proportion of Variance 0.6511029 0.1954461 0.07250845 0.04222202 0.02112002 0.01458865 0.003011895
Cumulative Proportion 0.6511029 0.8465490 0.91905742 0.96127944 0.98239946 0.99698811 1.000000000

> ica <- fastICA(X,n.comp=7)
> names(ica)
[1] "X" "K" "W" "A" "S"
# X: pre-processed data matrix (whitened/sphered data)
# K: pre-whitening matrix that projects data onto the first n.comp
# principal components.
# W: estimated un-mixing matrix (XW = S)
# A: estimated mixing matrix (X = SA)
# S: estimated source matrix (factor scores, $\Theta$ in the notes)

> windows()
> biplot(pc.cor)

> S <- ica$\$ $S
> dimnames(S) <- list(dimnames(X)[[1]],paste("Cmp.",1:7,sep=""))
> A <- ica$\$ $A
> dimnames(A) <- list(dimnames(X)[[2]],paste("Cmp.",1:7,sep=""))
> windows()
> biplot(S[,1:2],A[,1:2])

> loadings(pc.cor)

Loadings:
Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6
Electricity -0.415 -0.227 0.164 0.512 -0.373 -0.576
Fuel_Oil -0.351 0.547 -0.198 0.312
Bananas -0.373 -0.415 0.258 0.365 0.393 0.578
Tomatoes -0.369 -0.320 0.357 -0.738 -0.294
Orange_Juice -0.324 -0.311 -0.871 -0.119
Beef -0.424 0.220 -0.216 0.721 -0.449
Gasoline -0.380 0.479 -0.216 0.161
Comp.7
Electricity -0.131
Fuel_Oil -0.657
Bananas
Tomatoes
Orange_Juice 0.108
Beef
Gasoline 0.733

Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6
SS loadings 1.000 1.000 1.000 1.000 1.000 1.000
Proportion Var 0.143 0.143 0.143 0.143 0.143 0.143
Cumulative Var 0.143 0.286 0.429 0.571 0.714 0.857
Comp.7
SS loadings 1.000
Proportion Var 0.143
Cumulative Var 1.000

> field <- function(x) { substr(paste(x," ",sep=""),1,6) }
> A.str <- ifelse(abs(A)<2,field(" "),field(round(A,2)))
> dimnames(A.str) <- list(dimnames(X)[[2]],paste("Cmp.",1:7,sep=""))
> write.table(A.str[,1:4],"",quote=F,row.names=T,col.names=NA)
Cmp.1 Cmp.2 Cmp.3 Cmp.4
Electricity
Fuel_Oil
Bananas -2.59
Tomatoes -2.9
Orange_Juice -2.66
Beef
Gasoline

> L <- pc.cor$loadings
> L.str <- ifelse(abs(L)<0.3,field(" "),field(round(L,2)))
> dimnames(L.str) <- list(dimnames(X)[[2]],paste("Cmp.",1:7,sep=""))
> write.table(L.str[,1:4],"",quote=F,row.names=T,col.names=T)
Cmp.1 Cmp.2 Cmp.3 Cmp.4
Electricity -0.41 0.51
Fuel_Oil -0.35 0.55
Bananas -0.37 -0.41 0.37
Tomatoes -0.37 -0.32 0.36 -0.74
Orange_Juice -0.32 -0.31 -0.87
Beef -0.42
Gasoline -0.38 0.48

===References===
* [http://nyx-www.informatik.uni-bremen.de/664/1/smith_tr_02.pdf A tutorial on Principal Components Analysis]
* [http://mirlyn.lib.umich.edu/Record/004232056 Sampling / Steven K. Thompson]
* [http://mirlyn.lib.umich.edu/Record/004133572 Sampling theory and methods / S. Sampath]
* [http://en.wikipedia.org/wiki/Factor_analysis Factor Analysis details]

===Appendix===
[[SOCR_Data_Dinov_021808_ConsumerPriceIndex#Consumer_Price_Index_Data|SOCR 1981-2006 CPI Dataset]]

<center>
{| class="wikitable" style="text-align:center; width:25%" border="1"
|-
! Year||Electricity||Fuel_Oil||Bananas||Tomatoes||Orange_Juice||Beef||Gasoline
|-
| 1981||31.552||1.15||0.343||0.792||1.141||1.856||1.269
|-
| 1982||36.006||1.254||0.364||0.763||1.465||1.794||1.341
|-
| 1983||37.184||1.194||0.332||0.726||1.418||1.756||1.214
|-
| 1984||38.6||1.122||0.344||0.854||1.408||1.721||1.2
|-
| 1985||38.975||1.078||0.35||0.697||1.685||1.711||1.145
|-
| 1986||40.223||1.126||0.337||1.104||1.756||1.662||1.19
|-
| 1987||40.022||0.817||0.374||0.871||1.512||1.694||0.868
|-
| 1988||40.195||0.89||0.394||0.797||1.638||1.736||0.947
|-
| 1989||40.828||0.883||0.429||1.735||1.868||1.806||0.944
|-
| 1990||41.663||1.259||0.438||0.912||1.817||1.907||1.09
|-
| 1991||43.226||1.235||0.428||0.936||2.005||1.996||1.304
|-
| 1992||44.501||0.985||0.426||1.141||1.879||1.926||1.135
|-
| 1993||46.959||0.969||0.44||1.604||1.677||1.97||1.182
|-
| 1994||48.2||0.919||0.503||1.323||1.674||1.892||1.109
|-
| 1995||48.874||0.913||0.463||1.103||1.583||1.847||1.19
|-
| 1996||48.538||1.007||0.497||1.213||1.577||1.799||1.186
|-
| 1997||49.245||1.136||0.473||1.452||1.737||1.85||1.318
|-
| 1998||46.401||0.966||0.489||1.904||1.601||1.818||1.186
|-
| 1999||45.061||0.834||0.49||1.443||1.753||1.834||1.031
|-
| 2000||45.207||1.189||0.5||1.414||1.823||1.903||1.356
|-
| 2001||47.472||1.509||0.509||1.451||1.863||2.037||1.525
|-
| 2002||47.868||1.123||0.526||1.711||1.876||2.152||1.209
|-
| 2003||47.663||1.396||0.512||1.472||1.848||2.131||1.557
|-
| 2004||49.159||1.508||0.485||1.66||1.957||2.585||1.635
|-
| 2005||50.847||1.859||0.49||2.162||1.872||2.478||1.866
|-
| 2006||57.223||2.418||0.505||1.621||1.853||2.607||2.359
|}
</center>

<hr>
* SOCR Home page: http://www.socr.umich.edu

{{translate|pageName=http://wiki.socr.umich.edu/index.php?title=SMHS_PCA_ICA_FA}}

SMHS PCA ICA FA

2014-09-01T16:02:11Z

Jslavine: /* PCA using SOCR Analyses */

==[[SMHS| Scientific Methods for Health Sciences]] - Dimensionality Reduction: PCA, ICA, FA ==

===Overview===
* ''PCA'' (principal component analysis) is a mathematical procedure that transforms a number of possibly correlated variables into a smaller number of uncorrelated variables through a process known as orthogonal transformation.
* ''ICA'' (independent component analysis) is a computational tool to separate a multivariate signal into additive subcomponents by assuming that the subcomponents are non-Gaussian signals and are statistically independent from each other.
* ''Factor analysis'' is a statistical method that describes variability among observed correlated variables in terms of a potentially lower number of unobserved variables. It is related to PCA but they are not identical. In this section, we are going to introduce these three commonly used statistical tools and illustrate their application in various studies with examples and R code samples.

===Motivation===
Suppose we have a set of observable correlated random variables, and we want to reduce the dimensionality of the data into a reasonable new set. How can we achieve this dimensionality reduction? Principal component analysis, independent component analysis and factor analysis may be the answers here. How does each of them work? What are the differences among those statistical methods; what are their strengths and weaknesses? How can we decide on the best method for a specific dataset?

===Theory===
====PCA (principal component analysis)====
PCA is a statistical technique that uses an orthogonal transformation to convert a set of observations of possibly correlated variables into a set of variables that are linearly uncorrelated. The resulting uncorrelated variables are called principal components. The first principal component accounts for as much of the variability in the data as possible, and each succeeding component accounts for the remaining variability. PCA is the simplest of the true eigenvector-based multivariate analyses; it reveals the internal structure of the data in a way that best explains the variance in the data. It is a useful statistical technique that has been applied in many fields, including computer networks and image processing, and is a powerful method for finding patterns in high-dimensional datasets.

Consider a data matrix $X_{n\times p}$ with column-wise empirical means of zero (i.e., the sample mean of each column has been shifted to zero), where each of the $n$ rows represents a different repetition of the experiment, and each of the $p$ columns gives a particular kind of data-element (i.e., variable). Mathematically, the transformation is defined by a set of $p$-dimensional vectors of weights $w_{(k)}=(w_1,w_2,…,w_p)_{(k)}$, constrained to be unitary ($\\w_{(k)}||=1$), which map each row vector $x_i$ of $X$ to a new vector of principal component scores $t_{(i)}=(t_1,t_2,…,t_p)_{(i)}$, given by $t_{k(i)}=x_{(i)} w_{(k)}$. The mapping occurs such that the individual elements of $t$ considered over the data set successively inherit the maximum possible variance from $x$.

* First component: the first loading vector $w_{(1)}= \arg\max_{||w||=1} {\sum_i{(t_1)_{(i)}^2 }}= \arg\max_{||w||=1} {\sum_i{(x_{(i)} w)^2 }}$, in matrix form: $w_{(1)}=\arg\max_{||w||=1} {||Xw||^2 }=\arg\max_{||w||=1} {w^T X^T Xw}$.

* Further components: the $k^{th}$ component can be found by subtracting the first $k-1$ principal components from $X$. $\hat{X}_{k-1} = X-\sum_{s=1}^{k-1}{X w_{(s)} w_{(s)}^T }$ and finding the loading vector involves extracting the maximum variance from this new data matrix: $w_{(k)}=\arg\max_{||w||=1} {||\bar{X}_{k-1} w||^2 } = \arg\max {\frac{w^T \hat{X}_{k-1}^T \hat{X}_{k-1} w}{w^T w}}.$ This gives the remaining eigenvectors of $X^T X$ with the maximum values for the quantity in brackets given by other corresponding eigenvalues. The full principal components decomposition of $X$ can therefore be given as $T=XW$ where $W$ is a $p\times p$ matrix whose columns are the eigenvectors of $X^T X$.

* Dimensionality reduction: The faithful transformation $T = X W$ maps a data vector $x_{(i)}$ from an original space of $p$ variables to a new space of $p$ (other) variables which are uncorrelated over the dataset. However, not all the principal components need to be kept. Keeping only the first $L$ principal components, produced by using only the first $L$ loading vectors, gives the truncated transformation $T_L=XW_L$ where $T_L$ has $n$ rows but only $L$ columns. This dimensionality reduction is very useful for visualizing and processing high-dimensional datasets while still retaining as much of the variance in the dataset as possible.

* Singular value decomposition (SVD)
: The SVD of $X=U \Sigma W^T$ where $\Sigma$ is an $n\times p$ rectangular diagonal matrix of positive numbers $\sigma_{(k)}$, the singular values of $X$; $U$ is an $n\times n$ matrix, the columns of which are orthogonal until vector of length $n$ called the left singular vectors of $X$; and $W$ is a $p\times p$ whose columns are orthogonal until vector of length $p$ and called the right singular vectors of $X$. With factorization, the matrix $X^T X=W \Sigma U^T U \Sigma W^T = W \Sigma^2 W^T$; with singular value decomposition the score matrix $T=XW=U\Sigma W^T W=U\Sigma$, where each column of $T$ is given by one of the left singular vectors of $X$ multiplied by the corresponding singular value.

* Computational details of PCA:
** Begin with a dataset that contains at least two dimensions (i.e., variables). The dataset can contain as many observations (dimensions) as you like;
** Normalize the observations for each variable. To do this, simply subtract the mean (average) from each observation within a given variable. For example, let $X$ and $Y$ be the two variables from the original dataset, with variable $X$ containing observations $X_1,X_2,…,X_n$, and variable $Y$ containing observations $Y_1,Y_2,…,Y_n$. Let $\bar{X}$ be the average of the $n$ observations of $X$, i.e. $\bar{X}= \frac{X_1+X_2+⋯+X_n}{n}$, and similarly let $\bar{Y}= \frac{Y_1+Y_2+⋯+Y_n}{n}$ be the average of the $Y$ observations. Then, the normalized dataset would be, for variable $X$: $\{X_1-\bar{X}, X_2-\bar{X}, …,X_n-\bar{X}\}$ and for variable $Y$: $\{Y_1-\bar{Y},Y_2-\bar{Y}, …,Y_n-\bar{Y}\}$
** Calculate the covariance matrix between the variables of the normalized dataset;
** Calculate the eigenvalue and eigenvectors of the covariance matrix (the eigenvectors must be normalized to a length of 1);
**(5) Choose the most significant principal component, which is simply the eigenvector with the highest eigenvalue.

====PCA Properties====
* For any integer $q$, $1 \leq q \leq p$, consider the orthogonal linear transformation $y=B'x$, where $y$ is a $q$-element vector and $B'$ is a $q\times q$ matrix, and let $\Sigma_y =B'\Sigma B$ be the variance-covariance matrix for $y$. Then the trace of $\Sigma_y$, $tr(\Sigma_y)$, is maximized by taking $B=A_q$, where $A_q$ consists of the first $q$ columns of $A$;
* $y=B'x$ and the $tr(\Sigma_y)$ are minimized by taking $B=A_q^*$ where $A_q^*$ consists of the last $q$ columns of $A$. The last few principal components are not simply unstructured left-overs after removing the important ones;
* $\Sigma = \lambda_1 \alpha_1 \alpha_1'+\lambda_2 \alpha_2 \alpha_2'+⋯+\lambda_p \alpha_p \alpha_p'$. Given that $var(x_j)=\sum_{k=1}^p {\lambda_k \alpha_{kj}^2 }$, the elements of $\lambda_k \alpha_k \alpha_k'$ tends to become smaller as SkS increases, whereas the elements of $\lambda_k$ tends to stay about the same size because $\alpha_k \alpha_k'=1,$ for $k=1,2,…,p$.

====PCA Limitations====
* The results of PCA depend on the scaling of the variables.
* The applicability of PCA is limited by certain assumptions made in its derivation.

==== PCA in R====
require(graphics)
## The variances of the variables in the
## USArrests data vary by orders of magnitude, so scaling is appropriate
(pc.cr <- princomp(USArrests)) # inappropriate

Call:
princomp(x = USArrests)

Standard deviations:
Comp.1 Comp.2 Comp.3 Comp.4
82.890847 14.069560 6.424204 2.45783

princomp(USArrests, cor = TRUE) # =^= prcomp(USArrests, scale=TRUE)

Call:
princomp(x = USArrests, cor = TRUE)

Standard deviations:
Comp.1 Comp.2 Comp.3 Comp.4
1.5748783 0.9948694 0.5971291 0.4164494

## Similar, but different:
## The standard deviations differ by a factor of sqrt(49/50)
summary(pc.cr <- princomp(USArrests, cor = TRUE))

Importance of components:
Comp.1 Comp.2 Comp.3 Comp.4
Standard deviation 1.5748783 0.9948694 0.5971291 0.41644938
Proportion of Variance 0.6200604 0.2474413 0.0891408 0.04335752
Cumulative Proportion 0.6200604 0.8675017 0.9566425 1.00000000

loadings(pc.cr) # note that blank entries are small but not zero
## The signs of the columns are arbitrary

Loadings:
Comp.1 Comp.2 Comp.3 Comp.4
Murder -0.536 0.418 -0.341 0.649
Assault -0.583 0.188 -0.268 -0.743
UrbanPop -0.278 -0.873 -0.378 0.134
Rape -0.543 -0.167 0.818

Comp.1 Comp.2 Comp.3 Comp.4
SS loadings 1.00 1.00 1.00 1.00
Proportion Var 0.25 0.25 0.25 0.25
Cumulative Var 0.25 0.50 0.75 1.00

plot(pc.cr) # shows a screeplot.

# The histogram distribution presents a vivid picture of the variance attributable to the first four significant principal
# components respectively.
biplot(pc.cr) ## shows the plot of PCA in a different format

From the chart above, we can see the distribution of the variance attributed to different variables in the four principal components.

==== [[SOCR_EduMaterials_AnalysisActivities_PCA|PCA using SOCR Analyses]]====
[[SOCR_EduMaterials_AnalysisActivities_PCA| This SOCR activity]] illustrates the use of PCA.

====ICA (independent component analysis)====
ICA is a computational method separating a multivariate signal into additive subcomponents by assuming that the subcomponents are non-Gaussian signals and are statistically independent from each other.

* ICA Assumptions:
** The source signals are independent of each other;
** The distribution of the values in each source signals are non-Gaussian.
** Independence: the source of signals are assumed to be independent but their signal mixture are not independent because they share the same source signals;
** Normality: based on [[SMHS_CLT_LLN|CLT]], the distribution of a sum of independent random variables approximate a Gaussian distribution;
** Complexity: the temporal complexity of any signal mixture is greater than that of its simplest constituent source signal.

: ICA maximizes the statistical independence of the estimated components to find the independent components. In general, ICA can’t identify the actual number of source signals nor can it identify the proper scaling of the source signals. Suppose the data is represented by the random vector $x=(x_1,x_2,…,x_m )^t$ and the components denoted as $s=(s_1,s_2,…,s_n )^t$. We need to transform the observed data $x$ using a linear transformation $w$, $s=Wx$, into maximally independent components $s$ measured by some functions of independence. There are alternative ''models'' for ICA:
* Linear noiseless ICA: the components $x_i$ of the data $x=(x_1,x_2,…,x_m )^t$ are generated as a sum of the independent components $s_k$, for $k=1,…,n$; $x_i=a_{i,1} s_1 + ⋯ + a_{i,k} s_k + ⋯ + a_{i,n} s_n$, weighted by the mixing weights $a_{i,k}$.
* Linear noisy ICA: with additional assumption of zero-mean and uncorrelated Gaussian noise $n \sim N(0,diag(\Sigma))$, the ICA model takes the form $x=As+n$.
* Nonlinear ICA: the mixing of the sources is not necessarily linear. Using a nonlinear mixing function $f(.|θ)$ with parameter $\theta$ the nonlinear ICA model is $x=f(s│θ)+n$.
* Identifiability: the independent components are identifiable up to a permutation and scaling of the sources, which requires: (1) at most one of the sources $s_k$ is Gaussian; (2) the number of observed mixtures, $m$, must be at least as large as the number of estimated components: $n$ such that $n\leq m$, i.e., the mixing matrix $A$ must be of full rank in order to have inverse.
* ICA in R using package [http://cran.at.r-project.org/web/packages/fastICA/fastICA.pdf fastICA]. This example shows un-mixing two mixed independent uniforms:

library(fastICA)
S <- matrix(runif(10000), 5000, 2)
A <- matrix(c(1, 1, -1, 3), 2, 2, byrow = TRUE)
X <- S %*% A # In R, "*" and "%*%" indicate "scalar" and matrix multiplication, respectively!

a <- fastICA(X, 2, alg.typ = "parallel", fun = "logcosh", alpha = 1,
method = "C", row.norm = FALSE, maxit = 200,
tol = 0.0001, verbose = TRUE)
par(mfrow = c(1, 2))
plot(a$\$ $X, main = "Pre-processed data")
plot(a$\$ $S, main = "ICA components")

: Another example of un-mixing two independent signals is shown below:
S <- cbind(sin((1:1000)/20), rep((((1:200)-100)/100), 5))
# [http://stat.ethz.ch/R-manual/R-devel/library/base/html/cbind.html cbind] combines objects by rows and columns. It takes a sequence of vector, matrix or data frames arguments and combines them by columns or rows, respectively.

A <- matrix(c(0.291, 0.6557, -0.5439, 0.5572), 2, 2)
X <- S %*% A
a <- fastICA(X, 2, alg.typ = "parallel", fun = "logcosh", alpha = 1,
method = "R", row.norm = FALSE, maxit = 200,
tol = 0.0001, verbose = TRUE)
par(mfcol = c(2, 3))
plot(1:1000, S[,1 ], type = "l", main = "Original Signals",
xlab = "", ylab = "")
plot(1:1000, S[,2 ], type = "l", xlab = "", ylab = "")
plot(1:1000, X[,1 ], type = "l", main = "Mixed Signals",
xlab = "", ylab = "")
plot(1:1000, X[,2 ], type = "l", xlab = "", ylab = "")
plot(1:1000, a$\$ $S[,1 ], type = "l", main = "ICA source estimates",
xlab = "", ylab = "")
plot(1:1000, a$\$ $S[, 2], type = "l", xlab = "", ylab = "")

====Factor analysis (FA)====
FA is a statistical method, which describes variability among observed correlated variables in terms of potentially lower number of unobserved variables. Consider a set of $p$ observable random variables, $x_1,x_2,…,x_p$ with means $μ_1,μ_2,…,μ_p$, respectively. Suppose for some unknown constants $l_{i,j}$ and $k$ unobserved random variables $F_j$, where $i\in \{1,…,p\}$ and $j \in \{1,…,k\}$ where $k<p$. We have $x_i-μ_i=l_{i,1} F_1 + ⋯ +l_{i,k} F_k + ε_i$, where $ε_i$ are independently distributed error terms with mean zero and finite variance. In matrix form, we have $x-μ = LF+ε$, with $n$ observations, we have $x$ is a $p\times n$ matrix, $L$ is a $p \times k$ matrix and $F$ is $k\times n$ matrix. Assumptions: (1) $ε$ and $F$ are independent; (2) $E(F)=0$; (3) $cov(F)=I$ to make sure the factors are uncorrelated. Solutions to the equations above yield the factors $F$ and the loading matrix, $L$.

* Example: In the following, matrices will be indicated by indexed variables. "Subject" indices will be indicated using letters $a$, $b$ and $c$, with values running from $1$ to $N_a$, which is equal to $10$ in the above example. "Factor" indices will be indicated using letters $p$, $q$ and $r$, with values running from $1$ to $N_p$ which is equal to $2$ in the above example. "Instance" or "sample" indices will be indicated using letters $i$, $j$ and $k$, with values running from $1$ to $N_i$. In the example above, if a sample of $N_i=1000$ students responded to the $N_a=10$ questions, the $i^{th}$ student's score for the $a^{th}$ question are given by $x_{ai}$. The purpose of factor analysis is to characterize the correlations between the variables $x_a$ of which the $x_{ai}$ is a particular instance, or set of observations. To ensure that all variables are on equal footing, they are standardized: $z_{ai}=\frac{x_{ai}- μ_a}{σ_a}$, where $μ_a$ is the sample mean and sample variance $σ_a^2=\frac{1}{N_a} \sum_i {(x_{ai}-μ_a)^2}$. The factor analysis model is expressed by:
:$$\begin{matrix}z_{1,i} & = & \ell_{1,1}F_{1,i} & + & \ell_{1,2}F_{2,i} & + & \varepsilon_{1,i} \\
\vdots & & \vdots & & \vdots & & \vdots \\
z_{10,i} & = & \ell_{10,1}F_{1,i} & + & \ell_{10,2}F_{2,i} & + & \varepsilon_{10,i}
\end{matrix}$$

In matrix form ($Z=LF+ϵ$), this model can be expressed as:
$$z_{ai}=\sum_p \ell_{ap}F_{pi}+\varepsilon_{ai},$$
where $F_{1,i}$ is the $i^{th}$ student’s ''verbal intelligence'', $F_{2,i}$ is the $i^{th}$ student’s ''mathematical intelligence''; $l_{ap}$ are the factor loadings for the $a^{th}$ subject for $p=1,2$.

=====FA in R: using ''factanal()''=====
# Maximum Likelihood Factor Analysis
# entering raw data and extracting 3 factors,
# with varimax rotation

mydata <- read.csv("http://www.ats.ucla.edu/stat/data/mmreg.csv")
# mmreg.csv includes 600 observations and 8 variables.
# The psychological variables are locus_of_control, self_concept and motivation.
# The academic variables are standardized tests in reading (read), writing (write), math (math) and science (science).
# Additionally, the variable female is a zero-one indicator variable with the one indicating a female student.

# We can get some basic descriptions of the entire data set by using summary.
# To get the standard deviations, we use sapply to apply the sd function to each variable in the dataset.
summary(mydata)

sapply(mydata, sd)

fit <- factanal(mydata, 3, rotation="varimax") # mydata can be a raw data matrix or a covariance matrix.
# Pairwise deletion of missing data is used. Rotation can be "varimax" or "promax".

print(fit, digits=2, cutoff=.3, sort=TRUE)
# plot factor 1 by factor 2
load <- fit$\$ $loadings[,1:2]
plot(load,type="n") # set up plot
text(load,labels=names(mydata),cex=.7) # add variable names

# Principal Axis Factor Analysis
library(psych)
fit <- factor.pa(mydata, nfactors=3, rotation="varimax")
fit # print results

# Determine Number of Factors to Extract
library(nFactors)
ev <- eigen(cor(mydata)) # get eigenvalues
ap <- parallel(subject=nrow(mydata),var=ncol(mydata), rep=100,cent=.05)
nS <- nScree(x=ev$\$ $values, aparallel=ap$\$ $eigen$\$ $qevpea)
plotnScree(nS)

===PCA, ICA, FA: Similarities and Differences===
* PCA is closely related to factor analysis. The later typically incorporates more domain specific assumptions about the underlying structure and solves eigenvectors of a slightly different matrix. Principal components create variables that are linear combinations of the original variables. The new variables have the property that the variables are all orthogonal. The principal components can be used to find clusters in a set of data.

* PCA is a variance-focused approach seeking to reproduce the total variable variance, in which components reflect both common and unique variance of the variable. It is generally preferred for purposes of data reduction (i.e., translating variable space into optimal factor space) but not when detect the latent construct or factors. Factor analysis is similar to principal component analysis, in that factor analysis also involves linear combinations of variables.

* Different from PCA, factor analysis is a correlation-focused approach seeking to reproduce the inter-correlations among variables, in which the factors “represent the common variance of variables, excluding unique variance. Factor analysis is generally used when the research purpose is detecting data structure (i.e., latent constructs or factors) or causal modeling.

===Applications===
* [[SOCR_EduMaterials_AnalysisActivities_PCA| This SOCR Activity]] demonstrated the utilization of SOCR analyses package for statistical computing in the SOCR environment. It presents a general introduction to PCA and the theoretical background of this statistical tool and shows how to use PCA and how to read and interpret the outcome. It introduced students to input data in the correct format, read the results of PCA and make interpretation of the resulting transformed data.

* [http://www.sciencedirect.com/science/article/pii/0169743987800849 This article] presents a general introduction to PCA. Principal component analysis of a data matrix extracts the dominant patterns in the matrix in terms of a complementary set of score and loading plots. It is the responsibility of the data analyst to formulate the scientific issue at hand in terms of PC projections, PLS regressions, etc. Ask yourself, or the investigator, why the data matrix was collected, and for what purpose the experiments and measurements were made. Specify before the analysis what kinds of patterns you would expect and what you would find exciting. The results of the analysis depend on the scaling of the matrix, which therefore must be specified. Variance scaling, where each variable is scaled to unit variance, can be recommended for general use, provided that almost constant variables are left unscaled. Combining different types of variables warrants blockscaling. In the initial analysis, look for outliers and strong groupings in the plots, indicating that the data matrix perhaps should be “polished” or whether disjoint modeling is the proper course. For plotting purposes, two or three principal components are usually sufficient, but for modeling purposes the number of significant components should be properly determined, e.g. by cross-validation. Use the resulting principal components to guide your continued investigation or chemical experimentation, not as an end in itself.

* [http://link.springer.com/article/10.1007/BF02294359 This article] introduced the Akaike Information Criterion (AIC) to extend the method of maximum likelihood to the multi-model situation. It related the successful experience of the order determination of an autoregressive model to the determination of the number of factors in the maximum likelihood factor analysis. The use of the AIC criterion in the factor analysis is particularly interesting when it is viewed as the choice of a Bayesian model. This observation shows that the area of application of AIC can be much wider than the conventional i.i.d. type models on which the original derivation of the criterion was based. The observation of the Bayesian structure of the factor analysis model leads us to the handling of the problem of improper solution by introducing a natural prior distribution of factor loadings.

* [http://onlinelibrary.wiley.com/doi/10.1002/9780470057339.vnn086/abstract This article] contains a good introduction to the application of ICA. Independent component models have gained increasing interest in various fields of applications in recent years. The basic independent component model is a semi-parametric model assuming that a p-variate observed random vector is a linear transformation of an unobserved vector of p independent latent variables. This linear transformation is given by an unknown mixing matrix, and one of the main objectives of independent component analysis (ICA) is to estimate an unmixing matrix by means of which the latent variables can be recovered. In this article, we discuss the basic independent component model in detail, define the concepts and analysis tools carefully, and consider two families of ICA estimates. The statistical properties (consistency, asymptotic normality, efficiency, robustness) of the estimates can be analyzed and compared via the so called gain matrices. Some extensions of the basic independent component model, such as models with additive noise or models with dependent observations, are briefly discussed. The article ends with a short example.

===Software ===
* [http://socr.ucla.edu/htmls/ana/ SOCR ANalyses]
* [[SOCR_EduMaterials_AnalysisActivities_PCA SOCR PCA Activity]]
* [http://stat.ethz.ch/R-manual/R-patched/library/stats/html/princomp.html R princomp package]
* [http://cran.r-project.org/web/packages/fastICA/ R fastICA package] and [http://cran.at.r-project.org/web/packages/fastICA/fastICA.pdf documentation]

===Problems===

====R Example 1====
# Install package ‘fastICA’

> library(fastICA)
# Using the SOCR 1981-2006 CPI Data (http://wiki.socr.umich.edu/index.php/SOCR_Data_Dinov_021808_ConsumerPriceIndex),
# save the table in ASCII text file [[SMHS_PCA_ICA_FA#Appendix|CPI_Data.dat]]. Note the "dir" (folder) where you saved the data and reference it below
> CPI_Data <- as.matrix(read.table("/dir/CPI_Data.dat",header=T))

> # compare PCA and FA analyses
> X <- CPI_Data[,-1]
> pc.cor <- princomp(X,cor=T)
> summary(pc.cor )
Importance of components:
Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7
Standard deviation 2.1348817 1.1696678 0.71243186 0.54364890 0.38449985 0.31956304 0.145200770
Proportion of Variance 0.6511029 0.1954461 0.07250845 0.04222202 0.02112002 0.01458865 0.003011895
Cumulative Proportion 0.6511029 0.8465490 0.91905742 0.96127944 0.98239946 0.99698811 1.000000000

> ica <- fastICA(X,n.comp=7)
> names(ica)
[1] "X" "K" "W" "A" "S"
# X: pre-processed data matrix (whitened/sphered data)
# K: pre-whitening matrix that projects data onto the first n.comp
# principal components.
# W: estimated un-mixing matrix (XW = S)
# A: estimated mixing matrix (X = SA)
# S: estimated source matrix (factor scores, $\Theta$ in the notes)

> windows()
> biplot(pc.cor)

> S <- ica$\$ $S
> dimnames(S) <- list(dimnames(X)[[1]],paste("Cmp.",1:7,sep=""))
> A <- ica$\$ $A
> dimnames(A) <- list(dimnames(X)[[2]],paste("Cmp.",1:7,sep=""))
> windows()
> biplot(S[,1:2],A[,1:2])

> loadings(pc.cor)

Loadings:
Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6
Electricity -0.415 -0.227 0.164 0.512 -0.373 -0.576
Fuel_Oil -0.351 0.547 -0.198 0.312
Bananas -0.373 -0.415 0.258 0.365 0.393 0.578
Tomatoes -0.369 -0.320 0.357 -0.738 -0.294
Orange_Juice -0.324 -0.311 -0.871 -0.119
Beef -0.424 0.220 -0.216 0.721 -0.449
Gasoline -0.380 0.479 -0.216 0.161
Comp.7
Electricity -0.131
Fuel_Oil -0.657
Bananas
Tomatoes
Orange_Juice 0.108
Beef
Gasoline 0.733

Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6
SS loadings 1.000 1.000 1.000 1.000 1.000 1.000
Proportion Var 0.143 0.143 0.143 0.143 0.143 0.143
Cumulative Var 0.143 0.286 0.429 0.571 0.714 0.857
Comp.7
SS loadings 1.000
Proportion Var 0.143
Cumulative Var 1.000

> field <- function(x) { substr(paste(x," ",sep=""),1,6) }
> A.str <- ifelse(abs(A)<2,field(" "),field(round(A,2)))
> dimnames(A.str) <- list(dimnames(X)[[2]],paste("Cmp.",1:7,sep=""))
> write.table(A.str[,1:4],"",quote=F,row.names=T,col.names=NA)
Cmp.1 Cmp.2 Cmp.3 Cmp.4
Electricity
Fuel_Oil
Bananas -2.59
Tomatoes -2.9
Orange_Juice -2.66
Beef
Gasoline

> L <- pc.cor$loadings
> L.str <- ifelse(abs(L)<0.3,field(" "),field(round(L,2)))
> dimnames(L.str) <- list(dimnames(X)[[2]],paste("Cmp.",1:7,sep=""))
> write.table(L.str[,1:4],"",quote=F,row.names=T,col.names=T)
Cmp.1 Cmp.2 Cmp.3 Cmp.4
Electricity -0.41 0.51
Fuel_Oil -0.35 0.55
Bananas -0.37 -0.41 0.37
Tomatoes -0.37 -0.32 0.36 -0.74
Orange_Juice -0.32 -0.31 -0.87
Beef -0.42
Gasoline -0.38 0.48

===References===
* [http://nyx-www.informatik.uni-bremen.de/664/1/smith_tr_02.pdf A tutorial on Principal Components Analysis]
* [http://mirlyn.lib.umich.edu/Record/004232056 Sampling / Steven K. Thompson]
* [http://mirlyn.lib.umich.edu/Record/004133572 Sampling theory and methods / S. Sampath]
* [http://en.wikipedia.org/wiki/Factor_analysis Factor Analysis details]

===Appendix===
[[SOCR_Data_Dinov_021808_ConsumerPriceIndex#Consumer_Price_Index_Data|SOCR 1981-2006 CPI Dataset]]

<center>
{| class="wikitable" style="text-align:center; width:25%" border="1"
|-
! Year||Electricity||Fuel_Oil||Bananas||Tomatoes||Orange_Juice||Beef||Gasoline
|-
| 1981||31.552||1.15||0.343||0.792||1.141||1.856||1.269
|-
| 1982||36.006||1.254||0.364||0.763||1.465||1.794||1.341
|-
| 1983||37.184||1.194||0.332||0.726||1.418||1.756||1.214
|-
| 1984||38.6||1.122||0.344||0.854||1.408||1.721||1.2
|-
| 1985||38.975||1.078||0.35||0.697||1.685||1.711||1.145
|-
| 1986||40.223||1.126||0.337||1.104||1.756||1.662||1.19
|-
| 1987||40.022||0.817||0.374||0.871||1.512||1.694||0.868
|-
| 1988||40.195||0.89||0.394||0.797||1.638||1.736||0.947
|-
| 1989||40.828||0.883||0.429||1.735||1.868||1.806||0.944
|-
| 1990||41.663||1.259||0.438||0.912||1.817||1.907||1.09
|-
| 1991||43.226||1.235||0.428||0.936||2.005||1.996||1.304
|-
| 1992||44.501||0.985||0.426||1.141||1.879||1.926||1.135
|-
| 1993||46.959||0.969||0.44||1.604||1.677||1.97||1.182
|-
| 1994||48.2||0.919||0.503||1.323||1.674||1.892||1.109
|-
| 1995||48.874||0.913||0.463||1.103||1.583||1.847||1.19
|-
| 1996||48.538||1.007||0.497||1.213||1.577||1.799||1.186
|-
| 1997||49.245||1.136||0.473||1.452||1.737||1.85||1.318
|-
| 1998||46.401||0.966||0.489||1.904||1.601||1.818||1.186
|-
| 1999||45.061||0.834||0.49||1.443||1.753||1.834||1.031
|-
| 2000||45.207||1.189||0.5||1.414||1.823||1.903||1.356
|-
| 2001||47.472||1.509||0.509||1.451||1.863||2.037||1.525
|-
| 2002||47.868||1.123||0.526||1.711||1.876||2.152||1.209
|-
| 2003||47.663||1.396||0.512||1.472||1.848||2.131||1.557
|-
| 2004||49.159||1.508||0.485||1.66||1.957||2.585||1.635
|-
| 2005||50.847||1.859||0.49||2.162||1.872||2.478||1.866
|-
| 2006||57.223||2.418||0.505||1.621||1.853||2.607||2.359
|}
</center>

<hr>
* SOCR Home page: http://www.socr.umich.edu

{{translate|pageName=http://wiki.socr.umich.edu/index.php?title=SMHS_PCA_ICA_FA}}

SMHS PCA ICA FA

2014-09-01T16:01:44Z

Jslavine: /* PCA in R */

==[[SMHS| Scientific Methods for Health Sciences]] - Dimensionality Reduction: PCA, ICA, FA ==

===Overview===
* ''PCA'' (principal component analysis) is a mathematical procedure that transforms a number of possibly correlated variables into a smaller number of uncorrelated variables through a process known as orthogonal transformation.
* ''ICA'' (independent component analysis) is a computational tool to separate a multivariate signal into additive subcomponents by assuming that the subcomponents are non-Gaussian signals and are statistically independent from each other.
* ''Factor analysis'' is a statistical method that describes variability among observed correlated variables in terms of a potentially lower number of unobserved variables. It is related to PCA but they are not identical. In this section, we are going to introduce these three commonly used statistical tools and illustrate their application in various studies with examples and R code samples.

===Motivation===
Suppose we have a set of observable correlated random variables, and we want to reduce the dimensionality of the data into a reasonable new set. How can we achieve this dimensionality reduction? Principal component analysis, independent component analysis and factor analysis may be the answers here. How does each of them work? What are the differences among those statistical methods; what are their strengths and weaknesses? How can we decide on the best method for a specific dataset?

===Theory===
====PCA (principal component analysis)====
PCA is a statistical technique that uses an orthogonal transformation to convert a set of observations of possibly correlated variables into a set of variables that are linearly uncorrelated. The resulting uncorrelated variables are called principal components. The first principal component accounts for as much of the variability in the data as possible, and each succeeding component accounts for the remaining variability. PCA is the simplest of the true eigenvector-based multivariate analyses; it reveals the internal structure of the data in a way that best explains the variance in the data. It is a useful statistical technique that has been applied in many fields, including computer networks and image processing, and is a powerful method for finding patterns in high-dimensional datasets.

Consider a data matrix $X_{n\times p}$ with column-wise empirical means of zero (i.e., the sample mean of each column has been shifted to zero), where each of the $n$ rows represents a different repetition of the experiment, and each of the $p$ columns gives a particular kind of data-element (i.e., variable). Mathematically, the transformation is defined by a set of $p$-dimensional vectors of weights $w_{(k)}=(w_1,w_2,…,w_p)_{(k)}$, constrained to be unitary ($\\w_{(k)}||=1$), which map each row vector $x_i$ of $X$ to a new vector of principal component scores $t_{(i)}=(t_1,t_2,…,t_p)_{(i)}$, given by $t_{k(i)}=x_{(i)} w_{(k)}$. The mapping occurs such that the individual elements of $t$ considered over the data set successively inherit the maximum possible variance from $x$.

* First component: the first loading vector $w_{(1)}= \arg\max_{||w||=1} {\sum_i{(t_1)_{(i)}^2 }}= \arg\max_{||w||=1} {\sum_i{(x_{(i)} w)^2 }}$, in matrix form: $w_{(1)}=\arg\max_{||w||=1} {||Xw||^2 }=\arg\max_{||w||=1} {w^T X^T Xw}$.

* Further components: the $k^{th}$ component can be found by subtracting the first $k-1$ principal components from $X$. $\hat{X}_{k-1} = X-\sum_{s=1}^{k-1}{X w_{(s)} w_{(s)}^T }$ and finding the loading vector involves extracting the maximum variance from this new data matrix: $w_{(k)}=\arg\max_{||w||=1} {||\bar{X}_{k-1} w||^2 } = \arg\max {\frac{w^T \hat{X}_{k-1}^T \hat{X}_{k-1} w}{w^T w}}.$ This gives the remaining eigenvectors of $X^T X$ with the maximum values for the quantity in brackets given by other corresponding eigenvalues. The full principal components decomposition of $X$ can therefore be given as $T=XW$ where $W$ is a $p\times p$ matrix whose columns are the eigenvectors of $X^T X$.

* Dimensionality reduction: The faithful transformation $T = X W$ maps a data vector $x_{(i)}$ from an original space of $p$ variables to a new space of $p$ (other) variables which are uncorrelated over the dataset. However, not all the principal components need to be kept. Keeping only the first $L$ principal components, produced by using only the first $L$ loading vectors, gives the truncated transformation $T_L=XW_L$ where $T_L$ has $n$ rows but only $L$ columns. This dimensionality reduction is very useful for visualizing and processing high-dimensional datasets while still retaining as much of the variance in the dataset as possible.

* Singular value decomposition (SVD)
: The SVD of $X=U \Sigma W^T$ where $\Sigma$ is an $n\times p$ rectangular diagonal matrix of positive numbers $\sigma_{(k)}$, the singular values of $X$; $U$ is an $n\times n$ matrix, the columns of which are orthogonal until vector of length $n$ called the left singular vectors of $X$; and $W$ is a $p\times p$ whose columns are orthogonal until vector of length $p$ and called the right singular vectors of $X$. With factorization, the matrix $X^T X=W \Sigma U^T U \Sigma W^T = W \Sigma^2 W^T$; with singular value decomposition the score matrix $T=XW=U\Sigma W^T W=U\Sigma$, where each column of $T$ is given by one of the left singular vectors of $X$ multiplied by the corresponding singular value.

* Computational details of PCA:
** Begin with a dataset that contains at least two dimensions (i.e., variables). The dataset can contain as many observations (dimensions) as you like;
** Normalize the observations for each variable. To do this, simply subtract the mean (average) from each observation within a given variable. For example, let $X$ and $Y$ be the two variables from the original dataset, with variable $X$ containing observations $X_1,X_2,…,X_n$, and variable $Y$ containing observations $Y_1,Y_2,…,Y_n$. Let $\bar{X}$ be the average of the $n$ observations of $X$, i.e. $\bar{X}= \frac{X_1+X_2+⋯+X_n}{n}$, and similarly let $\bar{Y}= \frac{Y_1+Y_2+⋯+Y_n}{n}$ be the average of the $Y$ observations. Then, the normalized dataset would be, for variable $X$: $\{X_1-\bar{X}, X_2-\bar{X}, …,X_n-\bar{X}\}$ and for variable $Y$: $\{Y_1-\bar{Y},Y_2-\bar{Y}, …,Y_n-\bar{Y}\}$
** Calculate the covariance matrix between the variables of the normalized dataset;
** Calculate the eigenvalue and eigenvectors of the covariance matrix (the eigenvectors must be normalized to a length of 1);
**(5) Choose the most significant principal component, which is simply the eigenvector with the highest eigenvalue.

====PCA Properties====
* For any integer $q$, $1 \leq q \leq p$, consider the orthogonal linear transformation $y=B'x$, where $y$ is a $q$-element vector and $B'$ is a $q\times q$ matrix, and let $\Sigma_y =B'\Sigma B$ be the variance-covariance matrix for $y$. Then the trace of $\Sigma_y$, $tr(\Sigma_y)$, is maximized by taking $B=A_q$, where $A_q$ consists of the first $q$ columns of $A$;
* $y=B'x$ and the $tr(\Sigma_y)$ are minimized by taking $B=A_q^*$ where $A_q^*$ consists of the last $q$ columns of $A$. The last few principal components are not simply unstructured left-overs after removing the important ones;
* $\Sigma = \lambda_1 \alpha_1 \alpha_1'+\lambda_2 \alpha_2 \alpha_2'+⋯+\lambda_p \alpha_p \alpha_p'$. Given that $var(x_j)=\sum_{k=1}^p {\lambda_k \alpha_{kj}^2 }$, the elements of $\lambda_k \alpha_k \alpha_k'$ tends to become smaller as SkS increases, whereas the elements of $\lambda_k$ tends to stay about the same size because $\alpha_k \alpha_k'=1,$ for $k=1,2,…,p$.

====PCA Limitations====
* The results of PCA depend on the scaling of the variables.
* The applicability of PCA is limited by certain assumptions made in its derivation.

==== PCA in R====
require(graphics)
## The variances of the variables in the
## USArrests data vary by orders of magnitude, so scaling is appropriate
(pc.cr <- princomp(USArrests)) # inappropriate

Call:
princomp(x = USArrests)

Standard deviations:
Comp.1 Comp.2 Comp.3 Comp.4
82.890847 14.069560 6.424204 2.45783

princomp(USArrests, cor = TRUE) # =^= prcomp(USArrests, scale=TRUE)

Call:
princomp(x = USArrests, cor = TRUE)

Standard deviations:
Comp.1 Comp.2 Comp.3 Comp.4
1.5748783 0.9948694 0.5971291 0.4164494

## Similar, but different:
## The standard deviations differ by a factor of sqrt(49/50)
summary(pc.cr <- princomp(USArrests, cor = TRUE))

Importance of components:
Comp.1 Comp.2 Comp.3 Comp.4
Standard deviation 1.5748783 0.9948694 0.5971291 0.41644938
Proportion of Variance 0.6200604 0.2474413 0.0891408 0.04335752
Cumulative Proportion 0.6200604 0.8675017 0.9566425 1.00000000

loadings(pc.cr) # note that blank entries are small but not zero
## The signs of the columns are arbitrary

Loadings:
Comp.1 Comp.2 Comp.3 Comp.4
Murder -0.536 0.418 -0.341 0.649
Assault -0.583 0.188 -0.268 -0.743
UrbanPop -0.278 -0.873 -0.378 0.134
Rape -0.543 -0.167 0.818

Comp.1 Comp.2 Comp.3 Comp.4
SS loadings 1.00 1.00 1.00 1.00
Proportion Var 0.25 0.25 0.25 0.25
Cumulative Var 0.25 0.50 0.75 1.00

plot(pc.cr) # shows a screeplot.

# The histogram distribution presents a vivid picture of the variance attributable to the first four significant principal
# components respectively.
biplot(pc.cr) ## shows the plot of PCA in a different format

From the chart above, we can see the distribution of the variance attributed to different variables in the four principal components.

==== [[SOCR_EduMaterials_AnalysisActivities_PCA|PCA using SOCR Analyses]]====
This [[SOCR_EduMaterials_AnalysisActivities_PCA| SOCR Activity illustrates the use of PCA]].

====ICA (independent component analysis)====
ICA is a computational method separating a multivariate signal into additive subcomponents by assuming that the subcomponents are non-Gaussian signals and are statistically independent from each other.

* ICA Assumptions:
** The source signals are independent of each other;
** The distribution of the values in each source signals are non-Gaussian.
** Independence: the source of signals are assumed to be independent but their signal mixture are not independent because they share the same source signals;
** Normality: based on [[SMHS_CLT_LLN|CLT]], the distribution of a sum of independent random variables approximate a Gaussian distribution;
** Complexity: the temporal complexity of any signal mixture is greater than that of its simplest constituent source signal.

: ICA maximizes the statistical independence of the estimated components to find the independent components. In general, ICA can’t identify the actual number of source signals nor can it identify the proper scaling of the source signals. Suppose the data is represented by the random vector $x=(x_1,x_2,…,x_m )^t$ and the components denoted as $s=(s_1,s_2,…,s_n )^t$. We need to transform the observed data $x$ using a linear transformation $w$, $s=Wx$, into maximally independent components $s$ measured by some functions of independence. There are alternative ''models'' for ICA:
* Linear noiseless ICA: the components $x_i$ of the data $x=(x_1,x_2,…,x_m )^t$ are generated as a sum of the independent components $s_k$, for $k=1,…,n$; $x_i=a_{i,1} s_1 + ⋯ + a_{i,k} s_k + ⋯ + a_{i,n} s_n$, weighted by the mixing weights $a_{i,k}$.
* Linear noisy ICA: with additional assumption of zero-mean and uncorrelated Gaussian noise $n \sim N(0,diag(\Sigma))$, the ICA model takes the form $x=As+n$.
* Nonlinear ICA: the mixing of the sources is not necessarily linear. Using a nonlinear mixing function $f(.|θ)$ with parameter $\theta$ the nonlinear ICA model is $x=f(s│θ)+n$.
* Identifiability: the independent components are identifiable up to a permutation and scaling of the sources, which requires: (1) at most one of the sources $s_k$ is Gaussian; (2) the number of observed mixtures, $m$, must be at least as large as the number of estimated components: $n$ such that $n\leq m$, i.e., the mixing matrix $A$ must be of full rank in order to have inverse.
* ICA in R using package [http://cran.at.r-project.org/web/packages/fastICA/fastICA.pdf fastICA]. This example shows un-mixing two mixed independent uniforms:

library(fastICA)
S <- matrix(runif(10000), 5000, 2)
A <- matrix(c(1, 1, -1, 3), 2, 2, byrow = TRUE)
X <- S %*% A # In R, "*" and "%*%" indicate "scalar" and matrix multiplication, respectively!

a <- fastICA(X, 2, alg.typ = "parallel", fun = "logcosh", alpha = 1,
method = "C", row.norm = FALSE, maxit = 200,
tol = 0.0001, verbose = TRUE)
par(mfrow = c(1, 2))
plot(a$\$ $X, main = "Pre-processed data")
plot(a$\$ $S, main = "ICA components")

: Another example of un-mixing two independent signals is shown below:
S <- cbind(sin((1:1000)/20), rep((((1:200)-100)/100), 5))
# [http://stat.ethz.ch/R-manual/R-devel/library/base/html/cbind.html cbind] combines objects by rows and columns. It takes a sequence of vector, matrix or data frames arguments and combines them by columns or rows, respectively.

A <- matrix(c(0.291, 0.6557, -0.5439, 0.5572), 2, 2)
X <- S %*% A
a <- fastICA(X, 2, alg.typ = "parallel", fun = "logcosh", alpha = 1,
method = "R", row.norm = FALSE, maxit = 200,
tol = 0.0001, verbose = TRUE)
par(mfcol = c(2, 3))
plot(1:1000, S[,1 ], type = "l", main = "Original Signals",
xlab = "", ylab = "")
plot(1:1000, S[,2 ], type = "l", xlab = "", ylab = "")
plot(1:1000, X[,1 ], type = "l", main = "Mixed Signals",
xlab = "", ylab = "")
plot(1:1000, X[,2 ], type = "l", xlab = "", ylab = "")
plot(1:1000, a$\$ $S[,1 ], type = "l", main = "ICA source estimates",
xlab = "", ylab = "")
plot(1:1000, a$\$ $S[, 2], type = "l", xlab = "", ylab = "")

====Factor analysis (FA)====
FA is a statistical method, which describes variability among observed correlated variables in terms of potentially lower number of unobserved variables. Consider a set of $p$ observable random variables, $x_1,x_2,…,x_p$ with means $μ_1,μ_2,…,μ_p$, respectively. Suppose for some unknown constants $l_{i,j}$ and $k$ unobserved random variables $F_j$, where $i\in \{1,…,p\}$ and $j \in \{1,…,k\}$ where $k<p$. We have $x_i-μ_i=l_{i,1} F_1 + ⋯ +l_{i,k} F_k + ε_i$, where $ε_i$ are independently distributed error terms with mean zero and finite variance. In matrix form, we have $x-μ = LF+ε$, with $n$ observations, we have $x$ is a $p\times n$ matrix, $L$ is a $p \times k$ matrix and $F$ is $k\times n$ matrix. Assumptions: (1) $ε$ and $F$ are independent; (2) $E(F)=0$; (3) $cov(F)=I$ to make sure the factors are uncorrelated. Solutions to the equations above yield the factors $F$ and the loading matrix, $L$.

* Example: In the following, matrices will be indicated by indexed variables. "Subject" indices will be indicated using letters $a$, $b$ and $c$, with values running from $1$ to $N_a$, which is equal to $10$ in the above example. "Factor" indices will be indicated using letters $p$, $q$ and $r$, with values running from $1$ to $N_p$ which is equal to $2$ in the above example. "Instance" or "sample" indices will be indicated using letters $i$, $j$ and $k$, with values running from $1$ to $N_i$. In the example above, if a sample of $N_i=1000$ students responded to the $N_a=10$ questions, the $i^{th}$ student's score for the $a^{th}$ question are given by $x_{ai}$. The purpose of factor analysis is to characterize the correlations between the variables $x_a$ of which the $x_{ai}$ is a particular instance, or set of observations. To ensure that all variables are on equal footing, they are standardized: $z_{ai}=\frac{x_{ai}- μ_a}{σ_a}$, where $μ_a$ is the sample mean and sample variance $σ_a^2=\frac{1}{N_a} \sum_i {(x_{ai}-μ_a)^2}$. The factor analysis model is expressed by:
:$$\begin{matrix}z_{1,i} & = & \ell_{1,1}F_{1,i} & + & \ell_{1,2}F_{2,i} & + & \varepsilon_{1,i} \\
\vdots & & \vdots & & \vdots & & \vdots \\
z_{10,i} & = & \ell_{10,1}F_{1,i} & + & \ell_{10,2}F_{2,i} & + & \varepsilon_{10,i}
\end{matrix}$$

In matrix form ($Z=LF+ϵ$), this model can be expressed as:
$$z_{ai}=\sum_p \ell_{ap}F_{pi}+\varepsilon_{ai},$$
where $F_{1,i}$ is the $i^{th}$ student’s ''verbal intelligence'', $F_{2,i}$ is the $i^{th}$ student’s ''mathematical intelligence''; $l_{ap}$ are the factor loadings for the $a^{th}$ subject for $p=1,2$.

=====FA in R: using ''factanal()''=====
# Maximum Likelihood Factor Analysis
# entering raw data and extracting 3 factors,
# with varimax rotation

mydata <- read.csv("http://www.ats.ucla.edu/stat/data/mmreg.csv")
# mmreg.csv includes 600 observations and 8 variables.
# The psychological variables are locus_of_control, self_concept and motivation.
# The academic variables are standardized tests in reading (read), writing (write), math (math) and science (science).
# Additionally, the variable female is a zero-one indicator variable with the one indicating a female student.

# We can get some basic descriptions of the entire data set by using summary.
# To get the standard deviations, we use sapply to apply the sd function to each variable in the dataset.
summary(mydata)

sapply(mydata, sd)

fit <- factanal(mydata, 3, rotation="varimax") # mydata can be a raw data matrix or a covariance matrix.
# Pairwise deletion of missing data is used. Rotation can be "varimax" or "promax".

print(fit, digits=2, cutoff=.3, sort=TRUE)
# plot factor 1 by factor 2
load <- fit$\$ $loadings[,1:2]
plot(load,type="n") # set up plot
text(load,labels=names(mydata),cex=.7) # add variable names

# Principal Axis Factor Analysis
library(psych)
fit <- factor.pa(mydata, nfactors=3, rotation="varimax")
fit # print results

# Determine Number of Factors to Extract
library(nFactors)
ev <- eigen(cor(mydata)) # get eigenvalues
ap <- parallel(subject=nrow(mydata),var=ncol(mydata), rep=100,cent=.05)
nS <- nScree(x=ev$\$ $values, aparallel=ap$\$ $eigen$\$ $qevpea)
plotnScree(nS)

===PCA, ICA, FA: Similarities and Differences===
* PCA is closely related to factor analysis. The later typically incorporates more domain specific assumptions about the underlying structure and solves eigenvectors of a slightly different matrix. Principal components create variables that are linear combinations of the original variables. The new variables have the property that the variables are all orthogonal. The principal components can be used to find clusters in a set of data.

* PCA is a variance-focused approach seeking to reproduce the total variable variance, in which components reflect both common and unique variance of the variable. It is generally preferred for purposes of data reduction (i.e., translating variable space into optimal factor space) but not when detect the latent construct or factors. Factor analysis is similar to principal component analysis, in that factor analysis also involves linear combinations of variables.

* Different from PCA, factor analysis is a correlation-focused approach seeking to reproduce the inter-correlations among variables, in which the factors “represent the common variance of variables, excluding unique variance. Factor analysis is generally used when the research purpose is detecting data structure (i.e., latent constructs or factors) or causal modeling.

===Applications===
* [[SOCR_EduMaterials_AnalysisActivities_PCA| This SOCR Activity]] demonstrated the utilization of SOCR analyses package for statistical computing in the SOCR environment. It presents a general introduction to PCA and the theoretical background of this statistical tool and shows how to use PCA and how to read and interpret the outcome. It introduced students to input data in the correct format, read the results of PCA and make interpretation of the resulting transformed data.

* [http://www.sciencedirect.com/science/article/pii/0169743987800849 This article] presents a general introduction to PCA. Principal component analysis of a data matrix extracts the dominant patterns in the matrix in terms of a complementary set of score and loading plots. It is the responsibility of the data analyst to formulate the scientific issue at hand in terms of PC projections, PLS regressions, etc. Ask yourself, or the investigator, why the data matrix was collected, and for what purpose the experiments and measurements were made. Specify before the analysis what kinds of patterns you would expect and what you would find exciting. The results of the analysis depend on the scaling of the matrix, which therefore must be specified. Variance scaling, where each variable is scaled to unit variance, can be recommended for general use, provided that almost constant variables are left unscaled. Combining different types of variables warrants blockscaling. In the initial analysis, look for outliers and strong groupings in the plots, indicating that the data matrix perhaps should be “polished” or whether disjoint modeling is the proper course. For plotting purposes, two or three principal components are usually sufficient, but for modeling purposes the number of significant components should be properly determined, e.g. by cross-validation. Use the resulting principal components to guide your continued investigation or chemical experimentation, not as an end in itself.

* [http://link.springer.com/article/10.1007/BF02294359 This article] introduced the Akaike Information Criterion (AIC) to extend the method of maximum likelihood to the multi-model situation. It related the successful experience of the order determination of an autoregressive model to the determination of the number of factors in the maximum likelihood factor analysis. The use of the AIC criterion in the factor analysis is particularly interesting when it is viewed as the choice of a Bayesian model. This observation shows that the area of application of AIC can be much wider than the conventional i.i.d. type models on which the original derivation of the criterion was based. The observation of the Bayesian structure of the factor analysis model leads us to the handling of the problem of improper solution by introducing a natural prior distribution of factor loadings.

* [http://onlinelibrary.wiley.com/doi/10.1002/9780470057339.vnn086/abstract This article] contains a good introduction to the application of ICA. Independent component models have gained increasing interest in various fields of applications in recent years. The basic independent component model is a semi-parametric model assuming that a p-variate observed random vector is a linear transformation of an unobserved vector of p independent latent variables. This linear transformation is given by an unknown mixing matrix, and one of the main objectives of independent component analysis (ICA) is to estimate an unmixing matrix by means of which the latent variables can be recovered. In this article, we discuss the basic independent component model in detail, define the concepts and analysis tools carefully, and consider two families of ICA estimates. The statistical properties (consistency, asymptotic normality, efficiency, robustness) of the estimates can be analyzed and compared via the so called gain matrices. Some extensions of the basic independent component model, such as models with additive noise or models with dependent observations, are briefly discussed. The article ends with a short example.

===Software ===
* [http://socr.ucla.edu/htmls/ana/ SOCR ANalyses]
* [[SOCR_EduMaterials_AnalysisActivities_PCA SOCR PCA Activity]]
* [http://stat.ethz.ch/R-manual/R-patched/library/stats/html/princomp.html R princomp package]
* [http://cran.r-project.org/web/packages/fastICA/ R fastICA package] and [http://cran.at.r-project.org/web/packages/fastICA/fastICA.pdf documentation]

===Problems===

====R Example 1====
# Install package ‘fastICA’

> library(fastICA)
# Using the SOCR 1981-2006 CPI Data (http://wiki.socr.umich.edu/index.php/SOCR_Data_Dinov_021808_ConsumerPriceIndex),
# save the table in ASCII text file [[SMHS_PCA_ICA_FA#Appendix|CPI_Data.dat]]. Note the "dir" (folder) where you saved the data and reference it below
> CPI_Data <- as.matrix(read.table("/dir/CPI_Data.dat",header=T))

> # compare PCA and FA analyses
> X <- CPI_Data[,-1]
> pc.cor <- princomp(X,cor=T)
> summary(pc.cor )
Importance of components:
Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7
Standard deviation 2.1348817 1.1696678 0.71243186 0.54364890 0.38449985 0.31956304 0.145200770
Proportion of Variance 0.6511029 0.1954461 0.07250845 0.04222202 0.02112002 0.01458865 0.003011895
Cumulative Proportion 0.6511029 0.8465490 0.91905742 0.96127944 0.98239946 0.99698811 1.000000000

> ica <- fastICA(X,n.comp=7)
> names(ica)
[1] "X" "K" "W" "A" "S"
# X: pre-processed data matrix (whitened/sphered data)
# K: pre-whitening matrix that projects data onto the first n.comp
# principal components.
# W: estimated un-mixing matrix (XW = S)
# A: estimated mixing matrix (X = SA)
# S: estimated source matrix (factor scores, $\Theta$ in the notes)

> windows()
> biplot(pc.cor)

> S <- ica$\$ $S
> dimnames(S) <- list(dimnames(X)[[1]],paste("Cmp.",1:7,sep=""))
> A <- ica$\$ $A
> dimnames(A) <- list(dimnames(X)[[2]],paste("Cmp.",1:7,sep=""))
> windows()
> biplot(S[,1:2],A[,1:2])

> loadings(pc.cor)

Loadings:
Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6
Electricity -0.415 -0.227 0.164 0.512 -0.373 -0.576
Fuel_Oil -0.351 0.547 -0.198 0.312
Bananas -0.373 -0.415 0.258 0.365 0.393 0.578
Tomatoes -0.369 -0.320 0.357 -0.738 -0.294
Orange_Juice -0.324 -0.311 -0.871 -0.119
Beef -0.424 0.220 -0.216 0.721 -0.449
Gasoline -0.380 0.479 -0.216 0.161
Comp.7
Electricity -0.131
Fuel_Oil -0.657
Bananas
Tomatoes
Orange_Juice 0.108
Beef
Gasoline 0.733

Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6
SS loadings 1.000 1.000 1.000 1.000 1.000 1.000
Proportion Var 0.143 0.143 0.143 0.143 0.143 0.143
Cumulative Var 0.143 0.286 0.429 0.571 0.714 0.857
Comp.7
SS loadings 1.000
Proportion Var 0.143
Cumulative Var 1.000

> field <- function(x) { substr(paste(x," ",sep=""),1,6) }
> A.str <- ifelse(abs(A)<2,field(" "),field(round(A,2)))
> dimnames(A.str) <- list(dimnames(X)[[2]],paste("Cmp.",1:7,sep=""))
> write.table(A.str[,1:4],"",quote=F,row.names=T,col.names=NA)
Cmp.1 Cmp.2 Cmp.3 Cmp.4
Electricity
Fuel_Oil
Bananas -2.59
Tomatoes -2.9
Orange_Juice -2.66
Beef
Gasoline

> L <- pc.cor$loadings
> L.str <- ifelse(abs(L)<0.3,field(" "),field(round(L,2)))
> dimnames(L.str) <- list(dimnames(X)[[2]],paste("Cmp.",1:7,sep=""))
> write.table(L.str[,1:4],"",quote=F,row.names=T,col.names=T)
Cmp.1 Cmp.2 Cmp.3 Cmp.4
Electricity -0.41 0.51
Fuel_Oil -0.35 0.55
Bananas -0.37 -0.41 0.37
Tomatoes -0.37 -0.32 0.36 -0.74
Orange_Juice -0.32 -0.31 -0.87
Beef -0.42
Gasoline -0.38 0.48

===References===
* [http://nyx-www.informatik.uni-bremen.de/664/1/smith_tr_02.pdf A tutorial on Principal Components Analysis]
* [http://mirlyn.lib.umich.edu/Record/004232056 Sampling / Steven K. Thompson]
* [http://mirlyn.lib.umich.edu/Record/004133572 Sampling theory and methods / S. Sampath]
* [http://en.wikipedia.org/wiki/Factor_analysis Factor Analysis details]

===Appendix===
[[SOCR_Data_Dinov_021808_ConsumerPriceIndex#Consumer_Price_Index_Data|SOCR 1981-2006 CPI Dataset]]

<center>
{| class="wikitable" style="text-align:center; width:25%" border="1"
|-
! Year||Electricity||Fuel_Oil||Bananas||Tomatoes||Orange_Juice||Beef||Gasoline
|-
| 1981||31.552||1.15||0.343||0.792||1.141||1.856||1.269
|-
| 1982||36.006||1.254||0.364||0.763||1.465||1.794||1.341
|-
| 1983||37.184||1.194||0.332||0.726||1.418||1.756||1.214
|-
| 1984||38.6||1.122||0.344||0.854||1.408||1.721||1.2
|-
| 1985||38.975||1.078||0.35||0.697||1.685||1.711||1.145
|-
| 1986||40.223||1.126||0.337||1.104||1.756||1.662||1.19
|-
| 1987||40.022||0.817||0.374||0.871||1.512||1.694||0.868
|-
| 1988||40.195||0.89||0.394||0.797||1.638||1.736||0.947
|-
| 1989||40.828||0.883||0.429||1.735||1.868||1.806||0.944
|-
| 1990||41.663||1.259||0.438||0.912||1.817||1.907||1.09
|-
| 1991||43.226||1.235||0.428||0.936||2.005||1.996||1.304
|-
| 1992||44.501||0.985||0.426||1.141||1.879||1.926||1.135
|-
| 1993||46.959||0.969||0.44||1.604||1.677||1.97||1.182
|-
| 1994||48.2||0.919||0.503||1.323||1.674||1.892||1.109
|-
| 1995||48.874||0.913||0.463||1.103||1.583||1.847||1.19
|-
| 1996||48.538||1.007||0.497||1.213||1.577||1.799||1.186
|-
| 1997||49.245||1.136||0.473||1.452||1.737||1.85||1.318
|-
| 1998||46.401||0.966||0.489||1.904||1.601||1.818||1.186
|-
| 1999||45.061||0.834||0.49||1.443||1.753||1.834||1.031
|-
| 2000||45.207||1.189||0.5||1.414||1.823||1.903||1.356
|-
| 2001||47.472||1.509||0.509||1.451||1.863||2.037||1.525
|-
| 2002||47.868||1.123||0.526||1.711||1.876||2.152||1.209
|-
| 2003||47.663||1.396||0.512||1.472||1.848||2.131||1.557
|-
| 2004||49.159||1.508||0.485||1.66||1.957||2.585||1.635
|-
| 2005||50.847||1.859||0.49||2.162||1.872||2.478||1.866
|-
| 2006||57.223||2.418||0.505||1.621||1.853||2.607||2.359
|}
</center>

<hr>
* SOCR Home page: http://www.socr.umich.edu

{{translate|pageName=http://wiki.socr.umich.edu/index.php?title=SMHS_PCA_ICA_FA}}

SMHS PCA ICA FA

2014-09-01T16:00:08Z

Jslavine: /* PCA Limitations */

==[[SMHS| Scientific Methods for Health Sciences]] - Dimensionality Reduction: PCA, ICA, FA ==

===Overview===
* ''PCA'' (principal component analysis) is a mathematical procedure that transforms a number of possibly correlated variables into a smaller number of uncorrelated variables through a process known as orthogonal transformation.
* ''ICA'' (independent component analysis) is a computational tool to separate a multivariate signal into additive subcomponents by assuming that the subcomponents are non-Gaussian signals and are statistically independent from each other.
* ''Factor analysis'' is a statistical method that describes variability among observed correlated variables in terms of a potentially lower number of unobserved variables. It is related to PCA but they are not identical. In this section, we are going to introduce these three commonly used statistical tools and illustrate their application in various studies with examples and R code samples.

===Motivation===
Suppose we have a set of observable correlated random variables, and we want to reduce the dimensionality of the data into a reasonable new set. How can we achieve this dimensionality reduction? Principal component analysis, independent component analysis and factor analysis may be the answers here. How does each of them work? What are the differences among those statistical methods; what are their strengths and weaknesses? How can we decide on the best method for a specific dataset?

===Theory===
====PCA (principal component analysis)====
PCA is a statistical technique that uses an orthogonal transformation to convert a set of observations of possibly correlated variables into a set of variables that are linearly uncorrelated. The resulting uncorrelated variables are called principal components. The first principal component accounts for as much of the variability in the data as possible, and each succeeding component accounts for the remaining variability. PCA is the simplest of the true eigenvector-based multivariate analyses; it reveals the internal structure of the data in a way that best explains the variance in the data. It is a useful statistical technique that has been applied in many fields, including computer networks and image processing, and is a powerful method for finding patterns in high-dimensional datasets.

Consider a data matrix $X_{n\times p}$ with column-wise empirical means of zero (i.e., the sample mean of each column has been shifted to zero), where each of the $n$ rows represents a different repetition of the experiment, and each of the $p$ columns gives a particular kind of data-element (i.e., variable). Mathematically, the transformation is defined by a set of $p$-dimensional vectors of weights $w_{(k)}=(w_1,w_2,…,w_p)_{(k)}$, constrained to be unitary ($\\w_{(k)}||=1$), which map each row vector $x_i$ of $X$ to a new vector of principal component scores $t_{(i)}=(t_1,t_2,…,t_p)_{(i)}$, given by $t_{k(i)}=x_{(i)} w_{(k)}$. The mapping occurs such that the individual elements of $t$ considered over the data set successively inherit the maximum possible variance from $x$.

* First component: the first loading vector $w_{(1)}= \arg\max_{||w||=1} {\sum_i{(t_1)_{(i)}^2 }}= \arg\max_{||w||=1} {\sum_i{(x_{(i)} w)^2 }}$, in matrix form: $w_{(1)}=\arg\max_{||w||=1} {||Xw||^2 }=\arg\max_{||w||=1} {w^T X^T Xw}$.

* Further components: the $k^{th}$ component can be found by subtracting the first $k-1$ principal components from $X$. $\hat{X}_{k-1} = X-\sum_{s=1}^{k-1}{X w_{(s)} w_{(s)}^T }$ and finding the loading vector involves extracting the maximum variance from this new data matrix: $w_{(k)}=\arg\max_{||w||=1} {||\bar{X}_{k-1} w||^2 } = \arg\max {\frac{w^T \hat{X}_{k-1}^T \hat{X}_{k-1} w}{w^T w}}.$ This gives the remaining eigenvectors of $X^T X$ with the maximum values for the quantity in brackets given by other corresponding eigenvalues. The full principal components decomposition of $X$ can therefore be given as $T=XW$ where $W$ is a $p\times p$ matrix whose columns are the eigenvectors of $X^T X$.

* Dimensionality reduction: The faithful transformation $T = X W$ maps a data vector $x_{(i)}$ from an original space of $p$ variables to a new space of $p$ (other) variables which are uncorrelated over the dataset. However, not all the principal components need to be kept. Keeping only the first $L$ principal components, produced by using only the first $L$ loading vectors, gives the truncated transformation $T_L=XW_L$ where $T_L$ has $n$ rows but only $L$ columns. This dimensionality reduction is very useful for visualizing and processing high-dimensional datasets while still retaining as much of the variance in the dataset as possible.

* Singular value decomposition (SVD)
: The SVD of $X=U \Sigma W^T$ where $\Sigma$ is an $n\times p$ rectangular diagonal matrix of positive numbers $\sigma_{(k)}$, the singular values of $X$; $U$ is an $n\times n$ matrix, the columns of which are orthogonal until vector of length $n$ called the left singular vectors of $X$; and $W$ is a $p\times p$ whose columns are orthogonal until vector of length $p$ and called the right singular vectors of $X$. With factorization, the matrix $X^T X=W \Sigma U^T U \Sigma W^T = W \Sigma^2 W^T$; with singular value decomposition the score matrix $T=XW=U\Sigma W^T W=U\Sigma$, where each column of $T$ is given by one of the left singular vectors of $X$ multiplied by the corresponding singular value.

* Computational details of PCA:
** Begin with a dataset that contains at least two dimensions (i.e., variables). The dataset can contain as many observations (dimensions) as you like;
** Normalize the observations for each variable. To do this, simply subtract the mean (average) from each observation within a given variable. For example, let $X$ and $Y$ be the two variables from the original dataset, with variable $X$ containing observations $X_1,X_2,…,X_n$, and variable $Y$ containing observations $Y_1,Y_2,…,Y_n$. Let $\bar{X}$ be the average of the $n$ observations of $X$, i.e. $\bar{X}= \frac{X_1+X_2+⋯+X_n}{n}$, and similarly let $\bar{Y}= \frac{Y_1+Y_2+⋯+Y_n}{n}$ be the average of the $Y$ observations. Then, the normalized dataset would be, for variable $X$: $\{X_1-\bar{X}, X_2-\bar{X}, …,X_n-\bar{X}\}$ and for variable $Y$: $\{Y_1-\bar{Y},Y_2-\bar{Y}, …,Y_n-\bar{Y}\}$
** Calculate the covariance matrix between the variables of the normalized dataset;
** Calculate the eigenvalue and eigenvectors of the covariance matrix (the eigenvectors must be normalized to a length of 1);
**(5) Choose the most significant principal component, which is simply the eigenvector with the highest eigenvalue.

====PCA Properties====
* For any integer $q$, $1 \leq q \leq p$, consider the orthogonal linear transformation $y=B'x$, where $y$ is a $q$-element vector and $B'$ is a $q\times q$ matrix, and let $\Sigma_y =B'\Sigma B$ be the variance-covariance matrix for $y$. Then the trace of $\Sigma_y$, $tr(\Sigma_y)$, is maximized by taking $B=A_q$, where $A_q$ consists of the first $q$ columns of $A$;
* $y=B'x$ and the $tr(\Sigma_y)$ are minimized by taking $B=A_q^*$ where $A_q^*$ consists of the last $q$ columns of $A$. The last few principal components are not simply unstructured left-overs after removing the important ones;
* $\Sigma = \lambda_1 \alpha_1 \alpha_1'+\lambda_2 \alpha_2 \alpha_2'+⋯+\lambda_p \alpha_p \alpha_p'$. Given that $var(x_j)=\sum_{k=1}^p {\lambda_k \alpha_{kj}^2 }$, the elements of $\lambda_k \alpha_k \alpha_k'$ tends to become smaller as SkS increases, whereas the elements of $\lambda_k$ tends to stay about the same size because $\alpha_k \alpha_k'=1,$ for $k=1,2,…,p$.

====PCA Limitations====
* The results of PCA depend on the scaling of the variables.
* The applicability of PCA is limited by certain assumptions made in its derivation.

==== PCA in R====
require(graphics)
## The variances of the variables in the
## USArrests data vary by orders of magnitude, so scaling is appropriate
(pc.cr <- princomp(USArrests)) # inappropriate

Call:
princomp(x = USArrests)

Standard deviations:
Comp.1 Comp.2 Comp.3 Comp.4
82.890847 14.069560 6.424204 2.45783

princomp(USArrests, cor = TRUE) # =^= prcomp(USArrests, scale=TRUE)

Call:
princomp(x = USArrests, cor = TRUE)

Standard deviations:
Comp.1 Comp.2 Comp.3 Comp.4
1.5748783 0.9948694 0.5971291 0.4164494

## Similar, but different:
## The standard deviations differ by a factor of sqrt(49/50)
summary(pc.cr <- princomp(USArrests, cor = TRUE))

Importance of components:
Comp.1 Comp.2 Comp.3 Comp.4
Standard deviation 1.5748783 0.9948694 0.5971291 0.41644938
Proportion of Variance 0.6200604 0.2474413 0.0891408 0.04335752
Cumulative Proportion 0.6200604 0.8675017 0.9566425 1.00000000

loadings(pc.cr) # note that blank entries are small but not zero
## The signs of the columns are arbitrary

Loadings:
Comp.1 Comp.2 Comp.3 Comp.4
Murder -0.536 0.418 -0.341 0.649
Assault -0.583 0.188 -0.268 -0.743
UrbanPop -0.278 -0.873 -0.378 0.134
Rape -0.543 -0.167 0.818

Comp.1 Comp.2 Comp.3 Comp.4
SS loadings 1.00 1.00 1.00 1.00
Proportion Var 0.25 0.25 0.25 0.25
Cumulative Var 0.25 0.50 0.75 1.00

plot(pc.cr) # shows a screeplot.

# The histogram distribution presents a vivid picture of the variance attributable to the first four significant principal
# components respectively.
biplot(pc.cr) ## shows the plot of PCA in a different format

From the chart above, we can see the distribution variance attributed to different variables on the four principal components.

==== [[SOCR_EduMaterials_AnalysisActivities_PCA|PCA using SOCR Analyses]]====
This [[SOCR_EduMaterials_AnalysisActivities_PCA| SOCR Activity illustrates the use of PCA]].

====ICA (independent component analysis)====
ICA is a computational method separating a multivariate signal into additive subcomponents by assuming that the subcomponents are non-Gaussian signals and are statistically independent from each other.

* ICA Assumptions:
** The source signals are independent of each other;
** The distribution of the values in each source signals are non-Gaussian.
** Independence: the source of signals are assumed to be independent but their signal mixture are not independent because they share the same source signals;
** Normality: based on [[SMHS_CLT_LLN|CLT]], the distribution of a sum of independent random variables approximate a Gaussian distribution;
** Complexity: the temporal complexity of any signal mixture is greater than that of its simplest constituent source signal.

: ICA maximizes the statistical independence of the estimated components to find the independent components. In general, ICA can’t identify the actual number of source signals nor can it identify the proper scaling of the source signals. Suppose the data is represented by the random vector $x=(x_1,x_2,…,x_m )^t$ and the components denoted as $s=(s_1,s_2,…,s_n )^t$. We need to transform the observed data $x$ using a linear transformation $w$, $s=Wx$, into maximally independent components $s$ measured by some functions of independence. There are alternative ''models'' for ICA:
* Linear noiseless ICA: the components $x_i$ of the data $x=(x_1,x_2,…,x_m )^t$ are generated as a sum of the independent components $s_k$, for $k=1,…,n$; $x_i=a_{i,1} s_1 + ⋯ + a_{i,k} s_k + ⋯ + a_{i,n} s_n$, weighted by the mixing weights $a_{i,k}$.
* Linear noisy ICA: with additional assumption of zero-mean and uncorrelated Gaussian noise $n \sim N(0,diag(\Sigma))$, the ICA model takes the form $x=As+n$.
* Nonlinear ICA: the mixing of the sources is not necessarily linear. Using a nonlinear mixing function $f(.|θ)$ with parameter $\theta$ the nonlinear ICA model is $x=f(s│θ)+n$.
* Identifiability: the independent components are identifiable up to a permutation and scaling of the sources, which requires: (1) at most one of the sources $s_k$ is Gaussian; (2) the number of observed mixtures, $m$, must be at least as large as the number of estimated components: $n$ such that $n\leq m$, i.e., the mixing matrix $A$ must be of full rank in order to have inverse.
* ICA in R using package [http://cran.at.r-project.org/web/packages/fastICA/fastICA.pdf fastICA]. This example shows un-mixing two mixed independent uniforms:

library(fastICA)
S <- matrix(runif(10000), 5000, 2)
A <- matrix(c(1, 1, -1, 3), 2, 2, byrow = TRUE)
X <- S %*% A # In R, "*" and "%*%" indicate "scalar" and matrix multiplication, respectively!

a <- fastICA(X, 2, alg.typ = "parallel", fun = "logcosh", alpha = 1,
method = "C", row.norm = FALSE, maxit = 200,
tol = 0.0001, verbose = TRUE)
par(mfrow = c(1, 2))
plot(a$\$ $X, main = "Pre-processed data")
plot(a$\$ $S, main = "ICA components")

: Another example of un-mixing two independent signals is shown below:
S <- cbind(sin((1:1000)/20), rep((((1:200)-100)/100), 5))
# [http://stat.ethz.ch/R-manual/R-devel/library/base/html/cbind.html cbind] combines objects by rows and columns. It takes a sequence of vector, matrix or data frames arguments and combines them by columns or rows, respectively.

A <- matrix(c(0.291, 0.6557, -0.5439, 0.5572), 2, 2)
X <- S %*% A
a <- fastICA(X, 2, alg.typ = "parallel", fun = "logcosh", alpha = 1,
method = "R", row.norm = FALSE, maxit = 200,
tol = 0.0001, verbose = TRUE)
par(mfcol = c(2, 3))
plot(1:1000, S[,1 ], type = "l", main = "Original Signals",
xlab = "", ylab = "")
plot(1:1000, S[,2 ], type = "l", xlab = "", ylab = "")
plot(1:1000, X[,1 ], type = "l", main = "Mixed Signals",
xlab = "", ylab = "")
plot(1:1000, X[,2 ], type = "l", xlab = "", ylab = "")
plot(1:1000, a$\$ $S[,1 ], type = "l", main = "ICA source estimates",
xlab = "", ylab = "")
plot(1:1000, a$\$ $S[, 2], type = "l", xlab = "", ylab = "")

====Factor analysis (FA)====
FA is a statistical method, which describes variability among observed correlated variables in terms of potentially lower number of unobserved variables. Consider a set of $p$ observable random variables, $x_1,x_2,…,x_p$ with means $μ_1,μ_2,…,μ_p$, respectively. Suppose for some unknown constants $l_{i,j}$ and $k$ unobserved random variables $F_j$, where $i\in \{1,…,p\}$ and $j \in \{1,…,k\}$ where $k<p$. We have $x_i-μ_i=l_{i,1} F_1 + ⋯ +l_{i,k} F_k + ε_i$, where $ε_i$ are independently distributed error terms with mean zero and finite variance. In matrix form, we have $x-μ = LF+ε$, with $n$ observations, we have $x$ is a $p\times n$ matrix, $L$ is a $p \times k$ matrix and $F$ is $k\times n$ matrix. Assumptions: (1) $ε$ and $F$ are independent; (2) $E(F)=0$; (3) $cov(F)=I$ to make sure the factors are uncorrelated. Solutions to the equations above yield the factors $F$ and the loading matrix, $L$.

* Example: In the following, matrices will be indicated by indexed variables. "Subject" indices will be indicated using letters $a$, $b$ and $c$, with values running from $1$ to $N_a$, which is equal to $10$ in the above example. "Factor" indices will be indicated using letters $p$, $q$ and $r$, with values running from $1$ to $N_p$ which is equal to $2$ in the above example. "Instance" or "sample" indices will be indicated using letters $i$, $j$ and $k$, with values running from $1$ to $N_i$. In the example above, if a sample of $N_i=1000$ students responded to the $N_a=10$ questions, the $i^{th}$ student's score for the $a^{th}$ question are given by $x_{ai}$. The purpose of factor analysis is to characterize the correlations between the variables $x_a$ of which the $x_{ai}$ is a particular instance, or set of observations. To ensure that all variables are on equal footing, they are standardized: $z_{ai}=\frac{x_{ai}- μ_a}{σ_a}$, where $μ_a$ is the sample mean and sample variance $σ_a^2=\frac{1}{N_a} \sum_i {(x_{ai}-μ_a)^2}$. The factor analysis model is expressed by:
:$$\begin{matrix}z_{1,i} & = & \ell_{1,1}F_{1,i} & + & \ell_{1,2}F_{2,i} & + & \varepsilon_{1,i} \\
\vdots & & \vdots & & \vdots & & \vdots \\
z_{10,i} & = & \ell_{10,1}F_{1,i} & + & \ell_{10,2}F_{2,i} & + & \varepsilon_{10,i}
\end{matrix}$$

In matrix form ($Z=LF+ϵ$), this model can be expressed as:
$$z_{ai}=\sum_p \ell_{ap}F_{pi}+\varepsilon_{ai},$$
where $F_{1,i}$ is the $i^{th}$ student’s ''verbal intelligence'', $F_{2,i}$ is the $i^{th}$ student’s ''mathematical intelligence''; $l_{ap}$ are the factor loadings for the $a^{th}$ subject for $p=1,2$.

=====FA in R: using ''factanal()''=====
# Maximum Likelihood Factor Analysis
# entering raw data and extracting 3 factors,
# with varimax rotation

mydata <- read.csv("http://www.ats.ucla.edu/stat/data/mmreg.csv")
# mmreg.csv includes 600 observations and 8 variables.
# The psychological variables are locus_of_control, self_concept and motivation.
# The academic variables are standardized tests in reading (read), writing (write), math (math) and science (science).
# Additionally, the variable female is a zero-one indicator variable with the one indicating a female student.

# We can get some basic descriptions of the entire data set by using summary.
# To get the standard deviations, we use sapply to apply the sd function to each variable in the dataset.
summary(mydata)

sapply(mydata, sd)

fit <- factanal(mydata, 3, rotation="varimax") # mydata can be a raw data matrix or a covariance matrix.
# Pairwise deletion of missing data is used. Rotation can be "varimax" or "promax".

print(fit, digits=2, cutoff=.3, sort=TRUE)
# plot factor 1 by factor 2
load <- fit$\$ $loadings[,1:2]
plot(load,type="n") # set up plot
text(load,labels=names(mydata),cex=.7) # add variable names

# Principal Axis Factor Analysis
library(psych)
fit <- factor.pa(mydata, nfactors=3, rotation="varimax")
fit # print results

# Determine Number of Factors to Extract
library(nFactors)
ev <- eigen(cor(mydata)) # get eigenvalues
ap <- parallel(subject=nrow(mydata),var=ncol(mydata), rep=100,cent=.05)
nS <- nScree(x=ev$\$ $values, aparallel=ap$\$ $eigen$\$ $qevpea)
plotnScree(nS)

===PCA, ICA, FA: Similarities and Differences===
* PCA is closely related to factor analysis. The later typically incorporates more domain specific assumptions about the underlying structure and solves eigenvectors of a slightly different matrix. Principal components create variables that are linear combinations of the original variables. The new variables have the property that the variables are all orthogonal. The principal components can be used to find clusters in a set of data.

* PCA is a variance-focused approach seeking to reproduce the total variable variance, in which components reflect both common and unique variance of the variable. It is generally preferred for purposes of data reduction (i.e., translating variable space into optimal factor space) but not when detect the latent construct or factors. Factor analysis is similar to principal component analysis, in that factor analysis also involves linear combinations of variables.

* Different from PCA, factor analysis is a correlation-focused approach seeking to reproduce the inter-correlations among variables, in which the factors “represent the common variance of variables, excluding unique variance. Factor analysis is generally used when the research purpose is detecting data structure (i.e., latent constructs or factors) or causal modeling.

===Applications===
* [[SOCR_EduMaterials_AnalysisActivities_PCA| This SOCR Activity]] demonstrated the utilization of SOCR analyses package for statistical computing in the SOCR environment. It presents a general introduction to PCA and the theoretical background of this statistical tool and shows how to use PCA and how to read and interpret the outcome. It introduced students to input data in the correct format, read the results of PCA and make interpretation of the resulting transformed data.

* [http://www.sciencedirect.com/science/article/pii/0169743987800849 This article] presents a general introduction to PCA. Principal component analysis of a data matrix extracts the dominant patterns in the matrix in terms of a complementary set of score and loading plots. It is the responsibility of the data analyst to formulate the scientific issue at hand in terms of PC projections, PLS regressions, etc. Ask yourself, or the investigator, why the data matrix was collected, and for what purpose the experiments and measurements were made. Specify before the analysis what kinds of patterns you would expect and what you would find exciting. The results of the analysis depend on the scaling of the matrix, which therefore must be specified. Variance scaling, where each variable is scaled to unit variance, can be recommended for general use, provided that almost constant variables are left unscaled. Combining different types of variables warrants blockscaling. In the initial analysis, look for outliers and strong groupings in the plots, indicating that the data matrix perhaps should be “polished” or whether disjoint modeling is the proper course. For plotting purposes, two or three principal components are usually sufficient, but for modeling purposes the number of significant components should be properly determined, e.g. by cross-validation. Use the resulting principal components to guide your continued investigation or chemical experimentation, not as an end in itself.

* [http://link.springer.com/article/10.1007/BF02294359 This article] introduced the Akaike Information Criterion (AIC) to extend the method of maximum likelihood to the multi-model situation. It related the successful experience of the order determination of an autoregressive model to the determination of the number of factors in the maximum likelihood factor analysis. The use of the AIC criterion in the factor analysis is particularly interesting when it is viewed as the choice of a Bayesian model. This observation shows that the area of application of AIC can be much wider than the conventional i.i.d. type models on which the original derivation of the criterion was based. The observation of the Bayesian structure of the factor analysis model leads us to the handling of the problem of improper solution by introducing a natural prior distribution of factor loadings.

* [http://onlinelibrary.wiley.com/doi/10.1002/9780470057339.vnn086/abstract This article] contains a good introduction to the application of ICA. Independent component models have gained increasing interest in various fields of applications in recent years. The basic independent component model is a semi-parametric model assuming that a p-variate observed random vector is a linear transformation of an unobserved vector of p independent latent variables. This linear transformation is given by an unknown mixing matrix, and one of the main objectives of independent component analysis (ICA) is to estimate an unmixing matrix by means of which the latent variables can be recovered. In this article, we discuss the basic independent component model in detail, define the concepts and analysis tools carefully, and consider two families of ICA estimates. The statistical properties (consistency, asymptotic normality, efficiency, robustness) of the estimates can be analyzed and compared via the so called gain matrices. Some extensions of the basic independent component model, such as models with additive noise or models with dependent observations, are briefly discussed. The article ends with a short example.

===Software ===
* [http://socr.ucla.edu/htmls/ana/ SOCR ANalyses]
* [[SOCR_EduMaterials_AnalysisActivities_PCA SOCR PCA Activity]]
* [http://stat.ethz.ch/R-manual/R-patched/library/stats/html/princomp.html R princomp package]
* [http://cran.r-project.org/web/packages/fastICA/ R fastICA package] and [http://cran.at.r-project.org/web/packages/fastICA/fastICA.pdf documentation]

===Problems===

====R Example 1====
# Install package ‘fastICA’

> library(fastICA)
# Using the SOCR 1981-2006 CPI Data (http://wiki.socr.umich.edu/index.php/SOCR_Data_Dinov_021808_ConsumerPriceIndex),
# save the table in ASCII text file [[SMHS_PCA_ICA_FA#Appendix|CPI_Data.dat]]. Note the "dir" (folder) where you saved the data and reference it below
> CPI_Data <- as.matrix(read.table("/dir/CPI_Data.dat",header=T))

> # compare PCA and FA analyses
> X <- CPI_Data[,-1]
> pc.cor <- princomp(X,cor=T)
> summary(pc.cor )
Importance of components:
Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7
Standard deviation 2.1348817 1.1696678 0.71243186 0.54364890 0.38449985 0.31956304 0.145200770
Proportion of Variance 0.6511029 0.1954461 0.07250845 0.04222202 0.02112002 0.01458865 0.003011895
Cumulative Proportion 0.6511029 0.8465490 0.91905742 0.96127944 0.98239946 0.99698811 1.000000000

> ica <- fastICA(X,n.comp=7)
> names(ica)
[1] "X" "K" "W" "A" "S"
# X: pre-processed data matrix (whitened/sphered data)
# K: pre-whitening matrix that projects data onto the first n.comp
# principal components.
# W: estimated un-mixing matrix (XW = S)
# A: estimated mixing matrix (X = SA)
# S: estimated source matrix (factor scores, $\Theta$ in the notes)

> windows()
> biplot(pc.cor)

> S <- ica$\$ $S
> dimnames(S) <- list(dimnames(X)[[1]],paste("Cmp.",1:7,sep=""))
> A <- ica$\$ $A
> dimnames(A) <- list(dimnames(X)[[2]],paste("Cmp.",1:7,sep=""))
> windows()
> biplot(S[,1:2],A[,1:2])

> loadings(pc.cor)

Loadings:
Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6
Electricity -0.415 -0.227 0.164 0.512 -0.373 -0.576
Fuel_Oil -0.351 0.547 -0.198 0.312
Bananas -0.373 -0.415 0.258 0.365 0.393 0.578
Tomatoes -0.369 -0.320 0.357 -0.738 -0.294
Orange_Juice -0.324 -0.311 -0.871 -0.119
Beef -0.424 0.220 -0.216 0.721 -0.449
Gasoline -0.380 0.479 -0.216 0.161
Comp.7
Electricity -0.131
Fuel_Oil -0.657
Bananas
Tomatoes
Orange_Juice 0.108
Beef
Gasoline 0.733

Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6
SS loadings 1.000 1.000 1.000 1.000 1.000 1.000
Proportion Var 0.143 0.143 0.143 0.143 0.143 0.143
Cumulative Var 0.143 0.286 0.429 0.571 0.714 0.857
Comp.7
SS loadings 1.000
Proportion Var 0.143
Cumulative Var 1.000

> field <- function(x) { substr(paste(x," ",sep=""),1,6) }
> A.str <- ifelse(abs(A)<2,field(" "),field(round(A,2)))
> dimnames(A.str) <- list(dimnames(X)[[2]],paste("Cmp.",1:7,sep=""))
> write.table(A.str[,1:4],"",quote=F,row.names=T,col.names=NA)
Cmp.1 Cmp.2 Cmp.3 Cmp.4
Electricity
Fuel_Oil
Bananas -2.59
Tomatoes -2.9
Orange_Juice -2.66
Beef
Gasoline

> L <- pc.cor$loadings
> L.str <- ifelse(abs(L)<0.3,field(" "),field(round(L,2)))
> dimnames(L.str) <- list(dimnames(X)[[2]],paste("Cmp.",1:7,sep=""))
> write.table(L.str[,1:4],"",quote=F,row.names=T,col.names=T)
Cmp.1 Cmp.2 Cmp.3 Cmp.4
Electricity -0.41 0.51
Fuel_Oil -0.35 0.55
Bananas -0.37 -0.41 0.37
Tomatoes -0.37 -0.32 0.36 -0.74
Orange_Juice -0.32 -0.31 -0.87
Beef -0.42
Gasoline -0.38 0.48

===References===
* [http://nyx-www.informatik.uni-bremen.de/664/1/smith_tr_02.pdf A tutorial on Principal Components Analysis]
* [http://mirlyn.lib.umich.edu/Record/004232056 Sampling / Steven K. Thompson]
* [http://mirlyn.lib.umich.edu/Record/004133572 Sampling theory and methods / S. Sampath]
* [http://en.wikipedia.org/wiki/Factor_analysis Factor Analysis details]

===Appendix===
[[SOCR_Data_Dinov_021808_ConsumerPriceIndex#Consumer_Price_Index_Data|SOCR 1981-2006 CPI Dataset]]

<center>
{| class="wikitable" style="text-align:center; width:25%" border="1"
|-
! Year||Electricity||Fuel_Oil||Bananas||Tomatoes||Orange_Juice||Beef||Gasoline
|-
| 1981||31.552||1.15||0.343||0.792||1.141||1.856||1.269
|-
| 1982||36.006||1.254||0.364||0.763||1.465||1.794||1.341
|-
| 1983||37.184||1.194||0.332||0.726||1.418||1.756||1.214
|-
| 1984||38.6||1.122||0.344||0.854||1.408||1.721||1.2
|-
| 1985||38.975||1.078||0.35||0.697||1.685||1.711||1.145
|-
| 1986||40.223||1.126||0.337||1.104||1.756||1.662||1.19
|-
| 1987||40.022||0.817||0.374||0.871||1.512||1.694||0.868
|-
| 1988||40.195||0.89||0.394||0.797||1.638||1.736||0.947
|-
| 1989||40.828||0.883||0.429||1.735||1.868||1.806||0.944
|-
| 1990||41.663||1.259||0.438||0.912||1.817||1.907||1.09
|-
| 1991||43.226||1.235||0.428||0.936||2.005||1.996||1.304
|-
| 1992||44.501||0.985||0.426||1.141||1.879||1.926||1.135
|-
| 1993||46.959||0.969||0.44||1.604||1.677||1.97||1.182
|-
| 1994||48.2||0.919||0.503||1.323||1.674||1.892||1.109
|-
| 1995||48.874||0.913||0.463||1.103||1.583||1.847||1.19
|-
| 1996||48.538||1.007||0.497||1.213||1.577||1.799||1.186
|-
| 1997||49.245||1.136||0.473||1.452||1.737||1.85||1.318
|-
| 1998||46.401||0.966||0.489||1.904||1.601||1.818||1.186
|-
| 1999||45.061||0.834||0.49||1.443||1.753||1.834||1.031
|-
| 2000||45.207||1.189||0.5||1.414||1.823||1.903||1.356
|-
| 2001||47.472||1.509||0.509||1.451||1.863||2.037||1.525
|-
| 2002||47.868||1.123||0.526||1.711||1.876||2.152||1.209
|-
| 2003||47.663||1.396||0.512||1.472||1.848||2.131||1.557
|-
| 2004||49.159||1.508||0.485||1.66||1.957||2.585||1.635
|-
| 2005||50.847||1.859||0.49||2.162||1.872||2.478||1.866
|-
| 2006||57.223||2.418||0.505||1.621||1.853||2.607||2.359
|}
</center>

<hr>
* SOCR Home page: http://www.socr.umich.edu

{{translate|pageName=http://wiki.socr.umich.edu/index.php?title=SMHS_PCA_ICA_FA}}

SMHS PCA ICA FA

2014-09-01T15:59:28Z

Jslavine: /* PCA Properties */

==[[SMHS| Scientific Methods for Health Sciences]] - Dimensionality Reduction: PCA, ICA, FA ==

===Overview===
* ''PCA'' (principal component analysis) is a mathematical procedure that transforms a number of possibly correlated variables into a smaller number of uncorrelated variables through a process known as orthogonal transformation.
* ''ICA'' (independent component analysis) is a computational tool to separate a multivariate signal into additive subcomponents by assuming that the subcomponents are non-Gaussian signals and are statistically independent from each other.
* ''Factor analysis'' is a statistical method that describes variability among observed correlated variables in terms of a potentially lower number of unobserved variables. It is related to PCA but they are not identical. In this section, we are going to introduce these three commonly used statistical tools and illustrate their application in various studies with examples and R code samples.

===Motivation===
Suppose we have a set of observable correlated random variables, and we want to reduce the dimensionality of the data into a reasonable new set. How can we achieve this dimensionality reduction? Principal component analysis, independent component analysis and factor analysis may be the answers here. How does each of them work? What are the differences among those statistical methods; what are their strengths and weaknesses? How can we decide on the best method for a specific dataset?

===Theory===
====PCA (principal component analysis)====
PCA is a statistical technique that uses an orthogonal transformation to convert a set of observations of possibly correlated variables into a set of variables that are linearly uncorrelated. The resulting uncorrelated variables are called principal components. The first principal component accounts for as much of the variability in the data as possible, and each succeeding component accounts for the remaining variability. PCA is the simplest of the true eigenvector-based multivariate analyses; it reveals the internal structure of the data in a way that best explains the variance in the data. It is a useful statistical technique that has been applied in many fields, including computer networks and image processing, and is a powerful method for finding patterns in high-dimensional datasets.

Consider a data matrix $X_{n\times p}$ with column-wise empirical means of zero (i.e., the sample mean of each column has been shifted to zero), where each of the $n$ rows represents a different repetition of the experiment, and each of the $p$ columns gives a particular kind of data-element (i.e., variable). Mathematically, the transformation is defined by a set of $p$-dimensional vectors of weights $w_{(k)}=(w_1,w_2,…,w_p)_{(k)}$, constrained to be unitary ($\\w_{(k)}||=1$), which map each row vector $x_i$ of $X$ to a new vector of principal component scores $t_{(i)}=(t_1,t_2,…,t_p)_{(i)}$, given by $t_{k(i)}=x_{(i)} w_{(k)}$. The mapping occurs such that the individual elements of $t$ considered over the data set successively inherit the maximum possible variance from $x$.

* First component: the first loading vector $w_{(1)}= \arg\max_{||w||=1} {\sum_i{(t_1)_{(i)}^2 }}= \arg\max_{||w||=1} {\sum_i{(x_{(i)} w)^2 }}$, in matrix form: $w_{(1)}=\arg\max_{||w||=1} {||Xw||^2 }=\arg\max_{||w||=1} {w^T X^T Xw}$.

* Further components: the $k^{th}$ component can be found by subtracting the first $k-1$ principal components from $X$. $\hat{X}_{k-1} = X-\sum_{s=1}^{k-1}{X w_{(s)} w_{(s)}^T }$ and finding the loading vector involves extracting the maximum variance from this new data matrix: $w_{(k)}=\arg\max_{||w||=1} {||\bar{X}_{k-1} w||^2 } = \arg\max {\frac{w^T \hat{X}_{k-1}^T \hat{X}_{k-1} w}{w^T w}}.$ This gives the remaining eigenvectors of $X^T X$ with the maximum values for the quantity in brackets given by other corresponding eigenvalues. The full principal components decomposition of $X$ can therefore be given as $T=XW$ where $W$ is a $p\times p$ matrix whose columns are the eigenvectors of $X^T X$.

* Dimensionality reduction: The faithful transformation $T = X W$ maps a data vector $x_{(i)}$ from an original space of $p$ variables to a new space of $p$ (other) variables which are uncorrelated over the dataset. However, not all the principal components need to be kept. Keeping only the first $L$ principal components, produced by using only the first $L$ loading vectors, gives the truncated transformation $T_L=XW_L$ where $T_L$ has $n$ rows but only $L$ columns. This dimensionality reduction is very useful for visualizing and processing high-dimensional datasets while still retaining as much of the variance in the dataset as possible.

* Singular value decomposition (SVD)
: The SVD of $X=U \Sigma W^T$ where $\Sigma$ is an $n\times p$ rectangular diagonal matrix of positive numbers $\sigma_{(k)}$, the singular values of $X$; $U$ is an $n\times n$ matrix, the columns of which are orthogonal until vector of length $n$ called the left singular vectors of $X$; and $W$ is a $p\times p$ whose columns are orthogonal until vector of length $p$ and called the right singular vectors of $X$. With factorization, the matrix $X^T X=W \Sigma U^T U \Sigma W^T = W \Sigma^2 W^T$; with singular value decomposition the score matrix $T=XW=U\Sigma W^T W=U\Sigma$, where each column of $T$ is given by one of the left singular vectors of $X$ multiplied by the corresponding singular value.

* Computational details of PCA:
** Begin with a dataset that contains at least two dimensions (i.e., variables). The dataset can contain as many observations (dimensions) as you like;
** Normalize the observations for each variable. To do this, simply subtract the mean (average) from each observation within a given variable. For example, let $X$ and $Y$ be the two variables from the original dataset, with variable $X$ containing observations $X_1,X_2,…,X_n$, and variable $Y$ containing observations $Y_1,Y_2,…,Y_n$. Let $\bar{X}$ be the average of the $n$ observations of $X$, i.e. $\bar{X}= \frac{X_1+X_2+⋯+X_n}{n}$, and similarly let $\bar{Y}= \frac{Y_1+Y_2+⋯+Y_n}{n}$ be the average of the $Y$ observations. Then, the normalized dataset would be, for variable $X$: $\{X_1-\bar{X}, X_2-\bar{X}, …,X_n-\bar{X}\}$ and for variable $Y$: $\{Y_1-\bar{Y},Y_2-\bar{Y}, …,Y_n-\bar{Y}\}$
** Calculate the covariance matrix between the variables of the normalized dataset;
** Calculate the eigenvalue and eigenvectors of the covariance matrix (the eigenvectors must be normalized to a length of 1);
**(5) Choose the most significant principal component, which is simply the eigenvector with the highest eigenvalue.

====PCA Properties====
* For any integer $q$, $1 \leq q \leq p$, consider the orthogonal linear transformation $y=B'x$, where $y$ is a $q$-element vector and $B'$ is a $q\times q$ matrix, and let $\Sigma_y =B'\Sigma B$ be the variance-covariance matrix for $y$. Then the trace of $\Sigma_y$, $tr(\Sigma_y)$, is maximized by taking $B=A_q$, where $A_q$ consists of the first $q$ columns of $A$;
* $y=B'x$ and the $tr(\Sigma_y)$ are minimized by taking $B=A_q^*$ where $A_q^*$ consists of the last $q$ columns of $A$. The last few principal components are not simply unstructured left-overs after removing the important ones;
* $\Sigma = \lambda_1 \alpha_1 \alpha_1'+\lambda_2 \alpha_2 \alpha_2'+⋯+\lambda_p \alpha_p \alpha_p'$. Given that $var(x_j)=\sum_{k=1}^p {\lambda_k \alpha_{kj}^2 }$, the elements of $\lambda_k \alpha_k \alpha_k'$ tends to become smaller as SkS increases, whereas the elements of $\lambda_k$ tends to stay about the same size because $\alpha_k \alpha_k'=1,$ for $k=1,2,…,p$.

====PCA Limitations====
* The results of PCA depend on the scaling of the variables; (
* The applicability of PCA is limited by certain assumptions made in its derivation.

==== PCA in R====
require(graphics)
## The variances of the variables in the
## USArrests data vary by orders of magnitude, so scaling is appropriate
(pc.cr <- princomp(USArrests)) # inappropriate

Call:
princomp(x = USArrests)

Standard deviations:
Comp.1 Comp.2 Comp.3 Comp.4
82.890847 14.069560 6.424204 2.45783

princomp(USArrests, cor = TRUE) # =^= prcomp(USArrests, scale=TRUE)

Call:
princomp(x = USArrests, cor = TRUE)

Standard deviations:
Comp.1 Comp.2 Comp.3 Comp.4
1.5748783 0.9948694 0.5971291 0.4164494

## Similar, but different:
## The standard deviations differ by a factor of sqrt(49/50)
summary(pc.cr <- princomp(USArrests, cor = TRUE))

Importance of components:
Comp.1 Comp.2 Comp.3 Comp.4
Standard deviation 1.5748783 0.9948694 0.5971291 0.41644938
Proportion of Variance 0.6200604 0.2474413 0.0891408 0.04335752
Cumulative Proportion 0.6200604 0.8675017 0.9566425 1.00000000

loadings(pc.cr) # note that blank entries are small but not zero
## The signs of the columns are arbitrary

Loadings:
Comp.1 Comp.2 Comp.3 Comp.4
Murder -0.536 0.418 -0.341 0.649
Assault -0.583 0.188 -0.268 -0.743
UrbanPop -0.278 -0.873 -0.378 0.134
Rape -0.543 -0.167 0.818

Comp.1 Comp.2 Comp.3 Comp.4
SS loadings 1.00 1.00 1.00 1.00
Proportion Var 0.25 0.25 0.25 0.25
Cumulative Var 0.25 0.50 0.75 1.00

plot(pc.cr) # shows a screeplot.

# The histogram distribution presents a vivid picture of the variance attributable to the first four significant principal
# components respectively.
biplot(pc.cr) ## shows the plot of PCA in a different format

From the chart above, we can see the distribution variance attributed to different variables on the four principal components.

==== [[SOCR_EduMaterials_AnalysisActivities_PCA|PCA using SOCR Analyses]]====
This [[SOCR_EduMaterials_AnalysisActivities_PCA| SOCR Activity illustrates the use of PCA]].

====ICA (independent component analysis)====
ICA is a computational method separating a multivariate signal into additive subcomponents by assuming that the subcomponents are non-Gaussian signals and are statistically independent from each other.

* ICA Assumptions:
** The source signals are independent of each other;
** The distribution of the values in each source signals are non-Gaussian.
** Independence: the source of signals are assumed to be independent but their signal mixture are not independent because they share the same source signals;
** Normality: based on [[SMHS_CLT_LLN|CLT]], the distribution of a sum of independent random variables approximate a Gaussian distribution;
** Complexity: the temporal complexity of any signal mixture is greater than that of its simplest constituent source signal.

: ICA maximizes the statistical independence of the estimated components to find the independent components. In general, ICA can’t identify the actual number of source signals nor can it identify the proper scaling of the source signals. Suppose the data is represented by the random vector $x=(x_1,x_2,…,x_m )^t$ and the components denoted as $s=(s_1,s_2,…,s_n )^t$. We need to transform the observed data $x$ using a linear transformation $w$, $s=Wx$, into maximally independent components $s$ measured by some functions of independence. There are alternative ''models'' for ICA:
* Linear noiseless ICA: the components $x_i$ of the data $x=(x_1,x_2,…,x_m )^t$ are generated as a sum of the independent components $s_k$, for $k=1,…,n$; $x_i=a_{i,1} s_1 + ⋯ + a_{i,k} s_k + ⋯ + a_{i,n} s_n$, weighted by the mixing weights $a_{i,k}$.
* Linear noisy ICA: with additional assumption of zero-mean and uncorrelated Gaussian noise $n \sim N(0,diag(\Sigma))$, the ICA model takes the form $x=As+n$.
* Nonlinear ICA: the mixing of the sources is not necessarily linear. Using a nonlinear mixing function $f(.|θ)$ with parameter $\theta$ the nonlinear ICA model is $x=f(s│θ)+n$.
* Identifiability: the independent components are identifiable up to a permutation and scaling of the sources, which requires: (1) at most one of the sources $s_k$ is Gaussian; (2) the number of observed mixtures, $m$, must be at least as large as the number of estimated components: $n$ such that $n\leq m$, i.e., the mixing matrix $A$ must be of full rank in order to have inverse.
* ICA in R using package [http://cran.at.r-project.org/web/packages/fastICA/fastICA.pdf fastICA]. This example shows un-mixing two mixed independent uniforms:

library(fastICA)
S <- matrix(runif(10000), 5000, 2)
A <- matrix(c(1, 1, -1, 3), 2, 2, byrow = TRUE)
X <- S %*% A # In R, "*" and "%*%" indicate "scalar" and matrix multiplication, respectively!

a <- fastICA(X, 2, alg.typ = "parallel", fun = "logcosh", alpha = 1,
method = "C", row.norm = FALSE, maxit = 200,
tol = 0.0001, verbose = TRUE)
par(mfrow = c(1, 2))
plot(a$\$ $X, main = "Pre-processed data")
plot(a$\$ $S, main = "ICA components")

: Another example of un-mixing two independent signals is shown below:
S <- cbind(sin((1:1000)/20), rep((((1:200)-100)/100), 5))
# [http://stat.ethz.ch/R-manual/R-devel/library/base/html/cbind.html cbind] combines objects by rows and columns. It takes a sequence of vector, matrix or data frames arguments and combines them by columns or rows, respectively.

A <- matrix(c(0.291, 0.6557, -0.5439, 0.5572), 2, 2)
X <- S %*% A
a <- fastICA(X, 2, alg.typ = "parallel", fun = "logcosh", alpha = 1,
method = "R", row.norm = FALSE, maxit = 200,
tol = 0.0001, verbose = TRUE)
par(mfcol = c(2, 3))
plot(1:1000, S[,1 ], type = "l", main = "Original Signals",
xlab = "", ylab = "")
plot(1:1000, S[,2 ], type = "l", xlab = "", ylab = "")
plot(1:1000, X[,1 ], type = "l", main = "Mixed Signals",
xlab = "", ylab = "")
plot(1:1000, X[,2 ], type = "l", xlab = "", ylab = "")
plot(1:1000, a$\$ $S[,1 ], type = "l", main = "ICA source estimates",
xlab = "", ylab = "")
plot(1:1000, a$\$ $S[, 2], type = "l", xlab = "", ylab = "")

====Factor analysis (FA)====
FA is a statistical method, which describes variability among observed correlated variables in terms of potentially lower number of unobserved variables. Consider a set of $p$ observable random variables, $x_1,x_2,…,x_p$ with means $μ_1,μ_2,…,μ_p$, respectively. Suppose for some unknown constants $l_{i,j}$ and $k$ unobserved random variables $F_j$, where $i\in \{1,…,p\}$ and $j \in \{1,…,k\}$ where $k<p$. We have $x_i-μ_i=l_{i,1} F_1 + ⋯ +l_{i,k} F_k + ε_i$, where $ε_i$ are independently distributed error terms with mean zero and finite variance. In matrix form, we have $x-μ = LF+ε$, with $n$ observations, we have $x$ is a $p\times n$ matrix, $L$ is a $p \times k$ matrix and $F$ is $k\times n$ matrix. Assumptions: (1) $ε$ and $F$ are independent; (2) $E(F)=0$; (3) $cov(F)=I$ to make sure the factors are uncorrelated. Solutions to the equations above yield the factors $F$ and the loading matrix, $L$.

* Example: In the following, matrices will be indicated by indexed variables. "Subject" indices will be indicated using letters $a$, $b$ and $c$, with values running from $1$ to $N_a$, which is equal to $10$ in the above example. "Factor" indices will be indicated using letters $p$, $q$ and $r$, with values running from $1$ to $N_p$ which is equal to $2$ in the above example. "Instance" or "sample" indices will be indicated using letters $i$, $j$ and $k$, with values running from $1$ to $N_i$. In the example above, if a sample of $N_i=1000$ students responded to the $N_a=10$ questions, the $i^{th}$ student's score for the $a^{th}$ question are given by $x_{ai}$. The purpose of factor analysis is to characterize the correlations between the variables $x_a$ of which the $x_{ai}$ is a particular instance, or set of observations. To ensure that all variables are on equal footing, they are standardized: $z_{ai}=\frac{x_{ai}- μ_a}{σ_a}$, where $μ_a$ is the sample mean and sample variance $σ_a^2=\frac{1}{N_a} \sum_i {(x_{ai}-μ_a)^2}$. The factor analysis model is expressed by:
:$$\begin{matrix}z_{1,i} & = & \ell_{1,1}F_{1,i} & + & \ell_{1,2}F_{2,i} & + & \varepsilon_{1,i} \\
\vdots & & \vdots & & \vdots & & \vdots \\
z_{10,i} & = & \ell_{10,1}F_{1,i} & + & \ell_{10,2}F_{2,i} & + & \varepsilon_{10,i}
\end{matrix}$$

In matrix form ($Z=LF+ϵ$), this model can be expressed as:
$$z_{ai}=\sum_p \ell_{ap}F_{pi}+\varepsilon_{ai},$$
where $F_{1,i}$ is the $i^{th}$ student’s ''verbal intelligence'', $F_{2,i}$ is the $i^{th}$ student’s ''mathematical intelligence''; $l_{ap}$ are the factor loadings for the $a^{th}$ subject for $p=1,2$.

=====FA in R: using ''factanal()''=====
# Maximum Likelihood Factor Analysis
# entering raw data and extracting 3 factors,
# with varimax rotation

mydata <- read.csv("http://www.ats.ucla.edu/stat/data/mmreg.csv")
# mmreg.csv includes 600 observations and 8 variables.
# The psychological variables are locus_of_control, self_concept and motivation.
# The academic variables are standardized tests in reading (read), writing (write), math (math) and science (science).
# Additionally, the variable female is a zero-one indicator variable with the one indicating a female student.

# We can get some basic descriptions of the entire data set by using summary.
# To get the standard deviations, we use sapply to apply the sd function to each variable in the dataset.
summary(mydata)

sapply(mydata, sd)

fit <- factanal(mydata, 3, rotation="varimax") # mydata can be a raw data matrix or a covariance matrix.
# Pairwise deletion of missing data is used. Rotation can be "varimax" or "promax".

print(fit, digits=2, cutoff=.3, sort=TRUE)
# plot factor 1 by factor 2
load <- fit$\$ $loadings[,1:2]
plot(load,type="n") # set up plot
text(load,labels=names(mydata),cex=.7) # add variable names

# Principal Axis Factor Analysis
library(psych)
fit <- factor.pa(mydata, nfactors=3, rotation="varimax")
fit # print results

# Determine Number of Factors to Extract
library(nFactors)
ev <- eigen(cor(mydata)) # get eigenvalues
ap <- parallel(subject=nrow(mydata),var=ncol(mydata), rep=100,cent=.05)
nS <- nScree(x=ev$\$ $values, aparallel=ap$\$ $eigen$\$ $qevpea)
plotnScree(nS)

===PCA, ICA, FA: Similarities and Differences===
* PCA is closely related to factor analysis. The later typically incorporates more domain specific assumptions about the underlying structure and solves eigenvectors of a slightly different matrix. Principal components create variables that are linear combinations of the original variables. The new variables have the property that the variables are all orthogonal. The principal components can be used to find clusters in a set of data.

* PCA is a variance-focused approach seeking to reproduce the total variable variance, in which components reflect both common and unique variance of the variable. It is generally preferred for purposes of data reduction (i.e., translating variable space into optimal factor space) but not when detect the latent construct or factors. Factor analysis is similar to principal component analysis, in that factor analysis also involves linear combinations of variables.

* Different from PCA, factor analysis is a correlation-focused approach seeking to reproduce the inter-correlations among variables, in which the factors “represent the common variance of variables, excluding unique variance. Factor analysis is generally used when the research purpose is detecting data structure (i.e., latent constructs or factors) or causal modeling.

===Applications===
* [[SOCR_EduMaterials_AnalysisActivities_PCA| This SOCR Activity]] demonstrated the utilization of SOCR analyses package for statistical computing in the SOCR environment. It presents a general introduction to PCA and the theoretical background of this statistical tool and shows how to use PCA and how to read and interpret the outcome. It introduced students to input data in the correct format, read the results of PCA and make interpretation of the resulting transformed data.

* [http://www.sciencedirect.com/science/article/pii/0169743987800849 This article] presents a general introduction to PCA. Principal component analysis of a data matrix extracts the dominant patterns in the matrix in terms of a complementary set of score and loading plots. It is the responsibility of the data analyst to formulate the scientific issue at hand in terms of PC projections, PLS regressions, etc. Ask yourself, or the investigator, why the data matrix was collected, and for what purpose the experiments and measurements were made. Specify before the analysis what kinds of patterns you would expect and what you would find exciting. The results of the analysis depend on the scaling of the matrix, which therefore must be specified. Variance scaling, where each variable is scaled to unit variance, can be recommended for general use, provided that almost constant variables are left unscaled. Combining different types of variables warrants blockscaling. In the initial analysis, look for outliers and strong groupings in the plots, indicating that the data matrix perhaps should be “polished” or whether disjoint modeling is the proper course. For plotting purposes, two or three principal components are usually sufficient, but for modeling purposes the number of significant components should be properly determined, e.g. by cross-validation. Use the resulting principal components to guide your continued investigation or chemical experimentation, not as an end in itself.

* [http://link.springer.com/article/10.1007/BF02294359 This article] introduced the Akaike Information Criterion (AIC) to extend the method of maximum likelihood to the multi-model situation. It related the successful experience of the order determination of an autoregressive model to the determination of the number of factors in the maximum likelihood factor analysis. The use of the AIC criterion in the factor analysis is particularly interesting when it is viewed as the choice of a Bayesian model. This observation shows that the area of application of AIC can be much wider than the conventional i.i.d. type models on which the original derivation of the criterion was based. The observation of the Bayesian structure of the factor analysis model leads us to the handling of the problem of improper solution by introducing a natural prior distribution of factor loadings.

* [http://onlinelibrary.wiley.com/doi/10.1002/9780470057339.vnn086/abstract This article] contains a good introduction to the application of ICA. Independent component models have gained increasing interest in various fields of applications in recent years. The basic independent component model is a semi-parametric model assuming that a p-variate observed random vector is a linear transformation of an unobserved vector of p independent latent variables. This linear transformation is given by an unknown mixing matrix, and one of the main objectives of independent component analysis (ICA) is to estimate an unmixing matrix by means of which the latent variables can be recovered. In this article, we discuss the basic independent component model in detail, define the concepts and analysis tools carefully, and consider two families of ICA estimates. The statistical properties (consistency, asymptotic normality, efficiency, robustness) of the estimates can be analyzed and compared via the so called gain matrices. Some extensions of the basic independent component model, such as models with additive noise or models with dependent observations, are briefly discussed. The article ends with a short example.

===Software ===
* [http://socr.ucla.edu/htmls/ana/ SOCR ANalyses]
* [[SOCR_EduMaterials_AnalysisActivities_PCA SOCR PCA Activity]]
* [http://stat.ethz.ch/R-manual/R-patched/library/stats/html/princomp.html R princomp package]
* [http://cran.r-project.org/web/packages/fastICA/ R fastICA package] and [http://cran.at.r-project.org/web/packages/fastICA/fastICA.pdf documentation]

===Problems===

====R Example 1====
# Install package ‘fastICA’

> library(fastICA)
# Using the SOCR 1981-2006 CPI Data (http://wiki.socr.umich.edu/index.php/SOCR_Data_Dinov_021808_ConsumerPriceIndex),
# save the table in ASCII text file [[SMHS_PCA_ICA_FA#Appendix|CPI_Data.dat]]. Note the "dir" (folder) where you saved the data and reference it below
> CPI_Data <- as.matrix(read.table("/dir/CPI_Data.dat",header=T))

> # compare PCA and FA analyses
> X <- CPI_Data[,-1]
> pc.cor <- princomp(X,cor=T)
> summary(pc.cor )
Importance of components:
Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7
Standard deviation 2.1348817 1.1696678 0.71243186 0.54364890 0.38449985 0.31956304 0.145200770
Proportion of Variance 0.6511029 0.1954461 0.07250845 0.04222202 0.02112002 0.01458865 0.003011895
Cumulative Proportion 0.6511029 0.8465490 0.91905742 0.96127944 0.98239946 0.99698811 1.000000000

> ica <- fastICA(X,n.comp=7)
> names(ica)
[1] "X" "K" "W" "A" "S"
# X: pre-processed data matrix (whitened/sphered data)
# K: pre-whitening matrix that projects data onto the first n.comp
# principal components.
# W: estimated un-mixing matrix (XW = S)
# A: estimated mixing matrix (X = SA)
# S: estimated source matrix (factor scores, $\Theta$ in the notes)

> windows()
> biplot(pc.cor)

> S <- ica$\$ $S
> dimnames(S) <- list(dimnames(X)[[1]],paste("Cmp.",1:7,sep=""))
> A <- ica$\$ $A
> dimnames(A) <- list(dimnames(X)[[2]],paste("Cmp.",1:7,sep=""))
> windows()
> biplot(S[,1:2],A[,1:2])

> loadings(pc.cor)

Loadings:
Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6
Electricity -0.415 -0.227 0.164 0.512 -0.373 -0.576
Fuel_Oil -0.351 0.547 -0.198 0.312
Bananas -0.373 -0.415 0.258 0.365 0.393 0.578
Tomatoes -0.369 -0.320 0.357 -0.738 -0.294
Orange_Juice -0.324 -0.311 -0.871 -0.119
Beef -0.424 0.220 -0.216 0.721 -0.449
Gasoline -0.380 0.479 -0.216 0.161
Comp.7
Electricity -0.131
Fuel_Oil -0.657
Bananas
Tomatoes
Orange_Juice 0.108
Beef
Gasoline 0.733

Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6
SS loadings 1.000 1.000 1.000 1.000 1.000 1.000
Proportion Var 0.143 0.143 0.143 0.143 0.143 0.143
Cumulative Var 0.143 0.286 0.429 0.571 0.714 0.857
Comp.7
SS loadings 1.000
Proportion Var 0.143
Cumulative Var 1.000

> field <- function(x) { substr(paste(x," ",sep=""),1,6) }
> A.str <- ifelse(abs(A)<2,field(" "),field(round(A,2)))
> dimnames(A.str) <- list(dimnames(X)[[2]],paste("Cmp.",1:7,sep=""))
> write.table(A.str[,1:4],"",quote=F,row.names=T,col.names=NA)
Cmp.1 Cmp.2 Cmp.3 Cmp.4
Electricity
Fuel_Oil
Bananas -2.59
Tomatoes -2.9
Orange_Juice -2.66
Beef
Gasoline

> L <- pc.cor$loadings
> L.str <- ifelse(abs(L)<0.3,field(" "),field(round(L,2)))
> dimnames(L.str) <- list(dimnames(X)[[2]],paste("Cmp.",1:7,sep=""))
> write.table(L.str[,1:4],"",quote=F,row.names=T,col.names=T)
Cmp.1 Cmp.2 Cmp.3 Cmp.4
Electricity -0.41 0.51
Fuel_Oil -0.35 0.55
Bananas -0.37 -0.41 0.37
Tomatoes -0.37 -0.32 0.36 -0.74
Orange_Juice -0.32 -0.31 -0.87
Beef -0.42
Gasoline -0.38 0.48

===References===
* [http://nyx-www.informatik.uni-bremen.de/664/1/smith_tr_02.pdf A tutorial on Principal Components Analysis]
* [http://mirlyn.lib.umich.edu/Record/004232056 Sampling / Steven K. Thompson]
* [http://mirlyn.lib.umich.edu/Record/004133572 Sampling theory and methods / S. Sampath]
* [http://en.wikipedia.org/wiki/Factor_analysis Factor Analysis details]

===Appendix===
[[SOCR_Data_Dinov_021808_ConsumerPriceIndex#Consumer_Price_Index_Data|SOCR 1981-2006 CPI Dataset]]

<center>
{| class="wikitable" style="text-align:center; width:25%" border="1"
|-
! Year||Electricity||Fuel_Oil||Bananas||Tomatoes||Orange_Juice||Beef||Gasoline
|-
| 1981||31.552||1.15||0.343||0.792||1.141||1.856||1.269
|-
| 1982||36.006||1.254||0.364||0.763||1.465||1.794||1.341
|-
| 1983||37.184||1.194||0.332||0.726||1.418||1.756||1.214
|-
| 1984||38.6||1.122||0.344||0.854||1.408||1.721||1.2
|-
| 1985||38.975||1.078||0.35||0.697||1.685||1.711||1.145
|-
| 1986||40.223||1.126||0.337||1.104||1.756||1.662||1.19
|-
| 1987||40.022||0.817||0.374||0.871||1.512||1.694||0.868
|-
| 1988||40.195||0.89||0.394||0.797||1.638||1.736||0.947
|-
| 1989||40.828||0.883||0.429||1.735||1.868||1.806||0.944
|-
| 1990||41.663||1.259||0.438||0.912||1.817||1.907||1.09
|-
| 1991||43.226||1.235||0.428||0.936||2.005||1.996||1.304
|-
| 1992||44.501||0.985||0.426||1.141||1.879||1.926||1.135
|-
| 1993||46.959||0.969||0.44||1.604||1.677||1.97||1.182
|-
| 1994||48.2||0.919||0.503||1.323||1.674||1.892||1.109
|-
| 1995||48.874||0.913||0.463||1.103||1.583||1.847||1.19
|-
| 1996||48.538||1.007||0.497||1.213||1.577||1.799||1.186
|-
| 1997||49.245||1.136||0.473||1.452||1.737||1.85||1.318
|-
| 1998||46.401||0.966||0.489||1.904||1.601||1.818||1.186
|-
| 1999||45.061||0.834||0.49||1.443||1.753||1.834||1.031
|-
| 2000||45.207||1.189||0.5||1.414||1.823||1.903||1.356
|-
| 2001||47.472||1.509||0.509||1.451||1.863||2.037||1.525
|-
| 2002||47.868||1.123||0.526||1.711||1.876||2.152||1.209
|-
| 2003||47.663||1.396||0.512||1.472||1.848||2.131||1.557
|-
| 2004||49.159||1.508||0.485||1.66||1.957||2.585||1.635
|-
| 2005||50.847||1.859||0.49||2.162||1.872||2.478||1.866
|-
| 2006||57.223||2.418||0.505||1.621||1.853||2.607||2.359
|}
</center>

<hr>
* SOCR Home page: http://www.socr.umich.edu

{{translate|pageName=http://wiki.socr.umich.edu/index.php?title=SMHS_PCA_ICA_FA}}

SMHS PCA ICA FA

2014-09-01T15:53:03Z

Jslavine: /* PCA (principal component analysis) */

==[[SMHS| Scientific Methods for Health Sciences]] - Dimensionality Reduction: PCA, ICA, FA ==

===Overview===
* ''PCA'' (principal component analysis) is a mathematical procedure that transforms a number of possibly correlated variables into a smaller number of uncorrelated variables through a process known as orthogonal transformation.
* ''ICA'' (independent component analysis) is a computational tool to separate a multivariate signal into additive subcomponents by assuming that the subcomponents are non-Gaussian signals and are statistically independent from each other.
* ''Factor analysis'' is a statistical method that describes variability among observed correlated variables in terms of a potentially lower number of unobserved variables. It is related to PCA but they are not identical. In this section, we are going to introduce these three commonly used statistical tools and illustrate their application in various studies with examples and R code samples.

===Motivation===
Suppose we have a set of observable correlated random variables, and we want to reduce the dimensionality of the data into a reasonable new set. How can we achieve this dimensionality reduction? Principal component analysis, independent component analysis and factor analysis may be the answers here. How does each of them work? What are the differences among those statistical methods; what are their strengths and weaknesses? How can we decide on the best method for a specific dataset?

===Theory===
====PCA (principal component analysis)====
PCA is a statistical technique that uses an orthogonal transformation to convert a set of observations of possibly correlated variables into a set of variables that are linearly uncorrelated. The resulting uncorrelated variables are called principal components. The first principal component accounts for as much of the variability in the data as possible, and each succeeding component accounts for the remaining variability. PCA is the simplest of the true eigenvector-based multivariate analyses; it reveals the internal structure of the data in a way that best explains the variance in the data. It is a useful statistical technique that has been applied in many fields, including computer networks and image processing, and is a powerful method for finding patterns in high-dimensional datasets.

Consider a data matrix $X_{n\times p}$ with column-wise empirical means of zero (i.e., the sample mean of each column has been shifted to zero), where each of the $n$ rows represents a different repetition of the experiment, and each of the $p$ columns gives a particular kind of data-element (i.e., variable). Mathematically, the transformation is defined by a set of $p$-dimensional vectors of weights $w_{(k)}=(w_1,w_2,…,w_p)_{(k)}$, constrained to be unitary ($\\w_{(k)}||=1$), which map each row vector $x_i$ of $X$ to a new vector of principal component scores $t_{(i)}=(t_1,t_2,…,t_p)_{(i)}$, given by $t_{k(i)}=x_{(i)} w_{(k)}$. The mapping occurs such that the individual elements of $t$ considered over the data set successively inherit the maximum possible variance from $x$.

* First component: the first loading vector $w_{(1)}= \arg\max_{||w||=1} {\sum_i{(t_1)_{(i)}^2 }}= \arg\max_{||w||=1} {\sum_i{(x_{(i)} w)^2 }}$, in matrix form: $w_{(1)}=\arg\max_{||w||=1} {||Xw||^2 }=\arg\max_{||w||=1} {w^T X^T Xw}$.

* Further components: the $k^{th}$ component can be found by subtracting the first $k-1$ principal components from $X$. $\hat{X}_{k-1} = X-\sum_{s=1}^{k-1}{X w_{(s)} w_{(s)}^T }$ and finding the loading vector involves extracting the maximum variance from this new data matrix: $w_{(k)}=\arg\max_{||w||=1} {||\bar{X}_{k-1} w||^2 } = \arg\max {\frac{w^T \hat{X}_{k-1}^T \hat{X}_{k-1} w}{w^T w}}.$ This gives the remaining eigenvectors of $X^T X$ with the maximum values for the quantity in brackets given by other corresponding eigenvalues. The full principal components decomposition of $X$ can therefore be given as $T=XW$ where $W$ is a $p\times p$ matrix whose columns are the eigenvectors of $X^T X$.

* Dimensionality reduction: The faithful transformation $T = X W$ maps a data vector $x_{(i)}$ from an original space of $p$ variables to a new space of $p$ (other) variables which are uncorrelated over the dataset. However, not all the principal components need to be kept. Keeping only the first $L$ principal components, produced by using only the first $L$ loading vectors, gives the truncated transformation $T_L=XW_L$ where $T_L$ has $n$ rows but only $L$ columns. This dimensionality reduction is very useful for visualizing and processing high-dimensional datasets while still retaining as much of the variance in the dataset as possible.

* Singular value decomposition (SVD)
: The SVD of $X=U \Sigma W^T$ where $\Sigma$ is an $n\times p$ rectangular diagonal matrix of positive numbers $\sigma_{(k)}$, the singular values of $X$; $U$ is an $n\times n$ matrix, the columns of which are orthogonal until vector of length $n$ called the left singular vectors of $X$; and $W$ is a $p\times p$ whose columns are orthogonal until vector of length $p$ and called the right singular vectors of $X$. With factorization, the matrix $X^T X=W \Sigma U^T U \Sigma W^T = W \Sigma^2 W^T$; with singular value decomposition the score matrix $T=XW=U\Sigma W^T W=U\Sigma$, where each column of $T$ is given by one of the left singular vectors of $X$ multiplied by the corresponding singular value.

* Computational details of PCA:
** Begin with a dataset that contains at least two dimensions (i.e., variables). The dataset can contain as many observations (dimensions) as you like;
** Normalize the observations for each variable. To do this, simply subtract the mean (average) from each observation within a given variable. For example, let $X$ and $Y$ be the two variables from the original dataset, with variable $X$ containing observations $X_1,X_2,…,X_n$, and variable $Y$ containing observations $Y_1,Y_2,…,Y_n$. Let $\bar{X}$ be the average of the $n$ observations of $X$, i.e. $\bar{X}= \frac{X_1+X_2+⋯+X_n}{n}$, and similarly let $\bar{Y}= \frac{Y_1+Y_2+⋯+Y_n}{n}$ be the average of the $Y$ observations. Then, the normalized dataset would be, for variable $X$: $\{X_1-\bar{X}, X_2-\bar{X}, …,X_n-\bar{X}\}$ and for variable $Y$: $\{Y_1-\bar{Y},Y_2-\bar{Y}, …,Y_n-\bar{Y}\}$
** Calculate the covariance matrix between the variables of the normalized dataset;
** Calculate the eigenvalue and eigenvectors of the covariance matrix (the eigenvectors must be normalized to a length of 1);
**(5) Choose the most significant principal component, which is simply the eigenvector with the highest eigenvalue.

====PCA Properties====
* For any integer $q$, $1 \leq q \leq p$, consider the orthogonal linear transformation $y=B'x$, where $y$ is a $q$-element vector and $B'$ is a $q\times q$ matrix and let $\Sigma_y =B'\Sigma B$ be the variance-covariance matrix for $y$. Then the trace of $\Sigma_y$, $tr(\Sigma_y)$, is maximized by taking $B=A_q$, where $A_q$ consists of the first $q$ columns of $A$;
* $y=B'x$, and the $tr(\Sigma_y)$ is minimized by taking $B=A_q^*$ where $A_q^*$ consists of last $q$ columns of $A$. The last few principal components are not simply unstructured left-overs after removing the important ones;
* $\Sigma = \lambda_1 \alpha_1 \alpha_1'+\lambda_2 \alpha_2 \alpha_2'+⋯+\lambda_p \alpha_p \alpha_p'$. Given that $var(x_j)=\sum_{k=1}^p {\lambda_k \alpha_{kj}^2 }$, the elements of $\lambda_k \alpha_k \alpha_k'$ tends to become smaller as SkS increases, whereas the elements of $\lambda_k$ tends to stay about the same size because $\alpha_k \alpha_k'=1,$ for $k=1,2,…,p$.

====PCA Limitations====
* The results of PCA depend on the scaling of the variables; (
* The applicability of PCA is limited by certain assumptions made in its derivation.

==== PCA in R====
require(graphics)
## The variances of the variables in the
## USArrests data vary by orders of magnitude, so scaling is appropriate
(pc.cr <- princomp(USArrests)) # inappropriate

Call:
princomp(x = USArrests)

Standard deviations:
Comp.1 Comp.2 Comp.3 Comp.4
82.890847 14.069560 6.424204 2.45783

princomp(USArrests, cor = TRUE) # =^= prcomp(USArrests, scale=TRUE)

Call:
princomp(x = USArrests, cor = TRUE)

Standard deviations:
Comp.1 Comp.2 Comp.3 Comp.4
1.5748783 0.9948694 0.5971291 0.4164494

## Similar, but different:
## The standard deviations differ by a factor of sqrt(49/50)
summary(pc.cr <- princomp(USArrests, cor = TRUE))

Importance of components:
Comp.1 Comp.2 Comp.3 Comp.4
Standard deviation 1.5748783 0.9948694 0.5971291 0.41644938
Proportion of Variance 0.6200604 0.2474413 0.0891408 0.04335752
Cumulative Proportion 0.6200604 0.8675017 0.9566425 1.00000000

loadings(pc.cr) # note that blank entries are small but not zero
## The signs of the columns are arbitrary

Loadings:
Comp.1 Comp.2 Comp.3 Comp.4
Murder -0.536 0.418 -0.341 0.649
Assault -0.583 0.188 -0.268 -0.743
UrbanPop -0.278 -0.873 -0.378 0.134
Rape -0.543 -0.167 0.818

Comp.1 Comp.2 Comp.3 Comp.4
SS loadings 1.00 1.00 1.00 1.00
Proportion Var 0.25 0.25 0.25 0.25
Cumulative Var 0.25 0.50 0.75 1.00

plot(pc.cr) # shows a screeplot.

# The histogram distribution presents a vivid picture of the variance attributable to the first four significant principal
# components respectively.
biplot(pc.cr) ## shows the plot of PCA in a different format

From the chart above, we can see the distribution variance attributed to different variables on the four principal components.

==== [[SOCR_EduMaterials_AnalysisActivities_PCA|PCA using SOCR Analyses]]====
This [[SOCR_EduMaterials_AnalysisActivities_PCA| SOCR Activity illustrates the use of PCA]].

====ICA (independent component analysis)====
ICA is a computational method separating a multivariate signal into additive subcomponents by assuming that the subcomponents are non-Gaussian signals and are statistically independent from each other.

* ICA Assumptions:
** The source signals are independent of each other;
** The distribution of the values in each source signals are non-Gaussian.
** Independence: the source of signals are assumed to be independent but their signal mixture are not independent because they share the same source signals;
** Normality: based on [[SMHS_CLT_LLN|CLT]], the distribution of a sum of independent random variables approximate a Gaussian distribution;
** Complexity: the temporal complexity of any signal mixture is greater than that of its simplest constituent source signal.

: ICA maximizes the statistical independence of the estimated components to find the independent components. In general, ICA can’t identify the actual number of source signals nor can it identify the proper scaling of the source signals. Suppose the data is represented by the random vector $x=(x_1,x_2,…,x_m )^t$ and the components denoted as $s=(s_1,s_2,…,s_n )^t$. We need to transform the observed data $x$ using a linear transformation $w$, $s=Wx$, into maximally independent components $s$ measured by some functions of independence. There are alternative ''models'' for ICA:
* Linear noiseless ICA: the components $x_i$ of the data $x=(x_1,x_2,…,x_m )^t$ are generated as a sum of the independent components $s_k$, for $k=1,…,n$; $x_i=a_{i,1} s_1 + ⋯ + a_{i,k} s_k + ⋯ + a_{i,n} s_n$, weighted by the mixing weights $a_{i,k}$.
* Linear noisy ICA: with additional assumption of zero-mean and uncorrelated Gaussian noise $n \sim N(0,diag(\Sigma))$, the ICA model takes the form $x=As+n$.
* Nonlinear ICA: the mixing of the sources is not necessarily linear. Using a nonlinear mixing function $f(.|θ)$ with parameter $\theta$ the nonlinear ICA model is $x=f(s│θ)+n$.
* Identifiability: the independent components are identifiable up to a permutation and scaling of the sources, which requires: (1) at most one of the sources $s_k$ is Gaussian; (2) the number of observed mixtures, $m$, must be at least as large as the number of estimated components: $n$ such that $n\leq m$, i.e., the mixing matrix $A$ must be of full rank in order to have inverse.
* ICA in R using package [http://cran.at.r-project.org/web/packages/fastICA/fastICA.pdf fastICA]. This example shows un-mixing two mixed independent uniforms:

library(fastICA)
S <- matrix(runif(10000), 5000, 2)
A <- matrix(c(1, 1, -1, 3), 2, 2, byrow = TRUE)
X <- S %*% A # In R, "*" and "%*%" indicate "scalar" and matrix multiplication, respectively!

a <- fastICA(X, 2, alg.typ = "parallel", fun = "logcosh", alpha = 1,
method = "C", row.norm = FALSE, maxit = 200,
tol = 0.0001, verbose = TRUE)
par(mfrow = c(1, 2))
plot(a$\$ $X, main = "Pre-processed data")
plot(a$\$ $S, main = "ICA components")

: Another example of un-mixing two independent signals is shown below:
S <- cbind(sin((1:1000)/20), rep((((1:200)-100)/100), 5))
# [http://stat.ethz.ch/R-manual/R-devel/library/base/html/cbind.html cbind] combines objects by rows and columns. It takes a sequence of vector, matrix or data frames arguments and combines them by columns or rows, respectively.

A <- matrix(c(0.291, 0.6557, -0.5439, 0.5572), 2, 2)
X <- S %*% A
a <- fastICA(X, 2, alg.typ = "parallel", fun = "logcosh", alpha = 1,
method = "R", row.norm = FALSE, maxit = 200,
tol = 0.0001, verbose = TRUE)
par(mfcol = c(2, 3))
plot(1:1000, S[,1 ], type = "l", main = "Original Signals",
xlab = "", ylab = "")
plot(1:1000, S[,2 ], type = "l", xlab = "", ylab = "")
plot(1:1000, X[,1 ], type = "l", main = "Mixed Signals",
xlab = "", ylab = "")
plot(1:1000, X[,2 ], type = "l", xlab = "", ylab = "")
plot(1:1000, a$\$ $S[,1 ], type = "l", main = "ICA source estimates",
xlab = "", ylab = "")
plot(1:1000, a$\$ $S[, 2], type = "l", xlab = "", ylab = "")

====Factor analysis (FA)====
FA is a statistical method, which describes variability among observed correlated variables in terms of potentially lower number of unobserved variables. Consider a set of $p$ observable random variables, $x_1,x_2,…,x_p$ with means $μ_1,μ_2,…,μ_p$, respectively. Suppose for some unknown constants $l_{i,j}$ and $k$ unobserved random variables $F_j$, where $i\in \{1,…,p\}$ and $j \in \{1,…,k\}$ where $k<p$. We have $x_i-μ_i=l_{i,1} F_1 + ⋯ +l_{i,k} F_k + ε_i$, where $ε_i$ are independently distributed error terms with mean zero and finite variance. In matrix form, we have $x-μ = LF+ε$, with $n$ observations, we have $x$ is a $p\times n$ matrix, $L$ is a $p \times k$ matrix and $F$ is $k\times n$ matrix. Assumptions: (1) $ε$ and $F$ are independent; (2) $E(F)=0$; (3) $cov(F)=I$ to make sure the factors are uncorrelated. Solutions to the equations above yield the factors $F$ and the loading matrix, $L$.

* Example: In the following, matrices will be indicated by indexed variables. "Subject" indices will be indicated using letters $a$, $b$ and $c$, with values running from $1$ to $N_a$, which is equal to $10$ in the above example. "Factor" indices will be indicated using letters $p$, $q$ and $r$, with values running from $1$ to $N_p$ which is equal to $2$ in the above example. "Instance" or "sample" indices will be indicated using letters $i$, $j$ and $k$, with values running from $1$ to $N_i$. In the example above, if a sample of $N_i=1000$ students responded to the $N_a=10$ questions, the $i^{th}$ student's score for the $a^{th}$ question are given by $x_{ai}$. The purpose of factor analysis is to characterize the correlations between the variables $x_a$ of which the $x_{ai}$ is a particular instance, or set of observations. To ensure that all variables are on equal footing, they are standardized: $z_{ai}=\frac{x_{ai}- μ_a}{σ_a}$, where $μ_a$ is the sample mean and sample variance $σ_a^2=\frac{1}{N_a} \sum_i {(x_{ai}-μ_a)^2}$. The factor analysis model is expressed by:
:$$\begin{matrix}z_{1,i} & = & \ell_{1,1}F_{1,i} & + & \ell_{1,2}F_{2,i} & + & \varepsilon_{1,i} \\
\vdots & & \vdots & & \vdots & & \vdots \\
z_{10,i} & = & \ell_{10,1}F_{1,i} & + & \ell_{10,2}F_{2,i} & + & \varepsilon_{10,i}
\end{matrix}$$

In matrix form ($Z=LF+ϵ$), this model can be expressed as:
$$z_{ai}=\sum_p \ell_{ap}F_{pi}+\varepsilon_{ai},$$
where $F_{1,i}$ is the $i^{th}$ student’s ''verbal intelligence'', $F_{2,i}$ is the $i^{th}$ student’s ''mathematical intelligence''; $l_{ap}$ are the factor loadings for the $a^{th}$ subject for $p=1,2$.

=====FA in R: using ''factanal()''=====
# Maximum Likelihood Factor Analysis
# entering raw data and extracting 3 factors,
# with varimax rotation

mydata <- read.csv("http://www.ats.ucla.edu/stat/data/mmreg.csv")
# mmreg.csv includes 600 observations and 8 variables.
# The psychological variables are locus_of_control, self_concept and motivation.
# The academic variables are standardized tests in reading (read), writing (write), math (math) and science (science).
# Additionally, the variable female is a zero-one indicator variable with the one indicating a female student.

# We can get some basic descriptions of the entire data set by using summary.
# To get the standard deviations, we use sapply to apply the sd function to each variable in the dataset.
summary(mydata)

sapply(mydata, sd)

fit <- factanal(mydata, 3, rotation="varimax") # mydata can be a raw data matrix or a covariance matrix.
# Pairwise deletion of missing data is used. Rotation can be "varimax" or "promax".

print(fit, digits=2, cutoff=.3, sort=TRUE)
# plot factor 1 by factor 2
load <- fit$\$ $loadings[,1:2]
plot(load,type="n") # set up plot
text(load,labels=names(mydata),cex=.7) # add variable names

# Principal Axis Factor Analysis
library(psych)
fit <- factor.pa(mydata, nfactors=3, rotation="varimax")
fit # print results

# Determine Number of Factors to Extract
library(nFactors)
ev <- eigen(cor(mydata)) # get eigenvalues
ap <- parallel(subject=nrow(mydata),var=ncol(mydata), rep=100,cent=.05)
nS <- nScree(x=ev$\$ $values, aparallel=ap$\$ $eigen$\$ $qevpea)
plotnScree(nS)

===PCA, ICA, FA: Similarities and Differences===
* PCA is closely related to factor analysis. The later typically incorporates more domain specific assumptions about the underlying structure and solves eigenvectors of a slightly different matrix. Principal components create variables that are linear combinations of the original variables. The new variables have the property that the variables are all orthogonal. The principal components can be used to find clusters in a set of data.

* PCA is a variance-focused approach seeking to reproduce the total variable variance, in which components reflect both common and unique variance of the variable. It is generally preferred for purposes of data reduction (i.e., translating variable space into optimal factor space) but not when detect the latent construct or factors. Factor analysis is similar to principal component analysis, in that factor analysis also involves linear combinations of variables.

* Different from PCA, factor analysis is a correlation-focused approach seeking to reproduce the inter-correlations among variables, in which the factors “represent the common variance of variables, excluding unique variance. Factor analysis is generally used when the research purpose is detecting data structure (i.e., latent constructs or factors) or causal modeling.

===Applications===
* [[SOCR_EduMaterials_AnalysisActivities_PCA| This SOCR Activity]] demonstrated the utilization of SOCR analyses package for statistical computing in the SOCR environment. It presents a general introduction to PCA and the theoretical background of this statistical tool and shows how to use PCA and how to read and interpret the outcome. It introduced students to input data in the correct format, read the results of PCA and make interpretation of the resulting transformed data.

* [http://www.sciencedirect.com/science/article/pii/0169743987800849 This article] presents a general introduction to PCA. Principal component analysis of a data matrix extracts the dominant patterns in the matrix in terms of a complementary set of score and loading plots. It is the responsibility of the data analyst to formulate the scientific issue at hand in terms of PC projections, PLS regressions, etc. Ask yourself, or the investigator, why the data matrix was collected, and for what purpose the experiments and measurements were made. Specify before the analysis what kinds of patterns you would expect and what you would find exciting. The results of the analysis depend on the scaling of the matrix, which therefore must be specified. Variance scaling, where each variable is scaled to unit variance, can be recommended for general use, provided that almost constant variables are left unscaled. Combining different types of variables warrants blockscaling. In the initial analysis, look for outliers and strong groupings in the plots, indicating that the data matrix perhaps should be “polished” or whether disjoint modeling is the proper course. For plotting purposes, two or three principal components are usually sufficient, but for modeling purposes the number of significant components should be properly determined, e.g. by cross-validation. Use the resulting principal components to guide your continued investigation or chemical experimentation, not as an end in itself.

* [http://link.springer.com/article/10.1007/BF02294359 This article] introduced the Akaike Information Criterion (AIC) to extend the method of maximum likelihood to the multi-model situation. It related the successful experience of the order determination of an autoregressive model to the determination of the number of factors in the maximum likelihood factor analysis. The use of the AIC criterion in the factor analysis is particularly interesting when it is viewed as the choice of a Bayesian model. This observation shows that the area of application of AIC can be much wider than the conventional i.i.d. type models on which the original derivation of the criterion was based. The observation of the Bayesian structure of the factor analysis model leads us to the handling of the problem of improper solution by introducing a natural prior distribution of factor loadings.

* [http://onlinelibrary.wiley.com/doi/10.1002/9780470057339.vnn086/abstract This article] contains a good introduction to the application of ICA. Independent component models have gained increasing interest in various fields of applications in recent years. The basic independent component model is a semi-parametric model assuming that a p-variate observed random vector is a linear transformation of an unobserved vector of p independent latent variables. This linear transformation is given by an unknown mixing matrix, and one of the main objectives of independent component analysis (ICA) is to estimate an unmixing matrix by means of which the latent variables can be recovered. In this article, we discuss the basic independent component model in detail, define the concepts and analysis tools carefully, and consider two families of ICA estimates. The statistical properties (consistency, asymptotic normality, efficiency, robustness) of the estimates can be analyzed and compared via the so called gain matrices. Some extensions of the basic independent component model, such as models with additive noise or models with dependent observations, are briefly discussed. The article ends with a short example.

===Software ===
* [http://socr.ucla.edu/htmls/ana/ SOCR ANalyses]
* [[SOCR_EduMaterials_AnalysisActivities_PCA SOCR PCA Activity]]
* [http://stat.ethz.ch/R-manual/R-patched/library/stats/html/princomp.html R princomp package]
* [http://cran.r-project.org/web/packages/fastICA/ R fastICA package] and [http://cran.at.r-project.org/web/packages/fastICA/fastICA.pdf documentation]

===Problems===

====R Example 1====
# Install package ‘fastICA’

> library(fastICA)
# Using the SOCR 1981-2006 CPI Data (http://wiki.socr.umich.edu/index.php/SOCR_Data_Dinov_021808_ConsumerPriceIndex),
# save the table in ASCII text file [[SMHS_PCA_ICA_FA#Appendix|CPI_Data.dat]]. Note the "dir" (folder) where you saved the data and reference it below
> CPI_Data <- as.matrix(read.table("/dir/CPI_Data.dat",header=T))

> # compare PCA and FA analyses
> X <- CPI_Data[,-1]
> pc.cor <- princomp(X,cor=T)
> summary(pc.cor )
Importance of components:
Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7
Standard deviation 2.1348817 1.1696678 0.71243186 0.54364890 0.38449985 0.31956304 0.145200770
Proportion of Variance 0.6511029 0.1954461 0.07250845 0.04222202 0.02112002 0.01458865 0.003011895
Cumulative Proportion 0.6511029 0.8465490 0.91905742 0.96127944 0.98239946 0.99698811 1.000000000

> ica <- fastICA(X,n.comp=7)
> names(ica)
[1] "X" "K" "W" "A" "S"
# X: pre-processed data matrix (whitened/sphered data)
# K: pre-whitening matrix that projects data onto the first n.comp
# principal components.
# W: estimated un-mixing matrix (XW = S)
# A: estimated mixing matrix (X = SA)
# S: estimated source matrix (factor scores, $\Theta$ in the notes)

> windows()
> biplot(pc.cor)

> S <- ica$\$ $S
> dimnames(S) <- list(dimnames(X)[[1]],paste("Cmp.",1:7,sep=""))
> A <- ica$\$ $A
> dimnames(A) <- list(dimnames(X)[[2]],paste("Cmp.",1:7,sep=""))
> windows()
> biplot(S[,1:2],A[,1:2])

> loadings(pc.cor)

Loadings:
Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6
Electricity -0.415 -0.227 0.164 0.512 -0.373 -0.576
Fuel_Oil -0.351 0.547 -0.198 0.312
Bananas -0.373 -0.415 0.258 0.365 0.393 0.578
Tomatoes -0.369 -0.320 0.357 -0.738 -0.294
Orange_Juice -0.324 -0.311 -0.871 -0.119
Beef -0.424 0.220 -0.216 0.721 -0.449
Gasoline -0.380 0.479 -0.216 0.161
Comp.7
Electricity -0.131
Fuel_Oil -0.657
Bananas
Tomatoes
Orange_Juice 0.108
Beef
Gasoline 0.733

Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6
SS loadings 1.000 1.000 1.000 1.000 1.000 1.000
Proportion Var 0.143 0.143 0.143 0.143 0.143 0.143
Cumulative Var 0.143 0.286 0.429 0.571 0.714 0.857
Comp.7
SS loadings 1.000
Proportion Var 0.143
Cumulative Var 1.000

> field <- function(x) { substr(paste(x," ",sep=""),1,6) }
> A.str <- ifelse(abs(A)<2,field(" "),field(round(A,2)))
> dimnames(A.str) <- list(dimnames(X)[[2]],paste("Cmp.",1:7,sep=""))
> write.table(A.str[,1:4],"",quote=F,row.names=T,col.names=NA)
Cmp.1 Cmp.2 Cmp.3 Cmp.4
Electricity
Fuel_Oil
Bananas -2.59
Tomatoes -2.9
Orange_Juice -2.66
Beef
Gasoline

> L <- pc.cor$loadings
> L.str <- ifelse(abs(L)<0.3,field(" "),field(round(L,2)))
> dimnames(L.str) <- list(dimnames(X)[[2]],paste("Cmp.",1:7,sep=""))
> write.table(L.str[,1:4],"",quote=F,row.names=T,col.names=T)
Cmp.1 Cmp.2 Cmp.3 Cmp.4
Electricity -0.41 0.51
Fuel_Oil -0.35 0.55
Bananas -0.37 -0.41 0.37
Tomatoes -0.37 -0.32 0.36 -0.74
Orange_Juice -0.32 -0.31 -0.87
Beef -0.42
Gasoline -0.38 0.48

===References===
* [http://nyx-www.informatik.uni-bremen.de/664/1/smith_tr_02.pdf A tutorial on Principal Components Analysis]
* [http://mirlyn.lib.umich.edu/Record/004232056 Sampling / Steven K. Thompson]
* [http://mirlyn.lib.umich.edu/Record/004133572 Sampling theory and methods / S. Sampath]
* [http://en.wikipedia.org/wiki/Factor_analysis Factor Analysis details]

===Appendix===
[[SOCR_Data_Dinov_021808_ConsumerPriceIndex#Consumer_Price_Index_Data|SOCR 1981-2006 CPI Dataset]]

<center>
{| class="wikitable" style="text-align:center; width:25%" border="1"
|-
! Year||Electricity||Fuel_Oil||Bananas||Tomatoes||Orange_Juice||Beef||Gasoline
|-
| 1981||31.552||1.15||0.343||0.792||1.141||1.856||1.269
|-
| 1982||36.006||1.254||0.364||0.763||1.465||1.794||1.341
|-
| 1983||37.184||1.194||0.332||0.726||1.418||1.756||1.214
|-
| 1984||38.6||1.122||0.344||0.854||1.408||1.721||1.2
|-
| 1985||38.975||1.078||0.35||0.697||1.685||1.711||1.145
|-
| 1986||40.223||1.126||0.337||1.104||1.756||1.662||1.19
|-
| 1987||40.022||0.817||0.374||0.871||1.512||1.694||0.868
|-
| 1988||40.195||0.89||0.394||0.797||1.638||1.736||0.947
|-
| 1989||40.828||0.883||0.429||1.735||1.868||1.806||0.944
|-
| 1990||41.663||1.259||0.438||0.912||1.817||1.907||1.09
|-
| 1991||43.226||1.235||0.428||0.936||2.005||1.996||1.304
|-
| 1992||44.501||0.985||0.426||1.141||1.879||1.926||1.135
|-
| 1993||46.959||0.969||0.44||1.604||1.677||1.97||1.182
|-
| 1994||48.2||0.919||0.503||1.323||1.674||1.892||1.109
|-
| 1995||48.874||0.913||0.463||1.103||1.583||1.847||1.19
|-
| 1996||48.538||1.007||0.497||1.213||1.577||1.799||1.186
|-
| 1997||49.245||1.136||0.473||1.452||1.737||1.85||1.318
|-
| 1998||46.401||0.966||0.489||1.904||1.601||1.818||1.186
|-
| 1999||45.061||0.834||0.49||1.443||1.753||1.834||1.031
|-
| 2000||45.207||1.189||0.5||1.414||1.823||1.903||1.356
|-
| 2001||47.472||1.509||0.509||1.451||1.863||2.037||1.525
|-
| 2002||47.868||1.123||0.526||1.711||1.876||2.152||1.209
|-
| 2003||47.663||1.396||0.512||1.472||1.848||2.131||1.557
|-
| 2004||49.159||1.508||0.485||1.66||1.957||2.585||1.635
|-
| 2005||50.847||1.859||0.49||2.162||1.872||2.478||1.866
|-
| 2006||57.223||2.418||0.505||1.621||1.853||2.607||2.359
|}
</center>

<hr>
* SOCR Home page: http://www.socr.umich.edu

{{translate|pageName=http://wiki.socr.umich.edu/index.php?title=SMHS_PCA_ICA_FA}}

SMHS PCA ICA FA

2014-09-01T15:39:19Z

Jslavine: /* Motivation */

==[[SMHS| Scientific Methods for Health Sciences]] - Dimensionality Reduction: PCA, ICA, FA ==

===Overview===
* ''PCA'' (principal component analysis) is a mathematical procedure that transforms a number of possibly correlated variables into a smaller number of uncorrelated variables through a process known as orthogonal transformation.
* ''ICA'' (independent component analysis) is a computational tool to separate a multivariate signal into additive subcomponents by assuming that the subcomponents are non-Gaussian signals and are statistically independent from each other.
* ''Factor analysis'' is a statistical method that describes variability among observed correlated variables in terms of a potentially lower number of unobserved variables. It is related to PCA but they are not identical. In this section, we are going to introduce these three commonly used statistical tools and illustrate their application in various studies with examples and R code samples.

===Motivation===
Suppose we have a set of observable correlated random variables, and we want to reduce the dimensionality of the data into a reasonable new set. How can we achieve this dimensionality reduction? Principal component analysis, independent component analysis and factor analysis may be the answers here. How does each of them work? What are the differences among those statistical methods; what are their strengths and weaknesses? How can we decide on the best method for a specific dataset?

===Theory===
====PCA (principal component analysis)====
PCA is a statistical technique, which uses an orthogonal transformation to convert a set of observations of possibly correlated variables into set of values variables, which are linearly uncorrelated. The resulting uncorrelated variables are called principal components. The first principal component accounts for as much of the variability in the data as possible, and each succeeding component accounts for the remaining variability. PCA is the simplest of true eigenvector-based multivariate analyses to reveal the internal structure of the data in a way that best explains the variance in the data. It is a useful statistical technique that has found application many fields, including computer networks and image processing, and is a powerful method for finding patterns high-dimension dataset.

Consider a data matrix $X_{n\times p}$ with column-wise zero empirical mean (the sample mean of each column has been shifted to zero), where each of the $n$ rows represents a different repetition of the experiment, and each of the $p$ columns gives a particular kind of data-element (variable). Mathematically, the transformation is defined by a set of $p$ dimensional vectors of weights $w_{(k)}=(w_1,w_2,…,w_p)_{(k)}$, constrained to be unitary ($\\w_{(k)}||=1$), that map each row vector $x_i$ of $X$ to a new vector of principal component scores $t_{(i)}=(t_1,t_2,…,t_p)_{(i)}$, given by $t_{k(i)}=x_{(i)} w_{(k)}$ in such a way that the individual variable of $t$ considered over the data set successively inherit the maximum possible variance from $x$.

* First component: the first loading vector $w_{(1)}= \arg\max_{||w||=1} {\sum_i{(t_1)_{(i)}^2 }}= \arg\max_{||w||=1} {\sum_i{(x_{(i)} w)^2 }}$, in matrix form: $w_{(1)}=\arg\max_{||w||=1} {||Xw||^2 }=\arg\max_{||w||=1} {w^T X^T Xw}$.

* Further components: the $k^{th}$ component can be found by subtracting the first $k-1$ principal components from $X$. $\hat{X}_{k-1} = X-\sum_{s=1}^{k-1}{X w_{(s)} w_{(s)}^T }$ and finding the loading vector involves extracting the maximum variance from this new data matrix: $w_{(k)}=\arg\max_{||w||=1} {||\bar{X}_{k-1} w||^2 } = \arg\max {\frac{w^T \hat{X}_{k-1}^T \hat{X}_{k-1} w}{w^T w}}.$ This gives the remaining eigenvectors of $X^T X$ with the maximum values for the quantity in brackets given by other corresponding eigenvalues. The full principal components decomposition of $X$ can therefore be given as $T=XW$ where $W$ is a $p\times p$ matrix whose columns are the eigenvectors of $X^T X$.

* Dimensionality reduction: The faithful transformation $T = X W$ maps a data vector $x_{(i)}$ from an original space of $p$ variables to a new space of $p$ (other) variables which are uncorrelated over the dataset. However, not all the principal components need to be kept. Keeping only the first $L$ principal components, produced by using only the first $L$ loading vectors, gives the truncated transformation $T_L=XW_L$ where $T_L$ has $n$ rows but only $L$ columns. This dimensionality reduction is very useful for visualizing and processing high-dimensional dataset while still retaining as much of the variance in the dataset as possible.

* Singular value decomposition (SVD)
: The SVD of $X=U \Sigma W^T$ where $\Sigma$ is a $n\times p$ rectangular diagonal matrix of positive numbers $\sigma_{(k)}$, the singular values of $X$; $U$ is an $n\times n$ matrix, the columns of which are orthogonal until vector of length $n$ called the left singular vectors of $X$; and $W$ is a $p\times p$ whose columns are orthogonal until vector of length $p$ and called the right singular vectors of $X$. With factorization, the matrix $X^T X=W \Sigma U^T U \Sigma W^T = W \Sigma^2 W^T$; with singular value decomposition the score matrix $T=XW=U\Sigma W^T W=U\Sigma$, where each column of $T$ is given by one of the left singular vectors of $X$ multiplied by the corresponding singular value.

* Computational details of PCA:
** Begin with a dataset, which contains at least two dimensions (variables). The dataset can contain as many observations (dimensions) as possible;
** Normalize the observations for each variable. To do this, simply subtract the mean (average) from each observation in a given variable. For example, let $X$ and $Y$ be the two variables from the original dataset, with variable $X$ containing observations $X_1,X_2,…,X_n$, and variable $Y$ containing observations $Y_1,Y_2,…,Y_n$. Let $X$ ̅be the average of the $n$ $X$ observations, i.e. $\bar{X}= \frac{X_1+X_2+⋯+X_n}{n}$, and similarly let $\bar{Y}= \frac{Y_1+Y_2+⋯+Y_n}{n}$ be the average of the $Y$ observations. Then the normalized dataset would be: for variable $X$: $\{X_1-\bar{X}, X_2-\bar{X}, …,X_n-\bar{X}\}$ and for variable $Y$: $\{Y_1-\bar{Y},Y_2-\bar{Y}, …,Y_n-\bar{Y}\}$
** Calculate the covariance matrix between the variables of the normalized dataset;
** Calculate the Eigenvalue and Eigenvectors of the covariance matrix (the eigenvectors must be normalized to a length of 1);
**(5) Choose the most significant principal component, which is simply the Eigenvector with the highest Eigenvalue.

====PCA Properties====
* For any integer $q$, $1 \leq q \leq p$, consider the orthogonal linear transformation $y=B'x$, where $y$ is a $q$-element vector and $B'$ is a $q\times q$ matrix and let $\Sigma_y =B'\Sigma B$ be the variance-covariance matrix for $y$. Then the trace of $\Sigma_y$, $tr(\Sigma_y)$, is maximized by taking $B=A_q$, where $A_q$ consists of the first $q$ columns of $A$;
* $y=B'x$, and the $tr(\Sigma_y)$ is minimized by taking $B=A_q^*$ where $A_q^*$ consists of last $q$ columns of $A$. The last few principal components are not simply unstructured left-overs after removing the important ones;
* $\Sigma = \lambda_1 \alpha_1 \alpha_1'+\lambda_2 \alpha_2 \alpha_2'+⋯+\lambda_p \alpha_p \alpha_p'$. Given that $var(x_j)=\sum_{k=1}^p {\lambda_k \alpha_{kj}^2 }$, the elements of $\lambda_k \alpha_k \alpha_k'$ tends to become smaller as SkS increases, whereas the elements of $\lambda_k$ tends to stay about the same size because $\alpha_k \alpha_k'=1,$ for $k=1,2,…,p$.

====PCA Limitations====
* The results of PCA depend on the scaling of the variables; (
* The applicability of PCA is limited by certain assumptions made in its derivation.

==== PCA in R====
require(graphics)
## The variances of the variables in the
## USArrests data vary by orders of magnitude, so scaling is appropriate
(pc.cr <- princomp(USArrests)) # inappropriate

Call:
princomp(x = USArrests)

Standard deviations:
Comp.1 Comp.2 Comp.3 Comp.4
82.890847 14.069560 6.424204 2.45783

princomp(USArrests, cor = TRUE) # =^= prcomp(USArrests, scale=TRUE)

Call:
princomp(x = USArrests, cor = TRUE)

Standard deviations:
Comp.1 Comp.2 Comp.3 Comp.4
1.5748783 0.9948694 0.5971291 0.4164494

## Similar, but different:
## The standard deviations differ by a factor of sqrt(49/50)
summary(pc.cr <- princomp(USArrests, cor = TRUE))

Importance of components:
Comp.1 Comp.2 Comp.3 Comp.4
Standard deviation 1.5748783 0.9948694 0.5971291 0.41644938
Proportion of Variance 0.6200604 0.2474413 0.0891408 0.04335752
Cumulative Proportion 0.6200604 0.8675017 0.9566425 1.00000000

loadings(pc.cr) # note that blank entries are small but not zero
## The signs of the columns are arbitrary

Loadings:
Comp.1 Comp.2 Comp.3 Comp.4
Murder -0.536 0.418 -0.341 0.649
Assault -0.583 0.188 -0.268 -0.743
UrbanPop -0.278 -0.873 -0.378 0.134
Rape -0.543 -0.167 0.818

Comp.1 Comp.2 Comp.3 Comp.4
SS loadings 1.00 1.00 1.00 1.00
Proportion Var 0.25 0.25 0.25 0.25
Cumulative Var 0.25 0.50 0.75 1.00

plot(pc.cr) # shows a screeplot.

# The histogram distribution presents a vivid picture of the variance attributable to the first four significant principal
# components respectively.
biplot(pc.cr) ## shows the plot of PCA in a different format

From the chart above, we can see the distribution variance attributed to different variables on the four principal components.

==== [[SOCR_EduMaterials_AnalysisActivities_PCA|PCA using SOCR Analyses]]====
This [[SOCR_EduMaterials_AnalysisActivities_PCA| SOCR Activity illustrates the use of PCA]].

====ICA (independent component analysis)====
ICA is a computational method separating a multivariate signal into additive subcomponents by assuming that the subcomponents are non-Gaussian signals and are statistically independent from each other.

* ICA Assumptions:
** The source signals are independent of each other;
** The distribution of the values in each source signals are non-Gaussian.
** Independence: the source of signals are assumed to be independent but their signal mixture are not independent because they share the same source signals;
** Normality: based on [[SMHS_CLT_LLN|CLT]], the distribution of a sum of independent random variables approximate a Gaussian distribution;
** Complexity: the temporal complexity of any signal mixture is greater than that of its simplest constituent source signal.

: ICA maximizes the statistical independence of the estimated components to find the independent components. In general, ICA can’t identify the actual number of source signals nor can it identify the proper scaling of the source signals. Suppose the data is represented by the random vector $x=(x_1,x_2,…,x_m )^t$ and the components denoted as $s=(s_1,s_2,…,s_n )^t$. We need to transform the observed data $x$ using a linear transformation $w$, $s=Wx$, into maximally independent components $s$ measured by some functions of independence. There are alternative ''models'' for ICA:
* Linear noiseless ICA: the components $x_i$ of the data $x=(x_1,x_2,…,x_m )^t$ are generated as a sum of the independent components $s_k$, for $k=1,…,n$; $x_i=a_{i,1} s_1 + ⋯ + a_{i,k} s_k + ⋯ + a_{i,n} s_n$, weighted by the mixing weights $a_{i,k}$.
* Linear noisy ICA: with additional assumption of zero-mean and uncorrelated Gaussian noise $n \sim N(0,diag(\Sigma))$, the ICA model takes the form $x=As+n$.
* Nonlinear ICA: the mixing of the sources is not necessarily linear. Using a nonlinear mixing function $f(.|θ)$ with parameter $\theta$ the nonlinear ICA model is $x=f(s│θ)+n$.
* Identifiability: the independent components are identifiable up to a permutation and scaling of the sources, which requires: (1) at most one of the sources $s_k$ is Gaussian; (2) the number of observed mixtures, $m$, must be at least as large as the number of estimated components: $n$ such that $n\leq m$, i.e., the mixing matrix $A$ must be of full rank in order to have inverse.
* ICA in R using package [http://cran.at.r-project.org/web/packages/fastICA/fastICA.pdf fastICA]. This example shows un-mixing two mixed independent uniforms:

library(fastICA)
S <- matrix(runif(10000), 5000, 2)
A <- matrix(c(1, 1, -1, 3), 2, 2, byrow = TRUE)
X <- S %*% A # In R, "*" and "%*%" indicate "scalar" and matrix multiplication, respectively!

a <- fastICA(X, 2, alg.typ = "parallel", fun = "logcosh", alpha = 1,
method = "C", row.norm = FALSE, maxit = 200,
tol = 0.0001, verbose = TRUE)
par(mfrow = c(1, 2))
plot(a$\$ $X, main = "Pre-processed data")
plot(a$\$ $S, main = "ICA components")

: Another example of un-mixing two independent signals is shown below:
S <- cbind(sin((1:1000)/20), rep((((1:200)-100)/100), 5))
# [http://stat.ethz.ch/R-manual/R-devel/library/base/html/cbind.html cbind] combines objects by rows and columns. It takes a sequence of vector, matrix or data frames arguments and combines them by columns or rows, respectively.

A <- matrix(c(0.291, 0.6557, -0.5439, 0.5572), 2, 2)
X <- S %*% A
a <- fastICA(X, 2, alg.typ = "parallel", fun = "logcosh", alpha = 1,
method = "R", row.norm = FALSE, maxit = 200,
tol = 0.0001, verbose = TRUE)
par(mfcol = c(2, 3))
plot(1:1000, S[,1 ], type = "l", main = "Original Signals",
xlab = "", ylab = "")
plot(1:1000, S[,2 ], type = "l", xlab = "", ylab = "")
plot(1:1000, X[,1 ], type = "l", main = "Mixed Signals",
xlab = "", ylab = "")
plot(1:1000, X[,2 ], type = "l", xlab = "", ylab = "")
plot(1:1000, a$\$ $S[,1 ], type = "l", main = "ICA source estimates",
xlab = "", ylab = "")
plot(1:1000, a$\$ $S[, 2], type = "l", xlab = "", ylab = "")

====Factor analysis (FA)====
FA is a statistical method, which describes variability among observed correlated variables in terms of potentially lower number of unobserved variables. Consider a set of $p$ observable random variables, $x_1,x_2,…,x_p$ with means $μ_1,μ_2,…,μ_p$, respectively. Suppose for some unknown constants $l_{i,j}$ and $k$ unobserved random variables $F_j$, where $i\in \{1,…,p\}$ and $j \in \{1,…,k\}$ where $k<p$. We have $x_i-μ_i=l_{i,1} F_1 + ⋯ +l_{i,k} F_k + ε_i$, where $ε_i$ are independently distributed error terms with mean zero and finite variance. In matrix form, we have $x-μ = LF+ε$, with $n$ observations, we have $x$ is a $p\times n$ matrix, $L$ is a $p \times k$ matrix and $F$ is $k\times n$ matrix. Assumptions: (1) $ε$ and $F$ are independent; (2) $E(F)=0$; (3) $cov(F)=I$ to make sure the factors are uncorrelated. Solutions to the equations above yield the factors $F$ and the loading matrix, $L$.

* Example: In the following, matrices will be indicated by indexed variables. "Subject" indices will be indicated using letters $a$, $b$ and $c$, with values running from $1$ to $N_a$, which is equal to $10$ in the above example. "Factor" indices will be indicated using letters $p$, $q$ and $r$, with values running from $1$ to $N_p$ which is equal to $2$ in the above example. "Instance" or "sample" indices will be indicated using letters $i$, $j$ and $k$, with values running from $1$ to $N_i$. In the example above, if a sample of $N_i=1000$ students responded to the $N_a=10$ questions, the $i^{th}$ student's score for the $a^{th}$ question are given by $x_{ai}$. The purpose of factor analysis is to characterize the correlations between the variables $x_a$ of which the $x_{ai}$ is a particular instance, or set of observations. To ensure that all variables are on equal footing, they are standardized: $z_{ai}=\frac{x_{ai}- μ_a}{σ_a}$, where $μ_a$ is the sample mean and sample variance $σ_a^2=\frac{1}{N_a} \sum_i {(x_{ai}-μ_a)^2}$. The factor analysis model is expressed by:
:$$\begin{matrix}z_{1,i} & = & \ell_{1,1}F_{1,i} & + & \ell_{1,2}F_{2,i} & + & \varepsilon_{1,i} \\
\vdots & & \vdots & & \vdots & & \vdots \\
z_{10,i} & = & \ell_{10,1}F_{1,i} & + & \ell_{10,2}F_{2,i} & + & \varepsilon_{10,i}
\end{matrix}$$

In matrix form ($Z=LF+ϵ$), this model can be expressed as:
$$z_{ai}=\sum_p \ell_{ap}F_{pi}+\varepsilon_{ai},$$
where $F_{1,i}$ is the $i^{th}$ student’s ''verbal intelligence'', $F_{2,i}$ is the $i^{th}$ student’s ''mathematical intelligence''; $l_{ap}$ are the factor loadings for the $a^{th}$ subject for $p=1,2$.

=====FA in R: using ''factanal()''=====
# Maximum Likelihood Factor Analysis
# entering raw data and extracting 3 factors,
# with varimax rotation

mydata <- read.csv("http://www.ats.ucla.edu/stat/data/mmreg.csv")
# mmreg.csv includes 600 observations and 8 variables.
# The psychological variables are locus_of_control, self_concept and motivation.
# The academic variables are standardized tests in reading (read), writing (write), math (math) and science (science).
# Additionally, the variable female is a zero-one indicator variable with the one indicating a female student.

# We can get some basic descriptions of the entire data set by using summary.
# To get the standard deviations, we use sapply to apply the sd function to each variable in the dataset.
summary(mydata)

sapply(mydata, sd)

fit <- factanal(mydata, 3, rotation="varimax") # mydata can be a raw data matrix or a covariance matrix.
# Pairwise deletion of missing data is used. Rotation can be "varimax" or "promax".

print(fit, digits=2, cutoff=.3, sort=TRUE)
# plot factor 1 by factor 2
load <- fit$\$ $loadings[,1:2]
plot(load,type="n") # set up plot
text(load,labels=names(mydata),cex=.7) # add variable names

# Principal Axis Factor Analysis
library(psych)
fit <- factor.pa(mydata, nfactors=3, rotation="varimax")
fit # print results

# Determine Number of Factors to Extract
library(nFactors)
ev <- eigen(cor(mydata)) # get eigenvalues
ap <- parallel(subject=nrow(mydata),var=ncol(mydata), rep=100,cent=.05)
nS <- nScree(x=ev$\$ $values, aparallel=ap$\$ $eigen$\$ $qevpea)
plotnScree(nS)

===PCA, ICA, FA: Similarities and Differences===
* PCA is closely related to factor analysis. The later typically incorporates more domain specific assumptions about the underlying structure and solves eigenvectors of a slightly different matrix. Principal components create variables that are linear combinations of the original variables. The new variables have the property that the variables are all orthogonal. The principal components can be used to find clusters in a set of data.

* PCA is a variance-focused approach seeking to reproduce the total variable variance, in which components reflect both common and unique variance of the variable. It is generally preferred for purposes of data reduction (i.e., translating variable space into optimal factor space) but not when detect the latent construct or factors. Factor analysis is similar to principal component analysis, in that factor analysis also involves linear combinations of variables.

* Different from PCA, factor analysis is a correlation-focused approach seeking to reproduce the inter-correlations among variables, in which the factors “represent the common variance of variables, excluding unique variance. Factor analysis is generally used when the research purpose is detecting data structure (i.e., latent constructs or factors) or causal modeling.

===Applications===
* [[SOCR_EduMaterials_AnalysisActivities_PCA| This SOCR Activity]] demonstrated the utilization of SOCR analyses package for statistical computing in the SOCR environment. It presents a general introduction to PCA and the theoretical background of this statistical tool and shows how to use PCA and how to read and interpret the outcome. It introduced students to input data in the correct format, read the results of PCA and make interpretation of the resulting transformed data.

* [http://www.sciencedirect.com/science/article/pii/0169743987800849 This article] presents a general introduction to PCA. Principal component analysis of a data matrix extracts the dominant patterns in the matrix in terms of a complementary set of score and loading plots. It is the responsibility of the data analyst to formulate the scientific issue at hand in terms of PC projections, PLS regressions, etc. Ask yourself, or the investigator, why the data matrix was collected, and for what purpose the experiments and measurements were made. Specify before the analysis what kinds of patterns you would expect and what you would find exciting. The results of the analysis depend on the scaling of the matrix, which therefore must be specified. Variance scaling, where each variable is scaled to unit variance, can be recommended for general use, provided that almost constant variables are left unscaled. Combining different types of variables warrants blockscaling. In the initial analysis, look for outliers and strong groupings in the plots, indicating that the data matrix perhaps should be “polished” or whether disjoint modeling is the proper course. For plotting purposes, two or three principal components are usually sufficient, but for modeling purposes the number of significant components should be properly determined, e.g. by cross-validation. Use the resulting principal components to guide your continued investigation or chemical experimentation, not as an end in itself.

* [http://link.springer.com/article/10.1007/BF02294359 This article] introduced the Akaike Information Criterion (AIC) to extend the method of maximum likelihood to the multi-model situation. It related the successful experience of the order determination of an autoregressive model to the determination of the number of factors in the maximum likelihood factor analysis. The use of the AIC criterion in the factor analysis is particularly interesting when it is viewed as the choice of a Bayesian model. This observation shows that the area of application of AIC can be much wider than the conventional i.i.d. type models on which the original derivation of the criterion was based. The observation of the Bayesian structure of the factor analysis model leads us to the handling of the problem of improper solution by introducing a natural prior distribution of factor loadings.

* [http://onlinelibrary.wiley.com/doi/10.1002/9780470057339.vnn086/abstract This article] contains a good introduction to the application of ICA. Independent component models have gained increasing interest in various fields of applications in recent years. The basic independent component model is a semi-parametric model assuming that a p-variate observed random vector is a linear transformation of an unobserved vector of p independent latent variables. This linear transformation is given by an unknown mixing matrix, and one of the main objectives of independent component analysis (ICA) is to estimate an unmixing matrix by means of which the latent variables can be recovered. In this article, we discuss the basic independent component model in detail, define the concepts and analysis tools carefully, and consider two families of ICA estimates. The statistical properties (consistency, asymptotic normality, efficiency, robustness) of the estimates can be analyzed and compared via the so called gain matrices. Some extensions of the basic independent component model, such as models with additive noise or models with dependent observations, are briefly discussed. The article ends with a short example.

===Software ===
* [http://socr.ucla.edu/htmls/ana/ SOCR ANalyses]
* [[SOCR_EduMaterials_AnalysisActivities_PCA SOCR PCA Activity]]
* [http://stat.ethz.ch/R-manual/R-patched/library/stats/html/princomp.html R princomp package]
* [http://cran.r-project.org/web/packages/fastICA/ R fastICA package] and [http://cran.at.r-project.org/web/packages/fastICA/fastICA.pdf documentation]

===Problems===

====R Example 1====
# Install package ‘fastICA’

> library(fastICA)
# Using the SOCR 1981-2006 CPI Data (http://wiki.socr.umich.edu/index.php/SOCR_Data_Dinov_021808_ConsumerPriceIndex),
# save the table in ASCII text file [[SMHS_PCA_ICA_FA#Appendix|CPI_Data.dat]]. Note the "dir" (folder) where you saved the data and reference it below
> CPI_Data <- as.matrix(read.table("/dir/CPI_Data.dat",header=T))

> # compare PCA and FA analyses
> X <- CPI_Data[,-1]
> pc.cor <- princomp(X,cor=T)
> summary(pc.cor )
Importance of components:
Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7
Standard deviation 2.1348817 1.1696678 0.71243186 0.54364890 0.38449985 0.31956304 0.145200770
Proportion of Variance 0.6511029 0.1954461 0.07250845 0.04222202 0.02112002 0.01458865 0.003011895
Cumulative Proportion 0.6511029 0.8465490 0.91905742 0.96127944 0.98239946 0.99698811 1.000000000

> ica <- fastICA(X,n.comp=7)
> names(ica)
[1] "X" "K" "W" "A" "S"
# X: pre-processed data matrix (whitened/sphered data)
# K: pre-whitening matrix that projects data onto the first n.comp
# principal components.
# W: estimated un-mixing matrix (XW = S)
# A: estimated mixing matrix (X = SA)
# S: estimated source matrix (factor scores, $\Theta$ in the notes)

> windows()
> biplot(pc.cor)

> S <- ica$\$ $S
> dimnames(S) <- list(dimnames(X)[[1]],paste("Cmp.",1:7,sep=""))
> A <- ica$\$ $A
> dimnames(A) <- list(dimnames(X)[[2]],paste("Cmp.",1:7,sep=""))
> windows()
> biplot(S[,1:2],A[,1:2])

> loadings(pc.cor)

Loadings:
Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6
Electricity -0.415 -0.227 0.164 0.512 -0.373 -0.576
Fuel_Oil -0.351 0.547 -0.198 0.312
Bananas -0.373 -0.415 0.258 0.365 0.393 0.578
Tomatoes -0.369 -0.320 0.357 -0.738 -0.294
Orange_Juice -0.324 -0.311 -0.871 -0.119
Beef -0.424 0.220 -0.216 0.721 -0.449
Gasoline -0.380 0.479 -0.216 0.161
Comp.7
Electricity -0.131
Fuel_Oil -0.657
Bananas
Tomatoes
Orange_Juice 0.108
Beef
Gasoline 0.733

Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6
SS loadings 1.000 1.000 1.000 1.000 1.000 1.000
Proportion Var 0.143 0.143 0.143 0.143 0.143 0.143
Cumulative Var 0.143 0.286 0.429 0.571 0.714 0.857
Comp.7
SS loadings 1.000
Proportion Var 0.143
Cumulative Var 1.000

> field <- function(x) { substr(paste(x," ",sep=""),1,6) }
> A.str <- ifelse(abs(A)<2,field(" "),field(round(A,2)))
> dimnames(A.str) <- list(dimnames(X)[[2]],paste("Cmp.",1:7,sep=""))
> write.table(A.str[,1:4],"",quote=F,row.names=T,col.names=NA)
Cmp.1 Cmp.2 Cmp.3 Cmp.4
Electricity
Fuel_Oil
Bananas -2.59
Tomatoes -2.9
Orange_Juice -2.66
Beef
Gasoline

> L <- pc.cor$loadings
> L.str <- ifelse(abs(L)<0.3,field(" "),field(round(L,2)))
> dimnames(L.str) <- list(dimnames(X)[[2]],paste("Cmp.",1:7,sep=""))
> write.table(L.str[,1:4],"",quote=F,row.names=T,col.names=T)
Cmp.1 Cmp.2 Cmp.3 Cmp.4
Electricity -0.41 0.51
Fuel_Oil -0.35 0.55
Bananas -0.37 -0.41 0.37
Tomatoes -0.37 -0.32 0.36 -0.74
Orange_Juice -0.32 -0.31 -0.87
Beef -0.42
Gasoline -0.38 0.48

===References===
* [http://nyx-www.informatik.uni-bremen.de/664/1/smith_tr_02.pdf A tutorial on Principal Components Analysis]
* [http://mirlyn.lib.umich.edu/Record/004232056 Sampling / Steven K. Thompson]
* [http://mirlyn.lib.umich.edu/Record/004133572 Sampling theory and methods / S. Sampath]
* [http://en.wikipedia.org/wiki/Factor_analysis Factor Analysis details]

===Appendix===
[[SOCR_Data_Dinov_021808_ConsumerPriceIndex#Consumer_Price_Index_Data|SOCR 1981-2006 CPI Dataset]]

<center>
{| class="wikitable" style="text-align:center; width:25%" border="1"
|-
! Year||Electricity||Fuel_Oil||Bananas||Tomatoes||Orange_Juice||Beef||Gasoline
|-
| 1981||31.552||1.15||0.343||0.792||1.141||1.856||1.269
|-
| 1982||36.006||1.254||0.364||0.763||1.465||1.794||1.341
|-
| 1983||37.184||1.194||0.332||0.726||1.418||1.756||1.214
|-
| 1984||38.6||1.122||0.344||0.854||1.408||1.721||1.2
|-
| 1985||38.975||1.078||0.35||0.697||1.685||1.711||1.145
|-
| 1986||40.223||1.126||0.337||1.104||1.756||1.662||1.19
|-
| 1987||40.022||0.817||0.374||0.871||1.512||1.694||0.868
|-
| 1988||40.195||0.89||0.394||0.797||1.638||1.736||0.947
|-
| 1989||40.828||0.883||0.429||1.735||1.868||1.806||0.944
|-
| 1990||41.663||1.259||0.438||0.912||1.817||1.907||1.09
|-
| 1991||43.226||1.235||0.428||0.936||2.005||1.996||1.304
|-
| 1992||44.501||0.985||0.426||1.141||1.879||1.926||1.135
|-
| 1993||46.959||0.969||0.44||1.604||1.677||1.97||1.182
|-
| 1994||48.2||0.919||0.503||1.323||1.674||1.892||1.109
|-
| 1995||48.874||0.913||0.463||1.103||1.583||1.847||1.19
|-
| 1996||48.538||1.007||0.497||1.213||1.577||1.799||1.186
|-
| 1997||49.245||1.136||0.473||1.452||1.737||1.85||1.318
|-
| 1998||46.401||0.966||0.489||1.904||1.601||1.818||1.186
|-
| 1999||45.061||0.834||0.49||1.443||1.753||1.834||1.031
|-
| 2000||45.207||1.189||0.5||1.414||1.823||1.903||1.356
|-
| 2001||47.472||1.509||0.509||1.451||1.863||2.037||1.525
|-
| 2002||47.868||1.123||0.526||1.711||1.876||2.152||1.209
|-
| 2003||47.663||1.396||0.512||1.472||1.848||2.131||1.557
|-
| 2004||49.159||1.508||0.485||1.66||1.957||2.585||1.635
|-
| 2005||50.847||1.859||0.49||2.162||1.872||2.478||1.866
|-
| 2006||57.223||2.418||0.505||1.621||1.853||2.607||2.359
|}
</center>

<hr>
* SOCR Home page: http://www.socr.umich.edu

{{translate|pageName=http://wiki.socr.umich.edu/index.php?title=SMHS_PCA_ICA_FA}}

SMHS PCA ICA FA

2014-09-01T15:38:29Z

Jslavine: /* Overview */

==[[SMHS| Scientific Methods for Health Sciences]] - Dimensionality Reduction: PCA, ICA, FA ==

===Overview===
* ''PCA'' (principal component analysis) is a mathematical procedure that transforms a number of possibly correlated variables into a smaller number of uncorrelated variables through a process known as orthogonal transformation.
* ''ICA'' (independent component analysis) is a computational tool to separate a multivariate signal into additive subcomponents by assuming that the subcomponents are non-Gaussian signals and are statistically independent from each other.
* ''Factor analysis'' is a statistical method that describes variability among observed correlated variables in terms of a potentially lower number of unobserved variables. It is related to PCA but they are not identical. In this section, we are going to introduce these three commonly used statistical tools and illustrate their application in various studies with examples and R code samples.

===Motivation===
Suppose we have a set of observable correlated random variables, and we want to reduce the dimensionality of the data into a reasonable new set. How can we achieve dimensionality reduction? Principal component analysis, independent component analysis and factor analysis may be the answers here. How does each of them work? What would be the differences among those statistical methods, their strengthens and weaknesses? How can we decide on the right method regarding a specific dataset?

===Theory===
====PCA (principal component analysis)====
PCA is a statistical technique, which uses an orthogonal transformation to convert a set of observations of possibly correlated variables into set of values variables, which are linearly uncorrelated. The resulting uncorrelated variables are called principal components. The first principal component accounts for as much of the variability in the data as possible, and each succeeding component accounts for the remaining variability. PCA is the simplest of true eigenvector-based multivariate analyses to reveal the internal structure of the data in a way that best explains the variance in the data. It is a useful statistical technique that has found application many fields, including computer networks and image processing, and is a powerful method for finding patterns high-dimension dataset.

Consider a data matrix $X_{n\times p}$ with column-wise zero empirical mean (the sample mean of each column has been shifted to zero), where each of the $n$ rows represents a different repetition of the experiment, and each of the $p$ columns gives a particular kind of data-element (variable). Mathematically, the transformation is defined by a set of $p$ dimensional vectors of weights $w_{(k)}=(w_1,w_2,…,w_p)_{(k)}$, constrained to be unitary ($\\w_{(k)}||=1$), that map each row vector $x_i$ of $X$ to a new vector of principal component scores $t_{(i)}=(t_1,t_2,…,t_p)_{(i)}$, given by $t_{k(i)}=x_{(i)} w_{(k)}$ in such a way that the individual variable of $t$ considered over the data set successively inherit the maximum possible variance from $x$.

* First component: the first loading vector $w_{(1)}= \arg\max_{||w||=1} {\sum_i{(t_1)_{(i)}^2 }}= \arg\max_{||w||=1} {\sum_i{(x_{(i)} w)^2 }}$, in matrix form: $w_{(1)}=\arg\max_{||w||=1} {||Xw||^2 }=\arg\max_{||w||=1} {w^T X^T Xw}$.

* Further components: the $k^{th}$ component can be found by subtracting the first $k-1$ principal components from $X$. $\hat{X}_{k-1} = X-\sum_{s=1}^{k-1}{X w_{(s)} w_{(s)}^T }$ and finding the loading vector involves extracting the maximum variance from this new data matrix: $w_{(k)}=\arg\max_{||w||=1} {||\bar{X}_{k-1} w||^2 } = \arg\max {\frac{w^T \hat{X}_{k-1}^T \hat{X}_{k-1} w}{w^T w}}.$ This gives the remaining eigenvectors of $X^T X$ with the maximum values for the quantity in brackets given by other corresponding eigenvalues. The full principal components decomposition of $X$ can therefore be given as $T=XW$ where $W$ is a $p\times p$ matrix whose columns are the eigenvectors of $X^T X$.

* Dimensionality reduction: The faithful transformation $T = X W$ maps a data vector $x_{(i)}$ from an original space of $p$ variables to a new space of $p$ (other) variables which are uncorrelated over the dataset. However, not all the principal components need to be kept. Keeping only the first $L$ principal components, produced by using only the first $L$ loading vectors, gives the truncated transformation $T_L=XW_L$ where $T_L$ has $n$ rows but only $L$ columns. This dimensionality reduction is very useful for visualizing and processing high-dimensional dataset while still retaining as much of the variance in the dataset as possible.

* Singular value decomposition (SVD)
: The SVD of $X=U \Sigma W^T$ where $\Sigma$ is a $n\times p$ rectangular diagonal matrix of positive numbers $\sigma_{(k)}$, the singular values of $X$; $U$ is an $n\times n$ matrix, the columns of which are orthogonal until vector of length $n$ called the left singular vectors of $X$; and $W$ is a $p\times p$ whose columns are orthogonal until vector of length $p$ and called the right singular vectors of $X$. With factorization, the matrix $X^T X=W \Sigma U^T U \Sigma W^T = W \Sigma^2 W^T$; with singular value decomposition the score matrix $T=XW=U\Sigma W^T W=U\Sigma$, where each column of $T$ is given by one of the left singular vectors of $X$ multiplied by the corresponding singular value.

* Computational details of PCA:
** Begin with a dataset, which contains at least two dimensions (variables). The dataset can contain as many observations (dimensions) as possible;
** Normalize the observations for each variable. To do this, simply subtract the mean (average) from each observation in a given variable. For example, let $X$ and $Y$ be the two variables from the original dataset, with variable $X$ containing observations $X_1,X_2,…,X_n$, and variable $Y$ containing observations $Y_1,Y_2,…,Y_n$. Let $X$ ̅be the average of the $n$ $X$ observations, i.e. $\bar{X}= \frac{X_1+X_2+⋯+X_n}{n}$, and similarly let $\bar{Y}= \frac{Y_1+Y_2+⋯+Y_n}{n}$ be the average of the $Y$ observations. Then the normalized dataset would be: for variable $X$: $\{X_1-\bar{X}, X_2-\bar{X}, …,X_n-\bar{X}\}$ and for variable $Y$: $\{Y_1-\bar{Y},Y_2-\bar{Y}, …,Y_n-\bar{Y}\}$
** Calculate the covariance matrix between the variables of the normalized dataset;
** Calculate the Eigenvalue and Eigenvectors of the covariance matrix (the eigenvectors must be normalized to a length of 1);
**(5) Choose the most significant principal component, which is simply the Eigenvector with the highest Eigenvalue.

====PCA Properties====
* For any integer $q$, $1 \leq q \leq p$, consider the orthogonal linear transformation $y=B'x$, where $y$ is a $q$-element vector and $B'$ is a $q\times q$ matrix and let $\Sigma_y =B'\Sigma B$ be the variance-covariance matrix for $y$. Then the trace of $\Sigma_y$, $tr(\Sigma_y)$, is maximized by taking $B=A_q$, where $A_q$ consists of the first $q$ columns of $A$;
* $y=B'x$, and the $tr(\Sigma_y)$ is minimized by taking $B=A_q^*$ where $A_q^*$ consists of last $q$ columns of $A$. The last few principal components are not simply unstructured left-overs after removing the important ones;
* $\Sigma = \lambda_1 \alpha_1 \alpha_1'+\lambda_2 \alpha_2 \alpha_2'+⋯+\lambda_p \alpha_p \alpha_p'$. Given that $var(x_j)=\sum_{k=1}^p {\lambda_k \alpha_{kj}^2 }$, the elements of $\lambda_k \alpha_k \alpha_k'$ tends to become smaller as SkS increases, whereas the elements of $\lambda_k$ tends to stay about the same size because $\alpha_k \alpha_k'=1,$ for $k=1,2,…,p$.

====PCA Limitations====
* The results of PCA depend on the scaling of the variables; (
* The applicability of PCA is limited by certain assumptions made in its derivation.

==== PCA in R====
require(graphics)
## The variances of the variables in the
## USArrests data vary by orders of magnitude, so scaling is appropriate
(pc.cr <- princomp(USArrests)) # inappropriate

Call:
princomp(x = USArrests)

Standard deviations:
Comp.1 Comp.2 Comp.3 Comp.4
82.890847 14.069560 6.424204 2.45783

princomp(USArrests, cor = TRUE) # =^= prcomp(USArrests, scale=TRUE)

Call:
princomp(x = USArrests, cor = TRUE)

Standard deviations:
Comp.1 Comp.2 Comp.3 Comp.4
1.5748783 0.9948694 0.5971291 0.4164494

## Similar, but different:
## The standard deviations differ by a factor of sqrt(49/50)
summary(pc.cr <- princomp(USArrests, cor = TRUE))

Importance of components:
Comp.1 Comp.2 Comp.3 Comp.4
Standard deviation 1.5748783 0.9948694 0.5971291 0.41644938
Proportion of Variance 0.6200604 0.2474413 0.0891408 0.04335752
Cumulative Proportion 0.6200604 0.8675017 0.9566425 1.00000000

loadings(pc.cr) # note that blank entries are small but not zero
## The signs of the columns are arbitrary

Loadings:
Comp.1 Comp.2 Comp.3 Comp.4
Murder -0.536 0.418 -0.341 0.649
Assault -0.583 0.188 -0.268 -0.743
UrbanPop -0.278 -0.873 -0.378 0.134
Rape -0.543 -0.167 0.818

Comp.1 Comp.2 Comp.3 Comp.4
SS loadings 1.00 1.00 1.00 1.00
Proportion Var 0.25 0.25 0.25 0.25
Cumulative Var 0.25 0.50 0.75 1.00

plot(pc.cr) # shows a screeplot.

# The histogram distribution presents a vivid picture of the variance attributable to the first four significant principal
# components respectively.
biplot(pc.cr) ## shows the plot of PCA in a different format

From the chart above, we can see the distribution variance attributed to different variables on the four principal components.

==== [[SOCR_EduMaterials_AnalysisActivities_PCA|PCA using SOCR Analyses]]====
This [[SOCR_EduMaterials_AnalysisActivities_PCA| SOCR Activity illustrates the use of PCA]].

====ICA (independent component analysis)====
ICA is a computational method separating a multivariate signal into additive subcomponents by assuming that the subcomponents are non-Gaussian signals and are statistically independent from each other.

* ICA Assumptions:
** The source signals are independent of each other;
** The distribution of the values in each source signals are non-Gaussian.
** Independence: the source of signals are assumed to be independent but their signal mixture are not independent because they share the same source signals;
** Normality: based on [[SMHS_CLT_LLN|CLT]], the distribution of a sum of independent random variables approximate a Gaussian distribution;
** Complexity: the temporal complexity of any signal mixture is greater than that of its simplest constituent source signal.

: ICA maximizes the statistical independence of the estimated components to find the independent components. In general, ICA can’t identify the actual number of source signals nor can it identify the proper scaling of the source signals. Suppose the data is represented by the random vector $x=(x_1,x_2,…,x_m )^t$ and the components denoted as $s=(s_1,s_2,…,s_n )^t$. We need to transform the observed data $x$ using a linear transformation $w$, $s=Wx$, into maximally independent components $s$ measured by some functions of independence. There are alternative ''models'' for ICA:
* Linear noiseless ICA: the components $x_i$ of the data $x=(x_1,x_2,…,x_m )^t$ are generated as a sum of the independent components $s_k$, for $k=1,…,n$; $x_i=a_{i,1} s_1 + ⋯ + a_{i,k} s_k + ⋯ + a_{i,n} s_n$, weighted by the mixing weights $a_{i,k}$.
* Linear noisy ICA: with additional assumption of zero-mean and uncorrelated Gaussian noise $n \sim N(0,diag(\Sigma))$, the ICA model takes the form $x=As+n$.
* Nonlinear ICA: the mixing of the sources is not necessarily linear. Using a nonlinear mixing function $f(.|θ)$ with parameter $\theta$ the nonlinear ICA model is $x=f(s│θ)+n$.
* Identifiability: the independent components are identifiable up to a permutation and scaling of the sources, which requires: (1) at most one of the sources $s_k$ is Gaussian; (2) the number of observed mixtures, $m$, must be at least as large as the number of estimated components: $n$ such that $n\leq m$, i.e., the mixing matrix $A$ must be of full rank in order to have inverse.
* ICA in R using package [http://cran.at.r-project.org/web/packages/fastICA/fastICA.pdf fastICA]. This example shows un-mixing two mixed independent uniforms:

library(fastICA)
S <- matrix(runif(10000), 5000, 2)
A <- matrix(c(1, 1, -1, 3), 2, 2, byrow = TRUE)
X <- S %*% A # In R, "*" and "%*%" indicate "scalar" and matrix multiplication, respectively!

a <- fastICA(X, 2, alg.typ = "parallel", fun = "logcosh", alpha = 1,
method = "C", row.norm = FALSE, maxit = 200,
tol = 0.0001, verbose = TRUE)
par(mfrow = c(1, 2))
plot(a$\$ $X, main = "Pre-processed data")
plot(a$\$ $S, main = "ICA components")

: Another example of un-mixing two independent signals is shown below:
S <- cbind(sin((1:1000)/20), rep((((1:200)-100)/100), 5))
# [http://stat.ethz.ch/R-manual/R-devel/library/base/html/cbind.html cbind] combines objects by rows and columns. It takes a sequence of vector, matrix or data frames arguments and combines them by columns or rows, respectively.

A <- matrix(c(0.291, 0.6557, -0.5439, 0.5572), 2, 2)
X <- S %*% A
a <- fastICA(X, 2, alg.typ = "parallel", fun = "logcosh", alpha = 1,
method = "R", row.norm = FALSE, maxit = 200,
tol = 0.0001, verbose = TRUE)
par(mfcol = c(2, 3))
plot(1:1000, S[,1 ], type = "l", main = "Original Signals",
xlab = "", ylab = "")
plot(1:1000, S[,2 ], type = "l", xlab = "", ylab = "")
plot(1:1000, X[,1 ], type = "l", main = "Mixed Signals",
xlab = "", ylab = "")
plot(1:1000, X[,2 ], type = "l", xlab = "", ylab = "")
plot(1:1000, a$\$ $S[,1 ], type = "l", main = "ICA source estimates",
xlab = "", ylab = "")
plot(1:1000, a$\$ $S[, 2], type = "l", xlab = "", ylab = "")

====Factor analysis (FA)====
FA is a statistical method, which describes variability among observed correlated variables in terms of potentially lower number of unobserved variables. Consider a set of $p$ observable random variables, $x_1,x_2,…,x_p$ with means $μ_1,μ_2,…,μ_p$, respectively. Suppose for some unknown constants $l_{i,j}$ and $k$ unobserved random variables $F_j$, where $i\in \{1,…,p\}$ and $j \in \{1,…,k\}$ where $k<p$. We have $x_i-μ_i=l_{i,1} F_1 + ⋯ +l_{i,k} F_k + ε_i$, where $ε_i$ are independently distributed error terms with mean zero and finite variance. In matrix form, we have $x-μ = LF+ε$, with $n$ observations, we have $x$ is a $p\times n$ matrix, $L$ is a $p \times k$ matrix and $F$ is $k\times n$ matrix. Assumptions: (1) $ε$ and $F$ are independent; (2) $E(F)=0$; (3) $cov(F)=I$ to make sure the factors are uncorrelated. Solutions to the equations above yield the factors $F$ and the loading matrix, $L$.

* Example: In the following, matrices will be indicated by indexed variables. "Subject" indices will be indicated using letters $a$, $b$ and $c$, with values running from $1$ to $N_a$, which is equal to $10$ in the above example. "Factor" indices will be indicated using letters $p$, $q$ and $r$, with values running from $1$ to $N_p$ which is equal to $2$ in the above example. "Instance" or "sample" indices will be indicated using letters $i$, $j$ and $k$, with values running from $1$ to $N_i$. In the example above, if a sample of $N_i=1000$ students responded to the $N_a=10$ questions, the $i^{th}$ student's score for the $a^{th}$ question are given by $x_{ai}$. The purpose of factor analysis is to characterize the correlations between the variables $x_a$ of which the $x_{ai}$ is a particular instance, or set of observations. To ensure that all variables are on equal footing, they are standardized: $z_{ai}=\frac{x_{ai}- μ_a}{σ_a}$, where $μ_a$ is the sample mean and sample variance $σ_a^2=\frac{1}{N_a} \sum_i {(x_{ai}-μ_a)^2}$. The factor analysis model is expressed by:
:$$\begin{matrix}z_{1,i} & = & \ell_{1,1}F_{1,i} & + & \ell_{1,2}F_{2,i} & + & \varepsilon_{1,i} \\
\vdots & & \vdots & & \vdots & & \vdots \\
z_{10,i} & = & \ell_{10,1}F_{1,i} & + & \ell_{10,2}F_{2,i} & + & \varepsilon_{10,i}
\end{matrix}$$

In matrix form ($Z=LF+ϵ$), this model can be expressed as:
$$z_{ai}=\sum_p \ell_{ap}F_{pi}+\varepsilon_{ai},$$
where $F_{1,i}$ is the $i^{th}$ student’s ''verbal intelligence'', $F_{2,i}$ is the $i^{th}$ student’s ''mathematical intelligence''; $l_{ap}$ are the factor loadings for the $a^{th}$ subject for $p=1,2$.

=====FA in R: using ''factanal()''=====
# Maximum Likelihood Factor Analysis
# entering raw data and extracting 3 factors,
# with varimax rotation

mydata <- read.csv("http://www.ats.ucla.edu/stat/data/mmreg.csv")
# mmreg.csv includes 600 observations and 8 variables.
# The psychological variables are locus_of_control, self_concept and motivation.
# The academic variables are standardized tests in reading (read), writing (write), math (math) and science (science).
# Additionally, the variable female is a zero-one indicator variable with the one indicating a female student.

# We can get some basic descriptions of the entire data set by using summary.
# To get the standard deviations, we use sapply to apply the sd function to each variable in the dataset.
summary(mydata)

sapply(mydata, sd)

fit <- factanal(mydata, 3, rotation="varimax") # mydata can be a raw data matrix or a covariance matrix.
# Pairwise deletion of missing data is used. Rotation can be "varimax" or "promax".

print(fit, digits=2, cutoff=.3, sort=TRUE)
# plot factor 1 by factor 2
load <- fit$\$ $loadings[,1:2]
plot(load,type="n") # set up plot
text(load,labels=names(mydata),cex=.7) # add variable names

# Principal Axis Factor Analysis
library(psych)
fit <- factor.pa(mydata, nfactors=3, rotation="varimax")
fit # print results

# Determine Number of Factors to Extract
library(nFactors)
ev <- eigen(cor(mydata)) # get eigenvalues
ap <- parallel(subject=nrow(mydata),var=ncol(mydata), rep=100,cent=.05)
nS <- nScree(x=ev$\$ $values, aparallel=ap$\$ $eigen$\$ $qevpea)
plotnScree(nS)

===PCA, ICA, FA: Similarities and Differences===
* PCA is closely related to factor analysis. The later typically incorporates more domain specific assumptions about the underlying structure and solves eigenvectors of a slightly different matrix. Principal components create variables that are linear combinations of the original variables. The new variables have the property that the variables are all orthogonal. The principal components can be used to find clusters in a set of data.

* PCA is a variance-focused approach seeking to reproduce the total variable variance, in which components reflect both common and unique variance of the variable. It is generally preferred for purposes of data reduction (i.e., translating variable space into optimal factor space) but not when detect the latent construct or factors. Factor analysis is similar to principal component analysis, in that factor analysis also involves linear combinations of variables.

* Different from PCA, factor analysis is a correlation-focused approach seeking to reproduce the inter-correlations among variables, in which the factors “represent the common variance of variables, excluding unique variance. Factor analysis is generally used when the research purpose is detecting data structure (i.e., latent constructs or factors) or causal modeling.

===Applications===
* [[SOCR_EduMaterials_AnalysisActivities_PCA| This SOCR Activity]] demonstrated the utilization of SOCR analyses package for statistical computing in the SOCR environment. It presents a general introduction to PCA and the theoretical background of this statistical tool and shows how to use PCA and how to read and interpret the outcome. It introduced students to input data in the correct format, read the results of PCA and make interpretation of the resulting transformed data.

* [http://www.sciencedirect.com/science/article/pii/0169743987800849 This article] presents a general introduction to PCA. Principal component analysis of a data matrix extracts the dominant patterns in the matrix in terms of a complementary set of score and loading plots. It is the responsibility of the data analyst to formulate the scientific issue at hand in terms of PC projections, PLS regressions, etc. Ask yourself, or the investigator, why the data matrix was collected, and for what purpose the experiments and measurements were made. Specify before the analysis what kinds of patterns you would expect and what you would find exciting. The results of the analysis depend on the scaling of the matrix, which therefore must be specified. Variance scaling, where each variable is scaled to unit variance, can be recommended for general use, provided that almost constant variables are left unscaled. Combining different types of variables warrants blockscaling. In the initial analysis, look for outliers and strong groupings in the plots, indicating that the data matrix perhaps should be “polished” or whether disjoint modeling is the proper course. For plotting purposes, two or three principal components are usually sufficient, but for modeling purposes the number of significant components should be properly determined, e.g. by cross-validation. Use the resulting principal components to guide your continued investigation or chemical experimentation, not as an end in itself.

* [http://link.springer.com/article/10.1007/BF02294359 This article] introduced the Akaike Information Criterion (AIC) to extend the method of maximum likelihood to the multi-model situation. It related the successful experience of the order determination of an autoregressive model to the determination of the number of factors in the maximum likelihood factor analysis. The use of the AIC criterion in the factor analysis is particularly interesting when it is viewed as the choice of a Bayesian model. This observation shows that the area of application of AIC can be much wider than the conventional i.i.d. type models on which the original derivation of the criterion was based. The observation of the Bayesian structure of the factor analysis model leads us to the handling of the problem of improper solution by introducing a natural prior distribution of factor loadings.

* [http://onlinelibrary.wiley.com/doi/10.1002/9780470057339.vnn086/abstract This article] contains a good introduction to the application of ICA. Independent component models have gained increasing interest in various fields of applications in recent years. The basic independent component model is a semi-parametric model assuming that a p-variate observed random vector is a linear transformation of an unobserved vector of p independent latent variables. This linear transformation is given by an unknown mixing matrix, and one of the main objectives of independent component analysis (ICA) is to estimate an unmixing matrix by means of which the latent variables can be recovered. In this article, we discuss the basic independent component model in detail, define the concepts and analysis tools carefully, and consider two families of ICA estimates. The statistical properties (consistency, asymptotic normality, efficiency, robustness) of the estimates can be analyzed and compared via the so called gain matrices. Some extensions of the basic independent component model, such as models with additive noise or models with dependent observations, are briefly discussed. The article ends with a short example.

===Software ===
* [http://socr.ucla.edu/htmls/ana/ SOCR ANalyses]
* [[SOCR_EduMaterials_AnalysisActivities_PCA SOCR PCA Activity]]
* [http://stat.ethz.ch/R-manual/R-patched/library/stats/html/princomp.html R princomp package]
* [http://cran.r-project.org/web/packages/fastICA/ R fastICA package] and [http://cran.at.r-project.org/web/packages/fastICA/fastICA.pdf documentation]

===Problems===

====R Example 1====
# Install package ‘fastICA’

> library(fastICA)
# Using the SOCR 1981-2006 CPI Data (http://wiki.socr.umich.edu/index.php/SOCR_Data_Dinov_021808_ConsumerPriceIndex),
# save the table in ASCII text file [[SMHS_PCA_ICA_FA#Appendix|CPI_Data.dat]]. Note the "dir" (folder) where you saved the data and reference it below
> CPI_Data <- as.matrix(read.table("/dir/CPI_Data.dat",header=T))

> # compare PCA and FA analyses
> X <- CPI_Data[,-1]
> pc.cor <- princomp(X,cor=T)
> summary(pc.cor )
Importance of components:
Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7
Standard deviation 2.1348817 1.1696678 0.71243186 0.54364890 0.38449985 0.31956304 0.145200770
Proportion of Variance 0.6511029 0.1954461 0.07250845 0.04222202 0.02112002 0.01458865 0.003011895
Cumulative Proportion 0.6511029 0.8465490 0.91905742 0.96127944 0.98239946 0.99698811 1.000000000

> ica <- fastICA(X,n.comp=7)
> names(ica)
[1] "X" "K" "W" "A" "S"
# X: pre-processed data matrix (whitened/sphered data)
# K: pre-whitening matrix that projects data onto the first n.comp
# principal components.
# W: estimated un-mixing matrix (XW = S)
# A: estimated mixing matrix (X = SA)
# S: estimated source matrix (factor scores, $\Theta$ in the notes)

> windows()
> biplot(pc.cor)

> S <- ica$\$ $S
> dimnames(S) <- list(dimnames(X)[[1]],paste("Cmp.",1:7,sep=""))
> A <- ica$\$ $A
> dimnames(A) <- list(dimnames(X)[[2]],paste("Cmp.",1:7,sep=""))
> windows()
> biplot(S[,1:2],A[,1:2])

> loadings(pc.cor)

Loadings:
Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6
Electricity -0.415 -0.227 0.164 0.512 -0.373 -0.576
Fuel_Oil -0.351 0.547 -0.198 0.312
Bananas -0.373 -0.415 0.258 0.365 0.393 0.578
Tomatoes -0.369 -0.320 0.357 -0.738 -0.294
Orange_Juice -0.324 -0.311 -0.871 -0.119
Beef -0.424 0.220 -0.216 0.721 -0.449
Gasoline -0.380 0.479 -0.216 0.161
Comp.7
Electricity -0.131
Fuel_Oil -0.657
Bananas
Tomatoes
Orange_Juice 0.108
Beef
Gasoline 0.733

Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6
SS loadings 1.000 1.000 1.000 1.000 1.000 1.000
Proportion Var 0.143 0.143 0.143 0.143 0.143 0.143
Cumulative Var 0.143 0.286 0.429 0.571 0.714 0.857
Comp.7
SS loadings 1.000
Proportion Var 0.143
Cumulative Var 1.000

> field <- function(x) { substr(paste(x," ",sep=""),1,6) }
> A.str <- ifelse(abs(A)<2,field(" "),field(round(A,2)))
> dimnames(A.str) <- list(dimnames(X)[[2]],paste("Cmp.",1:7,sep=""))
> write.table(A.str[,1:4],"",quote=F,row.names=T,col.names=NA)
Cmp.1 Cmp.2 Cmp.3 Cmp.4
Electricity
Fuel_Oil
Bananas -2.59
Tomatoes -2.9
Orange_Juice -2.66
Beef
Gasoline

> L <- pc.cor$loadings
> L.str <- ifelse(abs(L)<0.3,field(" "),field(round(L,2)))
> dimnames(L.str) <- list(dimnames(X)[[2]],paste("Cmp.",1:7,sep=""))
> write.table(L.str[,1:4],"",quote=F,row.names=T,col.names=T)
Cmp.1 Cmp.2 Cmp.3 Cmp.4
Electricity -0.41 0.51
Fuel_Oil -0.35 0.55
Bananas -0.37 -0.41 0.37
Tomatoes -0.37 -0.32 0.36 -0.74
Orange_Juice -0.32 -0.31 -0.87
Beef -0.42
Gasoline -0.38 0.48

===References===
* [http://nyx-www.informatik.uni-bremen.de/664/1/smith_tr_02.pdf A tutorial on Principal Components Analysis]
* [http://mirlyn.lib.umich.edu/Record/004232056 Sampling / Steven K. Thompson]
* [http://mirlyn.lib.umich.edu/Record/004133572 Sampling theory and methods / S. Sampath]
* [http://en.wikipedia.org/wiki/Factor_analysis Factor Analysis details]

===Appendix===
[[SOCR_Data_Dinov_021808_ConsumerPriceIndex#Consumer_Price_Index_Data|SOCR 1981-2006 CPI Dataset]]

<center>
{| class="wikitable" style="text-align:center; width:25%" border="1"
|-
! Year||Electricity||Fuel_Oil||Bananas||Tomatoes||Orange_Juice||Beef||Gasoline
|-
| 1981||31.552||1.15||0.343||0.792||1.141||1.856||1.269
|-
| 1982||36.006||1.254||0.364||0.763||1.465||1.794||1.341
|-
| 1983||37.184||1.194||0.332||0.726||1.418||1.756||1.214
|-
| 1984||38.6||1.122||0.344||0.854||1.408||1.721||1.2
|-
| 1985||38.975||1.078||0.35||0.697||1.685||1.711||1.145
|-
| 1986||40.223||1.126||0.337||1.104||1.756||1.662||1.19
|-
| 1987||40.022||0.817||0.374||0.871||1.512||1.694||0.868
|-
| 1988||40.195||0.89||0.394||0.797||1.638||1.736||0.947
|-
| 1989||40.828||0.883||0.429||1.735||1.868||1.806||0.944
|-
| 1990||41.663||1.259||0.438||0.912||1.817||1.907||1.09
|-
| 1991||43.226||1.235||0.428||0.936||2.005||1.996||1.304
|-
| 1992||44.501||0.985||0.426||1.141||1.879||1.926||1.135
|-
| 1993||46.959||0.969||0.44||1.604||1.677||1.97||1.182
|-
| 1994||48.2||0.919||0.503||1.323||1.674||1.892||1.109
|-
| 1995||48.874||0.913||0.463||1.103||1.583||1.847||1.19
|-
| 1996||48.538||1.007||0.497||1.213||1.577||1.799||1.186
|-
| 1997||49.245||1.136||0.473||1.452||1.737||1.85||1.318
|-
| 1998||46.401||0.966||0.489||1.904||1.601||1.818||1.186
|-
| 1999||45.061||0.834||0.49||1.443||1.753||1.834||1.031
|-
| 2000||45.207||1.189||0.5||1.414||1.823||1.903||1.356
|-
| 2001||47.472||1.509||0.509||1.451||1.863||2.037||1.525
|-
| 2002||47.868||1.123||0.526||1.711||1.876||2.152||1.209
|-
| 2003||47.663||1.396||0.512||1.472||1.848||2.131||1.557
|-
| 2004||49.159||1.508||0.485||1.66||1.957||2.585||1.635
|-
| 2005||50.847||1.859||0.49||2.162||1.872||2.478||1.866
|-
| 2006||57.223||2.418||0.505||1.621||1.853||2.607||2.359
|}
</center>

<hr>
* SOCR Home page: http://www.socr.umich.edu

{{translate|pageName=http://wiki.socr.umich.edu/index.php?title=SMHS_PCA_ICA_FA}}

SMHS ResamplingSimulation

2014-08-31T18:18:34Z

Jslavine: /* Problems */

==[[SMHS| Scientific Methods for Health Sciences]] - Resampling and Simulation ==

===Overview===
In statistics, ''resampling'' and ''simulation'' are two important concepts with wide applications in research and projects from various fields. ''Resampling'' is any of a variety of methods in which the following processes are implemented: (1) estimating the precision of sample statistics (i.e., medians, percentiles) by using subsets of available data (i.e., ''jackknifing'') or drawing randomly with replacement from a set of data (i.e., ''bootstrapping''); (2) exchanging labels on data points when performing significance tests (i.e., ''permutation tests''); or (3) validating models by using random subsets (i.e., bootstrapping, ''cross-validation''). We are going to introduce some common resampling techniques including bootstrapping, jackknifing, cross-validation and permutation tests. ''Simulation'' involves imitating real world processes or systems over time. We usually apply simulation after a model, which represents the key characteristics of the process, is developed. Simulation is widely applied in many contexts such as simulation of technology for performance optimization, testing and video games. It is often applied when the real system is not accessible or is difficult and/or costly to apply, and it provides us with an easier way to obtain data about the system or test it. We are going to present an introduction to simulation including the basic methods, applications, advantages and limitations.

===Motivation===
Imagine we want to evaluate the quality of a system or process, but data on the process is very hard to collect. How can we evaluate the system without having to collect samples? If we know the characteristics of the data set, for example, if we know it follows a normal distribution, then we could easily generate a series of data following a normal distribution and use these to test the system. In fact, we can easily generate a large amount of data and test the system with more power. Consider another case in which, instead of knowing the exact characteristics of the data, we have few data from the past few years and we notice that they follow a certain pattern. Here, we can use this data set to work out the characteristics of the data and develop a model. We can then generate a new data set from the model we developed. A popular example is the bootstrapping method in the interest rate model. In order to learn more about resampling and simulation methods, we are going to introduce the fundamental concepts, rules and methodologies commonly applied in these fields to prepare students with necessary background in resampling and simulation.

===Theory===
==== Resampling methods====
Resampling methods use a computer to generate a large number of simulated samples; the patterns in these samples are then summarized and analyzed. In resampling methods, the simulated samples are drawn from the existing sample of data and not from a theoretically defined data generating process (DGP). Thus, in resampling methods, the researcher does not know or control the DGP but can still learn about it.

*Principles: The assumption is that there is some population DGP that remains unobserved and that the DGP produced the one sample of data you have; all information about the population contained in the original data set is also contained in the distribution of these simulated samples. We draw a new ‘sample’ of data that consists of a mix of the observations from the original sample and repeat this many times so we have many new simulated ‘samples’. One can consider the original data set to be a reasonable representation of the population, and the distribution of parameter estimates produced from running a model on a series of resampled data sets will provide a good approximation of the distribution in the population. Resampling methods can be either parametric or non-parametric.

====Bootstrapping====
Bootstrapping is a statistical method for estimating the sampling distribution of an estimator by sampling with replacement from the original sample, most often with the purpose of deriving robust estimates of standard errors and confidence intervals of a population parameter like the median, odds ratio or regression coefficient. This technique allows estimation of the sampling distribution of almost any statistic using only very simple methods, and it falls into the broader class of resampling methods.

*Situations where bootstrapping applies: (1) When the theoretical distribution of a statistic of interest is complicated or unknown; (2) When the sample size is insufficient for straightforward statistical inference; (3) When power calculations have to be performed, and a small pilot sample is available.

*Boostrapping is the practice of estimating properties of an estimator by measuring those properties when sampling from an approximating distribution, say the empirical distribution of the observed data. It is often used as a robust alternative to inference based on parametric assumptions when those assumptions are in doubt, or when parametric inferences is impossible or requires very complicated formulas for the calculation of standard errors. It may also be used for constructing hypothesis tests.

*The basic idea of bootstrapping is that inference about a population based on sample data (sample → population) can be modeled by resampling the data and performing inference on this new sample (resample → sample). More formally, bootstrapping works by treating inference regarding the true probability distribution given the data as being analogous to inference regarding the empirical distribution given the resampled data. The accuracy of inferences regarding the empirical distribution using the resampled data can be assessed because we know the distribution. If the empirical distribution is a reasonable approximation of the true probability distribution, then the quality of inference on the true probability distribution can in turn be inferred.

*Common process: (1) Begin with an observed sample of size N, (2) generate a simulated sample of size N by drawing observations from your observed sample independently and with replacement, (3) compute and save the statistic of interest, (4) repeat this process many times (e.g., 1000), and (5) treat the distribution of your estimated statistics of interest as an estimate of the population distribution of that statistic.

*Key features of the bootstrap: The draws must be independent; that is, each observation in the observed sample must have an equal chance of being selected. The simulated sample must be of size N to take full advantage of the information in the sample. Resampling must be done with replacement, if not, then every simulated sample of size N would be identical and the same as the original sample. Resampling with replacement means that in any given simulated sample, some cases might appear more than once while others will not appear at all.

*Types of bootstrap schemes: (1) ''Case resampling'' using the Monte Carlo algorithm; (2) estimating the distribution of the sample means; (3) regression; (4) Bayesian bootstrap; (5) smooth bootstrap; (6) parametric bootstrap; (7) resampling residuals; (8) Gaussian process regression bootstrap; (9) wild bootstrap; (10) block bootstrap.

*Advantages: Boostrapping is simple, and it is straightforward to derive estimates of standard errors and confidence intervals for complex estimators of complex parameters of the distribution; it is appropriate to control and check the stability of the results.

*Limitations: It does not provide general finite-sample guarantees. Additionally, the apparent simplicity may conceal the fact that important assumptions are being made when undertaking the bootstrap analysis where these would be more formally stated in other approaches.

====Jackknife====
The Jackknife method estimates the bias and standard error of a statistic when a random sample of observations is used to calculate it. The basic idea is to systematically recompute the statistic estimate, leaving out one or more observations at a time from the sample set. From the new set of replicates of the statistic, an estimate for the bias and an estimate for the variance of the statistic can be calculated.

*Jackknife estimates of variance asymptotically tend to the true value almost surely. The jackknife is consistent for sample means, sample variances, etc.

*The jackknife is not consistent for the sample median. In the case of a unimodal variable, the ratio of the jackknife variance to the sample variance tends to be distributed as one half the square of a chi-square distribution with two degrees of freedom.
*It is dependent on the independence of the data. Extensions of the jackknife to allow for dependences in the data have been proposed.

*Advantages: This method is good at detecting outliers and influential cases. Those sub-sample estimates that differ most indicate cases that have the strongest influence on those estimates in the original full sample analysis.

*Limitations: The jackknife is less general than the bootstrap and thus, is used less frequently. It does not perform well if the statistic under consideration does not change ‘smoothly’ across simulated samples and it does not perform well with small samples because you cannot generate many resamples.

==== Cross-validation====
Cross-validation (CV) is a statistical method for validating a predictive model by assessing a statistical model using a data set that is independent of the data set used to fit the model. Subsets of the data are held out for use as ''validating sets''; a model is fit to the remaining data (i.e., ''training set'') and used to predict the validation set. Averaging the quality of the predictions across the validation sets yields an overall measure of prediction accuracy.

*Steps: (1) Randomly partition the data into a training set and a validating set, (2) fit the model to the training set, (3) take the parameter estimates from that model and use them to calculate a measure of fit on the testing set, (4) repeat several times and average to reduce variability.

*'''Types of CV''':
**''Leave-one-out CV'': This is an iterative method in which the number of iterations = sample size, each observation becomes the validating set one time. Steps: 1) Delete observation #1 from the data, 2) fit the model to observations #2-n, 3) apply the coefficients form step #2 to observation #1 and calculate the chosen fit measure, 4) delete observation #2 from the data, 5) fit the model to observations #1 and #3-n, 6) apply the coefficients from step #5 to observation #2 and calculate the chosen fit measure, 7) repeat until all observations have been deleted once.
**''K-fold cross-validation'' splits the data into K subsets and each is held out in turn as the validation set. This avoids self-influence. This influence is similar to they way in which in regression analysis methods, such as linear regression, each y value draws the regression line toward itself, making the prediction of that value appear more accurate than it really is. Cross-validation applied to linear regression predicts the y value for each observation without using that observation.

*Limitations of CV: The training and validating data must be random samples from the same population. It will be most different from in-sample measures when n is small. It is more computationally demanding than calculating in-sample measures. It is subject to the researcher’s selection of an appropriate fit statistic.

====Permutation test====
Permutation (or randomization or re-randomization) test is a type of statistical significance test in which the distribution of the test statistic under the null hypothesis is obtained by calculating all possible values of the test statistic under rearrangements of the labels on the observed data points. It is just another form of resampling but is done without replacement.

*Rather than assume a distribution for the null hypothesis, we simulate what it would be by randomly reconfiguring our sample lots of times (say 1000) in a way that ‘breaks’ the relationship in our sample data.

*Suppose we have group A and group B with sample means $\bar{x}_A$ and $\bar{x}_B$ respectively and we want to test, at a 5% significance level, whether they come form the same distribution. $n_A$ and $n_B$ are the sample size for each group. A permutation test is designed to determine whether the observed difference between the sample means is large enough to reject the null hypothesis $H_0$: the two groups have identical probability distribution. The test proceeds: (1) the difference in the means between group A and B is calculated, (2) difference in sample means is calculated and recorded for each possible way of dividing these pooled values into two groups of size $n_A$ and $n_B$. The set of these calculated differences if the exact distribution of possible differences under the null hypothesis that group label does not matter, (3) the one-side p-value of the test is calculated as the proportion of sampled permutations where the difference in means was greater than or equal to $T (obs)$; the two-sided p-value of the test is calculated as the proportion of sampled permutations where the absolute difference was greater than or equal to $ABS(T(obs))$;(4) sort the recorded differences and then observe if $T(obs)$ is contained within the middle 95% of them, if not, reject $H_0$ at 5% significance level.

====Simulation====
A common assumption is that the coefficients we are trying to estimate come from a probability distribution. With large enough sample sizes, according to the central limit theorem (CLT), this distribution is multivariate normal.
The goal of simulation is to make random draws from this distribution to simulate many ‘hypothetical values’ of the coefficients.
*Steps: (1) Choose a quality index (QI), e.g., expected value, predicted probability, odds ratio, first difference, etc.; (2) Set a key variable in the model to a theoretically interesting value and the rest to their means or modes; (3) calculate the QI with each set of simulated coefficients; (4) set the variable to a new value; (5) calculate that QI with each set of the simulated coefficients, (6) repeat as appropriate; (7) efficiently summarize the distribution of the computed QI at each value of the variable of interest.

*Advantages: Simulation provides more information than a table of regression outputs. It accounts for uncertainty in the QI and is flexible to many different types of models, QIs and variable specifications. After performing it once, it is easy to use and can be much easier than working with analytic solutions.

*Limitations: It relies on the CLT to justify asymptotic normality. In contrast, a fully Bayesian model using MCMC could produce an exact finite-sample distribution and bootstrapping would require no distributional assumptions. It can be computationally intense and large models can produce great uncertainty regarding quantities of interest.

===Applications===
* [http://www.amstat.org/publications/jse/v16n2/dinov.html This article] presents an application of the Central Limit Theorem using the SOCR applet for a demonstration activity. This article described an innovative effort at using technological tools for improving student motivation and learning of the theory, practice and usability of the CLT in probability and statistics courses. The method is based on harnessing the computational libraries developed by SOCR to design a new interactive Java applet and a corresponding demonstration activity that illustrate the meaning and power of the CLT. It included four experiments to demonstrate the assumptions, meaning and implication of the CLT as well as a hands-on simulation and a number of examples illustrating the theory and application of the CLT.

* [http://link.springer.com/article/10.3758/BRM.40.3.879 This article], entitled "Asymptotic and resampling strategies for assessing and comparing indirect effects in multiple mediator models," provides an overview of simple and multiple mediation and explored three approaches that can be used to investigate indirect process. It also presents methods for contrasting two or more mediators within a single model via examples. The paper presents an illustrative example, assessing and contrasting potential mediators of the relationship between the helpfulness of socialization agents and job satisfaction as well as discussing software applications of these methods using SAS, SPSS macros, etc.

* [[SOCR_ResamplingSimulation_Activity|This article]] presents a resampling, randomization and simulation activity and illustrates the processes of sampling, resampling and randomization using the SOCR webapp. It aims to demonstrate the concepts of simulation and data generation, illustrate data resampling on a massive scale, reinforce the concept of resampling- and randomization-based statistical inference and demonstrate the similarities and differences between parametric-based and resampling-based statistical inference. The article provides specific steps to implement the activities and video is also provided for reference.

* [[SOCR_EduMaterials_Activities_SamplingDistributionCLTExperiment | This article]] presents an experiment on sampling distributions using the Central Limit Theorem. It demonstrates the properties of the sampling distributions of various sample statistics and illustrates the CLT via an experiment. The sampling distribution CLT experiment provides a simulation accessible to the public that demonstrates characteristics of various sample statistics and the CLT and empirically demonstratse that the sample average is unique. The article helps users develop a better understanding of the two topics and apply the topics to various types of activities by explaining concepts such as the native distribution, sample distribution and numerical parameter estimate.

===Software ===
*[http://ww2.coastal.edu/kingw/statistics/R-tutorials/resample.html R-Tutorial]
*[http://www.socr.ucla.edu/htmls/SOCR_Experiments.html SOCR Experiments]
*[http://socr.ucla.edu/htmls/HTML5/SOCR_Resampling_Webapp/ SOCR Resampling Webapp]

* Sampling with/without replacement in R:
> names<-c('Ann','Tom','William','Tim','Kate','Mike','Rose','Alfred','Jef','Jack')
> N<-length(names)
> sample(names,N,replace=F)
[1] "William" "Kate" "Mike" "Ann" "Jef" "Tom" "Rose"
[8] "Jack" "Tim" "Alfred"

> sample(names,N,replace=T)
[1] "Mike" "Rose" "William" "Rose" "Jef" "Mike" "Jack"
[8] "Rose" "Tom" "Ann"

===Problems===
# Go over the examples in article 4.2 and 4.3.
# Simulate stock closing prices, $S_t$, on 252 trading days where $S_{t}$ satisfies: $S_t=S_0 e^{vt+\sigma \sqrt{t} Z}$, $Z \sim Normal(0,1)$, $S_0=36, \sigma=2 and v=0.01$.
# Now suppose you bought a call on this stock with strike price 40. Based on your simulated data, what percentage of days would you profit from exercising the call option? (This is the percentage of days your simulated $S_t$ is greater than 40).

<hr>
* SOCR Home page: http://www.socr.umich.edu

=== References===
* [http://en.wikipedia.org/wiki/Resampling_(statistics) Resampling Wikipedia]
* [http://en.wikipedia.org/wiki/Cross-validation_(statistics) Cross Validation Wikipedia]
* [http://en.wikipedia.org/wiki/Bootstrapping_(statistics) Bootstrapping Wikipedia]

<hr>
* SOCR Home page: http://www.socr.umich.edu

{{translate|pageName=http://wiki.socr.umich.edu/index.php?title=SMHS_ResamplingSimulation}}

SMHS ResamplingSimulation

2014-08-31T18:14:45Z

Jslavine: /* Applications */

==[[SMHS| Scientific Methods for Health Sciences]] - Resampling and Simulation ==

===Overview===
In statistics, ''resampling'' and ''simulation'' are two important concepts with wide applications in research and projects from various fields. ''Resampling'' is any of a variety of methods in which the following processes are implemented: (1) estimating the precision of sample statistics (i.e., medians, percentiles) by using subsets of available data (i.e., ''jackknifing'') or drawing randomly with replacement from a set of data (i.e., ''bootstrapping''); (2) exchanging labels on data points when performing significance tests (i.e., ''permutation tests''); or (3) validating models by using random subsets (i.e., bootstrapping, ''cross-validation''). We are going to introduce some common resampling techniques including bootstrapping, jackknifing, cross-validation and permutation tests. ''Simulation'' involves imitating real world processes or systems over time. We usually apply simulation after a model, which represents the key characteristics of the process, is developed. Simulation is widely applied in many contexts such as simulation of technology for performance optimization, testing and video games. It is often applied when the real system is not accessible or is difficult and/or costly to apply, and it provides us with an easier way to obtain data about the system or test it. We are going to present an introduction to simulation including the basic methods, applications, advantages and limitations.

===Motivation===
Imagine we want to evaluate the quality of a system or process, but data on the process is very hard to collect. How can we evaluate the system without having to collect samples? If we know the characteristics of the data set, for example, if we know it follows a normal distribution, then we could easily generate a series of data following a normal distribution and use these to test the system. In fact, we can easily generate a large amount of data and test the system with more power. Consider another case in which, instead of knowing the exact characteristics of the data, we have few data from the past few years and we notice that they follow a certain pattern. Here, we can use this data set to work out the characteristics of the data and develop a model. We can then generate a new data set from the model we developed. A popular example is the bootstrapping method in the interest rate model. In order to learn more about resampling and simulation methods, we are going to introduce the fundamental concepts, rules and methodologies commonly applied in these fields to prepare students with necessary background in resampling and simulation.

===Theory===
==== Resampling methods====
Resampling methods use a computer to generate a large number of simulated samples; the patterns in these samples are then summarized and analyzed. In resampling methods, the simulated samples are drawn from the existing sample of data and not from a theoretically defined data generating process (DGP). Thus, in resampling methods, the researcher does not know or control the DGP but can still learn about it.

*Principles: The assumption is that there is some population DGP that remains unobserved and that the DGP produced the one sample of data you have; all information about the population contained in the original data set is also contained in the distribution of these simulated samples. We draw a new ‘sample’ of data that consists of a mix of the observations from the original sample and repeat this many times so we have many new simulated ‘samples’. One can consider the original data set to be a reasonable representation of the population, and the distribution of parameter estimates produced from running a model on a series of resampled data sets will provide a good approximation of the distribution in the population. Resampling methods can be either parametric or non-parametric.

====Bootstrapping====
Bootstrapping is a statistical method for estimating the sampling distribution of an estimator by sampling with replacement from the original sample, most often with the purpose of deriving robust estimates of standard errors and confidence intervals of a population parameter like the median, odds ratio or regression coefficient. This technique allows estimation of the sampling distribution of almost any statistic using only very simple methods, and it falls into the broader class of resampling methods.

*Situations where bootstrapping applies: (1) When the theoretical distribution of a statistic of interest is complicated or unknown; (2) When the sample size is insufficient for straightforward statistical inference; (3) When power calculations have to be performed, and a small pilot sample is available.

*Boostrapping is the practice of estimating properties of an estimator by measuring those properties when sampling from an approximating distribution, say the empirical distribution of the observed data. It is often used as a robust alternative to inference based on parametric assumptions when those assumptions are in doubt, or when parametric inferences is impossible or requires very complicated formulas for the calculation of standard errors. It may also be used for constructing hypothesis tests.

*The basic idea of bootstrapping is that inference about a population based on sample data (sample → population) can be modeled by resampling the data and performing inference on this new sample (resample → sample). More formally, bootstrapping works by treating inference regarding the true probability distribution given the data as being analogous to inference regarding the empirical distribution given the resampled data. The accuracy of inferences regarding the empirical distribution using the resampled data can be assessed because we know the distribution. If the empirical distribution is a reasonable approximation of the true probability distribution, then the quality of inference on the true probability distribution can in turn be inferred.

*Common process: (1) Begin with an observed sample of size N, (2) generate a simulated sample of size N by drawing observations from your observed sample independently and with replacement, (3) compute and save the statistic of interest, (4) repeat this process many times (e.g., 1000), and (5) treat the distribution of your estimated statistics of interest as an estimate of the population distribution of that statistic.

*Key features of the bootstrap: The draws must be independent; that is, each observation in the observed sample must have an equal chance of being selected. The simulated sample must be of size N to take full advantage of the information in the sample. Resampling must be done with replacement, if not, then every simulated sample of size N would be identical and the same as the original sample. Resampling with replacement means that in any given simulated sample, some cases might appear more than once while others will not appear at all.

*Types of bootstrap schemes: (1) ''Case resampling'' using the Monte Carlo algorithm; (2) estimating the distribution of the sample means; (3) regression; (4) Bayesian bootstrap; (5) smooth bootstrap; (6) parametric bootstrap; (7) resampling residuals; (8) Gaussian process regression bootstrap; (9) wild bootstrap; (10) block bootstrap.

*Advantages: Boostrapping is simple, and it is straightforward to derive estimates of standard errors and confidence intervals for complex estimators of complex parameters of the distribution; it is appropriate to control and check the stability of the results.

*Limitations: It does not provide general finite-sample guarantees. Additionally, the apparent simplicity may conceal the fact that important assumptions are being made when undertaking the bootstrap analysis where these would be more formally stated in other approaches.

====Jackknife====
The Jackknife method estimates the bias and standard error of a statistic when a random sample of observations is used to calculate it. The basic idea is to systematically recompute the statistic estimate, leaving out one or more observations at a time from the sample set. From the new set of replicates of the statistic, an estimate for the bias and an estimate for the variance of the statistic can be calculated.

*Jackknife estimates of variance asymptotically tend to the true value almost surely. The jackknife is consistent for sample means, sample variances, etc.

*The jackknife is not consistent for the sample median. In the case of a unimodal variable, the ratio of the jackknife variance to the sample variance tends to be distributed as one half the square of a chi-square distribution with two degrees of freedom.
*It is dependent on the independence of the data. Extensions of the jackknife to allow for dependences in the data have been proposed.

*Advantages: This method is good at detecting outliers and influential cases. Those sub-sample estimates that differ most indicate cases that have the strongest influence on those estimates in the original full sample analysis.

*Limitations: The jackknife is less general than the bootstrap and thus, is used less frequently. It does not perform well if the statistic under consideration does not change ‘smoothly’ across simulated samples and it does not perform well with small samples because you cannot generate many resamples.

==== Cross-validation====
Cross-validation (CV) is a statistical method for validating a predictive model by assessing a statistical model using a data set that is independent of the data set used to fit the model. Subsets of the data are held out for use as ''validating sets''; a model is fit to the remaining data (i.e., ''training set'') and used to predict the validation set. Averaging the quality of the predictions across the validation sets yields an overall measure of prediction accuracy.

*Steps: (1) Randomly partition the data into a training set and a validating set, (2) fit the model to the training set, (3) take the parameter estimates from that model and use them to calculate a measure of fit on the testing set, (4) repeat several times and average to reduce variability.

*'''Types of CV''':
**''Leave-one-out CV'': This is an iterative method in which the number of iterations = sample size, each observation becomes the validating set one time. Steps: 1) Delete observation #1 from the data, 2) fit the model to observations #2-n, 3) apply the coefficients form step #2 to observation #1 and calculate the chosen fit measure, 4) delete observation #2 from the data, 5) fit the model to observations #1 and #3-n, 6) apply the coefficients from step #5 to observation #2 and calculate the chosen fit measure, 7) repeat until all observations have been deleted once.
**''K-fold cross-validation'' splits the data into K subsets and each is held out in turn as the validation set. This avoids self-influence. This influence is similar to they way in which in regression analysis methods, such as linear regression, each y value draws the regression line toward itself, making the prediction of that value appear more accurate than it really is. Cross-validation applied to linear regression predicts the y value for each observation without using that observation.

*Limitations of CV: The training and validating data must be random samples from the same population. It will be most different from in-sample measures when n is small. It is more computationally demanding than calculating in-sample measures. It is subject to the researcher’s selection of an appropriate fit statistic.

====Permutation test====
Permutation (or randomization or re-randomization) test is a type of statistical significance test in which the distribution of the test statistic under the null hypothesis is obtained by calculating all possible values of the test statistic under rearrangements of the labels on the observed data points. It is just another form of resampling but is done without replacement.

*Rather than assume a distribution for the null hypothesis, we simulate what it would be by randomly reconfiguring our sample lots of times (say 1000) in a way that ‘breaks’ the relationship in our sample data.

*Suppose we have group A and group B with sample means $\bar{x}_A$ and $\bar{x}_B$ respectively and we want to test, at a 5% significance level, whether they come form the same distribution. $n_A$ and $n_B$ are the sample size for each group. A permutation test is designed to determine whether the observed difference between the sample means is large enough to reject the null hypothesis $H_0$: the two groups have identical probability distribution. The test proceeds: (1) the difference in the means between group A and B is calculated, (2) difference in sample means is calculated and recorded for each possible way of dividing these pooled values into two groups of size $n_A$ and $n_B$. The set of these calculated differences if the exact distribution of possible differences under the null hypothesis that group label does not matter, (3) the one-side p-value of the test is calculated as the proportion of sampled permutations where the difference in means was greater than or equal to $T (obs)$; the two-sided p-value of the test is calculated as the proportion of sampled permutations where the absolute difference was greater than or equal to $ABS(T(obs))$;(4) sort the recorded differences and then observe if $T(obs)$ is contained within the middle 95% of them, if not, reject $H_0$ at 5% significance level.

====Simulation====
A common assumption is that the coefficients we are trying to estimate come from a probability distribution. With large enough sample sizes, according to the central limit theorem (CLT), this distribution is multivariate normal.
The goal of simulation is to make random draws from this distribution to simulate many ‘hypothetical values’ of the coefficients.
*Steps: (1) Choose a quality index (QI), e.g., expected value, predicted probability, odds ratio, first difference, etc.; (2) Set a key variable in the model to a theoretically interesting value and the rest to their means or modes; (3) calculate the QI with each set of simulated coefficients; (4) set the variable to a new value; (5) calculate that QI with each set of the simulated coefficients, (6) repeat as appropriate; (7) efficiently summarize the distribution of the computed QI at each value of the variable of interest.

*Advantages: Simulation provides more information than a table of regression outputs. It accounts for uncertainty in the QI and is flexible to many different types of models, QIs and variable specifications. After performing it once, it is easy to use and can be much easier than working with analytic solutions.

*Limitations: It relies on the CLT to justify asymptotic normality. In contrast, a fully Bayesian model using MCMC could produce an exact finite-sample distribution and bootstrapping would require no distributional assumptions. It can be computationally intense and large models can produce great uncertainty regarding quantities of interest.

===Applications===
* [http://www.amstat.org/publications/jse/v16n2/dinov.html This article] presents an application of the Central Limit Theorem using the SOCR applet for a demonstration activity. This article described an innovative effort at using technological tools for improving student motivation and learning of the theory, practice and usability of the CLT in probability and statistics courses. The method is based on harnessing the computational libraries developed by SOCR to design a new interactive Java applet and a corresponding demonstration activity that illustrate the meaning and power of the CLT. It included four experiments to demonstrate the assumptions, meaning and implication of the CLT as well as a hands-on simulation and a number of examples illustrating the theory and application of the CLT.

* [http://link.springer.com/article/10.3758/BRM.40.3.879 This article], entitled "Asymptotic and resampling strategies for assessing and comparing indirect effects in multiple mediator models," provides an overview of simple and multiple mediation and explored three approaches that can be used to investigate indirect process. It also presents methods for contrasting two or more mediators within a single model via examples. The paper presents an illustrative example, assessing and contrasting potential mediators of the relationship between the helpfulness of socialization agents and job satisfaction as well as discussing software applications of these methods using SAS, SPSS macros, etc.

* [[SOCR_ResamplingSimulation_Activity|This article]] presents a resampling, randomization and simulation activity and illustrates the processes of sampling, resampling and randomization using the SOCR webapp. It aims to demonstrate the concepts of simulation and data generation, illustrate data resampling on a massive scale, reinforce the concept of resampling- and randomization-based statistical inference and demonstrate the similarities and differences between parametric-based and resampling-based statistical inference. The article provides specific steps to implement the activities and video is also provided for reference.

* [[SOCR_EduMaterials_Activities_SamplingDistributionCLTExperiment | This article]] presents an experiment on sampling distributions using the Central Limit Theorem. It demonstrates the properties of the sampling distributions of various sample statistics and illustrates the CLT via an experiment. The sampling distribution CLT experiment provides a simulation accessible to the public that demonstrates characteristics of various sample statistics and the CLT and empirically demonstratse that the sample average is unique. The article helps users develop a better understanding of the two topics and apply the topics to various types of activities by explaining concepts such as the native distribution, sample distribution and numerical parameter estimate.

===Software ===
*[http://ww2.coastal.edu/kingw/statistics/R-tutorials/resample.html R-Tutorial]
*[http://www.socr.ucla.edu/htmls/SOCR_Experiments.html SOCR Experiments]
*[http://socr.ucla.edu/htmls/HTML5/SOCR_Resampling_Webapp/ SOCR Resampling Webapp]

* Sampling with/without replacement in R:
> names<-c('Ann','Tom','William','Tim','Kate','Mike','Rose','Alfred','Jef','Jack')
> N<-length(names)
> sample(names,N,replace=F)
[1] "William" "Kate" "Mike" "Ann" "Jef" "Tom" "Rose"
[8] "Jack" "Tim" "Alfred"

> sample(names,N,replace=T)
[1] "Mike" "Rose" "William" "Rose" "Jef" "Mike" "Jack"
[8] "Rose" "Tom" "Ann"

===Problems===
# Go over the examples in article 4.2 and 4.3.
# Do the exercise of simulating stock closing price $S_t$ on 252 trading days where $S_{t}$ satisfies: $S_t=S_0 e^{vt+\sigma \sqrt{t} Z}$, $Z \sim Normal(0,1)$ with $S_0=36, \sigma=2%, v=0.01%$.
# Now suppose you bought a call on this stock with strike price 40, with your simulation data, what is percentage of days you can profit from exercising the call option? (That is the percentage of days your $S_t$ is greater than 40).

<hr>
* SOCR Home page: http://www.socr.umich.edu

=== References===
* [http://en.wikipedia.org/wiki/Resampling_(statistics) Resampling Wikipedia]
* [http://en.wikipedia.org/wiki/Cross-validation_(statistics) Cross Validation Wikipedia]
* [http://en.wikipedia.org/wiki/Bootstrapping_(statistics) Bootstrapping Wikipedia]

<hr>
* SOCR Home page: http://www.socr.umich.edu

{{translate|pageName=http://wiki.socr.umich.edu/index.php?title=SMHS_ResamplingSimulation}}

SMHS ResamplingSimulation

2014-08-31T18:07:09Z

Jslavine: /* Simulation */

==[[SMHS| Scientific Methods for Health Sciences]] - Resampling and Simulation ==

===Overview===
In statistics, ''resampling'' and ''simulation'' are two important concepts with wide applications in research and projects from various fields. ''Resampling'' is any of a variety of methods in which the following processes are implemented: (1) estimating the precision of sample statistics (i.e., medians, percentiles) by using subsets of available data (i.e., ''jackknifing'') or drawing randomly with replacement from a set of data (i.e., ''bootstrapping''); (2) exchanging labels on data points when performing significance tests (i.e., ''permutation tests''); or (3) validating models by using random subsets (i.e., bootstrapping, ''cross-validation''). We are going to introduce some common resampling techniques including bootstrapping, jackknifing, cross-validation and permutation tests. ''Simulation'' involves imitating real world processes or systems over time. We usually apply simulation after a model, which represents the key characteristics of the process, is developed. Simulation is widely applied in many contexts such as simulation of technology for performance optimization, testing and video games. It is often applied when the real system is not accessible or is difficult and/or costly to apply, and it provides us with an easier way to obtain data about the system or test it. We are going to present an introduction to simulation including the basic methods, applications, advantages and limitations.

===Motivation===
Imagine we want to evaluate the quality of a system or process, but data on the process is very hard to collect. How can we evaluate the system without having to collect samples? If we know the characteristics of the data set, for example, if we know it follows a normal distribution, then we could easily generate a series of data following a normal distribution and use these to test the system. In fact, we can easily generate a large amount of data and test the system with more power. Consider another case in which, instead of knowing the exact characteristics of the data, we have few data from the past few years and we notice that they follow a certain pattern. Here, we can use this data set to work out the characteristics of the data and develop a model. We can then generate a new data set from the model we developed. A popular example is the bootstrapping method in the interest rate model. In order to learn more about resampling and simulation methods, we are going to introduce the fundamental concepts, rules and methodologies commonly applied in these fields to prepare students with necessary background in resampling and simulation.

===Theory===
==== Resampling methods====
Resampling methods use a computer to generate a large number of simulated samples; the patterns in these samples are then summarized and analyzed. In resampling methods, the simulated samples are drawn from the existing sample of data and not from a theoretically defined data generating process (DGP). Thus, in resampling methods, the researcher does not know or control the DGP but can still learn about it.

*Principles: The assumption is that there is some population DGP that remains unobserved and that the DGP produced the one sample of data you have; all information about the population contained in the original data set is also contained in the distribution of these simulated samples. We draw a new ‘sample’ of data that consists of a mix of the observations from the original sample and repeat this many times so we have many new simulated ‘samples’. One can consider the original data set to be a reasonable representation of the population, and the distribution of parameter estimates produced from running a model on a series of resampled data sets will provide a good approximation of the distribution in the population. Resampling methods can be either parametric or non-parametric.

====Bootstrapping====
Bootstrapping is a statistical method for estimating the sampling distribution of an estimator by sampling with replacement from the original sample, most often with the purpose of deriving robust estimates of standard errors and confidence intervals of a population parameter like the median, odds ratio or regression coefficient. This technique allows estimation of the sampling distribution of almost any statistic using only very simple methods, and it falls into the broader class of resampling methods.

*Situations where bootstrapping applies: (1) When the theoretical distribution of a statistic of interest is complicated or unknown; (2) When the sample size is insufficient for straightforward statistical inference; (3) When power calculations have to be performed, and a small pilot sample is available.

*Boostrapping is the practice of estimating properties of an estimator by measuring those properties when sampling from an approximating distribution, say the empirical distribution of the observed data. It is often used as a robust alternative to inference based on parametric assumptions when those assumptions are in doubt, or when parametric inferences is impossible or requires very complicated formulas for the calculation of standard errors. It may also be used for constructing hypothesis tests.

*The basic idea of bootstrapping is that inference about a population based on sample data (sample → population) can be modeled by resampling the data and performing inference on this new sample (resample → sample). More formally, bootstrapping works by treating inference regarding the true probability distribution given the data as being analogous to inference regarding the empirical distribution given the resampled data. The accuracy of inferences regarding the empirical distribution using the resampled data can be assessed because we know the distribution. If the empirical distribution is a reasonable approximation of the true probability distribution, then the quality of inference on the true probability distribution can in turn be inferred.

*Common process: (1) Begin with an observed sample of size N, (2) generate a simulated sample of size N by drawing observations from your observed sample independently and with replacement, (3) compute and save the statistic of interest, (4) repeat this process many times (e.g., 1000), and (5) treat the distribution of your estimated statistics of interest as an estimate of the population distribution of that statistic.

*Key features of the bootstrap: The draws must be independent; that is, each observation in the observed sample must have an equal chance of being selected. The simulated sample must be of size N to take full advantage of the information in the sample. Resampling must be done with replacement, if not, then every simulated sample of size N would be identical and the same as the original sample. Resampling with replacement means that in any given simulated sample, some cases might appear more than once while others will not appear at all.

*Types of bootstrap schemes: (1) ''Case resampling'' using the Monte Carlo algorithm; (2) estimating the distribution of the sample means; (3) regression; (4) Bayesian bootstrap; (5) smooth bootstrap; (6) parametric bootstrap; (7) resampling residuals; (8) Gaussian process regression bootstrap; (9) wild bootstrap; (10) block bootstrap.

*Advantages: Boostrapping is simple, and it is straightforward to derive estimates of standard errors and confidence intervals for complex estimators of complex parameters of the distribution; it is appropriate to control and check the stability of the results.

*Limitations: It does not provide general finite-sample guarantees. Additionally, the apparent simplicity may conceal the fact that important assumptions are being made when undertaking the bootstrap analysis where these would be more formally stated in other approaches.

====Jackknife====
The Jackknife method estimates the bias and standard error of a statistic when a random sample of observations is used to calculate it. The basic idea is to systematically recompute the statistic estimate, leaving out one or more observations at a time from the sample set. From the new set of replicates of the statistic, an estimate for the bias and an estimate for the variance of the statistic can be calculated.

*Jackknife estimates of variance asymptotically tend to the true value almost surely. The jackknife is consistent for sample means, sample variances, etc.

*The jackknife is not consistent for the sample median. In the case of a unimodal variable, the ratio of the jackknife variance to the sample variance tends to be distributed as one half the square of a chi-square distribution with two degrees of freedom.
*It is dependent on the independence of the data. Extensions of the jackknife to allow for dependences in the data have been proposed.

*Advantages: This method is good at detecting outliers and influential cases. Those sub-sample estimates that differ most indicate cases that have the strongest influence on those estimates in the original full sample analysis.

*Limitations: The jackknife is less general than the bootstrap and thus, is used less frequently. It does not perform well if the statistic under consideration does not change ‘smoothly’ across simulated samples and it does not perform well with small samples because you cannot generate many resamples.

==== Cross-validation====
Cross-validation (CV) is a statistical method for validating a predictive model by assessing a statistical model using a data set that is independent of the data set used to fit the model. Subsets of the data are held out for use as ''validating sets''; a model is fit to the remaining data (i.e., ''training set'') and used to predict the validation set. Averaging the quality of the predictions across the validation sets yields an overall measure of prediction accuracy.

*Steps: (1) Randomly partition the data into a training set and a validating set, (2) fit the model to the training set, (3) take the parameter estimates from that model and use them to calculate a measure of fit on the testing set, (4) repeat several times and average to reduce variability.

*'''Types of CV''':
**''Leave-one-out CV'': This is an iterative method in which the number of iterations = sample size, each observation becomes the validating set one time. Steps: 1) Delete observation #1 from the data, 2) fit the model to observations #2-n, 3) apply the coefficients form step #2 to observation #1 and calculate the chosen fit measure, 4) delete observation #2 from the data, 5) fit the model to observations #1 and #3-n, 6) apply the coefficients from step #5 to observation #2 and calculate the chosen fit measure, 7) repeat until all observations have been deleted once.
**''K-fold cross-validation'' splits the data into K subsets and each is held out in turn as the validation set. This avoids self-influence. This influence is similar to they way in which in regression analysis methods, such as linear regression, each y value draws the regression line toward itself, making the prediction of that value appear more accurate than it really is. Cross-validation applied to linear regression predicts the y value for each observation without using that observation.

*Limitations of CV: The training and validating data must be random samples from the same population. It will be most different from in-sample measures when n is small. It is more computationally demanding than calculating in-sample measures. It is subject to the researcher’s selection of an appropriate fit statistic.

====Permutation test====
Permutation (or randomization or re-randomization) test is a type of statistical significance test in which the distribution of the test statistic under the null hypothesis is obtained by calculating all possible values of the test statistic under rearrangements of the labels on the observed data points. It is just another form of resampling but is done without replacement.

*Rather than assume a distribution for the null hypothesis, we simulate what it would be by randomly reconfiguring our sample lots of times (say 1000) in a way that ‘breaks’ the relationship in our sample data.

*Suppose we have group A and group B with sample means $\bar{x}_A$ and $\bar{x}_B$ respectively and we want to test, at a 5% significance level, whether they come form the same distribution. $n_A$ and $n_B$ are the sample size for each group. A permutation test is designed to determine whether the observed difference between the sample means is large enough to reject the null hypothesis $H_0$: the two groups have identical probability distribution. The test proceeds: (1) the difference in the means between group A and B is calculated, (2) difference in sample means is calculated and recorded for each possible way of dividing these pooled values into two groups of size $n_A$ and $n_B$. The set of these calculated differences if the exact distribution of possible differences under the null hypothesis that group label does not matter, (3) the one-side p-value of the test is calculated as the proportion of sampled permutations where the difference in means was greater than or equal to $T (obs)$; the two-sided p-value of the test is calculated as the proportion of sampled permutations where the absolute difference was greater than or equal to $ABS(T(obs))$;(4) sort the recorded differences and then observe if $T(obs)$ is contained within the middle 95% of them, if not, reject $H_0$ at 5% significance level.

====Simulation====
A common assumption is that the coefficients we are trying to estimate come from a probability distribution. With large enough sample sizes, according to the central limit theorem (CLT), this distribution is multivariate normal.
The goal of simulation is to make random draws from this distribution to simulate many ‘hypothetical values’ of the coefficients.
*Steps: (1) Choose a quality index (QI), e.g., expected value, predicted probability, odds ratio, first difference, etc.; (2) Set a key variable in the model to a theoretically interesting value and the rest to their means or modes; (3) calculate the QI with each set of simulated coefficients; (4) set the variable to a new value; (5) calculate that QI with each set of the simulated coefficients, (6) repeat as appropriate; (7) efficiently summarize the distribution of the computed QI at each value of the variable of interest.

*Advantages: Simulation provides more information than a table of regression outputs. It accounts for uncertainty in the QI and is flexible to many different types of models, QIs and variable specifications. After performing it once, it is easy to use and can be much easier than working with analytic solutions.

*Limitations: It relies on the CLT to justify asymptotic normality. In contrast, a fully Bayesian model using MCMC could produce an exact finite-sample distribution and bootstrapping would require no distributional assumptions. It can be computationally intense and large models can produce great uncertainty regarding quantities of interest.

===Applications===
* [http://www.amstat.org/publications/jse/v16n2/dinov.html This article] presents the application of Central Limit Theorem using the new SOCR applet and demonstration activity. In this article, it described an innovative effort of using technological tools for improving student motivation and learning of the theory, practice and usability of the CLT probability and statistics courses. The method is based on harnessing the computational libraries developed by SOCR to design a new interactive Java applet and a corresponding demonstration activity that illustrate the meaning and power of the CLT. It included four experiments to demonstrate the assumptions, meaning and implication of CLT as well as hands-on simulation and a number of examples illustrating the theory and application of CLT.

* [http://link.springer.com/article/10.3758/BRM.40.3.879 This article] titled Asymptotic and resampling strategies for assessing and comparing indirect effects in multiple mediator models provided an overview of simple and multiple mediation and explored three approaches that can be used to investigate indirect process. It also presents methods for contrasting two or more mediators within a single model through examples. The paper presents an illustrative example, assessing and contrasting potential mediators of relationship between the helpfulness of socialization agents and job satisfaction as well as software application of these methods including SAS, SPSS macros and etc.

* [[SOCR_ResamplingSimulation_Activity|This article]] presents the resampling, randomization and simulation activity and illustrated the processes of sampling, resampling and randomization using the SOCR webapp. It aims to demonstrate the concepts of simulation and data generation, illustrate data resampling on a massive scale, reinforce the concept of resampling and randomization based statistical inference and demonstrate the similarities and differences between parametric-based and resampling-based statistical inference. The article provides specific steps to implement the activities and video is also provided for reference.

* [[SOCR_EduMaterials_Activities_SamplingDistributionCLTExperiment | This article]] is an experiment on sampling distribution with Central Limit Theory. It demonstrates the properties of the sampling distributions of various sample statistics and illustrates the CLT through the experiment. The sampling distribution CLT experiment provides a simulation accessible to the public that demonstrates characteristics of various sample statistics and CLT and to empirically demonstrate that the sample average is unique. The article helps users develop a better understanding of the two topics and apply them to various types of activities as the concepts of a native distribution, sample distribution and numerical parameter estimate.

===Software ===
*[http://ww2.coastal.edu/kingw/statistics/R-tutorials/resample.html R-Tutorial]
*[http://www.socr.ucla.edu/htmls/SOCR_Experiments.html SOCR Experiments]
*[http://socr.ucla.edu/htmls/HTML5/SOCR_Resampling_Webapp/ SOCR Resampling Webapp]

* Sampling with/without replacement in R:
> names<-c('Ann','Tom','William','Tim','Kate','Mike','Rose','Alfred','Jef','Jack')
> N<-length(names)
> sample(names,N,replace=F)
[1] "William" "Kate" "Mike" "Ann" "Jef" "Tom" "Rose"
[8] "Jack" "Tim" "Alfred"

> sample(names,N,replace=T)
[1] "Mike" "Rose" "William" "Rose" "Jef" "Mike" "Jack"
[8] "Rose" "Tom" "Ann"

===Problems===
# Go over the examples in article 4.2 and 4.3.
# Do the exercise of simulating stock closing price $S_t$ on 252 trading days where $S_{t}$ satisfies: $S_t=S_0 e^{vt+\sigma \sqrt{t} Z}$, $Z \sim Normal(0,1)$ with $S_0=36, \sigma=2%, v=0.01%$.
# Now suppose you bought a call on this stock with strike price 40, with your simulation data, what is percentage of days you can profit from exercising the call option? (That is the percentage of days your $S_t$ is greater than 40).

<hr>
* SOCR Home page: http://www.socr.umich.edu

=== References===
* [http://en.wikipedia.org/wiki/Resampling_(statistics) Resampling Wikipedia]
* [http://en.wikipedia.org/wiki/Cross-validation_(statistics) Cross Validation Wikipedia]
* [http://en.wikipedia.org/wiki/Bootstrapping_(statistics) Bootstrapping Wikipedia]

<hr>
* SOCR Home page: http://www.socr.umich.edu

{{translate|pageName=http://wiki.socr.umich.edu/index.php?title=SMHS_ResamplingSimulation}}

SMHS ResamplingSimulation

2014-08-31T17:59:10Z

Jslavine: /* Cross-validation */

==[[SMHS| Scientific Methods for Health Sciences]] - Resampling and Simulation ==

===Overview===
In statistics, ''resampling'' and ''simulation'' are two important concepts with wide applications in research and projects from various fields. ''Resampling'' is any of a variety of methods in which the following processes are implemented: (1) estimating the precision of sample statistics (i.e., medians, percentiles) by using subsets of available data (i.e., ''jackknifing'') or drawing randomly with replacement from a set of data (i.e., ''bootstrapping''); (2) exchanging labels on data points when performing significance tests (i.e., ''permutation tests''); or (3) validating models by using random subsets (i.e., bootstrapping, ''cross-validation''). We are going to introduce some common resampling techniques including bootstrapping, jackknifing, cross-validation and permutation tests. ''Simulation'' involves imitating real world processes or systems over time. We usually apply simulation after a model, which represents the key characteristics of the process, is developed. Simulation is widely applied in many contexts such as simulation of technology for performance optimization, testing and video games. It is often applied when the real system is not accessible or is difficult and/or costly to apply, and it provides us with an easier way to obtain data about the system or test it. We are going to present an introduction to simulation including the basic methods, applications, advantages and limitations.

===Motivation===
Imagine we want to evaluate the quality of a system or process, but data on the process is very hard to collect. How can we evaluate the system without having to collect samples? If we know the characteristics of the data set, for example, if we know it follows a normal distribution, then we could easily generate a series of data following a normal distribution and use these to test the system. In fact, we can easily generate a large amount of data and test the system with more power. Consider another case in which, instead of knowing the exact characteristics of the data, we have few data from the past few years and we notice that they follow a certain pattern. Here, we can use this data set to work out the characteristics of the data and develop a model. We can then generate a new data set from the model we developed. A popular example is the bootstrapping method in the interest rate model. In order to learn more about resampling and simulation methods, we are going to introduce the fundamental concepts, rules and methodologies commonly applied in these fields to prepare students with necessary background in resampling and simulation.

===Theory===
==== Resampling methods====
Resampling methods use a computer to generate a large number of simulated samples; the patterns in these samples are then summarized and analyzed. In resampling methods, the simulated samples are drawn from the existing sample of data and not from a theoretically defined data generating process (DGP). Thus, in resampling methods, the researcher does not know or control the DGP but can still learn about it.

*Principles: The assumption is that there is some population DGP that remains unobserved and that the DGP produced the one sample of data you have; all information about the population contained in the original data set is also contained in the distribution of these simulated samples. We draw a new ‘sample’ of data that consists of a mix of the observations from the original sample and repeat this many times so we have many new simulated ‘samples’. One can consider the original data set to be a reasonable representation of the population, and the distribution of parameter estimates produced from running a model on a series of resampled data sets will provide a good approximation of the distribution in the population. Resampling methods can be either parametric or non-parametric.

====Bootstrapping====
Bootstrapping is a statistical method for estimating the sampling distribution of an estimator by sampling with replacement from the original sample, most often with the purpose of deriving robust estimates of standard errors and confidence intervals of a population parameter like the median, odds ratio or regression coefficient. This technique allows estimation of the sampling distribution of almost any statistic using only very simple methods, and it falls into the broader class of resampling methods.

*Situations where bootstrapping applies: (1) When the theoretical distribution of a statistic of interest is complicated or unknown; (2) When the sample size is insufficient for straightforward statistical inference; (3) When power calculations have to be performed, and a small pilot sample is available.

*Boostrapping is the practice of estimating properties of an estimator by measuring those properties when sampling from an approximating distribution, say the empirical distribution of the observed data. It is often used as a robust alternative to inference based on parametric assumptions when those assumptions are in doubt, or when parametric inferences is impossible or requires very complicated formulas for the calculation of standard errors. It may also be used for constructing hypothesis tests.

*The basic idea of bootstrapping is that inference about a population based on sample data (sample → population) can be modeled by resampling the data and performing inference on this new sample (resample → sample). More formally, bootstrapping works by treating inference regarding the true probability distribution given the data as being analogous to inference regarding the empirical distribution given the resampled data. The accuracy of inferences regarding the empirical distribution using the resampled data can be assessed because we know the distribution. If the empirical distribution is a reasonable approximation of the true probability distribution, then the quality of inference on the true probability distribution can in turn be inferred.

*Common process: (1) Begin with an observed sample of size N, (2) generate a simulated sample of size N by drawing observations from your observed sample independently and with replacement, (3) compute and save the statistic of interest, (4) repeat this process many times (e.g., 1000), and (5) treat the distribution of your estimated statistics of interest as an estimate of the population distribution of that statistic.

*Key features of the bootstrap: The draws must be independent; that is, each observation in the observed sample must have an equal chance of being selected. The simulated sample must be of size N to take full advantage of the information in the sample. Resampling must be done with replacement, if not, then every simulated sample of size N would be identical and the same as the original sample. Resampling with replacement means that in any given simulated sample, some cases might appear more than once while others will not appear at all.

*Types of bootstrap schemes: (1) ''Case resampling'' using the Monte Carlo algorithm; (2) estimating the distribution of the sample means; (3) regression; (4) Bayesian bootstrap; (5) smooth bootstrap; (6) parametric bootstrap; (7) resampling residuals; (8) Gaussian process regression bootstrap; (9) wild bootstrap; (10) block bootstrap.

*Advantages: Boostrapping is simple, and it is straightforward to derive estimates of standard errors and confidence intervals for complex estimators of complex parameters of the distribution; it is appropriate to control and check the stability of the results.

*Limitations: It does not provide general finite-sample guarantees. Additionally, the apparent simplicity may conceal the fact that important assumptions are being made when undertaking the bootstrap analysis where these would be more formally stated in other approaches.

====Jackknife====
The Jackknife method estimates the bias and standard error of a statistic when a random sample of observations is used to calculate it. The basic idea is to systematically recompute the statistic estimate, leaving out one or more observations at a time from the sample set. From the new set of replicates of the statistic, an estimate for the bias and an estimate for the variance of the statistic can be calculated.

*Jackknife estimates of variance asymptotically tend to the true value almost surely. The jackknife is consistent for sample means, sample variances, etc.

*The jackknife is not consistent for the sample median. In the case of a unimodal variable, the ratio of the jackknife variance to the sample variance tends to be distributed as one half the square of a chi-square distribution with two degrees of freedom.
*It is dependent on the independence of the data. Extensions of the jackknife to allow for dependences in the data have been proposed.

*Advantages: This method is good at detecting outliers and influential cases. Those sub-sample estimates that differ most indicate cases that have the strongest influence on those estimates in the original full sample analysis.

*Limitations: The jackknife is less general than the bootstrap and thus, is used less frequently. It does not perform well if the statistic under consideration does not change ‘smoothly’ across simulated samples and it does not perform well with small samples because you cannot generate many resamples.

==== Cross-validation====
Cross-validation (CV) is a statistical method for validating a predictive model by assessing a statistical model using a data set that is independent of the data set used to fit the model. Subsets of the data are held out for use as ''validating sets''; a model is fit to the remaining data (i.e., ''training set'') and used to predict the validation set. Averaging the quality of the predictions across the validation sets yields an overall measure of prediction accuracy.

*Steps: (1) Randomly partition the data into a training set and a validating set, (2) fit the model to the training set, (3) take the parameter estimates from that model and use them to calculate a measure of fit on the testing set, (4) repeat several times and average to reduce variability.

*'''Types of CV''':
**''Leave-one-out CV'': This is an iterative method in which the number of iterations = sample size, each observation becomes the validating set one time. Steps: 1) Delete observation #1 from the data, 2) fit the model to observations #2-n, 3) apply the coefficients form step #2 to observation #1 and calculate the chosen fit measure, 4) delete observation #2 from the data, 5) fit the model to observations #1 and #3-n, 6) apply the coefficients from step #5 to observation #2 and calculate the chosen fit measure, 7) repeat until all observations have been deleted once.
**''K-fold cross-validation'' splits the data into K subsets and each is held out in turn as the validation set. This avoids self-influence. This influence is similar to they way in which in regression analysis methods, such as linear regression, each y value draws the regression line toward itself, making the prediction of that value appear more accurate than it really is. Cross-validation applied to linear regression predicts the y value for each observation without using that observation.

*Limitations of CV: The training and validating data must be random samples from the same population. It will be most different from in-sample measures when n is small. It is more computationally demanding than calculating in-sample measures. It is subject to the researcher’s selection of an appropriate fit statistic.

====Permutation test====
Permutation (or randomization or re-randomization) test is a type of statistical significance test in which the distribution of the test statistic under the null hypothesis is obtained by calculating all possible values of the test statistic under rearrangements of the labels on the observed data points. It is just another form of resampling but is done without replacement.

*Rather than assume a distribution for the null hypothesis, we simulate what it would be by randomly reconfiguring our sample lots of times (say 1000) in a way that ‘breaks’ the relationship in our sample data.

*Suppose we have group A and group B with sample means $\bar{x}_A$ and $\bar{x}_B$ respectively and we want to test, at a 5% significance level, whether they come form the same distribution. $n_A$ and $n_B$ are the sample size for each group. A permutation test is designed to determine whether the observed difference between the sample means is large enough to reject the null hypothesis $H_0$: the two groups have identical probability distribution. The test proceeds: (1) the difference in the means between group A and B is calculated, (2) difference in sample means is calculated and recorded for each possible way of dividing these pooled values into two groups of size $n_A$ and $n_B$. The set of these calculated differences if the exact distribution of possible differences under the null hypothesis that group label does not matter, (3) the one-side p-value of the test is calculated as the proportion of sampled permutations where the difference in means was greater than or equal to $T (obs)$; the two-sided p-value of the test is calculated as the proportion of sampled permutations where the absolute difference was greater than or equal to $ABS(T(obs))$;(4) sort the recorded differences and then observe if $T(obs)$ is contained within the middle 95% of them, if not, reject $H_0$ at 5% significance level.

====Simulation====
A common assumption is that the coefficients we estimate are drawn from a probability distribution that describes the larger population. With large enough sample size, according to the CLT this distribution is multivariate normal.
*Steps: (1) goal of simulation is to make random draws from this distribution to simulate many ‘hypothetical values’ of the coefficients, (2) the next step is to choose a QI say expected value, predicted probability, odds ratio, first difference, etc, (3) set a key variable in the model to a theoretically interesting value and the rest to their means or modes, (4) calculate QI with each set of simulated coefficients and set the variable to a new value, (5) set the variable to a new value, (6) calculate that QI with each set of simulated coefficients, (7) repeat as appropriate, (8) efficiently summarize the distribution of the computed QI at each value of our variable.

*Advantages: provide more information than a just a table of regression output; accounts for uncertainty in the QI; flexible to many different types of models, QIs and variable specifications; after doing it once, easy to use; can be much easier than working with analytic solutions.

*Limitations: relies on CLT to justify asymptotic normality (fully Bayesian model using MCMC could produce exact finite-sample distribution; bootstrapping would require no distributional assumption); computational intensity; large models can produce lots of uncertainty around quantity of interest.

===Applications===
* [http://www.amstat.org/publications/jse/v16n2/dinov.html This article] presents the application of Central Limit Theorem using the new SOCR applet and demonstration activity. In this article, it described an innovative effort of using technological tools for improving student motivation and learning of the theory, practice and usability of the CLT probability and statistics courses. The method is based on harnessing the computational libraries developed by SOCR to design a new interactive Java applet and a corresponding demonstration activity that illustrate the meaning and power of the CLT. It included four experiments to demonstrate the assumptions, meaning and implication of CLT as well as hands-on simulation and a number of examples illustrating the theory and application of CLT.

* [http://link.springer.com/article/10.3758/BRM.40.3.879 This article] titled Asymptotic and resampling strategies for assessing and comparing indirect effects in multiple mediator models provided an overview of simple and multiple mediation and explored three approaches that can be used to investigate indirect process. It also presents methods for contrasting two or more mediators within a single model through examples. The paper presents an illustrative example, assessing and contrasting potential mediators of relationship between the helpfulness of socialization agents and job satisfaction as well as software application of these methods including SAS, SPSS macros and etc.

* [[SOCR_ResamplingSimulation_Activity|This article]] presents the resampling, randomization and simulation activity and illustrated the processes of sampling, resampling and randomization using the SOCR webapp. It aims to demonstrate the concepts of simulation and data generation, illustrate data resampling on a massive scale, reinforce the concept of resampling and randomization based statistical inference and demonstrate the similarities and differences between parametric-based and resampling-based statistical inference. The article provides specific steps to implement the activities and video is also provided for reference.

* [[SOCR_EduMaterials_Activities_SamplingDistributionCLTExperiment | This article]] is an experiment on sampling distribution with Central Limit Theory. It demonstrates the properties of the sampling distributions of various sample statistics and illustrates the CLT through the experiment. The sampling distribution CLT experiment provides a simulation accessible to the public that demonstrates characteristics of various sample statistics and CLT and to empirically demonstrate that the sample average is unique. The article helps users develop a better understanding of the two topics and apply them to various types of activities as the concepts of a native distribution, sample distribution and numerical parameter estimate.

===Software ===
*[http://ww2.coastal.edu/kingw/statistics/R-tutorials/resample.html R-Tutorial]
*[http://www.socr.ucla.edu/htmls/SOCR_Experiments.html SOCR Experiments]
*[http://socr.ucla.edu/htmls/HTML5/SOCR_Resampling_Webapp/ SOCR Resampling Webapp]

* Sampling with/without replacement in R:
> names<-c('Ann','Tom','William','Tim','Kate','Mike','Rose','Alfred','Jef','Jack')
> N<-length(names)
> sample(names,N,replace=F)
[1] "William" "Kate" "Mike" "Ann" "Jef" "Tom" "Rose"
[8] "Jack" "Tim" "Alfred"

> sample(names,N,replace=T)
[1] "Mike" "Rose" "William" "Rose" "Jef" "Mike" "Jack"
[8] "Rose" "Tom" "Ann"

===Problems===
# Go over the examples in article 4.2 and 4.3.
# Do the exercise of simulating stock closing price $S_t$ on 252 trading days where $S_{t}$ satisfies: $S_t=S_0 e^{vt+\sigma \sqrt{t} Z}$, $Z \sim Normal(0,1)$ with $S_0=36, \sigma=2%, v=0.01%$.
# Now suppose you bought a call on this stock with strike price 40, with your simulation data, what is percentage of days you can profit from exercising the call option? (That is the percentage of days your $S_t$ is greater than 40).

<hr>
* SOCR Home page: http://www.socr.umich.edu

=== References===
* [http://en.wikipedia.org/wiki/Resampling_(statistics) Resampling Wikipedia]
* [http://en.wikipedia.org/wiki/Cross-validation_(statistics) Cross Validation Wikipedia]
* [http://en.wikipedia.org/wiki/Bootstrapping_(statistics) Bootstrapping Wikipedia]

<hr>
* SOCR Home page: http://www.socr.umich.edu

{{translate|pageName=http://wiki.socr.umich.edu/index.php?title=SMHS_ResamplingSimulation}}

SMHS ResamplingSimulation

2014-08-31T17:48:26Z

Jslavine: /* Jackknife */

==[[SMHS| Scientific Methods for Health Sciences]] - Resampling and Simulation ==

===Overview===
In statistics, ''resampling'' and ''simulation'' are two important concepts with wide applications in research and projects from various fields. ''Resampling'' is any of a variety of methods in which the following processes are implemented: (1) estimating the precision of sample statistics (i.e., medians, percentiles) by using subsets of available data (i.e., ''jackknifing'') or drawing randomly with replacement from a set of data (i.e., ''bootstrapping''); (2) exchanging labels on data points when performing significance tests (i.e., ''permutation tests''); or (3) validating models by using random subsets (i.e., bootstrapping, ''cross-validation''). We are going to introduce some common resampling techniques including bootstrapping, jackknifing, cross-validation and permutation tests. ''Simulation'' involves imitating real world processes or systems over time. We usually apply simulation after a model, which represents the key characteristics of the process, is developed. Simulation is widely applied in many contexts such as simulation of technology for performance optimization, testing and video games. It is often applied when the real system is not accessible or is difficult and/or costly to apply, and it provides us with an easier way to obtain data about the system or test it. We are going to present an introduction to simulation including the basic methods, applications, advantages and limitations.

===Motivation===
Imagine we want to evaluate the quality of a system or process, but data on the process is very hard to collect. How can we evaluate the system without having to collect samples? If we know the characteristics of the data set, for example, if we know it follows a normal distribution, then we could easily generate a series of data following a normal distribution and use these to test the system. In fact, we can easily generate a large amount of data and test the system with more power. Consider another case in which, instead of knowing the exact characteristics of the data, we have few data from the past few years and we notice that they follow a certain pattern. Here, we can use this data set to work out the characteristics of the data and develop a model. We can then generate a new data set from the model we developed. A popular example is the bootstrapping method in the interest rate model. In order to learn more about resampling and simulation methods, we are going to introduce the fundamental concepts, rules and methodologies commonly applied in these fields to prepare students with necessary background in resampling and simulation.

===Theory===
==== Resampling methods====
Resampling methods use a computer to generate a large number of simulated samples; the patterns in these samples are then summarized and analyzed. In resampling methods, the simulated samples are drawn from the existing sample of data and not from a theoretically defined data generating process (DGP). Thus, in resampling methods, the researcher does not know or control the DGP but can still learn about it.

*Principles: The assumption is that there is some population DGP that remains unobserved and that the DGP produced the one sample of data you have; all information about the population contained in the original data set is also contained in the distribution of these simulated samples. We draw a new ‘sample’ of data that consists of a mix of the observations from the original sample and repeat this many times so we have many new simulated ‘samples’. One can consider the original data set to be a reasonable representation of the population, and the distribution of parameter estimates produced from running a model on a series of resampled data sets will provide a good approximation of the distribution in the population. Resampling methods can be either parametric or non-parametric.

====Bootstrapping====
Bootstrapping is a statistical method for estimating the sampling distribution of an estimator by sampling with replacement from the original sample, most often with the purpose of deriving robust estimates of standard errors and confidence intervals of a population parameter like the median, odds ratio or regression coefficient. This technique allows estimation of the sampling distribution of almost any statistic using only very simple methods, and it falls into the broader class of resampling methods.

*Situations where bootstrapping applies: (1) When the theoretical distribution of a statistic of interest is complicated or unknown; (2) When the sample size is insufficient for straightforward statistical inference; (3) When power calculations have to be performed, and a small pilot sample is available.

*Boostrapping is the practice of estimating properties of an estimator by measuring those properties when sampling from an approximating distribution, say the empirical distribution of the observed data. It is often used as a robust alternative to inference based on parametric assumptions when those assumptions are in doubt, or when parametric inferences is impossible or requires very complicated formulas for the calculation of standard errors. It may also be used for constructing hypothesis tests.

*The basic idea of bootstrapping is that inference about a population based on sample data (sample → population) can be modeled by resampling the data and performing inference on this new sample (resample → sample). More formally, bootstrapping works by treating inference regarding the true probability distribution given the data as being analogous to inference regarding the empirical distribution given the resampled data. The accuracy of inferences regarding the empirical distribution using the resampled data can be assessed because we know the distribution. If the empirical distribution is a reasonable approximation of the true probability distribution, then the quality of inference on the true probability distribution can in turn be inferred.

*Common process: (1) Begin with an observed sample of size N, (2) generate a simulated sample of size N by drawing observations from your observed sample independently and with replacement, (3) compute and save the statistic of interest, (4) repeat this process many times (e.g., 1000), and (5) treat the distribution of your estimated statistics of interest as an estimate of the population distribution of that statistic.

*Key features of the bootstrap: The draws must be independent; that is, each observation in the observed sample must have an equal chance of being selected. The simulated sample must be of size N to take full advantage of the information in the sample. Resampling must be done with replacement, if not, then every simulated sample of size N would be identical and the same as the original sample. Resampling with replacement means that in any given simulated sample, some cases might appear more than once while others will not appear at all.

*Types of bootstrap schemes: (1) ''Case resampling'' using the Monte Carlo algorithm; (2) estimating the distribution of the sample means; (3) regression; (4) Bayesian bootstrap; (5) smooth bootstrap; (6) parametric bootstrap; (7) resampling residuals; (8) Gaussian process regression bootstrap; (9) wild bootstrap; (10) block bootstrap.

*Advantages: Boostrapping is simple, and it is straightforward to derive estimates of standard errors and confidence intervals for complex estimators of complex parameters of the distribution; it is appropriate to control and check the stability of the results.

*Limitations: It does not provide general finite-sample guarantees. Additionally, the apparent simplicity may conceal the fact that important assumptions are being made when undertaking the bootstrap analysis where these would be more formally stated in other approaches.

====Jackknife====
The Jackknife method estimates the bias and standard error of a statistic when a random sample of observations is used to calculate it. The basic idea is to systematically recompute the statistic estimate, leaving out one or more observations at a time from the sample set. From the new set of replicates of the statistic, an estimate for the bias and an estimate for the variance of the statistic can be calculated.

*Jackknife estimates of variance asymptotically tend to the true value almost surely. The jackknife is consistent for sample means, sample variances, etc.

*The jackknife is not consistent for the sample median. In the case of a unimodal variable, the ratio of the jackknife variance to the sample variance tends to be distributed as one half the square of a chi-square distribution with two degrees of freedom.
*It is dependent on the independence of the data. Extensions of the jackknife to allow for dependences in the data have been proposed.

*Advantages: This method is good at detecting outliers and influential cases. Those sub-sample estimates that differ most indicate cases that have the strongest influence on those estimates in the original full sample analysis.

*Limitations: The jackknife is less general than the bootstrap and thus, is used less frequently. It does not perform well if the statistic under consideration does not change ‘smoothly’ across simulated samples and it does not perform well with small samples because you cannot generate many resamples.

==== Cross-validation====
Cross-validation (CV) is a statistical method for validating a predictive model, assessing a statistical model on a data set that is independent of the data set used to fit the model. Subsets of the data are held out for use as validating sets; a model is fit to the remaining data (training set) and used to predict for the validation set. Averaging the quality of the predictions across the validation sets yields an overall measure of prediction accuracy.

*Steps: (1) randomly partition the variable data into a training set and a testing set, (2) fit the model on the training set, (3) take the parameter estimates from that model, use them to calculate a measure of fit on the testing set, (4) repeat for several times and average to reduce variability.

*Types of CV:
**leave-one-out CV: iterative method with number of iterations = sample size, each observation becomes the training set one time; Steps: 1) delete observation #1 from the data, 2) fit the model on observations #2-n, 3) apply the coefficients form step #2 to observation #1, calculate the chosen fit measure, 4) delete observation #2 form the data, 5) fit the model on observations #1 and #3-n, 6) apply the coefficients from step #5 to observation #2, calculate the chosen fit measure, 7) repeat until all observations have been deleted once.
**K-fold cross-validation, splits the data into K subsets and each is held out in turn as the validation set. This avoids self-influence. For comparison, in regression analysis method such as linear regression, each y value draws the regression line toward itself, making the prediction of that value appear more accurate than it really is. Cross-validation applied to linear regression predicts the y value for each observation without using that observation.

*Limitations of CV: training and testing data must be random samples from the same population; will show biggest differences from in-sample measures when n is small; higher computational demand than calculating in-sample measures; subject to researcher’s selection of an appropriate fit statistic.

====Permutation test====
Permutation (or randomization or re-randomization) test is a type of statistical significance test in which the distribution of the test statistic under the null hypothesis is obtained by calculating all possible values of the test statistic under rearrangements of the labels on the observed data points. It is just another form of resampling but is done without replacement.

*Rather than assume a distribution for the null hypothesis, we simulate what it would be by randomly reconfiguring our sample lots of times (say 1000) in a way that ‘breaks’ the relationship in our sample data.

*Suppose we have group A and group B with sample means $\bar{x}_A$ and $\bar{x}_B$ respectively and we want to test, at a 5% significance level, whether they come form the same distribution. $n_A$ and $n_B$ are the sample size for each group. A permutation test is designed to determine whether the observed difference between the sample means is large enough to reject the null hypothesis $H_0$: the two groups have identical probability distribution. The test proceeds: (1) the difference in the means between group A and B is calculated, (2) difference in sample means is calculated and recorded for each possible way of dividing these pooled values into two groups of size $n_A$ and $n_B$. The set of these calculated differences if the exact distribution of possible differences under the null hypothesis that group label does not matter, (3) the one-side p-value of the test is calculated as the proportion of sampled permutations where the difference in means was greater than or equal to $T (obs)$; the two-sided p-value of the test is calculated as the proportion of sampled permutations where the absolute difference was greater than or equal to $ABS(T(obs))$;(4) sort the recorded differences and then observe if $T(obs)$ is contained within the middle 95% of them, if not, reject $H_0$ at 5% significance level.

====Simulation====
A common assumption is that the coefficients we estimate are drawn from a probability distribution that describes the larger population. With large enough sample size, according to the CLT this distribution is multivariate normal.
*Steps: (1) goal of simulation is to make random draws from this distribution to simulate many ‘hypothetical values’ of the coefficients, (2) the next step is to choose a QI say expected value, predicted probability, odds ratio, first difference, etc, (3) set a key variable in the model to a theoretically interesting value and the rest to their means or modes, (4) calculate QI with each set of simulated coefficients and set the variable to a new value, (5) set the variable to a new value, (6) calculate that QI with each set of simulated coefficients, (7) repeat as appropriate, (8) efficiently summarize the distribution of the computed QI at each value of our variable.

*Advantages: provide more information than a just a table of regression output; accounts for uncertainty in the QI; flexible to many different types of models, QIs and variable specifications; after doing it once, easy to use; can be much easier than working with analytic solutions.

*Limitations: relies on CLT to justify asymptotic normality (fully Bayesian model using MCMC could produce exact finite-sample distribution; bootstrapping would require no distributional assumption); computational intensity; large models can produce lots of uncertainty around quantity of interest.

===Applications===
* [http://www.amstat.org/publications/jse/v16n2/dinov.html This article] presents the application of Central Limit Theorem using the new SOCR applet and demonstration activity. In this article, it described an innovative effort of using technological tools for improving student motivation and learning of the theory, practice and usability of the CLT probability and statistics courses. The method is based on harnessing the computational libraries developed by SOCR to design a new interactive Java applet and a corresponding demonstration activity that illustrate the meaning and power of the CLT. It included four experiments to demonstrate the assumptions, meaning and implication of CLT as well as hands-on simulation and a number of examples illustrating the theory and application of CLT.

* [http://link.springer.com/article/10.3758/BRM.40.3.879 This article] titled Asymptotic and resampling strategies for assessing and comparing indirect effects in multiple mediator models provided an overview of simple and multiple mediation and explored three approaches that can be used to investigate indirect process. It also presents methods for contrasting two or more mediators within a single model through examples. The paper presents an illustrative example, assessing and contrasting potential mediators of relationship between the helpfulness of socialization agents and job satisfaction as well as software application of these methods including SAS, SPSS macros and etc.

* [[SOCR_ResamplingSimulation_Activity|This article]] presents the resampling, randomization and simulation activity and illustrated the processes of sampling, resampling and randomization using the SOCR webapp. It aims to demonstrate the concepts of simulation and data generation, illustrate data resampling on a massive scale, reinforce the concept of resampling and randomization based statistical inference and demonstrate the similarities and differences between parametric-based and resampling-based statistical inference. The article provides specific steps to implement the activities and video is also provided for reference.

* [[SOCR_EduMaterials_Activities_SamplingDistributionCLTExperiment | This article]] is an experiment on sampling distribution with Central Limit Theory. It demonstrates the properties of the sampling distributions of various sample statistics and illustrates the CLT through the experiment. The sampling distribution CLT experiment provides a simulation accessible to the public that demonstrates characteristics of various sample statistics and CLT and to empirically demonstrate that the sample average is unique. The article helps users develop a better understanding of the two topics and apply them to various types of activities as the concepts of a native distribution, sample distribution and numerical parameter estimate.

===Software ===
*[http://ww2.coastal.edu/kingw/statistics/R-tutorials/resample.html R-Tutorial]
*[http://www.socr.ucla.edu/htmls/SOCR_Experiments.html SOCR Experiments]
*[http://socr.ucla.edu/htmls/HTML5/SOCR_Resampling_Webapp/ SOCR Resampling Webapp]

* Sampling with/without replacement in R:
> names<-c('Ann','Tom','William','Tim','Kate','Mike','Rose','Alfred','Jef','Jack')
> N<-length(names)
> sample(names,N,replace=F)
[1] "William" "Kate" "Mike" "Ann" "Jef" "Tom" "Rose"
[8] "Jack" "Tim" "Alfred"

> sample(names,N,replace=T)
[1] "Mike" "Rose" "William" "Rose" "Jef" "Mike" "Jack"
[8] "Rose" "Tom" "Ann"

===Problems===
# Go over the examples in article 4.2 and 4.3.
# Do the exercise of simulating stock closing price $S_t$ on 252 trading days where $S_{t}$ satisfies: $S_t=S_0 e^{vt+\sigma \sqrt{t} Z}$, $Z \sim Normal(0,1)$ with $S_0=36, \sigma=2%, v=0.01%$.
# Now suppose you bought a call on this stock with strike price 40, with your simulation data, what is percentage of days you can profit from exercising the call option? (That is the percentage of days your $S_t$ is greater than 40).

<hr>
* SOCR Home page: http://www.socr.umich.edu

=== References===
* [http://en.wikipedia.org/wiki/Resampling_(statistics) Resampling Wikipedia]
* [http://en.wikipedia.org/wiki/Cross-validation_(statistics) Cross Validation Wikipedia]
* [http://en.wikipedia.org/wiki/Bootstrapping_(statistics) Bootstrapping Wikipedia]

<hr>
* SOCR Home page: http://www.socr.umich.edu

{{translate|pageName=http://wiki.socr.umich.edu/index.php?title=SMHS_ResamplingSimulation}}

SMHS ResamplingSimulation

2014-08-31T17:43:54Z

Jslavine: /* Bootstrapping */

==[[SMHS| Scientific Methods for Health Sciences]] - Resampling and Simulation ==

===Overview===
In statistics, ''resampling'' and ''simulation'' are two important concepts with wide applications in research and projects from various fields. ''Resampling'' is any of a variety of methods in which the following processes are implemented: (1) estimating the precision of sample statistics (i.e., medians, percentiles) by using subsets of available data (i.e., ''jackknifing'') or drawing randomly with replacement from a set of data (i.e., ''bootstrapping''); (2) exchanging labels on data points when performing significance tests (i.e., ''permutation tests''); or (3) validating models by using random subsets (i.e., bootstrapping, ''cross-validation''). We are going to introduce some common resampling techniques including bootstrapping, jackknifing, cross-validation and permutation tests. ''Simulation'' involves imitating real world processes or systems over time. We usually apply simulation after a model, which represents the key characteristics of the process, is developed. Simulation is widely applied in many contexts such as simulation of technology for performance optimization, testing and video games. It is often applied when the real system is not accessible or is difficult and/or costly to apply, and it provides us with an easier way to obtain data about the system or test it. We are going to present an introduction to simulation including the basic methods, applications, advantages and limitations.

===Motivation===
Imagine we want to evaluate the quality of a system or process, but data on the process is very hard to collect. How can we evaluate the system without having to collect samples? If we know the characteristics of the data set, for example, if we know it follows a normal distribution, then we could easily generate a series of data following a normal distribution and use these to test the system. In fact, we can easily generate a large amount of data and test the system with more power. Consider another case in which, instead of knowing the exact characteristics of the data, we have few data from the past few years and we notice that they follow a certain pattern. Here, we can use this data set to work out the characteristics of the data and develop a model. We can then generate a new data set from the model we developed. A popular example is the bootstrapping method in the interest rate model. In order to learn more about resampling and simulation methods, we are going to introduce the fundamental concepts, rules and methodologies commonly applied in these fields to prepare students with necessary background in resampling and simulation.

===Theory===
==== Resampling methods====
Resampling methods use a computer to generate a large number of simulated samples; the patterns in these samples are then summarized and analyzed. In resampling methods, the simulated samples are drawn from the existing sample of data and not from a theoretically defined data generating process (DGP). Thus, in resampling methods, the researcher does not know or control the DGP but can still learn about it.

*Principles: The assumption is that there is some population DGP that remains unobserved and that the DGP produced the one sample of data you have; all information about the population contained in the original data set is also contained in the distribution of these simulated samples. We draw a new ‘sample’ of data that consists of a mix of the observations from the original sample and repeat this many times so we have many new simulated ‘samples’. One can consider the original data set to be a reasonable representation of the population, and the distribution of parameter estimates produced from running a model on a series of resampled data sets will provide a good approximation of the distribution in the population. Resampling methods can be either parametric or non-parametric.

====Bootstrapping====
Bootstrapping is a statistical method for estimating the sampling distribution of an estimator by sampling with replacement from the original sample, most often with the purpose of deriving robust estimates of standard errors and confidence intervals of a population parameter like the median, odds ratio or regression coefficient. This technique allows estimation of the sampling distribution of almost any statistic using only very simple methods, and it falls into the broader class of resampling methods.

*Situations where bootstrapping applies: (1) When the theoretical distribution of a statistic of interest is complicated or unknown; (2) When the sample size is insufficient for straightforward statistical inference; (3) When power calculations have to be performed, and a small pilot sample is available.

*Boostrapping is the practice of estimating properties of an estimator by measuring those properties when sampling from an approximating distribution, say the empirical distribution of the observed data. It is often used as a robust alternative to inference based on parametric assumptions when those assumptions are in doubt, or when parametric inferences is impossible or requires very complicated formulas for the calculation of standard errors. It may also be used for constructing hypothesis tests.

*The basic idea of bootstrapping is that inference about a population based on sample data (sample → population) can be modeled by resampling the data and performing inference on this new sample (resample → sample). More formally, bootstrapping works by treating inference regarding the true probability distribution given the data as being analogous to inference regarding the empirical distribution given the resampled data. The accuracy of inferences regarding the empirical distribution using the resampled data can be assessed because we know the distribution. If the empirical distribution is a reasonable approximation of the true probability distribution, then the quality of inference on the true probability distribution can in turn be inferred.

*Common process: (1) Begin with an observed sample of size N, (2) generate a simulated sample of size N by drawing observations from your observed sample independently and with replacement, (3) compute and save the statistic of interest, (4) repeat this process many times (e.g., 1000), and (5) treat the distribution of your estimated statistics of interest as an estimate of the population distribution of that statistic.

*Key features of the bootstrap: The draws must be independent; that is, each observation in the observed sample must have an equal chance of being selected. The simulated sample must be of size N to take full advantage of the information in the sample. Resampling must be done with replacement, if not, then every simulated sample of size N would be identical and the same as the original sample. Resampling with replacement means that in any given simulated sample, some cases might appear more than once while others will not appear at all.

*Types of bootstrap schemes: (1) ''Case resampling'' using the Monte Carlo algorithm; (2) estimating the distribution of the sample means; (3) regression; (4) Bayesian bootstrap; (5) smooth bootstrap; (6) parametric bootstrap; (7) resampling residuals; (8) Gaussian process regression bootstrap; (9) wild bootstrap; (10) block bootstrap.

*Advantages: Boostrapping is simple, and it is straightforward to derive estimates of standard errors and confidence intervals for complex estimators of complex parameters of the distribution; it is appropriate to control and check the stability of the results.

*Limitations: It does not provide general finite-sample guarantees. Additionally, the apparent simplicity may conceal the fact that important assumptions are being made when undertaking the bootstrap analysis where these would be more formally stated in other approaches.

====Jackknife====
The Jackknife method estimates the bias and standard error of a statistic when a random sample of observations is used to calculate it. The basic idea is systematically recomputing the statistic estimate, leaving out one or more observations at a time from the sample set. From the new set of replicates of the statistic, an estimate for the bias and an estimate for the variance of the statistic can be calculated.

*Jackknife estimate of variance tends to asymptotically to the true value almost surely. The jackknife is consistent for the sample means, sample variances, and etc.

*Jackknife is not consistent for the sample median. In the case of a unimodal variate the ratio of the jackknife variance to the sample variance tends to be distributed as one half the square of a chi-square distribution with two degrees of freedom.
*It is dependent on the independence of the data. Extensions of the jackknife to allow for dependence in the data have been proposed.

*Advantages: good at detecting outliers/influential cases. Those sub-sample estimates that differ most from the rest indicate those cases that has the most influence on those estimates in the original full sample analysis.

*Limitations: The jackknife is less general than the bootstrap, and thus used less frequently; it does not perform well if the statistic under consideration does not change ‘smoothly’ across simulated samples; it does not perform well in small samples because you don’t end up generating many resamples.

==== Cross-validation====
Cross-validation (CV) is a statistical method for validating a predictive model, assessing a statistical model on a data set that is independent of the data set used to fit the model. Subsets of the data are held out for use as validating sets; a model is fit to the remaining data (training set) and used to predict for the validation set. Averaging the quality of the predictions across the validation sets yields an overall measure of prediction accuracy.

*Steps: (1) randomly partition the variable data into a training set and a testing set, (2) fit the model on the training set, (3) take the parameter estimates from that model, use them to calculate a measure of fit on the testing set, (4) repeat for several times and average to reduce variability.

*Types of CV:
**leave-one-out CV: iterative method with number of iterations = sample size, each observation becomes the training set one time; Steps: 1) delete observation #1 from the data, 2) fit the model on observations #2-n, 3) apply the coefficients form step #2 to observation #1, calculate the chosen fit measure, 4) delete observation #2 form the data, 5) fit the model on observations #1 and #3-n, 6) apply the coefficients from step #5 to observation #2, calculate the chosen fit measure, 7) repeat until all observations have been deleted once.
**K-fold cross-validation, splits the data into K subsets and each is held out in turn as the validation set. This avoids self-influence. For comparison, in regression analysis method such as linear regression, each y value draws the regression line toward itself, making the prediction of that value appear more accurate than it really is. Cross-validation applied to linear regression predicts the y value for each observation without using that observation.

*Limitations of CV: training and testing data must be random samples from the same population; will show biggest differences from in-sample measures when n is small; higher computational demand than calculating in-sample measures; subject to researcher’s selection of an appropriate fit statistic.

====Permutation test====
Permutation (or randomization or re-randomization) test is a type of statistical significance test in which the distribution of the test statistic under the null hypothesis is obtained by calculating all possible values of the test statistic under rearrangements of the labels on the observed data points. It is just another form of resampling but is done without replacement.

*Rather than assume a distribution for the null hypothesis, we simulate what it would be by randomly reconfiguring our sample lots of times (say 1000) in a way that ‘breaks’ the relationship in our sample data.

*Suppose we have group A and group B with sample means $\bar{x}_A$ and $\bar{x}_B$ respectively and we want to test, at a 5% significance level, whether they come form the same distribution. $n_A$ and $n_B$ are the sample size for each group. A permutation test is designed to determine whether the observed difference between the sample means is large enough to reject the null hypothesis $H_0$: the two groups have identical probability distribution. The test proceeds: (1) the difference in the means between group A and B is calculated, (2) difference in sample means is calculated and recorded for each possible way of dividing these pooled values into two groups of size $n_A$ and $n_B$. The set of these calculated differences if the exact distribution of possible differences under the null hypothesis that group label does not matter, (3) the one-side p-value of the test is calculated as the proportion of sampled permutations where the difference in means was greater than or equal to $T (obs)$; the two-sided p-value of the test is calculated as the proportion of sampled permutations where the absolute difference was greater than or equal to $ABS(T(obs))$;(4) sort the recorded differences and then observe if $T(obs)$ is contained within the middle 95% of them, if not, reject $H_0$ at 5% significance level.

====Simulation====
A common assumption is that the coefficients we estimate are drawn from a probability distribution that describes the larger population. With large enough sample size, according to the CLT this distribution is multivariate normal.
*Steps: (1) goal of simulation is to make random draws from this distribution to simulate many ‘hypothetical values’ of the coefficients, (2) the next step is to choose a QI say expected value, predicted probability, odds ratio, first difference, etc, (3) set a key variable in the model to a theoretically interesting value and the rest to their means or modes, (4) calculate QI with each set of simulated coefficients and set the variable to a new value, (5) set the variable to a new value, (6) calculate that QI with each set of simulated coefficients, (7) repeat as appropriate, (8) efficiently summarize the distribution of the computed QI at each value of our variable.

*Advantages: provide more information than a just a table of regression output; accounts for uncertainty in the QI; flexible to many different types of models, QIs and variable specifications; after doing it once, easy to use; can be much easier than working with analytic solutions.

*Limitations: relies on CLT to justify asymptotic normality (fully Bayesian model using MCMC could produce exact finite-sample distribution; bootstrapping would require no distributional assumption); computational intensity; large models can produce lots of uncertainty around quantity of interest.

===Applications===
* [http://www.amstat.org/publications/jse/v16n2/dinov.html This article] presents the application of Central Limit Theorem using the new SOCR applet and demonstration activity. In this article, it described an innovative effort of using technological tools for improving student motivation and learning of the theory, practice and usability of the CLT probability and statistics courses. The method is based on harnessing the computational libraries developed by SOCR to design a new interactive Java applet and a corresponding demonstration activity that illustrate the meaning and power of the CLT. It included four experiments to demonstrate the assumptions, meaning and implication of CLT as well as hands-on simulation and a number of examples illustrating the theory and application of CLT.

* [http://link.springer.com/article/10.3758/BRM.40.3.879 This article] titled Asymptotic and resampling strategies for assessing and comparing indirect effects in multiple mediator models provided an overview of simple and multiple mediation and explored three approaches that can be used to investigate indirect process. It also presents methods for contrasting two or more mediators within a single model through examples. The paper presents an illustrative example, assessing and contrasting potential mediators of relationship between the helpfulness of socialization agents and job satisfaction as well as software application of these methods including SAS, SPSS macros and etc.

* [[SOCR_ResamplingSimulation_Activity|This article]] presents the resampling, randomization and simulation activity and illustrated the processes of sampling, resampling and randomization using the SOCR webapp. It aims to demonstrate the concepts of simulation and data generation, illustrate data resampling on a massive scale, reinforce the concept of resampling and randomization based statistical inference and demonstrate the similarities and differences between parametric-based and resampling-based statistical inference. The article provides specific steps to implement the activities and video is also provided for reference.

* [[SOCR_EduMaterials_Activities_SamplingDistributionCLTExperiment | This article]] is an experiment on sampling distribution with Central Limit Theory. It demonstrates the properties of the sampling distributions of various sample statistics and illustrates the CLT through the experiment. The sampling distribution CLT experiment provides a simulation accessible to the public that demonstrates characteristics of various sample statistics and CLT and to empirically demonstrate that the sample average is unique. The article helps users develop a better understanding of the two topics and apply them to various types of activities as the concepts of a native distribution, sample distribution and numerical parameter estimate.

===Software ===
*[http://ww2.coastal.edu/kingw/statistics/R-tutorials/resample.html R-Tutorial]
*[http://www.socr.ucla.edu/htmls/SOCR_Experiments.html SOCR Experiments]
*[http://socr.ucla.edu/htmls/HTML5/SOCR_Resampling_Webapp/ SOCR Resampling Webapp]

* Sampling with/without replacement in R:
> names<-c('Ann','Tom','William','Tim','Kate','Mike','Rose','Alfred','Jef','Jack')
> N<-length(names)
> sample(names,N,replace=F)
[1] "William" "Kate" "Mike" "Ann" "Jef" "Tom" "Rose"
[8] "Jack" "Tim" "Alfred"

> sample(names,N,replace=T)
[1] "Mike" "Rose" "William" "Rose" "Jef" "Mike" "Jack"
[8] "Rose" "Tom" "Ann"

===Problems===
# Go over the examples in article 4.2 and 4.3.
# Do the exercise of simulating stock closing price $S_t$ on 252 trading days where $S_{t}$ satisfies: $S_t=S_0 e^{vt+\sigma \sqrt{t} Z}$, $Z \sim Normal(0,1)$ with $S_0=36, \sigma=2%, v=0.01%$.
# Now suppose you bought a call on this stock with strike price 40, with your simulation data, what is percentage of days you can profit from exercising the call option? (That is the percentage of days your $S_t$ is greater than 40).

<hr>
* SOCR Home page: http://www.socr.umich.edu

=== References===
* [http://en.wikipedia.org/wiki/Resampling_(statistics) Resampling Wikipedia]
* [http://en.wikipedia.org/wiki/Cross-validation_(statistics) Cross Validation Wikipedia]
* [http://en.wikipedia.org/wiki/Bootstrapping_(statistics) Bootstrapping Wikipedia]

<hr>
* SOCR Home page: http://www.socr.umich.edu

{{translate|pageName=http://wiki.socr.umich.edu/index.php?title=SMHS_ResamplingSimulation}}

SMHS ResamplingSimulation

2014-08-31T17:36:42Z

Jslavine: /* Resampling methods */

==[[SMHS| Scientific Methods for Health Sciences]] - Resampling and Simulation ==

===Overview===
In statistics, ''resampling'' and ''simulation'' are two important concepts with wide applications in research and projects from various fields. ''Resampling'' is any of a variety of methods in which the following processes are implemented: (1) estimating the precision of sample statistics (i.e., medians, percentiles) by using subsets of available data (i.e., ''jackknifing'') or drawing randomly with replacement from a set of data (i.e., ''bootstrapping''); (2) exchanging labels on data points when performing significance tests (i.e., ''permutation tests''); or (3) validating models by using random subsets (i.e., bootstrapping, ''cross-validation''). We are going to introduce some common resampling techniques including bootstrapping, jackknifing, cross-validation and permutation tests. ''Simulation'' involves imitating real world processes or systems over time. We usually apply simulation after a model, which represents the key characteristics of the process, is developed. Simulation is widely applied in many contexts such as simulation of technology for performance optimization, testing and video games. It is often applied when the real system is not accessible or is difficult and/or costly to apply, and it provides us with an easier way to obtain data about the system or test it. We are going to present an introduction to simulation including the basic methods, applications, advantages and limitations.

===Motivation===
Imagine we want to evaluate the quality of a system or process, but data on the process is very hard to collect. How can we evaluate the system without having to collect samples? If we know the characteristics of the data set, for example, if we know it follows a normal distribution, then we could easily generate a series of data following a normal distribution and use these to test the system. In fact, we can easily generate a large amount of data and test the system with more power. Consider another case in which, instead of knowing the exact characteristics of the data, we have few data from the past few years and we notice that they follow a certain pattern. Here, we can use this data set to work out the characteristics of the data and develop a model. We can then generate a new data set from the model we developed. A popular example is the bootstrapping method in the interest rate model. In order to learn more about resampling and simulation methods, we are going to introduce the fundamental concepts, rules and methodologies commonly applied in these fields to prepare students with necessary background in resampling and simulation.

===Theory===
==== Resampling methods====
Resampling methods use a computer to generate a large number of simulated samples; the patterns in these samples are then summarized and analyzed. In resampling methods, the simulated samples are drawn from the existing sample of data and not from a theoretically defined data generating process (DGP). Thus, in resampling methods, the researcher does not know or control the DGP but can still learn about it.

*Principles: The assumption is that there is some population DGP that remains unobserved and that the DGP produced the one sample of data you have; all information about the population contained in the original data set is also contained in the distribution of these simulated samples. We draw a new ‘sample’ of data that consists of a mix of the observations from the original sample and repeat this many times so we have many new simulated ‘samples’. One can consider the original data set to be a reasonable representation of the population, and the distribution of parameter estimates produced from running a model on a series of resampled data sets will provide a good approximation of the distribution in the population. Resampling methods can be either parametric or non-parametric.

====Bootstrapping====
Bootstrapping is a statistical method for estimating the sampling distribution of an estimator by sampling with replacement from the original sample, most often with the purpose of deriving robust estimates of standard errors and confidence intervals of a population parameter like median, odds ratio or regression coefficient. This technique allows estimation of the sampling distribution of almost any statistic using only very simple method and it falls in the broader class of resampling method.

*Situations where bootstrapping applies: (1) when the theoretical distribution of a statistic of interest is complicated or unknown; (2) when the sample size is insufficient for straightforward statistical inference; (3) when power calculations have to be performed, and a small pilot sample is available.

*It is the practice of estimating properties of an estimator by measuring those properties when sampling from an approximating distribution, say the empirical distribution of the observed data. It is often used as a robust alternative to inference based on parametric assumptions when those assumptions are in doubt, or where parametric inferences is impossible or requires very complicated formulas for the calculation of standard errors. It may also be used for constructing hypothesis tests.

*The basic idea of bootstrapping is that inference about a population from sample data (sample → population) can be modeled by resampling the sample data and perform inference on (resample → sample). More formally, the bootstrap works by treating inference of the true probability distribution, given the original data, as being analogous to inference of the empirical distribution given the resampled data. The accuracy of inferences regarding the empirical distribution using the resample data can be assessed because we know the distribution. If the empirical distribution is reasonable approximation to the true probability distribution, then the quality of inference on true probability distribution can in turn be inferred.

*Common process: (1) begin with an observed sample of size N, (2) generate a simulated sample of size N by drawing observations from your observed sample independently and with replacement, (3) compute and save the statistic of interest, (4) repeat this process many times (say 1000), (5) treat the distribution of your estimated statistics of interest as an estimate of the population distribution of that statistic.

*Key features of the bootstrap: the draws must be independent, each observation in the observed sample must have an equal chance of being selected; the simulated sample must be of size N to take full advantage of the information in the sample; resampling must be done with replacement, if not, then every simulated sample of size N would be identical to each other and to the original sample; resampling with replacement means that in any given simulated sample, some cases might appear more than once while others will not appear at all.

*Types of bootstrap scheme: (1) case resampling: the Monte Carlo algorithm; (2) estimating the distribution of sample mean; (3) regression; (4) Bayesian bootstrap; (5) smooth bootstrap; (6) parametric bootstrap; (7) resampling residuals; (8) Gaussian process regression bootstrap; (9) wild bootstrap; (10) block bootstrap.

*Advantages: simplicity and straightforward to derive estimates of standard errors and confidence intervals for complex estimators of complex parameters of the distribution; appropriate to control and check the stability of the results.

*Limitations: does not provide general finite-sample guarantees; the apparent simplicity may conceal the fact that important assumptions are being made when undertaking the bootstrap analysis where these would be more formally stated in other approaches.

====Jackknife====
The Jackknife method estimates the bias and standard error of a statistic when a random sample of observations is used to calculate it. The basic idea is systematically recomputing the statistic estimate, leaving out one or more observations at a time from the sample set. From the new set of replicates of the statistic, an estimate for the bias and an estimate for the variance of the statistic can be calculated.

*Jackknife estimate of variance tends to asymptotically to the true value almost surely. The jackknife is consistent for the sample means, sample variances, and etc.

*Jackknife is not consistent for the sample median. In the case of a unimodal variate the ratio of the jackknife variance to the sample variance tends to be distributed as one half the square of a chi-square distribution with two degrees of freedom.
*It is dependent on the independence of the data. Extensions of the jackknife to allow for dependence in the data have been proposed.

*Advantages: good at detecting outliers/influential cases. Those sub-sample estimates that differ most from the rest indicate those cases that has the most influence on those estimates in the original full sample analysis.

*Limitations: The jackknife is less general than the bootstrap, and thus used less frequently; it does not perform well if the statistic under consideration does not change ‘smoothly’ across simulated samples; it does not perform well in small samples because you don’t end up generating many resamples.

==== Cross-validation====
Cross-validation (CV) is a statistical method for validating a predictive model, assessing a statistical model on a data set that is independent of the data set used to fit the model. Subsets of the data are held out for use as validating sets; a model is fit to the remaining data (training set) and used to predict for the validation set. Averaging the quality of the predictions across the validation sets yields an overall measure of prediction accuracy.

*Steps: (1) randomly partition the variable data into a training set and a testing set, (2) fit the model on the training set, (3) take the parameter estimates from that model, use them to calculate a measure of fit on the testing set, (4) repeat for several times and average to reduce variability.

*Types of CV:
**leave-one-out CV: iterative method with number of iterations = sample size, each observation becomes the training set one time; Steps: 1) delete observation #1 from the data, 2) fit the model on observations #2-n, 3) apply the coefficients form step #2 to observation #1, calculate the chosen fit measure, 4) delete observation #2 form the data, 5) fit the model on observations #1 and #3-n, 6) apply the coefficients from step #5 to observation #2, calculate the chosen fit measure, 7) repeat until all observations have been deleted once.
**K-fold cross-validation, splits the data into K subsets and each is held out in turn as the validation set. This avoids self-influence. For comparison, in regression analysis method such as linear regression, each y value draws the regression line toward itself, making the prediction of that value appear more accurate than it really is. Cross-validation applied to linear regression predicts the y value for each observation without using that observation.

*Limitations of CV: training and testing data must be random samples from the same population; will show biggest differences from in-sample measures when n is small; higher computational demand than calculating in-sample measures; subject to researcher’s selection of an appropriate fit statistic.

====Permutation test====
Permutation (or randomization or re-randomization) test is a type of statistical significance test in which the distribution of the test statistic under the null hypothesis is obtained by calculating all possible values of the test statistic under rearrangements of the labels on the observed data points. It is just another form of resampling but is done without replacement.

*Rather than assume a distribution for the null hypothesis, we simulate what it would be by randomly reconfiguring our sample lots of times (say 1000) in a way that ‘breaks’ the relationship in our sample data.

*Suppose we have group A and group B with sample means $\bar{x}_A$ and $\bar{x}_B$ respectively and we want to test, at a 5% significance level, whether they come form the same distribution. $n_A$ and $n_B$ are the sample size for each group. A permutation test is designed to determine whether the observed difference between the sample means is large enough to reject the null hypothesis $H_0$: the two groups have identical probability distribution. The test proceeds: (1) the difference in the means between group A and B is calculated, (2) difference in sample means is calculated and recorded for each possible way of dividing these pooled values into two groups of size $n_A$ and $n_B$. The set of these calculated differences if the exact distribution of possible differences under the null hypothesis that group label does not matter, (3) the one-side p-value of the test is calculated as the proportion of sampled permutations where the difference in means was greater than or equal to $T (obs)$; the two-sided p-value of the test is calculated as the proportion of sampled permutations where the absolute difference was greater than or equal to $ABS(T(obs))$;(4) sort the recorded differences and then observe if $T(obs)$ is contained within the middle 95% of them, if not, reject $H_0$ at 5% significance level.

====Simulation====
A common assumption is that the coefficients we estimate are drawn from a probability distribution that describes the larger population. With large enough sample size, according to the CLT this distribution is multivariate normal.
*Steps: (1) goal of simulation is to make random draws from this distribution to simulate many ‘hypothetical values’ of the coefficients, (2) the next step is to choose a QI say expected value, predicted probability, odds ratio, first difference, etc, (3) set a key variable in the model to a theoretically interesting value and the rest to their means or modes, (4) calculate QI with each set of simulated coefficients and set the variable to a new value, (5) set the variable to a new value, (6) calculate that QI with each set of simulated coefficients, (7) repeat as appropriate, (8) efficiently summarize the distribution of the computed QI at each value of our variable.

*Advantages: provide more information than a just a table of regression output; accounts for uncertainty in the QI; flexible to many different types of models, QIs and variable specifications; after doing it once, easy to use; can be much easier than working with analytic solutions.

*Limitations: relies on CLT to justify asymptotic normality (fully Bayesian model using MCMC could produce exact finite-sample distribution; bootstrapping would require no distributional assumption); computational intensity; large models can produce lots of uncertainty around quantity of interest.

===Applications===
* [http://www.amstat.org/publications/jse/v16n2/dinov.html This article] presents the application of Central Limit Theorem using the new SOCR applet and demonstration activity. In this article, it described an innovative effort of using technological tools for improving student motivation and learning of the theory, practice and usability of the CLT probability and statistics courses. The method is based on harnessing the computational libraries developed by SOCR to design a new interactive Java applet and a corresponding demonstration activity that illustrate the meaning and power of the CLT. It included four experiments to demonstrate the assumptions, meaning and implication of CLT as well as hands-on simulation and a number of examples illustrating the theory and application of CLT.

* [http://link.springer.com/article/10.3758/BRM.40.3.879 This article] titled Asymptotic and resampling strategies for assessing and comparing indirect effects in multiple mediator models provided an overview of simple and multiple mediation and explored three approaches that can be used to investigate indirect process. It also presents methods for contrasting two or more mediators within a single model through examples. The paper presents an illustrative example, assessing and contrasting potential mediators of relationship between the helpfulness of socialization agents and job satisfaction as well as software application of these methods including SAS, SPSS macros and etc.

* [[SOCR_ResamplingSimulation_Activity|This article]] presents the resampling, randomization and simulation activity and illustrated the processes of sampling, resampling and randomization using the SOCR webapp. It aims to demonstrate the concepts of simulation and data generation, illustrate data resampling on a massive scale, reinforce the concept of resampling and randomization based statistical inference and demonstrate the similarities and differences between parametric-based and resampling-based statistical inference. The article provides specific steps to implement the activities and video is also provided for reference.

* [[SOCR_EduMaterials_Activities_SamplingDistributionCLTExperiment | This article]] is an experiment on sampling distribution with Central Limit Theory. It demonstrates the properties of the sampling distributions of various sample statistics and illustrates the CLT through the experiment. The sampling distribution CLT experiment provides a simulation accessible to the public that demonstrates characteristics of various sample statistics and CLT and to empirically demonstrate that the sample average is unique. The article helps users develop a better understanding of the two topics and apply them to various types of activities as the concepts of a native distribution, sample distribution and numerical parameter estimate.

===Software ===
*[http://ww2.coastal.edu/kingw/statistics/R-tutorials/resample.html R-Tutorial]
*[http://www.socr.ucla.edu/htmls/SOCR_Experiments.html SOCR Experiments]
*[http://socr.ucla.edu/htmls/HTML5/SOCR_Resampling_Webapp/ SOCR Resampling Webapp]

* Sampling with/without replacement in R:
> names<-c('Ann','Tom','William','Tim','Kate','Mike','Rose','Alfred','Jef','Jack')
> N<-length(names)
> sample(names,N,replace=F)
[1] "William" "Kate" "Mike" "Ann" "Jef" "Tom" "Rose"
[8] "Jack" "Tim" "Alfred"

> sample(names,N,replace=T)
[1] "Mike" "Rose" "William" "Rose" "Jef" "Mike" "Jack"
[8] "Rose" "Tom" "Ann"

===Problems===
# Go over the examples in article 4.2 and 4.3.
# Do the exercise of simulating stock closing price $S_t$ on 252 trading days where $S_{t}$ satisfies: $S_t=S_0 e^{vt+\sigma \sqrt{t} Z}$, $Z \sim Normal(0,1)$ with $S_0=36, \sigma=2%, v=0.01%$.
# Now suppose you bought a call on this stock with strike price 40, with your simulation data, what is percentage of days you can profit from exercising the call option? (That is the percentage of days your $S_t$ is greater than 40).

<hr>
* SOCR Home page: http://www.socr.umich.edu

=== References===
* [http://en.wikipedia.org/wiki/Resampling_(statistics) Resampling Wikipedia]
* [http://en.wikipedia.org/wiki/Cross-validation_(statistics) Cross Validation Wikipedia]
* [http://en.wikipedia.org/wiki/Bootstrapping_(statistics) Bootstrapping Wikipedia]

<hr>
* SOCR Home page: http://www.socr.umich.edu

{{translate|pageName=http://wiki.socr.umich.edu/index.php?title=SMHS_ResamplingSimulation}}

SMHS ResamplingSimulation

2014-08-31T17:32:11Z

Jslavine: /* Motivation */

==[[SMHS| Scientific Methods for Health Sciences]] - Resampling and Simulation ==

===Overview===
In statistics, ''resampling'' and ''simulation'' are two important concepts with wide applications in research and projects from various fields. ''Resampling'' is any of a variety of methods in which the following processes are implemented: (1) estimating the precision of sample statistics (i.e., medians, percentiles) by using subsets of available data (i.e., ''jackknifing'') or drawing randomly with replacement from a set of data (i.e., ''bootstrapping''); (2) exchanging labels on data points when performing significance tests (i.e., ''permutation tests''); or (3) validating models by using random subsets (i.e., bootstrapping, ''cross-validation''). We are going to introduce some common resampling techniques including bootstrapping, jackknifing, cross-validation and permutation tests. ''Simulation'' involves imitating real world processes or systems over time. We usually apply simulation after a model, which represents the key characteristics of the process, is developed. Simulation is widely applied in many contexts such as simulation of technology for performance optimization, testing and video games. It is often applied when the real system is not accessible or is difficult and/or costly to apply, and it provides us with an easier way to obtain data about the system or test it. We are going to present an introduction to simulation including the basic methods, applications, advantages and limitations.

===Motivation===
Imagine we want to evaluate the quality of a system or process, but data on the process is very hard to collect. How can we evaluate the system without having to collect samples? If we know the characteristics of the data set, for example, if we know it follows a normal distribution, then we could easily generate a series of data following a normal distribution and use these to test the system. In fact, we can easily generate a large amount of data and test the system with more power. Consider another case in which, instead of knowing the exact characteristics of the data, we have few data from the past few years and we notice that they follow a certain pattern. Here, we can use this data set to work out the characteristics of the data and develop a model. We can then generate a new data set from the model we developed. A popular example is the bootstrapping method in the interest rate model. In order to learn more about resampling and simulation methods, we are going to introduce the fundamental concepts, rules and methodologies commonly applied in these fields to prepare students with necessary background in resampling and simulation.

===Theory===
==== Resampling methods====
Resampling methods use a computer to generate a large number of simulated samples, patterns in these samples are then summarized and analyzed. However, in resampling methods, the simulated samples are drawn from the existing sample of data you have in your hands and not from a theoretically defined DGP. Thus, in resampling methods, the researcher doesn’t know or control the DGP but the goal of learning about the DGP remains.

*Principles: assumption is that there is some population DGP that remains unobserved and that DGP produced one sample of data one had in hand; all information about the population contained in the original sample of data is also contained in the distribution of these simulated samples. Then draw a new ‘sample’ of data that consists of a different mix of the cases in original sample and repeats many times so we have lots of new simulated ‘samples’. Also, one can think that the sample of data one had in hands is reasonable representation of the population, and the distribution of parameter estimates produced from running a model on a series of resampled data sets will provide a good approximation of the distribution in the population. Resampling method can either be parametric or non-parametric.

====Bootstrapping====
Bootstrapping is a statistical method for estimating the sampling distribution of an estimator by sampling with replacement from the original sample, most often with the purpose of deriving robust estimates of standard errors and confidence intervals of a population parameter like median, odds ratio or regression coefficient. This technique allows estimation of the sampling distribution of almost any statistic using only very simple method and it falls in the broader class of resampling method.

*Situations where bootstrapping applies: (1) when the theoretical distribution of a statistic of interest is complicated or unknown; (2) when the sample size is insufficient for straightforward statistical inference; (3) when power calculations have to be performed, and a small pilot sample is available.

*It is the practice of estimating properties of an estimator by measuring those properties when sampling from an approximating distribution, say the empirical distribution of the observed data. It is often used as a robust alternative to inference based on parametric assumptions when those assumptions are in doubt, or where parametric inferences is impossible or requires very complicated formulas for the calculation of standard errors. It may also be used for constructing hypothesis tests.

*The basic idea of bootstrapping is that inference about a population from sample data (sample → population) can be modeled by resampling the sample data and perform inference on (resample → sample). More formally, the bootstrap works by treating inference of the true probability distribution, given the original data, as being analogous to inference of the empirical distribution given the resampled data. The accuracy of inferences regarding the empirical distribution using the resample data can be assessed because we know the distribution. If the empirical distribution is reasonable approximation to the true probability distribution, then the quality of inference on true probability distribution can in turn be inferred.

*Common process: (1) begin with an observed sample of size N, (2) generate a simulated sample of size N by drawing observations from your observed sample independently and with replacement, (3) compute and save the statistic of interest, (4) repeat this process many times (say 1000), (5) treat the distribution of your estimated statistics of interest as an estimate of the population distribution of that statistic.

*Key features of the bootstrap: the draws must be independent, each observation in the observed sample must have an equal chance of being selected; the simulated sample must be of size N to take full advantage of the information in the sample; resampling must be done with replacement, if not, then every simulated sample of size N would be identical to each other and to the original sample; resampling with replacement means that in any given simulated sample, some cases might appear more than once while others will not appear at all.

*Types of bootstrap scheme: (1) case resampling: the Monte Carlo algorithm; (2) estimating the distribution of sample mean; (3) regression; (4) Bayesian bootstrap; (5) smooth bootstrap; (6) parametric bootstrap; (7) resampling residuals; (8) Gaussian process regression bootstrap; (9) wild bootstrap; (10) block bootstrap.

*Advantages: simplicity and straightforward to derive estimates of standard errors and confidence intervals for complex estimators of complex parameters of the distribution; appropriate to control and check the stability of the results.

*Limitations: does not provide general finite-sample guarantees; the apparent simplicity may conceal the fact that important assumptions are being made when undertaking the bootstrap analysis where these would be more formally stated in other approaches.

====Jackknife====
The Jackknife method estimates the bias and standard error of a statistic when a random sample of observations is used to calculate it. The basic idea is systematically recomputing the statistic estimate, leaving out one or more observations at a time from the sample set. From the new set of replicates of the statistic, an estimate for the bias and an estimate for the variance of the statistic can be calculated.

*Jackknife estimate of variance tends to asymptotically to the true value almost surely. The jackknife is consistent for the sample means, sample variances, and etc.

*Jackknife is not consistent for the sample median. In the case of a unimodal variate the ratio of the jackknife variance to the sample variance tends to be distributed as one half the square of a chi-square distribution with two degrees of freedom.
*It is dependent on the independence of the data. Extensions of the jackknife to allow for dependence in the data have been proposed.

*Advantages: good at detecting outliers/influential cases. Those sub-sample estimates that differ most from the rest indicate those cases that has the most influence on those estimates in the original full sample analysis.

*Limitations: The jackknife is less general than the bootstrap, and thus used less frequently; it does not perform well if the statistic under consideration does not change ‘smoothly’ across simulated samples; it does not perform well in small samples because you don’t end up generating many resamples.

==== Cross-validation====
Cross-validation (CV) is a statistical method for validating a predictive model, assessing a statistical model on a data set that is independent of the data set used to fit the model. Subsets of the data are held out for use as validating sets; a model is fit to the remaining data (training set) and used to predict for the validation set. Averaging the quality of the predictions across the validation sets yields an overall measure of prediction accuracy.

*Steps: (1) randomly partition the variable data into a training set and a testing set, (2) fit the model on the training set, (3) take the parameter estimates from that model, use them to calculate a measure of fit on the testing set, (4) repeat for several times and average to reduce variability.

*Types of CV:
**leave-one-out CV: iterative method with number of iterations = sample size, each observation becomes the training set one time; Steps: 1) delete observation #1 from the data, 2) fit the model on observations #2-n, 3) apply the coefficients form step #2 to observation #1, calculate the chosen fit measure, 4) delete observation #2 form the data, 5) fit the model on observations #1 and #3-n, 6) apply the coefficients from step #5 to observation #2, calculate the chosen fit measure, 7) repeat until all observations have been deleted once.
**K-fold cross-validation, splits the data into K subsets and each is held out in turn as the validation set. This avoids self-influence. For comparison, in regression analysis method such as linear regression, each y value draws the regression line toward itself, making the prediction of that value appear more accurate than it really is. Cross-validation applied to linear regression predicts the y value for each observation without using that observation.

*Limitations of CV: training and testing data must be random samples from the same population; will show biggest differences from in-sample measures when n is small; higher computational demand than calculating in-sample measures; subject to researcher’s selection of an appropriate fit statistic.

====Permutation test====
Permutation (or randomization or re-randomization) test is a type of statistical significance test in which the distribution of the test statistic under the null hypothesis is obtained by calculating all possible values of the test statistic under rearrangements of the labels on the observed data points. It is just another form of resampling but is done without replacement.

*Rather than assume a distribution for the null hypothesis, we simulate what it would be by randomly reconfiguring our sample lots of times (say 1000) in a way that ‘breaks’ the relationship in our sample data.

*Suppose we have group A and group B with sample means $\bar{x}_A$ and $\bar{x}_B$ respectively and we want to test, at a 5% significance level, whether they come form the same distribution. $n_A$ and $n_B$ are the sample size for each group. A permutation test is designed to determine whether the observed difference between the sample means is large enough to reject the null hypothesis $H_0$: the two groups have identical probability distribution. The test proceeds: (1) the difference in the means between group A and B is calculated, (2) difference in sample means is calculated and recorded for each possible way of dividing these pooled values into two groups of size $n_A$ and $n_B$. The set of these calculated differences if the exact distribution of possible differences under the null hypothesis that group label does not matter, (3) the one-side p-value of the test is calculated as the proportion of sampled permutations where the difference in means was greater than or equal to $T (obs)$; the two-sided p-value of the test is calculated as the proportion of sampled permutations where the absolute difference was greater than or equal to $ABS(T(obs))$;(4) sort the recorded differences and then observe if $T(obs)$ is contained within the middle 95% of them, if not, reject $H_0$ at 5% significance level.

====Simulation====
A common assumption is that the coefficients we estimate are drawn from a probability distribution that describes the larger population. With large enough sample size, according to the CLT this distribution is multivariate normal.
*Steps: (1) goal of simulation is to make random draws from this distribution to simulate many ‘hypothetical values’ of the coefficients, (2) the next step is to choose a QI say expected value, predicted probability, odds ratio, first difference, etc, (3) set a key variable in the model to a theoretically interesting value and the rest to their means or modes, (4) calculate QI with each set of simulated coefficients and set the variable to a new value, (5) set the variable to a new value, (6) calculate that QI with each set of simulated coefficients, (7) repeat as appropriate, (8) efficiently summarize the distribution of the computed QI at each value of our variable.

*Advantages: provide more information than a just a table of regression output; accounts for uncertainty in the QI; flexible to many different types of models, QIs and variable specifications; after doing it once, easy to use; can be much easier than working with analytic solutions.

*Limitations: relies on CLT to justify asymptotic normality (fully Bayesian model using MCMC could produce exact finite-sample distribution; bootstrapping would require no distributional assumption); computational intensity; large models can produce lots of uncertainty around quantity of interest.

===Applications===
* [http://www.amstat.org/publications/jse/v16n2/dinov.html This article] presents the application of Central Limit Theorem using the new SOCR applet and demonstration activity. In this article, it described an innovative effort of using technological tools for improving student motivation and learning of the theory, practice and usability of the CLT probability and statistics courses. The method is based on harnessing the computational libraries developed by SOCR to design a new interactive Java applet and a corresponding demonstration activity that illustrate the meaning and power of the CLT. It included four experiments to demonstrate the assumptions, meaning and implication of CLT as well as hands-on simulation and a number of examples illustrating the theory and application of CLT.

* [http://link.springer.com/article/10.3758/BRM.40.3.879 This article] titled Asymptotic and resampling strategies for assessing and comparing indirect effects in multiple mediator models provided an overview of simple and multiple mediation and explored three approaches that can be used to investigate indirect process. It also presents methods for contrasting two or more mediators within a single model through examples. The paper presents an illustrative example, assessing and contrasting potential mediators of relationship between the helpfulness of socialization agents and job satisfaction as well as software application of these methods including SAS, SPSS macros and etc.

* [[SOCR_ResamplingSimulation_Activity|This article]] presents the resampling, randomization and simulation activity and illustrated the processes of sampling, resampling and randomization using the SOCR webapp. It aims to demonstrate the concepts of simulation and data generation, illustrate data resampling on a massive scale, reinforce the concept of resampling and randomization based statistical inference and demonstrate the similarities and differences between parametric-based and resampling-based statistical inference. The article provides specific steps to implement the activities and video is also provided for reference.

* [[SOCR_EduMaterials_Activities_SamplingDistributionCLTExperiment | This article]] is an experiment on sampling distribution with Central Limit Theory. It demonstrates the properties of the sampling distributions of various sample statistics and illustrates the CLT through the experiment. The sampling distribution CLT experiment provides a simulation accessible to the public that demonstrates characteristics of various sample statistics and CLT and to empirically demonstrate that the sample average is unique. The article helps users develop a better understanding of the two topics and apply them to various types of activities as the concepts of a native distribution, sample distribution and numerical parameter estimate.

===Software ===
*[http://ww2.coastal.edu/kingw/statistics/R-tutorials/resample.html R-Tutorial]
*[http://www.socr.ucla.edu/htmls/SOCR_Experiments.html SOCR Experiments]
*[http://socr.ucla.edu/htmls/HTML5/SOCR_Resampling_Webapp/ SOCR Resampling Webapp]

* Sampling with/without replacement in R:
> names<-c('Ann','Tom','William','Tim','Kate','Mike','Rose','Alfred','Jef','Jack')
> N<-length(names)
> sample(names,N,replace=F)
[1] "William" "Kate" "Mike" "Ann" "Jef" "Tom" "Rose"
[8] "Jack" "Tim" "Alfred"

> sample(names,N,replace=T)
[1] "Mike" "Rose" "William" "Rose" "Jef" "Mike" "Jack"
[8] "Rose" "Tom" "Ann"

===Problems===
# Go over the examples in article 4.2 and 4.3.
# Do the exercise of simulating stock closing price $S_t$ on 252 trading days where $S_{t}$ satisfies: $S_t=S_0 e^{vt+\sigma \sqrt{t} Z}$, $Z \sim Normal(0,1)$ with $S_0=36, \sigma=2%, v=0.01%$.
# Now suppose you bought a call on this stock with strike price 40, with your simulation data, what is percentage of days you can profit from exercising the call option? (That is the percentage of days your $S_t$ is greater than 40).

<hr>
* SOCR Home page: http://www.socr.umich.edu

=== References===
* [http://en.wikipedia.org/wiki/Resampling_(statistics) Resampling Wikipedia]
* [http://en.wikipedia.org/wiki/Cross-validation_(statistics) Cross Validation Wikipedia]
* [http://en.wikipedia.org/wiki/Bootstrapping_(statistics) Bootstrapping Wikipedia]

<hr>
* SOCR Home page: http://www.socr.umich.edu

{{translate|pageName=http://wiki.socr.umich.edu/index.php?title=SMHS_ResamplingSimulation}}

SMHS ResamplingSimulation

2014-08-31T17:29:17Z

Jslavine: /* Overview */

==[[SMHS| Scientific Methods for Health Sciences]] - Resampling and Simulation ==

===Overview===
In statistics, ''resampling'' and ''simulation'' are two important concepts with wide applications in research and projects from various fields. ''Resampling'' is any of a variety of methods in which the following processes are implemented: (1) estimating the precision of sample statistics (i.e., medians, percentiles) by using subsets of available data (i.e., ''jackknifing'') or drawing randomly with replacement from a set of data (i.e., ''bootstrapping''); (2) exchanging labels on data points when performing significance tests (i.e., ''permutation tests''); or (3) validating models by using random subsets (i.e., bootstrapping, ''cross-validation''). We are going to introduce some common resampling techniques including bootstrapping, jackknifing, cross-validation and permutation tests. ''Simulation'' involves imitating real world processes or systems over time. We usually apply simulation after a model, which represents the key characteristics of the process, is developed. Simulation is widely applied in many contexts such as simulation of technology for performance optimization, testing and video games. It is often applied when the real system is not accessible or is difficult and/or costly to apply, and it provides us with an easier way to obtain data about the system or test it. We are going to present an introduction to simulation including the basic methods, applications, advantages and limitations.

===Motivation===
Consider we want to evaluate the quality of a system or process, but the data is very hard to collect. How can we evaluate without having to actually taking samples from the system? In this case, it would be great if we know the characteristics of the data set, say if we know it follows a normal distribution, then we could easily generate a series of data following a normal distribution and use these to test the system. In fact, we can easily generate a large amount of dataset and test the system with more power. Consider another case, where instead of knowing the exact characters of the data, we only have very few data from the last few years where they follow a certain pattern. Here, we can use these dataset to work out the characteristic of the data and generate new dataset from the model we developed. A popular example is the bootstrapping method in the interest rate model. In order to learn more about the resampling and simulation methods, we are going to introduce the fundamental concepts, rules and methodologies commonly applied in these fields to prepare students with necessary background in resampling and simulation.

===Theory===
==== Resampling methods====
Resampling methods use a computer to generate a large number of simulated samples, patterns in these samples are then summarized and analyzed. However, in resampling methods, the simulated samples are drawn from the existing sample of data you have in your hands and not from a theoretically defined DGP. Thus, in resampling methods, the researcher doesn’t know or control the DGP but the goal of learning about the DGP remains.

*Principles: assumption is that there is some population DGP that remains unobserved and that DGP produced one sample of data one had in hand; all information about the population contained in the original sample of data is also contained in the distribution of these simulated samples. Then draw a new ‘sample’ of data that consists of a different mix of the cases in original sample and repeats many times so we have lots of new simulated ‘samples’. Also, one can think that the sample of data one had in hands is reasonable representation of the population, and the distribution of parameter estimates produced from running a model on a series of resampled data sets will provide a good approximation of the distribution in the population. Resampling method can either be parametric or non-parametric.

====Bootstrapping====
Bootstrapping is a statistical method for estimating the sampling distribution of an estimator by sampling with replacement from the original sample, most often with the purpose of deriving robust estimates of standard errors and confidence intervals of a population parameter like median, odds ratio or regression coefficient. This technique allows estimation of the sampling distribution of almost any statistic using only very simple method and it falls in the broader class of resampling method.

*Situations where bootstrapping applies: (1) when the theoretical distribution of a statistic of interest is complicated or unknown; (2) when the sample size is insufficient for straightforward statistical inference; (3) when power calculations have to be performed, and a small pilot sample is available.

*It is the practice of estimating properties of an estimator by measuring those properties when sampling from an approximating distribution, say the empirical distribution of the observed data. It is often used as a robust alternative to inference based on parametric assumptions when those assumptions are in doubt, or where parametric inferences is impossible or requires very complicated formulas for the calculation of standard errors. It may also be used for constructing hypothesis tests.

*The basic idea of bootstrapping is that inference about a population from sample data (sample → population) can be modeled by resampling the sample data and perform inference on (resample → sample). More formally, the bootstrap works by treating inference of the true probability distribution, given the original data, as being analogous to inference of the empirical distribution given the resampled data. The accuracy of inferences regarding the empirical distribution using the resample data can be assessed because we know the distribution. If the empirical distribution is reasonable approximation to the true probability distribution, then the quality of inference on true probability distribution can in turn be inferred.

*Common process: (1) begin with an observed sample of size N, (2) generate a simulated sample of size N by drawing observations from your observed sample independently and with replacement, (3) compute and save the statistic of interest, (4) repeat this process many times (say 1000), (5) treat the distribution of your estimated statistics of interest as an estimate of the population distribution of that statistic.

*Key features of the bootstrap: the draws must be independent, each observation in the observed sample must have an equal chance of being selected; the simulated sample must be of size N to take full advantage of the information in the sample; resampling must be done with replacement, if not, then every simulated sample of size N would be identical to each other and to the original sample; resampling with replacement means that in any given simulated sample, some cases might appear more than once while others will not appear at all.

*Types of bootstrap scheme: (1) case resampling: the Monte Carlo algorithm; (2) estimating the distribution of sample mean; (3) regression; (4) Bayesian bootstrap; (5) smooth bootstrap; (6) parametric bootstrap; (7) resampling residuals; (8) Gaussian process regression bootstrap; (9) wild bootstrap; (10) block bootstrap.

*Advantages: simplicity and straightforward to derive estimates of standard errors and confidence intervals for complex estimators of complex parameters of the distribution; appropriate to control and check the stability of the results.

*Limitations: does not provide general finite-sample guarantees; the apparent simplicity may conceal the fact that important assumptions are being made when undertaking the bootstrap analysis where these would be more formally stated in other approaches.

====Jackknife====
The Jackknife method estimates the bias and standard error of a statistic when a random sample of observations is used to calculate it. The basic idea is systematically recomputing the statistic estimate, leaving out one or more observations at a time from the sample set. From the new set of replicates of the statistic, an estimate for the bias and an estimate for the variance of the statistic can be calculated.

*Jackknife estimate of variance tends to asymptotically to the true value almost surely. The jackknife is consistent for the sample means, sample variances, and etc.

*Jackknife is not consistent for the sample median. In the case of a unimodal variate the ratio of the jackknife variance to the sample variance tends to be distributed as one half the square of a chi-square distribution with two degrees of freedom.
*It is dependent on the independence of the data. Extensions of the jackknife to allow for dependence in the data have been proposed.

*Advantages: good at detecting outliers/influential cases. Those sub-sample estimates that differ most from the rest indicate those cases that has the most influence on those estimates in the original full sample analysis.

*Limitations: The jackknife is less general than the bootstrap, and thus used less frequently; it does not perform well if the statistic under consideration does not change ‘smoothly’ across simulated samples; it does not perform well in small samples because you don’t end up generating many resamples.

==== Cross-validation====
Cross-validation (CV) is a statistical method for validating a predictive model, assessing a statistical model on a data set that is independent of the data set used to fit the model. Subsets of the data are held out for use as validating sets; a model is fit to the remaining data (training set) and used to predict for the validation set. Averaging the quality of the predictions across the validation sets yields an overall measure of prediction accuracy.

*Steps: (1) randomly partition the variable data into a training set and a testing set, (2) fit the model on the training set, (3) take the parameter estimates from that model, use them to calculate a measure of fit on the testing set, (4) repeat for several times and average to reduce variability.

*Types of CV:
**leave-one-out CV: iterative method with number of iterations = sample size, each observation becomes the training set one time; Steps: 1) delete observation #1 from the data, 2) fit the model on observations #2-n, 3) apply the coefficients form step #2 to observation #1, calculate the chosen fit measure, 4) delete observation #2 form the data, 5) fit the model on observations #1 and #3-n, 6) apply the coefficients from step #5 to observation #2, calculate the chosen fit measure, 7) repeat until all observations have been deleted once.
**K-fold cross-validation, splits the data into K subsets and each is held out in turn as the validation set. This avoids self-influence. For comparison, in regression analysis method such as linear regression, each y value draws the regression line toward itself, making the prediction of that value appear more accurate than it really is. Cross-validation applied to linear regression predicts the y value for each observation without using that observation.

*Limitations of CV: training and testing data must be random samples from the same population; will show biggest differences from in-sample measures when n is small; higher computational demand than calculating in-sample measures; subject to researcher’s selection of an appropriate fit statistic.

====Permutation test====
Permutation (or randomization or re-randomization) test is a type of statistical significance test in which the distribution of the test statistic under the null hypothesis is obtained by calculating all possible values of the test statistic under rearrangements of the labels on the observed data points. It is just another form of resampling but is done without replacement.

*Rather than assume a distribution for the null hypothesis, we simulate what it would be by randomly reconfiguring our sample lots of times (say 1000) in a way that ‘breaks’ the relationship in our sample data.

*Suppose we have group A and group B with sample means $\bar{x}_A$ and $\bar{x}_B$ respectively and we want to test, at a 5% significance level, whether they come form the same distribution. $n_A$ and $n_B$ are the sample size for each group. A permutation test is designed to determine whether the observed difference between the sample means is large enough to reject the null hypothesis $H_0$: the two groups have identical probability distribution. The test proceeds: (1) the difference in the means between group A and B is calculated, (2) difference in sample means is calculated and recorded for each possible way of dividing these pooled values into two groups of size $n_A$ and $n_B$. The set of these calculated differences if the exact distribution of possible differences under the null hypothesis that group label does not matter, (3) the one-side p-value of the test is calculated as the proportion of sampled permutations where the difference in means was greater than or equal to $T (obs)$; the two-sided p-value of the test is calculated as the proportion of sampled permutations where the absolute difference was greater than or equal to $ABS(T(obs))$;(4) sort the recorded differences and then observe if $T(obs)$ is contained within the middle 95% of them, if not, reject $H_0$ at 5% significance level.

====Simulation====
A common assumption is that the coefficients we estimate are drawn from a probability distribution that describes the larger population. With large enough sample size, according to the CLT this distribution is multivariate normal.
*Steps: (1) goal of simulation is to make random draws from this distribution to simulate many ‘hypothetical values’ of the coefficients, (2) the next step is to choose a QI say expected value, predicted probability, odds ratio, first difference, etc, (3) set a key variable in the model to a theoretically interesting value and the rest to their means or modes, (4) calculate QI with each set of simulated coefficients and set the variable to a new value, (5) set the variable to a new value, (6) calculate that QI with each set of simulated coefficients, (7) repeat as appropriate, (8) efficiently summarize the distribution of the computed QI at each value of our variable.

*Advantages: provide more information than a just a table of regression output; accounts for uncertainty in the QI; flexible to many different types of models, QIs and variable specifications; after doing it once, easy to use; can be much easier than working with analytic solutions.

*Limitations: relies on CLT to justify asymptotic normality (fully Bayesian model using MCMC could produce exact finite-sample distribution; bootstrapping would require no distributional assumption); computational intensity; large models can produce lots of uncertainty around quantity of interest.

===Applications===
* [http://www.amstat.org/publications/jse/v16n2/dinov.html This article] presents the application of Central Limit Theorem using the new SOCR applet and demonstration activity. In this article, it described an innovative effort of using technological tools for improving student motivation and learning of the theory, practice and usability of the CLT probability and statistics courses. The method is based on harnessing the computational libraries developed by SOCR to design a new interactive Java applet and a corresponding demonstration activity that illustrate the meaning and power of the CLT. It included four experiments to demonstrate the assumptions, meaning and implication of CLT as well as hands-on simulation and a number of examples illustrating the theory and application of CLT.

* [http://link.springer.com/article/10.3758/BRM.40.3.879 This article] titled Asymptotic and resampling strategies for assessing and comparing indirect effects in multiple mediator models provided an overview of simple and multiple mediation and explored three approaches that can be used to investigate indirect process. It also presents methods for contrasting two or more mediators within a single model through examples. The paper presents an illustrative example, assessing and contrasting potential mediators of relationship between the helpfulness of socialization agents and job satisfaction as well as software application of these methods including SAS, SPSS macros and etc.

* [[SOCR_ResamplingSimulation_Activity|This article]] presents the resampling, randomization and simulation activity and illustrated the processes of sampling, resampling and randomization using the SOCR webapp. It aims to demonstrate the concepts of simulation and data generation, illustrate data resampling on a massive scale, reinforce the concept of resampling and randomization based statistical inference and demonstrate the similarities and differences between parametric-based and resampling-based statistical inference. The article provides specific steps to implement the activities and video is also provided for reference.

* [[SOCR_EduMaterials_Activities_SamplingDistributionCLTExperiment | This article]] is an experiment on sampling distribution with Central Limit Theory. It demonstrates the properties of the sampling distributions of various sample statistics and illustrates the CLT through the experiment. The sampling distribution CLT experiment provides a simulation accessible to the public that demonstrates characteristics of various sample statistics and CLT and to empirically demonstrate that the sample average is unique. The article helps users develop a better understanding of the two topics and apply them to various types of activities as the concepts of a native distribution, sample distribution and numerical parameter estimate.

===Software ===
*[http://ww2.coastal.edu/kingw/statistics/R-tutorials/resample.html R-Tutorial]
*[http://www.socr.ucla.edu/htmls/SOCR_Experiments.html SOCR Experiments]
*[http://socr.ucla.edu/htmls/HTML5/SOCR_Resampling_Webapp/ SOCR Resampling Webapp]

* Sampling with/without replacement in R:
> names<-c('Ann','Tom','William','Tim','Kate','Mike','Rose','Alfred','Jef','Jack')
> N<-length(names)
> sample(names,N,replace=F)
[1] "William" "Kate" "Mike" "Ann" "Jef" "Tom" "Rose"
[8] "Jack" "Tim" "Alfred"

> sample(names,N,replace=T)
[1] "Mike" "Rose" "William" "Rose" "Jef" "Mike" "Jack"
[8] "Rose" "Tom" "Ann"

===Problems===
# Go over the examples in article 4.2 and 4.3.
# Do the exercise of simulating stock closing price $S_t$ on 252 trading days where $S_{t}$ satisfies: $S_t=S_0 e^{vt+\sigma \sqrt{t} Z}$, $Z \sim Normal(0,1)$ with $S_0=36, \sigma=2%, v=0.01%$.
# Now suppose you bought a call on this stock with strike price 40, with your simulation data, what is percentage of days you can profit from exercising the call option? (That is the percentage of days your $S_t$ is greater than 40).

<hr>
* SOCR Home page: http://www.socr.umich.edu

=== References===
* [http://en.wikipedia.org/wiki/Resampling_(statistics) Resampling Wikipedia]
* [http://en.wikipedia.org/wiki/Cross-validation_(statistics) Cross Validation Wikipedia]
* [http://en.wikipedia.org/wiki/Bootstrapping_(statistics) Bootstrapping Wikipedia]

<hr>
* SOCR Home page: http://www.socr.umich.edu

{{translate|pageName=http://wiki.socr.umich.edu/index.php?title=SMHS_ResamplingSimulation}}

SMHS ProbabilityDistributions

2014-08-31T14:20:59Z

Jslavine: /* Theory */

==[[SMHS| Scientific Methods for Health Sciences]] - Probability Distributions ==

===Overview===
Distributions are the fundamental basis of probability theory. There are two types of processes that we observe in nature, discrete and continuous, and they are modeled by the corresponding distributions. (There can also be mixture-, multidimensional and tensor distributions[http://www.example.com link title]; these are not discussed here). The type of distribution depends on the type of data. Discrete and continuous distributions represent discrete or continuous random variables, respectively. This section aims to introduce various discrete and continuous distributions and discuss the relationships between distributions.

*Discrete distributions: [[AP_Statistics_Curriculum_2007_Distrib_Binomial|Bernoulli distribution]], [[AP_Statistics_Curriculum_2007_Distrib_Binomial|Binomial distribution]], [[AP_Statistics_Curriculum_2007_Distrib_Multinomial|Multinomial distribution]], [[SOCR_EduMaterials_Activities_Explore_Distributions#Geometric_probability_distribution|Geometric distribution]], [[AP_Statistics_Curriculum_2007_Distrib_Dists#HyperGeometric|Hypergeometric distribution]], [[AP_Statistics_Curriculum_2007_Distrib_Dists#Negative_Binomial|Negative binomial distribution]], [[AP_Statistics_Curriculum_2007_Distrib_Dists#Negative_Multinomial_Distribution_.28NMD.29|Negative multinomial distribution]], [[AP_Statistics_Curriculum_2007_Distrib_Poisson|Poisson distribution]].

*Continuous distributions: [[AP_Statistics_Curriculum_2007#Chapter_V:_Normal_Probability_Distribution|Normal distribution]], [[SOCR_BivariateNormal_JS_Activity| Multivariate normal distribution]].

===Motivation===
We have talked about different types of data and the fundamentals of probability theory. In order to capture and estimate patterns in data, we introduced the concept of a distribution. A probability distribution assigns a probability to each measurable subset of the possible outcomes of a random experiment. It can either be univariate or multivariate. A univariate distribution gives the probability of a single random variable while a a multivariate distribution (i.e., a joint probability distribution) gives the probability of a random vector, which is a set of two or more random variables taking on various combinations of values. Consider the coin tossing experiment, what distribution would we expect the outcomes to follow?

===Theory===
'''Random variables''': A random variable is a function or a mapping from a sample space onto the real numbers (most of the time). In other words, a random variable assigns real values to outcomes of experiments.

'''Probability density / mass functions and the cumulative distribution function'''
The probability density function (pdf) or probability mass function (pmf) for a continuous or discrete random variable, respectively, is the function defined by the probability of the subset of the sample space $\{s\in S\}\subset S$. $p(x)=P(\{s\in S\} | X(s)=x)$, for all $x$.

The cumulative distribution function (cdf) $F(x)$ for any random variable $X$ with probability mass or density function $p(x)$ is defined by the total probability of all $\{s\in S\}\subset S$, where $X(s) \leq x; F(x)=P(X\leq x)$, for all x.

'''Expectation and variance'''
*Expectation: The expected value, expectation or mean, of a discrete random variable $X$ is defined as $E[X]=\sum_i {x_i P(X=x_i)}$. The expected value of a continuous random variable $Y$ is defined as $E[Y]=\int_y{yP(y)dy}$. This is the integral over the domain of $Y$, where $P(y)$ is the probability density function of $Y$. An important property of the expectation is that it is a linear functional, i.e., $E[aX+bY]=aE[X]+bE[Y]$.

*Variance: The variance of a discrete random variable $X$ is defined as $VAR[X]=\sum_i {(x_i-E[X])^2 P(X=x_i)}$. The variance of a continuous random variable $Y$ is defined as $VAR[Y]=\int_y {(y-E[Y])^2 P(y)dy}$. This is the integral over the domain of $Y$ and $P(y)$ is the probability density function of $Y$. The second moment, variance, does not quite have the same linear functional properties as the expectation: $VAR[aX]= a^2 VAR[X]$ and $VAR[X+Y]=VAR[X]+VAR[Y]+2COV(X,Y)$.
*Covariance:$COV(X,Y)=E[(X-E[X])(Y-E[Y])]$.

====Bernoulli distribution====
A [[AP_Statistics_Curriculum_2007_Distrib_Binomial#Bernoulli_process|Bernoulli trial]] is an experiment whose dichotomous outcomes are random (e.g. ‘head vs. ‘tail’). $X(outcome)= \begin{cases}
0, & \text{s=head} \\
1, & \text{s=tail}
\end{cases}$.
If ''p''=P(''head''), then $E[X]=p$ and $VAR[X]=p(1-p)$.

====Binomial distribution====
Suppose we conduct an experiment observing n trials of a Bernoulli process. If we are interested in the RV $x$ = {Number of heads in $n$ trials}, then $X$ is called a [[AP_Statistics_Curriculum_2007_Distrib_Binomial#Binomial_Random_Variables|binomial RV]] and its distribution is called binomial distribution. We say $X \sim B(n,p)$,where $n$ is the sample size and $p$ is the probability of heads during one trial. $P(X=x)={n\choose x} p^x (1-p)^{n-x}$, for $x=0,1,…,n$, where ${n\choose x}=\frac {n!} {x!(n-x)!}$ is the binomial coefficient.
$$E[X]=np;VAR[X]=np(1-p)$$

====Multinomial distribution====
The [[AP_Statistics_Curriculum_2007_Distrib_Multinomial|multinomial distribution]] is an extension of binomial where the experiment consists of $k$ repeated trials and each trial has a discrete number of possible outcomes. In any given trial, the probability that a particular outcome will occur is constant, and the trials are independent.

$ p=P(X_1=r_1 \cap \cdots \cap X_k=r_{k}│r_1 + ⋯ +r_k=n)$ = ${n\choose r_1,…,r_k} p_1^{r_1} p_2^{r_2}…p_k^{r_k}$ for all (∀) $r_1+⋯+r_k=n$ where ${n\choose r_1,…,r_k}=\frac {n!}{r_1! \times … \times r_k!}$.

====Geometric distribution====
The probability distribution of the number, X, of Bernoulli trials needed to obtain one success is called the [[AP_Statistics_Curriculum_2007_Distrib_Dists#Geometric|geometric distribution]]. It is supported on the set $\{1,2,3,…\}$. $P(X=x)=(1-p)^{x-1}p$, for $x = 1, 2, … $

$$E[X]=\dfrac {1} {p},VAR[X]= \frac {1-p} {p^{2}}$$

====Hypergeometric distribution====
A discrete probability distribution that describes the number of successes in a sequence of $n$ draws from a finite population without replacement. An experimental design for using the [[AP_Statistics_Curriculum_2007_Distrib_Dists#HyperGeometric|hypergeometric distribution]] is illustrated in the table below. A shipment of $N$ objects includes $m$ defective ones. The hypergeometric distribution describes the probability that in a sample of $n$ distinct objects drawn from the shipment, exactly $k$ will be defective.

<center>
{| class="wikitable" style="text-align:center; width:75%"border="1"
|-
|'''Type''' ||'''Drawn''' ||'''Not-Drawn''' || '''Total'''
|-
|Defective || $k$|| $m-k$ || $m$
|-
|Non-Defective || $n-k$ || $N+k-n-m$ ||$N-m$
|-
|Total || $n$|| $N-n$ || $N$
|}
</center>

$$ P(X=k)=\frac {{m \choose k}{N-m \choose n-k}} {N \choose n}, E[X]=\frac{nm}{N}, VAR[X]=\frac{\frac{nm}{N}(1-\frac{m}{N})(N-n)} {N-1}$$

==== Negative binomial distribution====
Suppose X is the trial index (n) of the $r^{th}$ success, or the total number of experiments ($n$) needed to get $r$ successes. The [[AP_Statistics_Curriculum_2007_Distrib_Dists#Negative_Binomial| negative binomial distribution]] has the following mass function $P(X=n)={n-1 \choose r-1} p^r (1-p)^{(n-r)}$, for $n=r,r+1,r+2,…$, where $n$ is the trial number of the $𝑟^{𝑡ℎ}$ success.

$$E[X]=\frac {r} {p},VAR[X]=\frac {r(1-p)} {p^{2}}$$

Suppose Y is the number of failures ($k$) to get $r$ successes. $P(Y=k)={k+r-1 \choose k} p^{r} (1-p)^{k}$, for $k=0,1,2,…,$ where $k$ is the number of failures before the $ r^{th} $ success. $Y \sim NegBin(r,p)$, the probability of $k$ failures and $r$ successes in $n = k+1$ $Bernoulli(p)$ trials with success on the last trial.

$$E[Y]=\frac{r(1-p)}{p},VAR[Y]=\frac {r(1-p)} {p^{2}}$$

NOTE: $X=Y+r,E[X]=E[Y]+r,VAR[X]=VAR[Y]$.

====Negative multinomial distribution (NMD)====
The [[AP_Statistics_Curriculum_2007_Distrib_Dists#Negative_Multinomial_Distribution_.28NMD.29|NMD]] is a generalization of the two-parameter $NegBin(r,p)$ to more than one outcome. Suppose we have $m$ possible outcomes $\{X_0,…,X_m\}$ each with probability $\{p_0,…,p_m \}$, respectively, where $0<p_i<1$ and $\sum_{i=0}^m {p_i} =1$. Suppose the experiment generates independent outcomes until $\{X_0,…,X_m \}$ occur exactly $\{k_0,…,k_m \}$ times; then $\{X_{0},…,X_{m}\}$ follows a negative multinomial distribution with parameter vector $(k_0,\{p_{1},…,p_{m}\})$, where $m$ represents the degrees of freedom.

* In the special case of $m=1$, if $X$ is the total number of experiments ($n$) necessary to get $k_{0}$ and $n-k_{0}$ outcomes of the other possible outcome $(X_{1})$. $X \sim NegativeMultinomial(k_{0},{p_{0},p_{1})$

* NMD Probability Mass Function: <math> P(k_1, \cdots, k_m|k_0,\{p_1,\cdots,p_m\}) = \left (\sum_{i=0}^m{k_i}-1\right)!\frac{p_0^{k_0}}{(k_0-1)!} \prod_{i=1}^m{\frac{p_i^{k_i}}{k_i!}},</math> or equivalently:
: <math> P(k_1, \cdots, k_m|k_0,\{p_1,\cdots,p_m\}) = \Gamma\left(\sum_{i=1}^m{k_i}\right)\frac{p_0^{k_0}}{\Gamma(k_0)} \prod_{i=1}^m{\frac{p_i^{k_i}}{k_i!}},</math>
: where <math>\Gamma(x)</math> is the [http://en.wikipedia.org/wiki/Gamma_function Gamma function].
* Mean (vector): <math>\mu=E(X_1,\cdots,X_m)= (\mu_1=E(X_1), \cdots, \mu_m=E(X_m)) = \left ( \frac{k_0p_1}{p_0}, \cdots, \frac{k_0p_m}{p_0} \right).</math>
* Variance-Covariance (matrix): <math>Cov(X_i,X_j)= \{cov[i,j]\},</math> where
: <math> cov[i,j] = \begin{cases} \frac{k_0 p_i p_j}{p_0^2},& i\not= j,\\
\frac{k_0 p_i (p_i + p_0)}{p_0^2},& i=j.\end{cases}</math>

====Poisson distribution====
The discrete [[AP_Statistics_Curriculum_2007_Distrib_Poisson|Poisson distribution]] expresses the probability of a number of events occurring in a fixed interval of time given these events occur at a known average rate that is independent of the time since the last event. The figure below shows the PDF of a Poisson distribution with varying parameter ($\lambda$) values.
<center>[[image:SMHS_Probability_Fig1.png]]</center>

The distribution is right-skewed, but for increasing $\lambda$ (say $\lambda>40$) the distribution becomes bell shaped. See the [[AP_Statistics_Curriculum_2007_Limits_Norm2Poisson|Normal approximation to the Poisson distribution section]]. The Figure below shows the CDF of the Poisson distribution with varying parameter values.
<center>[[Image:SMHS_ProbabilityDistribution_fig2.png ]]</center> You can also see the [http://www.distributome.org/V3/calc/PoissonCalculator.html Distributome interactive Poisson calculator].

$$P(X=k)=\frac{λ^{k}e^{-λ}}{k!},E[X]=λ,VAR[X]=λ.$$

The CDF is discontinuous at the integers of $k$ and flat everywhere else because the variable only takes on integer values. That is, the CDF of the Poisson distribution is left continuous but not right continuous. Also note, the CDF of the Poisson distribution takes on the value of 0 with 0 occurrence and it is non-decreasing with increasing numbers of occurrences. It increases and then stays at 1 after a certain number of occurrence.

====Normal distribution====
The continuous [[AP_Statistics_Curriculum_2007_Normal_Std|standard normal distribution]] has a
* ''probability density'' function $ f(x)= {e^{-x^2 \over 2} \over \sqrt{2 \pi}} $ and a
* ''cumulative distribution'' function $\Phi(y)= \int_{-\infty}^{y}{{e^{-x^2 \over 2} \over \sqrt{2 \pi}} dx}.$

===Applications===
*[http://www.mdm.com/articles/28757-the-case-for-proactive-inside-sales?v=preview The article] examined how a proactive inside sales force can be critical to serving mid-market and small customers as part of a broader multichannel strategy and in included steps for initiating an effective program.

*[http://wiki.socr.umich.edu/index.php/SOCR_EduMaterials_Activities_NegativeBinomial This article] provides an example of Negative Binomial Experiment by SOCR. The goal of this experiment is to provide a simulation demonstrating properties of the Negative Binomial(k,p) distribution. The applet facilitates the calculations of the Negative Binomial mass/density function, the moments and cumulative distribution function. It gives the specific steps of the experiment in SOCR and it allows users to learn about the variation of the distribution with changing parameters.

===Software===
*[http://socr.ucla.edu/htmls/SOCR_Distributions.html SOCR Distributions]
*[http://socr.ucla.edu/htmls/exp/Bivariate_Normal_Experiment.html Bivariate_Normal_Experiment]
*[http://socr.ucla.edu/Applets.dir/Normal_T_Chi2_F_Tables.htm Normal T Chi$^{2}$]
*[http://socr.ucla.edu/htmls/dist/Multinomial_Distribution.html Multinomial_Distribution]
*[http://wiki.stat.ucla.edu/socr/index.php/SOCR_EduMaterials_Activities_Binomial_Distributions Activities Binomial Distributions]

===Problems===
* If sampling distributions of sample means are examined for samples of size 1, 5, 10, 16 and 50, you will notice that as sample size increases, the shape of the sampling distribution appears more like that of the:
: (a) normal distribution
: (b) uniform distribution
: (c) population distribution
: (d) binomial distribution

* Which of the following statements best describes the effect on the Binomial Probability Model if the number of trials is held constant and the p(the probability of "success") increases?
: (a) None of these statements are true
: (b) The mean and the standard deviation both increase
: (c) The mean decreases and the standard deviation increases
: (d) The mean increases and the standard deviation decreases
: (e) The mean and standard deviation both decrease

* Suppose you draw one card from a standard deck three times, with replacement. What is the probability that you get spades all three times? Choose one answer.
: (a) 0.002
: (b) 0.321
: (c) 0.015
: (d) 0.021

* Suppose the number of cars that enter a parking lot in an hour is a Poisson random variable, and suppose that P(X=0)=0.05. Determine the variance of X.
: (a) 0.349
: (b) 3.232
: (c) 9.321
: (d) 2.996

* A researcher converts 100 lung capacity measurements to z-scores. The lung capacity measurements do not follow a normal distribution. What can we say about the standard deviation of the 100 z-scores?
: (a) It depends on the standard deviation of the raw scores
: (b) It equals 1
: (c) It equals 100
: (d) It must always be less than the standard deviation of the raw scores
: (e) It depends on the shape of the raw score distribution

* Among first year students at a certain university, scores on the verbal SAT follow the normal curve. The average is around 500 and the SD is about 100. Tatiana took the SAT, and placed at the 85% percentile. What was her verbal SAT score?
: (a) 604
: (b) 560
: (c) 90
: (d) 403

* Consider a random sample 100 orc soldiers and found the mean and the standard deviation to be 200lbs and and 20lbs respectively. He can be 68% confident that the mean weight in the population of orc soldiers is between
: (a) 196 to 204 lbs
: (b) 198 to 202 lbs
: (c) 194 to 206 lbs
: (d) None of the above

* The Rockwell hardness of certain metal pins is known to have a mean of 50 and a standard deviation of 1.5. If the distribution of all such pin hardness measurements is known to be normal, what is the probability that the average hardness for a random sample of nine pins is at least 50.5?
: (a) Approximately 4
: (b) 0.4
: (c) Approximately 0.1587
: (d) Approximately 0

* You read that the heights of college women are nearly normal with a mean of 65 inches and a standard deviation of 2 inches. If Vanessa is at the 10th percentile (shortest 10% for women) in height for college women, then her height is closest to:
: (a) 64.5 inches
: (b) It cannot be determined from this information
: (c) 60.5 inches
: (d) 62.44 inches

* The settlement (in cm) of a structure shown in the following figure may be evaluated from S = 0.3A + 0.2B + 0.1C,
<center>[[Image:SMHS_ProbabDist_Fig3.png]]</center>
: where A, B, and C are respectively the thickness (in m) of the three layers of soil as shown. Suppose A, B, and C are modeled as independent normal random variables as: $A \sim N(5,1)$, $B \sim N(8,2)$, $C \sim N(7,1)$,
: (a) Determine the probability that the settlement will exceed 4 cm.
: (b) If the total thickness of the three layers is known exactly as 20 m; and furthermore, thicknesses A and B are correlated with correlation coefficient equal to 0.5, determine the probability that the settlement will exceed 4 cm.

* Suppose that the distribution of X in the population is strongly skewed to the left. If you took 200 independent and random samples of size 3 from this population, calculated the mean for each of the 200 samples, and drew the distribution of the sample means, what would the sampling distribution of the means look like?
: (a) It will be perfectly normal and the mean will be equal to the median.
: (b) It will be close to the normal and the mean will be close to the median.
: (c) On a p-plot, most of the points will be on the line.
: (d) It will be skewed to the left and the mean will be less than the median.

* A polling agency has been hired to predict the proportion of voters who favor a certain candidate. The polling agency picks a random sample of 1000 voters of which 400 indicate that they favor the candidate. If they increase the sample size to 2000, how does the standard error change?
: (a) The standard error will decrease by one-fourth
: (b) The standard error will not change; the margin of error changes
: (c) Since the sample size is doubled, the standard error will be halved
: (d) The standard error will decrease not by a factor of 1/2 but by the square of root of 1/2

* The probability of winning a certain instant scratch-n-win game is 0.02. You play the game 80 times. Find the probability that you win 3 times.
: (a) 0.2983
: (b) 0.1378
: (c) 0.3231
: (d) 0.2391

* After firing 1000 boxes of ammunition, a certain handgun jamed according to a Poisson distribution with a mean of 0.4 per box of ammunition. Approximate the probability that more than 350 boxes of ammunition contain some that is jammed.
: (a) .0032
: (b) .0012
: (c) .0231
: (d) .0089

===References===
* [http://wiki.socr.umich.edu/index.php/Probability_and_statistics_EBook#Chapter_IV:_Probability_Distributions SOCR EBook Probability Chapter]
* [http://en.wikipedia.org/wiki/Poisson_distribution Poisson Distribution Wikipedia]

<hr>
* SOCR Home page: http://www.socr.umich.edu

{{translate|pageName=http://wiki.socr.umich.edu/index.php?title=SMHS_ProbabilityDistributions}}

SMHS ProbabilityDistributions

2014-08-31T13:56:09Z

Jslavine: /* Motivation */

==[[SMHS| Scientific Methods for Health Sciences]] - Probability Distributions ==

===Overview===
Distributions are the fundamental basis of probability theory. There are two types of processes that we observe in nature, discrete and continuous, and they are modeled by the corresponding distributions. (There can also be mixture-, multidimensional and tensor distributions[http://www.example.com link title]; these are not discussed here). The type of distribution depends on the type of data. Discrete and continuous distributions represent discrete or continuous random variables, respectively. This section aims to introduce various discrete and continuous distributions and discuss the relationships between distributions.

*Discrete distributions: [[AP_Statistics_Curriculum_2007_Distrib_Binomial|Bernoulli distribution]], [[AP_Statistics_Curriculum_2007_Distrib_Binomial|Binomial distribution]], [[AP_Statistics_Curriculum_2007_Distrib_Multinomial|Multinomial distribution]], [[SOCR_EduMaterials_Activities_Explore_Distributions#Geometric_probability_distribution|Geometric distribution]], [[AP_Statistics_Curriculum_2007_Distrib_Dists#HyperGeometric|Hypergeometric distribution]], [[AP_Statistics_Curriculum_2007_Distrib_Dists#Negative_Binomial|Negative binomial distribution]], [[AP_Statistics_Curriculum_2007_Distrib_Dists#Negative_Multinomial_Distribution_.28NMD.29|Negative multinomial distribution]], [[AP_Statistics_Curriculum_2007_Distrib_Poisson|Poisson distribution]].

*Continuous distributions: [[AP_Statistics_Curriculum_2007#Chapter_V:_Normal_Probability_Distribution|Normal distribution]], [[SOCR_BivariateNormal_JS_Activity| Multivariate normal distribution]].

===Motivation===
We have talked about different types of data and the fundamentals of probability theory. In order to capture and estimate patterns in data, we introduced the concept of a distribution. A probability distribution assigns a probability to each measurable subset of the possible outcomes of a random experiment. It can either be univariate or multivariate. A univariate distribution gives the probability of a single random variable while a a multivariate distribution (i.e., a joint probability distribution) gives the probability of a random vector, which is a set of two or more random variables taking on various combinations of values. Consider the coin tossing experiment, what distribution would we expect the outcomes to follow?

===Theory===
'''Random variables''': a random variable is a function or a mapping from a sample space into the real numbers (most of the time). In other words, a random variable assigns real values to outcomes of experiments.

'''Probability density / mass and (cumulative) distribution functions'''
The probability density or probability mass function (pdf), for a continuous or discrete random variable, is the function defined by the probability of the subset of the sample space $\{s\in S\}\subset S$. $p(x)=P(\{s\in S\} | X(s)=x)$, for all $x$.
The cumulative distribution function (cdf) $F(x)$ of any random variable $X$ with probability mass or density function $p(x)$ is defined by the total probability of all $\{s\in S\}\subset S$, where $X(s) \leq x; F(x)=P(X\leq x)$, for all x.

'''Expectation and variance'''
*Expectation: The expected value, expectation or mean, of a discrete random variable $X$ is defined as $E[X]=\sum_i {x_i P(X=x_i)}$. The expected value of a continuous random variable $Y$ is defined as $E[Y]=\int_y{yP(y)dy}$, which is the integral over the domain of $Y$ and $P(y)$ is the probability density function of $Y$. An important property of expectation is that it is a linear functional, i.e., $E[aX+bY]=aE[X]+bE[Y]$.

*Variance: The variance of a discrete random variable $X$ is defined as $VAR[X]=\sum_i {(x_i-E[X])^2 P(X=x_i)}$. Variance of a continuous random variable $Y$ is defined as $VAR[Y]=\int_y {(y-E[Y])^2 P(y)dy}$, which is the integral over the domain of $Y$ and $P(y)$ is the probability density function of $Y$. The second moment, variance, does not quite have the same linear functional properties as the expectation: $VAR[aX]= a^2 VAR[X]$ and $VAR[X+Y]=VAR[X]+VAR[Y]+2COV(X,Y)$.
*Covariance:$COV(X,Y)=E[(X-E[X])(Y-E[Y])]$.

====Bernoulli distribution====
A [[AP_Statistics_Curriculum_2007_Distrib_Binomial#Bernoulli_process|Bernoulli trial]] is an experiment whose dichotomous outcomes are random (e.g. ‘head vs. ‘tail’). $X(outcome)= \begin{cases}
0, & \text{s=head} \\
1, & \text{s=tail}
\end{cases}$.
If ''p''=P(''head''), then $E[X]=p$ and $VAR[X]=p(1-p)$.

====Binomial distribution====
Suppose we conduct an experiment observing n trial Bernoulli process. If we are interested in the RV $x$ = {Number of heads in the $n$ trials}, then $X$ is called a [[AP_Statistics_Curriculum_2007_Distrib_Binomial#Binomial_Random_Variables|Binomial RV and its distribution is called Binomial distribution]], $X \sim B(n,p)$,where $n$ is sample size, $p$ is the probability of head at one trial. $P(X=x)={n\choose x} p^x (1-p)^{n-x}$, for $x=0,1,…,n$, where ${n\choose x}=\frac {n!} {x!(n-x)!}$ is the binomial coefficient.
$$E[X]=np,VAR[X]=np(1-p)$$

====Multinomial distribution====
The [[AP_Statistics_Curriculum_2007_Distrib_Multinomial|multinomial distribution]] is an extension of binomial where the experiment consists of $k$ repeated trials and each trial has a discrete number of possible outcomes; on any given trial, the probability that a particular outcome will occur is constant; the trials are independent.

$ p=P(X_1=r_1 \cap \cdots \cap X_k=r_{k}│r_1 + ⋯ +r_k=n)$ = ${n\choose r_1,…,r_k} p_1^{r_1} p_2^{r_2}…p_k^{r_k}$ for all (∀) $r_1+⋯+r_k=n$ where ${n\choose r_1,…,r_k}=\frac {n!}{r_1! \times … \times r_k!}$.

====Geometric distribution====
The probability distribution of number X of Bernoulli trials needed to get one success is called [[AP_Statistics_Curriculum_2007_Distrib_Dists#Geometric|Geometric]]. It is supported on the set $\{1,2,3,…\}$. $P(X=x)=(1-p)^{x-1}p$, for $x = 1, 2, … $

$$E[X]=\dfrac {1} {p},VAR[X]= \frac {1-p} {p^{2}}$$

====Hypergeometric distribution====
A discrete probability distribution that describes the number of successes in a sequence of $n$ draws from a finite population without replacement. An experimental design for using [[AP_Statistics_Curriculum_2007_Distrib_Dists#HyperGeometric|Hypergeometric distribution]] is illustrated in the table below: a shipment of $N$ objects includes $m$ are defective. The Hypergeometric Distribution describes the probability that in a sample of $n$ distinctive objects drawn from the shipment exactly $k$ objects are defective.

<center>
{| class="wikitable" style="text-align:center; width:75%"border="1"
|-
|'''Type''' ||'''Drawn''' ||'''Not-Drawn''' || '''Total'''
|-
|Defective || $k$|| $m-k$ || $m$
|-
|Non-Defective || $n-k$ || $N+k-n-m$ ||$N-m$
|-
|Total || $n$|| $N-n$ || $N$
|}
</center>

$$ P(X=k)=\frac {{m \choose k}{N-m \choose n-k}} {N \choose n}, E[X]=\frac{nm}{N}, VAR[X]=\frac{\frac{nm}{N}(1-\frac{m}{N})(N-n)} {N-1}$$

==== Negative binomial distribution====
Suppose X=trial index (n) of the $r^{th}$ success, or total number of experiments ($n$) to get $r$ successes. [[AP_Statistics_Curriculum_2007_Distrib_Dists#Negative_Binomial| Negative binomial distribution]] has the following mass function $P(X=n)={n-1 \choose r-1} p^r (1-p)^{(n-r)}$, for $n=r,r+1,r+2,…$, where $n$ is the trial number of the $𝑟^{𝑡ℎ}$ success.

$$E[X]=\frac {r} {p},VAR[X]=\frac {r(1-p)} {p^{2}}$$

Suppose Y= Number of failures ($k$) to get $r$ successes. $P(Y=k)={k+r-1 choose k} p^{r} (1-p)^{k}$, for $k=0,1,2,…,$ where $k$ is the number of failures before the $ r^{th} $ success. $Y \sim NegBin(r,p)$, the probability of $k$ failures and $r$ successes in $n = k+1$ $Bernoulli(p)$ trials with success on the last trial.

$$E[Y]=\frac{r(1-p)}{p},VAR[Y]=\frac {r(1-p)} {p^{2}}$$

NOTE: $X=Y+r,E[X]=E[Y]+r,VAR[X]=VAR[Y]$.

====Negative multinomial distribution (NMD)====
[[AP_Statistics_Curriculum_2007_Distrib_Dists#Negative_Multinomial_Distribution_.28NMD.29|NMD]] is a generalization of the two-parameter $NegBin(r,p)$ to more than one outcomes. Suppose we have $m$ possible outcomes $\{X_0,…,X_m\}$ each with probability $\{p_0,…,p_m \}$, respectively, where $0<p_i<1$ and $\sum_{i=0}^m {p_i} =1$. Suppose the experiment generates independent outcomes until $\{X_0,…,X_m \}$ occur exactly $\{k_0,…,k_m \}$ times, then $\{X_{0},…,X_{m}\}$ is Negative Multinomial with parameter vector $(k_0,\{p_{1},…,p_{m}\})$, with $m$ representing the degrees of freedom.

* In the special case of $m=1$, if $X$ is the total number of experiments ($n$) necessary to get $k_{0}$ and $n-k_{o}$ outcomes of the other possible outcome $(X_{1})$. $X \sim NegativeMultinomial(k_{0},{p_[0},p_{1})$

* NMD Probability Mass Function: <math> P(k_1, \cdots, k_m|k_0,\{p_1,\cdots,p_m\}) = \left (\sum_{i=0}^m{k_i}-1\right)!\frac{p_0^{k_0}}{(k_0-1)!} \prod_{i=1}^m{\frac{p_i^{k_i}}{k_i!}},</math> or equivalently:
: <math> P(k_1, \cdots, k_m|k_0,\{p_1,\cdots,p_m\}) = \Gamma\left(\sum_{i=1}^m{k_i}\right)\frac{p_0^{k_0}}{\Gamma(k_0)} \prod_{i=1}^m{\frac{p_i^{k_i}}{k_i!}},</math>
: where <math>\Gamma(x)</math> is the [http://en.wikipedia.org/wiki/Gamma_function Gamma function].
* Mean (vector): <math>\mu=E(X_1,\cdots,X_m)= (\mu_1=E(X_1), \cdots, \mu_m=E(X_m)) = \left ( \frac{k_0p_1}{p_0}, \cdots, \frac{k_0p_m}{p_0} \right).</math>
* Variance-Covariance (matrix): <math>Cov(X_i,X_j)= \{cov[i,j]\},</math> where
: <math> cov[i,j] = \begin{cases} \frac{k_0 p_i p_j}{p_0^2},& i\not= j,\\
\frac{k_0 p_i (p_i + p_0)}{p_0^2},& i=j.\end{cases}</math>

====Poisson distribution====
The discrete [[AP_Statistics_Curriculum_2007_Distrib_Poisson|Poisson distribution]] expresses the probability of the number of events occurring in a fixed interval of time if these events occur with a known average rate and independently of the time since the last event. The Figure below shows the pdf of Poisson Distribution with varying parameter ($\lambda$) values.
<center>[[image:SMHS_Probability_Fig1.png]]</center>

The distribution is right-skewed, but for increasing $\lambda$ (say $\lambda>40$) the distribution becomes bell shaped. See the [[AP_Statistics_Curriculum_2007_Limits_Norm2Poisson|Normal approximation to Poisson distribution section]]. The Figure below shows the CDF of Poisson Distribution with varying parameter values.
<center>[[Image:SMHS_ProbabilityDistribution_fig2.png ]]</center> You can also see the [http://www.distributome.org/V3/calc/PoissonCalculator.html Distributome interactive Poisson calculator].

$$P(X=k)=\frac{λ^{k}e^{-λ}}{k!},E[X]=λ,VAR[X]=λ.$$

The CDF is discontinuous at the integers of $k$ and flat everywhere else because the variable only takes on integer values. That is the CDF of Poisson distribution is left continuous but not right continuous. Also note, the CDF of Poisson distribution takes on the value of 0 with 0 occurrence and it is non-decreasing with increasing number of occurrence. And it increases and stays at 1 after certain number of occurrence.

====Normal distribution====
The continuous [[AP_Statistics_Curriculum_2007_Normal_Std|Standard Normal distribution]] has
* ''density'' function $ f(x)= {e^{-x^2 \over 2} \over \sqrt{2 \pi}}. $
* ''cumulative distribution'' function $\Phi(y)= \int_{-\infty}^{y}{{e^{-x^2 \over 2} \over \sqrt{2 \pi}} dx}.$

===Applications===
*[http://www.mdm.com/articles/28757-the-case-for-proactive-inside-sales?v=preview The article] examined how a proactive inside sales force can be critical to serving mid-market and small customers as part of a broader multichannel strategy and in included steps for initiating an effective program.

*[http://wiki.socr.umich.edu/index.php/SOCR_EduMaterials_Activities_NegativeBinomial This article] provides an example of Negative Binomial Experiment by SOCR. The goal of this experiment is to provide a simulation demonstrating properties of the Negative Binomial(k,p) distribution. The applet facilitates the calculations of the Negative Binomial mass/density function, the moments and cumulative distribution function. It gives the specific steps of the experiment in SOCR and it allows users to learn about the variation of the distribution with changing parameters.

===Software===
*[http://socr.ucla.edu/htmls/SOCR_Distributions.html SOCR Distributions]
*[http://socr.ucla.edu/htmls/exp/Bivariate_Normal_Experiment.html Bivariate_Normal_Experiment]
*[http://socr.ucla.edu/Applets.dir/Normal_T_Chi2_F_Tables.htm Normal T Chi$^{2}$]
*[http://socr.ucla.edu/htmls/dist/Multinomial_Distribution.html Multinomial_Distribution]
*[http://wiki.stat.ucla.edu/socr/index.php/SOCR_EduMaterials_Activities_Binomial_Distributions Activities Binomial Distributions]

===Problems===
* If sampling distributions of sample means are examined for samples of size 1, 5, 10, 16 and 50, you will notice that as sample size increases, the shape of the sampling distribution appears more like that of the:
: (a) normal distribution
: (b) uniform distribution
: (c) population distribution
: (d) binomial distribution

* Which of the following statements best describes the effect on the Binomial Probability Model if the number of trials is held constant and the p(the probability of "success") increases?
: (a) None of these statements are true
: (b) The mean and the standard deviation both increase
: (c) The mean decreases and the standard deviation increases
: (d) The mean increases and the standard deviation decreases
: (e) The mean and standard deviation both decrease

* Suppose you draw one card from a standard deck three times, with replacement. What is the probability that you get spades all three times? Choose one answer.
: (a) 0.002
: (b) 0.321
: (c) 0.015
: (d) 0.021

* Suppose the number of cars that enter a parking lot in an hour is a Poisson random variable, and suppose that P(X=0)=0.05. Determine the variance of X.
: (a) 0.349
: (b) 3.232
: (c) 9.321
: (d) 2.996

* A researcher converts 100 lung capacity measurements to z-scores. The lung capacity measurements do not follow a normal distribution. What can we say about the standard deviation of the 100 z-scores?
: (a) It depends on the standard deviation of the raw scores
: (b) It equals 1
: (c) It equals 100
: (d) It must always be less than the standard deviation of the raw scores
: (e) It depends on the shape of the raw score distribution

* Among first year students at a certain university, scores on the verbal SAT follow the normal curve. The average is around 500 and the SD is about 100. Tatiana took the SAT, and placed at the 85% percentile. What was her verbal SAT score?
: (a) 604
: (b) 560
: (c) 90
: (d) 403

* Consider a random sample 100 orc soldiers and found the mean and the standard deviation to be 200lbs and and 20lbs respectively. He can be 68% confident that the mean weight in the population of orc soldiers is between
: (a) 196 to 204 lbs
: (b) 198 to 202 lbs
: (c) 194 to 206 lbs
: (d) None of the above

* The Rockwell hardness of certain metal pins is known to have a mean of 50 and a standard deviation of 1.5. If the distribution of all such pin hardness measurements is known to be normal, what is the probability that the average hardness for a random sample of nine pins is at least 50.5?
: (a) Approximately 4
: (b) 0.4
: (c) Approximately 0.1587
: (d) Approximately 0

* You read that the heights of college women are nearly normal with a mean of 65 inches and a standard deviation of 2 inches. If Vanessa is at the 10th percentile (shortest 10% for women) in height for college women, then her height is closest to:
: (a) 64.5 inches
: (b) It cannot be determined from this information
: (c) 60.5 inches
: (d) 62.44 inches

* The settlement (in cm) of a structure shown in the following figure may be evaluated from S = 0.3A + 0.2B + 0.1C,
<center>[[Image:SMHS_ProbabDist_Fig3.png]]</center>
: where A, B, and C are respectively the thickness (in m) of the three layers of soil as shown. Suppose A, B, and C are modeled as independent normal random variables as: $A \sim N(5,1)$, $B \sim N(8,2)$, $C \sim N(7,1)$,
: (a) Determine the probability that the settlement will exceed 4 cm.
: (b) If the total thickness of the three layers is known exactly as 20 m; and furthermore, thicknesses A and B are correlated with correlation coefficient equal to 0.5, determine the probability that the settlement will exceed 4 cm.

* Suppose that the distribution of X in the population is strongly skewed to the left. If you took 200 independent and random samples of size 3 from this population, calculated the mean for each of the 200 samples, and drew the distribution of the sample means, what would the sampling distribution of the means look like?
: (a) It will be perfectly normal and the mean will be equal to the median.
: (b) It will be close to the normal and the mean will be close to the median.
: (c) On a p-plot, most of the points will be on the line.
: (d) It will be skewed to the left and the mean will be less than the median.

* A polling agency has been hired to predict the proportion of voters who favor a certain candidate. The polling agency picks a random sample of 1000 voters of which 400 indicate that they favor the candidate. If they increase the sample size to 2000, how does the standard error change?
: (a) The standard error will decrease by one-fourth
: (b) The standard error will not change; the margin of error changes
: (c) Since the sample size is doubled, the standard error will be halved
: (d) The standard error will decrease not by a factor of 1/2 but by the square of root of 1/2

* The probability of winning a certain instant scratch-n-win game is 0.02. You play the game 80 times. Find the probability that you win 3 times.
: (a) 0.2983
: (b) 0.1378
: (c) 0.3231
: (d) 0.2391

* After firing 1000 boxes of ammunition, a certain handgun jamed according to a Poisson distribution with a mean of 0.4 per box of ammunition. Approximate the probability that more than 350 boxes of ammunition contain some that is jammed.
: (a) .0032
: (b) .0012
: (c) .0231
: (d) .0089

===References===
* [http://wiki.socr.umich.edu/index.php/Probability_and_statistics_EBook#Chapter_IV:_Probability_Distributions SOCR EBook Probability Chapter]
* [http://en.wikipedia.org/wiki/Poisson_distribution Poisson Distribution Wikipedia]

<hr>
* SOCR Home page: http://www.socr.umich.edu

{{translate|pageName=http://wiki.socr.umich.edu/index.php?title=SMHS_ProbabilityDistributions}}

SMHS ProbabilityDistributions

2014-08-31T13:54:15Z

Jslavine: /* Overview */

==[[SMHS| Scientific Methods for Health Sciences]] - Probability Distributions ==

===Overview===
Distributions are the fundamental basis of probability theory. There are two types of processes that we observe in nature, discrete and continuous, and they are modeled by the corresponding distributions. (There can also be mixture-, multidimensional and tensor distributions[http://www.example.com link title]; these are not discussed here). The type of distribution depends on the type of data. Discrete and continuous distributions represent discrete or continuous random variables, respectively. This section aims to introduce various discrete and continuous distributions and discuss the relationships between distributions.

*Discrete distributions: [[AP_Statistics_Curriculum_2007_Distrib_Binomial|Bernoulli distribution]], [[AP_Statistics_Curriculum_2007_Distrib_Binomial|Binomial distribution]], [[AP_Statistics_Curriculum_2007_Distrib_Multinomial|Multinomial distribution]], [[SOCR_EduMaterials_Activities_Explore_Distributions#Geometric_probability_distribution|Geometric distribution]], [[AP_Statistics_Curriculum_2007_Distrib_Dists#HyperGeometric|Hypergeometric distribution]], [[AP_Statistics_Curriculum_2007_Distrib_Dists#Negative_Binomial|Negative binomial distribution]], [[AP_Statistics_Curriculum_2007_Distrib_Dists#Negative_Multinomial_Distribution_.28NMD.29|Negative multinomial distribution]], [[AP_Statistics_Curriculum_2007_Distrib_Poisson|Poisson distribution]].

*Continuous distributions: [[AP_Statistics_Curriculum_2007#Chapter_V:_Normal_Probability_Distribution|Normal distribution]], [[SOCR_BivariateNormal_JS_Activity| Multivariate normal distribution]].

===Motivation===
We have talked about different types of data and the fundamentals of probability theory. In order to capture and estimate the patterns of data, we introduced the concept of distribution. A probability distribution assigns a probability to each measurable subset of the possible outcomes of a random experiment. It can either be univariate or multivariate. A univariate distribution gives the probability of a single random variable while the a multivariate distribution (a joint probability distribution) gives the probability of a random vector which is a set of two or more random variables taking on various combinations of values. Consider the coin tossing experiment, what would be the distribution of the outcome?

===Theory===
'''Random variables''': a random variable is a function or a mapping from a sample space into the real numbers (most of the time). In other words, a random variable assigns real values to outcomes of experiments.

'''Probability density / mass and (cumulative) distribution functions'''
The probability density or probability mass function (pdf), for a continuous or discrete random variable, is the function defined by the probability of the subset of the sample space $\{s\in S\}\subset S$. $p(x)=P(\{s\in S\} | X(s)=x)$, for all $x$.
The cumulative distribution function (cdf) $F(x)$ of any random variable $X$ with probability mass or density function $p(x)$ is defined by the total probability of all $\{s\in S\}\subset S$, where $X(s) \leq x; F(x)=P(X\leq x)$, for all x.

'''Expectation and variance'''
*Expectation: The expected value, expectation or mean, of a discrete random variable $X$ is defined as $E[X]=\sum_i {x_i P(X=x_i)}$. The expected value of a continuous random variable $Y$ is defined as $E[Y]=\int_y{yP(y)dy}$, which is the integral over the domain of $Y$ and $P(y)$ is the probability density function of $Y$. An important property of expectation is that it is a linear functional, i.e., $E[aX+bY]=aE[X]+bE[Y]$.

*Variance: The variance of a discrete random variable $X$ is defined as $VAR[X]=\sum_i {(x_i-E[X])^2 P(X=x_i)}$. Variance of a continuous random variable $Y$ is defined as $VAR[Y]=\int_y {(y-E[Y])^2 P(y)dy}$, which is the integral over the domain of $Y$ and $P(y)$ is the probability density function of $Y$. The second moment, variance, does not quite have the same linear functional properties as the expectation: $VAR[aX]= a^2 VAR[X]$ and $VAR[X+Y]=VAR[X]+VAR[Y]+2COV(X,Y)$.
*Covariance:$COV(X,Y)=E[(X-E[X])(Y-E[Y])]$.

====Bernoulli distribution====
A [[AP_Statistics_Curriculum_2007_Distrib_Binomial#Bernoulli_process|Bernoulli trial]] is an experiment whose dichotomous outcomes are random (e.g. ‘head vs. ‘tail’). $X(outcome)= \begin{cases}
0, & \text{s=head} \\
1, & \text{s=tail}
\end{cases}$.
If ''p''=P(''head''), then $E[X]=p$ and $VAR[X]=p(1-p)$.

====Binomial distribution====
Suppose we conduct an experiment observing n trial Bernoulli process. If we are interested in the RV $x$ = {Number of heads in the $n$ trials}, then $X$ is called a [[AP_Statistics_Curriculum_2007_Distrib_Binomial#Binomial_Random_Variables|Binomial RV and its distribution is called Binomial distribution]], $X \sim B(n,p)$,where $n$ is sample size, $p$ is the probability of head at one trial. $P(X=x)={n\choose x} p^x (1-p)^{n-x}$, for $x=0,1,…,n$, where ${n\choose x}=\frac {n!} {x!(n-x)!}$ is the binomial coefficient.
$$E[X]=np,VAR[X]=np(1-p)$$

====Multinomial distribution====
The [[AP_Statistics_Curriculum_2007_Distrib_Multinomial|multinomial distribution]] is an extension of binomial where the experiment consists of $k$ repeated trials and each trial has a discrete number of possible outcomes; on any given trial, the probability that a particular outcome will occur is constant; the trials are independent.

$ p=P(X_1=r_1 \cap \cdots \cap X_k=r_{k}│r_1 + ⋯ +r_k=n)$ = ${n\choose r_1,…,r_k} p_1^{r_1} p_2^{r_2}…p_k^{r_k}$ for all (∀) $r_1+⋯+r_k=n$ where ${n\choose r_1,…,r_k}=\frac {n!}{r_1! \times … \times r_k!}$.

====Geometric distribution====
The probability distribution of number X of Bernoulli trials needed to get one success is called [[AP_Statistics_Curriculum_2007_Distrib_Dists#Geometric|Geometric]]. It is supported on the set $\{1,2,3,…\}$. $P(X=x)=(1-p)^{x-1}p$, for $x = 1, 2, … $

$$E[X]=\dfrac {1} {p},VAR[X]= \frac {1-p} {p^{2}}$$

====Hypergeometric distribution====
A discrete probability distribution that describes the number of successes in a sequence of $n$ draws from a finite population without replacement. An experimental design for using [[AP_Statistics_Curriculum_2007_Distrib_Dists#HyperGeometric|Hypergeometric distribution]] is illustrated in the table below: a shipment of $N$ objects includes $m$ are defective. The Hypergeometric Distribution describes the probability that in a sample of $n$ distinctive objects drawn from the shipment exactly $k$ objects are defective.

<center>
{| class="wikitable" style="text-align:center; width:75%"border="1"
|-
|'''Type''' ||'''Drawn''' ||'''Not-Drawn''' || '''Total'''
|-
|Defective || $k$|| $m-k$ || $m$
|-
|Non-Defective || $n-k$ || $N+k-n-m$ ||$N-m$
|-
|Total || $n$|| $N-n$ || $N$
|}
</center>

$$ P(X=k)=\frac {{m \choose k}{N-m \choose n-k}} {N \choose n}, E[X]=\frac{nm}{N}, VAR[X]=\frac{\frac{nm}{N}(1-\frac{m}{N})(N-n)} {N-1}$$

==== Negative binomial distribution====
Suppose X=trial index (n) of the $r^{th}$ success, or total number of experiments ($n$) to get $r$ successes. [[AP_Statistics_Curriculum_2007_Distrib_Dists#Negative_Binomial| Negative binomial distribution]] has the following mass function $P(X=n)={n-1 \choose r-1} p^r (1-p)^{(n-r)}$, for $n=r,r+1,r+2,…$, where $n$ is the trial number of the $𝑟^{𝑡ℎ}$ success.

$$E[X]=\frac {r} {p},VAR[X]=\frac {r(1-p)} {p^{2}}$$

Suppose Y= Number of failures ($k$) to get $r$ successes. $P(Y=k)={k+r-1 choose k} p^{r} (1-p)^{k}$, for $k=0,1,2,…,$ where $k$ is the number of failures before the $ r^{th} $ success. $Y \sim NegBin(r,p)$, the probability of $k$ failures and $r$ successes in $n = k+1$ $Bernoulli(p)$ trials with success on the last trial.

$$E[Y]=\frac{r(1-p)}{p},VAR[Y]=\frac {r(1-p)} {p^{2}}$$

NOTE: $X=Y+r,E[X]=E[Y]+r,VAR[X]=VAR[Y]$.

====Negative multinomial distribution (NMD)====
[[AP_Statistics_Curriculum_2007_Distrib_Dists#Negative_Multinomial_Distribution_.28NMD.29|NMD]] is a generalization of the two-parameter $NegBin(r,p)$ to more than one outcomes. Suppose we have $m$ possible outcomes $\{X_0,…,X_m\}$ each with probability $\{p_0,…,p_m \}$, respectively, where $0<p_i<1$ and $\sum_{i=0}^m {p_i} =1$. Suppose the experiment generates independent outcomes until $\{X_0,…,X_m \}$ occur exactly $\{k_0,…,k_m \}$ times, then $\{X_{0},…,X_{m}\}$ is Negative Multinomial with parameter vector $(k_0,\{p_{1},…,p_{m}\})$, with $m$ representing the degrees of freedom.

* In the special case of $m=1$, if $X$ is the total number of experiments ($n$) necessary to get $k_{0}$ and $n-k_{o}$ outcomes of the other possible outcome $(X_{1})$. $X \sim NegativeMultinomial(k_{0},{p_[0},p_{1})$

* NMD Probability Mass Function: <math> P(k_1, \cdots, k_m|k_0,\{p_1,\cdots,p_m\}) = \left (\sum_{i=0}^m{k_i}-1\right)!\frac{p_0^{k_0}}{(k_0-1)!} \prod_{i=1}^m{\frac{p_i^{k_i}}{k_i!}},</math> or equivalently:
: <math> P(k_1, \cdots, k_m|k_0,\{p_1,\cdots,p_m\}) = \Gamma\left(\sum_{i=1}^m{k_i}\right)\frac{p_0^{k_0}}{\Gamma(k_0)} \prod_{i=1}^m{\frac{p_i^{k_i}}{k_i!}},</math>
: where <math>\Gamma(x)</math> is the [http://en.wikipedia.org/wiki/Gamma_function Gamma function].
* Mean (vector): <math>\mu=E(X_1,\cdots,X_m)= (\mu_1=E(X_1), \cdots, \mu_m=E(X_m)) = \left ( \frac{k_0p_1}{p_0}, \cdots, \frac{k_0p_m}{p_0} \right).</math>
* Variance-Covariance (matrix): <math>Cov(X_i,X_j)= \{cov[i,j]\},</math> where
: <math> cov[i,j] = \begin{cases} \frac{k_0 p_i p_j}{p_0^2},& i\not= j,\\
\frac{k_0 p_i (p_i + p_0)}{p_0^2},& i=j.\end{cases}</math>

====Poisson distribution====
The discrete [[AP_Statistics_Curriculum_2007_Distrib_Poisson|Poisson distribution]] expresses the probability of the number of events occurring in a fixed interval of time if these events occur with a known average rate and independently of the time since the last event. The Figure below shows the pdf of Poisson Distribution with varying parameter ($\lambda$) values.
<center>[[image:SMHS_Probability_Fig1.png]]</center>

The distribution is right-skewed, but for increasing $\lambda$ (say $\lambda>40$) the distribution becomes bell shaped. See the [[AP_Statistics_Curriculum_2007_Limits_Norm2Poisson|Normal approximation to Poisson distribution section]]. The Figure below shows the CDF of Poisson Distribution with varying parameter values.
<center>[[Image:SMHS_ProbabilityDistribution_fig2.png ]]</center> You can also see the [http://www.distributome.org/V3/calc/PoissonCalculator.html Distributome interactive Poisson calculator].

$$P(X=k)=\frac{λ^{k}e^{-λ}}{k!},E[X]=λ,VAR[X]=λ.$$

The CDF is discontinuous at the integers of $k$ and flat everywhere else because the variable only takes on integer values. That is the CDF of Poisson distribution is left continuous but not right continuous. Also note, the CDF of Poisson distribution takes on the value of 0 with 0 occurrence and it is non-decreasing with increasing number of occurrence. And it increases and stays at 1 after certain number of occurrence.

====Normal distribution====
The continuous [[AP_Statistics_Curriculum_2007_Normal_Std|Standard Normal distribution]] has
* ''density'' function $ f(x)= {e^{-x^2 \over 2} \over \sqrt{2 \pi}}. $
* ''cumulative distribution'' function $\Phi(y)= \int_{-\infty}^{y}{{e^{-x^2 \over 2} \over \sqrt{2 \pi}} dx}.$

===Applications===
*[http://www.mdm.com/articles/28757-the-case-for-proactive-inside-sales?v=preview The article] examined how a proactive inside sales force can be critical to serving mid-market and small customers as part of a broader multichannel strategy and in included steps for initiating an effective program.

*[http://wiki.socr.umich.edu/index.php/SOCR_EduMaterials_Activities_NegativeBinomial This article] provides an example of Negative Binomial Experiment by SOCR. The goal of this experiment is to provide a simulation demonstrating properties of the Negative Binomial(k,p) distribution. The applet facilitates the calculations of the Negative Binomial mass/density function, the moments and cumulative distribution function. It gives the specific steps of the experiment in SOCR and it allows users to learn about the variation of the distribution with changing parameters.

===Software===
*[http://socr.ucla.edu/htmls/SOCR_Distributions.html SOCR Distributions]
*[http://socr.ucla.edu/htmls/exp/Bivariate_Normal_Experiment.html Bivariate_Normal_Experiment]
*[http://socr.ucla.edu/Applets.dir/Normal_T_Chi2_F_Tables.htm Normal T Chi$^{2}$]
*[http://socr.ucla.edu/htmls/dist/Multinomial_Distribution.html Multinomial_Distribution]
*[http://wiki.stat.ucla.edu/socr/index.php/SOCR_EduMaterials_Activities_Binomial_Distributions Activities Binomial Distributions]

===Problems===
* If sampling distributions of sample means are examined for samples of size 1, 5, 10, 16 and 50, you will notice that as sample size increases, the shape of the sampling distribution appears more like that of the:
: (a) normal distribution
: (b) uniform distribution
: (c) population distribution
: (d) binomial distribution

* Which of the following statements best describes the effect on the Binomial Probability Model if the number of trials is held constant and the p(the probability of "success") increases?
: (a) None of these statements are true
: (b) The mean and the standard deviation both increase
: (c) The mean decreases and the standard deviation increases
: (d) The mean increases and the standard deviation decreases
: (e) The mean and standard deviation both decrease

* Suppose you draw one card from a standard deck three times, with replacement. What is the probability that you get spades all three times? Choose one answer.
: (a) 0.002
: (b) 0.321
: (c) 0.015
: (d) 0.021

* Suppose the number of cars that enter a parking lot in an hour is a Poisson random variable, and suppose that P(X=0)=0.05. Determine the variance of X.
: (a) 0.349
: (b) 3.232
: (c) 9.321
: (d) 2.996

* A researcher converts 100 lung capacity measurements to z-scores. The lung capacity measurements do not follow a normal distribution. What can we say about the standard deviation of the 100 z-scores?
: (a) It depends on the standard deviation of the raw scores
: (b) It equals 1
: (c) It equals 100
: (d) It must always be less than the standard deviation of the raw scores
: (e) It depends on the shape of the raw score distribution

* Among first year students at a certain university, scores on the verbal SAT follow the normal curve. The average is around 500 and the SD is about 100. Tatiana took the SAT, and placed at the 85% percentile. What was her verbal SAT score?
: (a) 604
: (b) 560
: (c) 90
: (d) 403

* Consider a random sample 100 orc soldiers and found the mean and the standard deviation to be 200lbs and and 20lbs respectively. He can be 68% confident that the mean weight in the population of orc soldiers is between
: (a) 196 to 204 lbs
: (b) 198 to 202 lbs
: (c) 194 to 206 lbs
: (d) None of the above

* The Rockwell hardness of certain metal pins is known to have a mean of 50 and a standard deviation of 1.5. If the distribution of all such pin hardness measurements is known to be normal, what is the probability that the average hardness for a random sample of nine pins is at least 50.5?
: (a) Approximately 4
: (b) 0.4
: (c) Approximately 0.1587
: (d) Approximately 0

* You read that the heights of college women are nearly normal with a mean of 65 inches and a standard deviation of 2 inches. If Vanessa is at the 10th percentile (shortest 10% for women) in height for college women, then her height is closest to:
: (a) 64.5 inches
: (b) It cannot be determined from this information
: (c) 60.5 inches
: (d) 62.44 inches

* The settlement (in cm) of a structure shown in the following figure may be evaluated from S = 0.3A + 0.2B + 0.1C,
<center>[[Image:SMHS_ProbabDist_Fig3.png]]</center>
: where A, B, and C are respectively the thickness (in m) of the three layers of soil as shown. Suppose A, B, and C are modeled as independent normal random variables as: $A \sim N(5,1)$, $B \sim N(8,2)$, $C \sim N(7,1)$,
: (a) Determine the probability that the settlement will exceed 4 cm.
: (b) If the total thickness of the three layers is known exactly as 20 m; and furthermore, thicknesses A and B are correlated with correlation coefficient equal to 0.5, determine the probability that the settlement will exceed 4 cm.

* Suppose that the distribution of X in the population is strongly skewed to the left. If you took 200 independent and random samples of size 3 from this population, calculated the mean for each of the 200 samples, and drew the distribution of the sample means, what would the sampling distribution of the means look like?
: (a) It will be perfectly normal and the mean will be equal to the median.
: (b) It will be close to the normal and the mean will be close to the median.
: (c) On a p-plot, most of the points will be on the line.
: (d) It will be skewed to the left and the mean will be less than the median.

* A polling agency has been hired to predict the proportion of voters who favor a certain candidate. The polling agency picks a random sample of 1000 voters of which 400 indicate that they favor the candidate. If they increase the sample size to 2000, how does the standard error change?
: (a) The standard error will decrease by one-fourth
: (b) The standard error will not change; the margin of error changes
: (c) Since the sample size is doubled, the standard error will be halved
: (d) The standard error will decrease not by a factor of 1/2 but by the square of root of 1/2

* The probability of winning a certain instant scratch-n-win game is 0.02. You play the game 80 times. Find the probability that you win 3 times.
: (a) 0.2983
: (b) 0.1378
: (c) 0.3231
: (d) 0.2391

* After firing 1000 boxes of ammunition, a certain handgun jamed according to a Poisson distribution with a mean of 0.4 per box of ammunition. Approximate the probability that more than 350 boxes of ammunition contain some that is jammed.
: (a) .0032
: (b) .0012
: (c) .0231
: (d) .0089

===References===
* [http://wiki.socr.umich.edu/index.php/Probability_and_statistics_EBook#Chapter_IV:_Probability_Distributions SOCR EBook Probability Chapter]
* [http://en.wikipedia.org/wiki/Poisson_distribution Poisson Distribution Wikipedia]

<hr>
* SOCR Home page: http://www.socr.umich.edu

{{translate|pageName=http://wiki.socr.umich.edu/index.php?title=SMHS_ProbabilityDistributions}}

SMHS OR RR

2014-08-31T13:37:34Z

Jslavine: /* Problems */

==[[SMHS| Scientific Methods for Health Sciences]] - Odds Ratio and Relative Risk ==

===Overview===
The ''relative risk'' is a measure of dependence that allows us to compare two probabilities in terms of their ratio $ \frac{p_1}{p_2} $ rather than their difference
$(p_1 – p_2) $. Relative risk is a commonly used measure in public health studies. Another way to compare two probabilities is in terms of the odds. If an event takes place with probability p, then the odds of the event occurring are $ \frac{p}{1 - p} $. The ''odds ratio'' is the ratio of odds for two complementary probabilities.

===Motivation===
Suppose we study brain cancer in the context of cell phone use. The table below illustrates some simulated data. One clear healthcare question in this case-study could be: ''Is cell phone use associated with a higher incidence of brain cancer?'' To address this question, we can look at the relative risk of brain cancer in people who use cell phones.

<center>
{| class="wikitable" style="text-align:center; width:45%" border="1"
|-
| colspan=2 rowspan=2 | || colspan=2| '''Brain Cancer (BC)''' ||rowspan=2|'''Total'''
|-
|'''Yes''' || '''No'''
|-
| rowspan=2| '''Cell Phone (CP)''' || '''Yes''' || 18 || 80 || 98 '''(B)'''
|-
| '''No''' ||7 || 95 || 102 '''(C)'''
|-
| colspan=2|'''Total''' || 25 || 175 || 200
|}
</center>

First, we compute the (conditional!) probabilities (P) of brain cancer (BC) given either cell phone use, P1, or no cell-phone use, P2. We can then form their ratio to determine whether the relative risk of brain cancer (BC) is higher in cell phone users (CP) than in non-users (NCP).

$$ P_1 = P(BC|CP) = \dfrac {18}{98} = 0.184 $$

$$ P_2= P(BC|NCP) = \dfrac {7} {102} = 0.069 $$

Therefore, the relative risk of brain cancer in cell phone users is:
$$ RR= \frac{P(BC|CP)}{P(BC|NCP)} = \frac {0.184}{0.069} = 2.67.$$

The risk of having brain cancer is more than 2.5 times greater among cell phone users compared to non-cell phone users.

For the same example, the odds ratio (OR) of brain cancer relative to cell-phone use is:

$$ OR = \frac{\frac{P \left( BC \mid CP \right)}{1 - P \left( BC \mid CP \right)}}{\frac{P \left( BC \mid NCP \right)}{1 - P \left( BC \mid NCP \right)}}
= \frac{\frac{\frac{18}{98}}{1 - \frac{18}{98}}} {\frac{\frac{7}{102}}{1 - \frac{7}{102}}} =\frac{\frac{0.184}{0.816}}{\frac{0.069}{0.931}} = 3.04 $$

Thus, the odds of having brain cancer is about 3 times greater for cell phone users than it is for non-cell phone users.
We could have compared the odds of owning a cell phone given that a patient had brain cancer (i.e., the column-wise probabilities), $ P(CP|BC) = 18/25 = 0.72 $ versus $ P(CP|NBC) = 80/175 = 0.457 $. However, this does not seem as important scientifically.

===Theory===
<center>
{|class="wikitable" style="text-align:center; width:75%" border="1"
|-
| colspan=2 rowspan=2| || colspan=2|Factor 1|| rowspan=2|Total
|-
|Yes||No
|-
| rowspan=2|Factor 2||Yes||$n_{1,1}$||$n_{1,2}$||$n_{1,1} + n_{1,2}$
|-
|No||$n_{2,1}$||$n_{2,2}$||$n_{2,1} + n_{2,2}$
|-
| colspan=2|Total||$n_{1,1} + n_{2,1}$||$n_{2,1} + n_{1,2}$||$N=n_{1,1} + n_{1,2} + n_{2,1} + n_{2,2}$
|}
</center>

$$RR=\frac{\frac{n_{1,1}}{n_{1,1}+ n_{1,2}}}{\frac{n_{2,1}}{n_{2,1}+n_{2,2}}}.$$

$$OR = \frac{n_{1,1} × n_{2,2}}{n_{1,2}× n_{2,1}}.$$

====Interpretation====
* '''RR''': In general, the measure relative risk (RR) is interpreted as follows:
**RR = 1 indicates that the probabilities of two events are the same.
**RR > 1 implies that there is an increased risk.
**RR < 1 implies that there is a decreased risk.

* '''OR'''
**If event $A|B$ has probability $p = 1/2$, then the odds are $\frac{1/2}{1/2}=1$, or $1:1$, or 1 to 1. This means the probability that event A|B occurs is equal to the probability that it does not occur.
**If event $A|C$ has probability $p = 3/4$, then the odds are $\frac{3/4}{1/4}= 3$, or 3 to 1. The probability that the event A|C occurs is three times as large as the probability that it does not occur.
**Similarly, if A|D has probability $p = 1/4$, then the odds are $\frac {1/4}{3/4}=\frac {1}{3}$, or 1 to 3. The probability that the event A|D occurs is three times smaller than the probability that it does not occur.

*'''RR vs. OR'''
**The formula and reasoning for the relative risk is a little bit easier to follow. In most cases the OR and RR measures are roughly equal to each other.
**Odds ratios have an advantage over relative risk because they can be calculated no matter the row or column comparison
**Relative risk runs into problems when the study design is a cohort study or a case-control design
**Odds ratios are an approximation of relative risk: $ OR = RR \times \frac{1-P_2} {1-P_1}$.

====Inference====
* ''Inference on OR'': In practice, we commonly report ORs along with their confidence intervals (CIs). It turns out that the distribution of ORs is not normal, however, the ''log-transformed OR is approximately normally distributed'', and the standard error of $ ln(OR)$ is:

$$ SE(ln(OR))= \sqrt{\frac {1} {n_{1,1}}+ \frac {1} {n_{1,2}} + \frac {1} {n_{2,1}} + \frac{1} {n_{2,2}}}.$$

Thus, if $\alpha$ is the false-positive (i.e., Type I) error rate, the $ (1-\alpha)100\% $ CI of the log-transformed OR can be computed by:
$$ln(OR)±z\frac{\alpha}{2}SE(ln(OR)),$$

where the OR point estimate is $OR = \frac {n_{1,1}×n_{2,2}}{n_{1,2}×n_{2,1}}$, and the standard error of the log-transformed OR is listed above (i.e., $SE(ln(OR))$).

You can use the [http://socr.ucla.edu/htmls/dist/StudentT_Distribution.html SOCR Student’s T-distribution calculator] to compute the value of the standard normal Z statistics for a given false-positive rate, $\alpha$.

NOTE: Remember that once you find the lower $(L=ln(OR)-z\dfrac{\alpha}{2}SE(ln(OR))$ and upper $(U=ln(OR)+z\frac{\alpha}{2}SE(ln(OR))$ limits of the $ln(OR)$ confidence interval, these represent log-transformed data. To convert these confidence limits into real OR terms, you need to invert the log transform (i.e., using the exponential function). Thus, the $ CI(OR) $ would be: $(e^L,e^R)$.

===Applications===
* [http://dx.doi.org/10.1016/j.ijnurstu.2012.11.014 This article] retrospectively studies the relationship between surveillance, staffing, and serious adverse events in children on general care postoperative units. The paper investigates these hypotheses: (1) the relationship between patient factors and surveillance is moderated by staffing (i.e., registered nurse hours per patient per shift), and (2) the relationship between staffing and serious adverse events is mediated by surveillance.

: The study demonstrates that one additional full-time registered nurse equivalent per day reduced the odds of in-hospital mortality, respiratory failure, pneumonia, and failure to rescue; the greatest cost-benefit was found in adult surgical patients. Table 4 of the results shows the OR and CI(OR). Interpret the findings.

<center>Predictors of adverse events as shown in final logistic regression analysis.</center>
<center>
{| class="wikitable" style="text-align:left; width:75%" border="1"
|-
|" | '''Factors''' || '''β (S.E.)''' || '''p-Value''' || '''Odds ratio [95% CI)'''
|-
| '''Staffing''' || −0.41 (0.33) || 0.219 || 0.66 [0.35, 1.28]
|-
|'''American Society of Anesthesiologists Physical Status''' ||0.94 (0.39) ||0.017 || 2.57 [1.88, 5.55]
|-
|'''Comorbidity''' || 0.57 (0.43) || 0.189 || 1.76 [0.76, 4.12]
|-
| '''Perioperative complication'''|| 0.64 (0.22)|| 0.003 || 1.90 [1.24, 2.92]
|-
|'''Interaction staffing × surveillance''' || −1.04 (0.42) || 0.012 ||0.354 [0.157, 0.798]
|}
</center>

* [http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3308724/ This article] investigates whether hospitals with well-organized care (e.g., improved nurse staffing and work environments) provide better patient care and nurse workforce stability in European countries and the United States. It uses data from 488 clinics in 12 European countries and 617 in the US. It is based on 33,659 nurses and 11,318 patients in Europe and 27,509 nurses and more than 120,000 patients in the US.

: Some of the authors’ findings included nurses in hospitals with better work environments were approximately half as likely to (a) report poor or fair quality of care (Europe, adjusted odds ratio 0.56, 95% confidence interval 0.51 to 0.61; US, 0.54, 0.51 to 0.58) and (b) give their hospitals poor or failing grades on patient safety (0.50, 0.44 to 0.56 EU; 0.55, 0.50 to 0.61 US).

: Interpret the results in the table below. Note that in this nurse outcomes study, the [[SMHS_OR_RR#References|authors adjusted the regression estimates]] (odds ratios) at the hospital level for differences in the composition of nurses between hospitals and between countries (i.e., age, sex, full time employment status, and specialty) using a multilevel model structure in which nurses were nested within hospitals and countries.

<center> Effects of nurse staffing and practice environment on nurse outcomes in study countries. </center>

<center>
{| class="wikitable" style="text-align:center; width:75%" border="1"
|-

| rowspan="2" |'''Nurse Outcome''' || colspan="2" |'''Europe''' || || colspan="2" |'''US'''
|-
| '''Unadjusted odds ratio (95% CI)''' ||'''Adjusted odds ratio (95% CI)''' || || '''Unadjusted odds ratio (95% CI)''' || '''Adjusted odds ratio (95% CI)'''
|-
| colspan="6" |'''Poor or fair quality of care in ward'''
|-
|ROWSPAN="2" style="text-align: center;" |'''Practice environment''' || 0.58 || 0.56 || || 0.52 || 0.54
|-
|(0.53 to 0.63)|| (0.51 to 0.61) || || (0.49 to 0.56) || (0.51 to 0.58)
|-
|rowspan="2" |'''Staffing''' || 1.11 || 1.11 || || 1.2 || 1.06
|-
| (1.08 to 1.13) ||(1.07 to 1.15) || || (1.16 to 1.25) || (1.03 to 1.1)
|-
| colspan="6" |'''Poor or fair quality of care in ward'''
|-
|rowspan="2" |'''Practice environment''' || 0.5|| 0.5 || || 0.53 || 0.55
|-
|(0.43 to 0.57) ||(0.44 to 0.56) || || (0.48 to 0.59) ||(0.5 to 0.61)
|-
|rowspan="2"| '''Staffing''' || 1.04 || 1.1 || || 1.18 ||1.05
|-
| (1.01 to 1.08) || (1.05 to 1.16) || || (1.12 to 1.23) ||(1 to 1.1)
|-
|colspan="6" |'''Burnout'''
|-
|rowspan="2"|'''Practice environment''' || 0.69|| 0.67|| || 0.69 || 0.71
|-
|(0.63 to 0.76)||(0.61 to 0.73)|| || (0.66 to 0.73) ||(0.68 to 0.75)
|-
|rowspan="2"|'''Staffing''' || 1.06 || 1.05|| || 1.12 ||1.03
|-
| (1.04 to 1.08)||(1.02 to 1.09)|| ||(1.08 to 1.15)||(1 to 1.06)
|-
| colspan="6" |'''Job dissatisfaction'''
|-
|rowspan="2"|'''Practice environment''' ||0.63|| 0.52 || || 0.58 || 0.6
|-
|(0.57 to 0.69) || (0.47 to 0.57) || ||(0.55 to 0.61) ||(0.57 to 0.64)
|-
|rowspan="2"|'''Staffing''' || 1.1 || 1.07 || || 1.17 || 1.06
|-
| (1.08 to 1.12) || (1.04 to 1.11) || || (1.13 to 1.21) ||(1.03 to 1.09)
|-
| colspan="6" |'''Intention to leave in the next year'''
|-
|rowspan="2"|'''Practice environment''' || 0.72 || 0.61 || || 0.7 ||0.69
|-
|(0.66 to 0.79) ||(0.56 to 0.67)|| || (0.65 to 0.76) ||(0.64 to 0.75)
|-
|rowspan="2"|'''Staffing''' || 1.04 || 1.05 || || 1.1 || 1.03
|-
| (1.01 to 1.06)|| (1.02 to 1.09)|| || (1.05 to 1.15)||(0.98 to 1.08)
|-
| colspan="6" |'''Not confident that patients can manage own care after hospital discharge'''
|-
|rowspan="2"|'''Practice environment''' || 0.62 || 0.73 || || 0.71 || 0.72
|-
|(0.56 to 0.69)||(0.69 to 0.78) || ||(0.67 to 0.75) ||(0.68 to 0.77)
|-
|rowspan="2"|'''Staffing''' || 1.08 || 1.03 || ||1.1 || 1.04
|-
| (1.05 to 1.11) ||(1 to 1.05) || || (1.06 to 1.13) || (1.01 to 1.07)
|-
| colspan="6" |'''Not confident that hospital management would resolve patients’ problems'''
|-
|rowspan="2"|'''Practice environment''' || 0.5 || 0.53 || || 0.56 || 0.56
|-
| (0.46 to 0.54) ||(0.48 to 0.58)|| ||(0.53 to 0.59) ||(0.54 to 0.59)
|-
|rowspan="2"|'''Staffing''' || 1.04 ||1.02 || || 1.12 || 1.01
|-
|(1.01 to 1.07) ||(0.98 to 1.06) || ||(1.09 to 1.17) ||(0.98 to 1.03)
|}
</center>

===Software===
*[http://www.distributome.org/V3/calc/StudentCalculator.html T Distribution Calculator]

*[http://socr.umich.edu/Applets/Normal_T_Chi2_F_Tables.html Distribution Applets]

===Problems===
Formulate some clinically relevant questions in terms of the OR and RR, and try to answer them in the following situations. Interpret the results. E.g., the estimate of the relative risk of a heart attack is approximately ____ for those who smoke versus those who do not smoke. Compute the CI of the OR.

<center>
{| class="wikitable" style="text-align:center; width:45%" border="1"
|-
| colspan=2 rowspan=2 | || colspan=2| '''Heart Attack (HA)''' || rowspan=2|'''Total'''
|-
|'''Yes''' || '''No'''
|-
| rowspan=2| '''Smoking (S)''' || '''Yes''' || 33 || 18 || 51
|-
| '''No''' ||167 || 182 || 349
|-
| colspan=2|'''Total''' ||200 || 200 || 400
|}
</center>

===References===
* [http://www.sciencedirect.com/science/article/pii/S0378375812001954 Reducing bias and mean squared error associated with regression-based odds ratio estimators]
* [http://www.sciencedirect.com/science/article/pii/S0020748912004166 Nursing surveillance moderates the relationship between staffing levels and pediatric postoperative serious adverse events: A nested case–control study]
* [http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3308724/ Patient safety, satisfaction, and quality of hospital care: cross sectional surveys of nurses and patients in 12 countries in Europe and the United States]
* [http://www.sciencedirect.com/science/article/pii/S0378375812001954 Reducing bias and mean squared error associated with regression-based odds ratio estimators]

<hr>
* SOCR Home page: http://www.socr.umich.edu

{{translate|pageName=http://wiki.socr.umich.edu/index.php?title=SMHS_OR_RR}}

SMHS OR RR

2014-08-31T13:35:08Z

Jslavine: /* Applications */

==[[SMHS| Scientific Methods for Health Sciences]] - Odds Ratio and Relative Risk ==

===Overview===
The ''relative risk'' is a measure of dependence that allows us to compare two probabilities in terms of their ratio $ \frac{p_1}{p_2} $ rather than their difference
$(p_1 – p_2) $. Relative risk is a commonly used measure in public health studies. Another way to compare two probabilities is in terms of the odds. If an event takes place with probability p, then the odds of the event occurring are $ \frac{p}{1 - p} $. The ''odds ratio'' is the ratio of odds for two complementary probabilities.

===Motivation===
Suppose we study brain cancer in the context of cell phone use. The table below illustrates some simulated data. One clear healthcare question in this case-study could be: ''Is cell phone use associated with a higher incidence of brain cancer?'' To address this question, we can look at the relative risk of brain cancer in people who use cell phones.

<center>
{| class="wikitable" style="text-align:center; width:45%" border="1"
|-
| colspan=2 rowspan=2 | || colspan=2| '''Brain Cancer (BC)''' ||rowspan=2|'''Total'''
|-
|'''Yes''' || '''No'''
|-
| rowspan=2| '''Cell Phone (CP)''' || '''Yes''' || 18 || 80 || 98 '''(B)'''
|-
| '''No''' ||7 || 95 || 102 '''(C)'''
|-
| colspan=2|'''Total''' || 25 || 175 || 200
|}
</center>

First, we compute the (conditional!) probabilities (P) of brain cancer (BC) given either cell phone use, P1, or no cell-phone use, P2. We can then form their ratio to determine whether the relative risk of brain cancer (BC) is higher in cell phone users (CP) than in non-users (NCP).

$$ P_1 = P(BC|CP) = \dfrac {18}{98} = 0.184 $$

$$ P_2= P(BC|NCP) = \dfrac {7} {102} = 0.069 $$

Therefore, the relative risk of brain cancer in cell phone users is:
$$ RR= \frac{P(BC|CP)}{P(BC|NCP)} = \frac {0.184}{0.069} = 2.67.$$

The risk of having brain cancer is more than 2.5 times greater among cell phone users compared to non-cell phone users.

For the same example, the odds ratio (OR) of brain cancer relative to cell-phone use is:

$$ OR = \frac{\frac{P \left( BC \mid CP \right)}{1 - P \left( BC \mid CP \right)}}{\frac{P \left( BC \mid NCP \right)}{1 - P \left( BC \mid NCP \right)}}
= \frac{\frac{\frac{18}{98}}{1 - \frac{18}{98}}} {\frac{\frac{7}{102}}{1 - \frac{7}{102}}} =\frac{\frac{0.184}{0.816}}{\frac{0.069}{0.931}} = 3.04 $$

Thus, the odds of having brain cancer is about 3 times greater for cell phone users than it is for non-cell phone users.
We could have compared the odds of owning a cell phone given that a patient had brain cancer (i.e., the column-wise probabilities), $ P(CP|BC) = 18/25 = 0.72 $ versus $ P(CP|NBC) = 80/175 = 0.457 $. However, this does not seem as important scientifically.

===Theory===
<center>
{|class="wikitable" style="text-align:center; width:75%" border="1"
|-
| colspan=2 rowspan=2| || colspan=2|Factor 1|| rowspan=2|Total
|-
|Yes||No
|-
| rowspan=2|Factor 2||Yes||$n_{1,1}$||$n_{1,2}$||$n_{1,1} + n_{1,2}$
|-
|No||$n_{2,1}$||$n_{2,2}$||$n_{2,1} + n_{2,2}$
|-
| colspan=2|Total||$n_{1,1} + n_{2,1}$||$n_{2,1} + n_{1,2}$||$N=n_{1,1} + n_{1,2} + n_{2,1} + n_{2,2}$
|}
</center>

$$RR=\frac{\frac{n_{1,1}}{n_{1,1}+ n_{1,2}}}{\frac{n_{2,1}}{n_{2,1}+n_{2,2}}}.$$

$$OR = \frac{n_{1,1} × n_{2,2}}{n_{1,2}× n_{2,1}}.$$

====Interpretation====
* '''RR''': In general, the measure relative risk (RR) is interpreted as follows:
**RR = 1 indicates that the probabilities of two events are the same.
**RR > 1 implies that there is an increased risk.
**RR < 1 implies that there is a decreased risk.

* '''OR'''
**If event $A|B$ has probability $p = 1/2$, then the odds are $\frac{1/2}{1/2}=1$, or $1:1$, or 1 to 1. This means the probability that event A|B occurs is equal to the probability that it does not occur.
**If event $A|C$ has probability $p = 3/4$, then the odds are $\frac{3/4}{1/4}= 3$, or 3 to 1. The probability that the event A|C occurs is three times as large as the probability that it does not occur.
**Similarly, if A|D has probability $p = 1/4$, then the odds are $\frac {1/4}{3/4}=\frac {1}{3}$, or 1 to 3. The probability that the event A|D occurs is three times smaller than the probability that it does not occur.

*'''RR vs. OR'''
**The formula and reasoning for the relative risk is a little bit easier to follow. In most cases the OR and RR measures are roughly equal to each other.
**Odds ratios have an advantage over relative risk because they can be calculated no matter the row or column comparison
**Relative risk runs into problems when the study design is a cohort study or a case-control design
**Odds ratios are an approximation of relative risk: $ OR = RR \times \frac{1-P_2} {1-P_1}$.

====Inference====
* ''Inference on OR'': In practice, we commonly report ORs along with their confidence intervals (CIs). It turns out that the distribution of ORs is not normal, however, the ''log-transformed OR is approximately normally distributed'', and the standard error of $ ln(OR)$ is:

$$ SE(ln(OR))= \sqrt{\frac {1} {n_{1,1}}+ \frac {1} {n_{1,2}} + \frac {1} {n_{2,1}} + \frac{1} {n_{2,2}}}.$$

Thus, if $\alpha$ is the false-positive (i.e., Type I) error rate, the $ (1-\alpha)100\% $ CI of the log-transformed OR can be computed by:
$$ln(OR)±z\frac{\alpha}{2}SE(ln(OR)),$$

where the OR point estimate is $OR = \frac {n_{1,1}×n_{2,2}}{n_{1,2}×n_{2,1}}$, and the standard error of the log-transformed OR is listed above (i.e., $SE(ln(OR))$).

You can use the [http://socr.ucla.edu/htmls/dist/StudentT_Distribution.html SOCR Student’s T-distribution calculator] to compute the value of the standard normal Z statistics for a given false-positive rate, $\alpha$.

NOTE: Remember that once you find the lower $(L=ln(OR)-z\dfrac{\alpha}{2}SE(ln(OR))$ and upper $(U=ln(OR)+z\frac{\alpha}{2}SE(ln(OR))$ limits of the $ln(OR)$ confidence interval, these represent log-transformed data. To convert these confidence limits into real OR terms, you need to invert the log transform (i.e., using the exponential function). Thus, the $ CI(OR) $ would be: $(e^L,e^R)$.

===Applications===
* [http://dx.doi.org/10.1016/j.ijnurstu.2012.11.014 This article] retrospectively studies the relationship between surveillance, staffing, and serious adverse events in children on general care postoperative units. The paper investigates these hypotheses: (1) the relationship between patient factors and surveillance is moderated by staffing (i.e., registered nurse hours per patient per shift), and (2) the relationship between staffing and serious adverse events is mediated by surveillance.

: The study demonstrates that one additional full-time registered nurse equivalent per day reduced the odds of in-hospital mortality, respiratory failure, pneumonia, and failure to rescue; the greatest cost-benefit was found in adult surgical patients. Table 4 of the results shows the OR and CI(OR). Interpret the findings.

<center>Predictors of adverse events as shown in final logistic regression analysis.</center>
<center>
{| class="wikitable" style="text-align:left; width:75%" border="1"
|-
|" | '''Factors''' || '''β (S.E.)''' || '''p-Value''' || '''Odds ratio [95% CI)'''
|-
| '''Staffing''' || −0.41 (0.33) || 0.219 || 0.66 [0.35, 1.28]
|-
|'''American Society of Anesthesiologists Physical Status''' ||0.94 (0.39) ||0.017 || 2.57 [1.88, 5.55]
|-
|'''Comorbidity''' || 0.57 (0.43) || 0.189 || 1.76 [0.76, 4.12]
|-
| '''Perioperative complication'''|| 0.64 (0.22)|| 0.003 || 1.90 [1.24, 2.92]
|-
|'''Interaction staffing × surveillance''' || −1.04 (0.42) || 0.012 ||0.354 [0.157, 0.798]
|}
</center>

* [http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3308724/ This article] investigates whether hospitals with well-organized care (e.g., improved nurse staffing and work environments) provide better patient care and nurse workforce stability in European countries and the United States. It uses data from 488 clinics in 12 European countries and 617 in the US. It is based on 33,659 nurses and 11,318 patients in Europe and 27,509 nurses and more than 120,000 patients in the US.

: Some of the authors’ findings included nurses in hospitals with better work environments were approximately half as likely to (a) report poor or fair quality of care (Europe, adjusted odds ratio 0.56, 95% confidence interval 0.51 to 0.61; US, 0.54, 0.51 to 0.58) and (b) give their hospitals poor or failing grades on patient safety (0.50, 0.44 to 0.56 EU; 0.55, 0.50 to 0.61 US).

: Interpret the results in the table below. Note that in this nurse outcomes study, the [[SMHS_OR_RR#References|authors adjusted the regression estimates]] (odds ratios) at the hospital level for differences in the composition of nurses between hospitals and between countries (i.e., age, sex, full time employment status, and specialty) using a multilevel model structure in which nurses were nested within hospitals and countries.

<center> Effects of nurse staffing and practice environment on nurse outcomes in study countries. </center>

<center>
{| class="wikitable" style="text-align:center; width:75%" border="1"
|-

| rowspan="2" |'''Nurse Outcome''' || colspan="2" |'''Europe''' || || colspan="2" |'''US'''
|-
| '''Unadjusted odds ratio (95% CI)''' ||'''Adjusted odds ratio (95% CI)''' || || '''Unadjusted odds ratio (95% CI)''' || '''Adjusted odds ratio (95% CI)'''
|-
| colspan="6" |'''Poor or fair quality of care in ward'''
|-
|ROWSPAN="2" style="text-align: center;" |'''Practice environment''' || 0.58 || 0.56 || || 0.52 || 0.54
|-
|(0.53 to 0.63)|| (0.51 to 0.61) || || (0.49 to 0.56) || (0.51 to 0.58)
|-
|rowspan="2" |'''Staffing''' || 1.11 || 1.11 || || 1.2 || 1.06
|-
| (1.08 to 1.13) ||(1.07 to 1.15) || || (1.16 to 1.25) || (1.03 to 1.1)
|-
| colspan="6" |'''Poor or fair quality of care in ward'''
|-
|rowspan="2" |'''Practice environment''' || 0.5|| 0.5 || || 0.53 || 0.55
|-
|(0.43 to 0.57) ||(0.44 to 0.56) || || (0.48 to 0.59) ||(0.5 to 0.61)
|-
|rowspan="2"| '''Staffing''' || 1.04 || 1.1 || || 1.18 ||1.05
|-
| (1.01 to 1.08) || (1.05 to 1.16) || || (1.12 to 1.23) ||(1 to 1.1)
|-
|colspan="6" |'''Burnout'''
|-
|rowspan="2"|'''Practice environment''' || 0.69|| 0.67|| || 0.69 || 0.71
|-
|(0.63 to 0.76)||(0.61 to 0.73)|| || (0.66 to 0.73) ||(0.68 to 0.75)
|-
|rowspan="2"|'''Staffing''' || 1.06 || 1.05|| || 1.12 ||1.03
|-
| (1.04 to 1.08)||(1.02 to 1.09)|| ||(1.08 to 1.15)||(1 to 1.06)
|-
| colspan="6" |'''Job dissatisfaction'''
|-
|rowspan="2"|'''Practice environment''' ||0.63|| 0.52 || || 0.58 || 0.6
|-
|(0.57 to 0.69) || (0.47 to 0.57) || ||(0.55 to 0.61) ||(0.57 to 0.64)
|-
|rowspan="2"|'''Staffing''' || 1.1 || 1.07 || || 1.17 || 1.06
|-
| (1.08 to 1.12) || (1.04 to 1.11) || || (1.13 to 1.21) ||(1.03 to 1.09)
|-
| colspan="6" |'''Intention to leave in the next year'''
|-
|rowspan="2"|'''Practice environment''' || 0.72 || 0.61 || || 0.7 ||0.69
|-
|(0.66 to 0.79) ||(0.56 to 0.67)|| || (0.65 to 0.76) ||(0.64 to 0.75)
|-
|rowspan="2"|'''Staffing''' || 1.04 || 1.05 || || 1.1 || 1.03
|-
| (1.01 to 1.06)|| (1.02 to 1.09)|| || (1.05 to 1.15)||(0.98 to 1.08)
|-
| colspan="6" |'''Not confident that patients can manage own care after hospital discharge'''
|-
|rowspan="2"|'''Practice environment''' || 0.62 || 0.73 || || 0.71 || 0.72
|-
|(0.56 to 0.69)||(0.69 to 0.78) || ||(0.67 to 0.75) ||(0.68 to 0.77)
|-
|rowspan="2"|'''Staffing''' || 1.08 || 1.03 || ||1.1 || 1.04
|-
| (1.05 to 1.11) ||(1 to 1.05) || || (1.06 to 1.13) || (1.01 to 1.07)
|-
| colspan="6" |'''Not confident that hospital management would resolve patients’ problems'''
|-
|rowspan="2"|'''Practice environment''' || 0.5 || 0.53 || || 0.56 || 0.56
|-
| (0.46 to 0.54) ||(0.48 to 0.58)|| ||(0.53 to 0.59) ||(0.54 to 0.59)
|-
|rowspan="2"|'''Staffing''' || 1.04 ||1.02 || || 1.12 || 1.01
|-
|(1.01 to 1.07) ||(0.98 to 1.06) || ||(1.09 to 1.17) ||(0.98 to 1.03)
|}
</center>

===Software===
*[http://www.distributome.org/V3/calc/StudentCalculator.html T Distribution Calculator]

*[http://socr.umich.edu/Applets/Normal_T_Chi2_F_Tables.html Distribution Applets]

===Problems===
Formulate some clinically relevant questions in terms of the OR and RR and try to answer them in the following situations. Interpret the results. E.g., the estimate of the relative risk of a heart attack is about <blank> as great for those who smoke versus who do not smoke. Compute the CI (OR).

<center>
{| class="wikitable" style="text-align:center; width:45%" border="1"
|-
| colspan=2 rowspan=2 | || colspan=2| '''Heart Attack (HA)''' || rowspan=2|'''Total'''
|-
|'''Yes''' || '''No'''
|-
| rowspan=2| '''Smoking (S)''' || '''Yes''' || 33 || 18 || 51
|-
| '''No''' ||167 || 182 || 349
|-
| colspan=2|'''Total''' ||200 || 200 || 400
|}
</center>

===References===
* [http://www.sciencedirect.com/science/article/pii/S0378375812001954 Reducing bias and mean squared error associated with regression-based odds ratio estimators]
* [http://www.sciencedirect.com/science/article/pii/S0020748912004166 Nursing surveillance moderates the relationship between staffing levels and pediatric postoperative serious adverse events: A nested case–control study]
* [http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3308724/ Patient safety, satisfaction, and quality of hospital care: cross sectional surveys of nurses and patients in 12 countries in Europe and the United States]
* [http://www.sciencedirect.com/science/article/pii/S0378375812001954 Reducing bias and mean squared error associated with regression-based odds ratio estimators]

<hr>
* SOCR Home page: http://www.socr.umich.edu

{{translate|pageName=http://wiki.socr.umich.edu/index.php?title=SMHS_OR_RR}}

SMHS OR RR

2014-08-31T13:26:15Z

Jslavine: /* Inference */

==[[SMHS| Scientific Methods for Health Sciences]] - Odds Ratio and Relative Risk ==

===Overview===
The ''relative risk'' is a measure of dependence that allows us to compare two probabilities in terms of their ratio $ \frac{p_1}{p_2} $ rather than their difference
$(p_1 – p_2) $. Relative risk is a commonly used measure in public health studies. Another way to compare two probabilities is in terms of the odds. If an event takes place with probability p, then the odds of the event occurring are $ \frac{p}{1 - p} $. The ''odds ratio'' is the ratio of odds for two complementary probabilities.

===Motivation===
Suppose we study brain cancer in the context of cell phone use. The table below illustrates some simulated data. One clear healthcare question in this case-study could be: ''Is cell phone use associated with a higher incidence of brain cancer?'' To address this question, we can look at the relative risk of brain cancer in people who use cell phones.

<center>
{| class="wikitable" style="text-align:center; width:45%" border="1"
|-
| colspan=2 rowspan=2 | || colspan=2| '''Brain Cancer (BC)''' ||rowspan=2|'''Total'''
|-
|'''Yes''' || '''No'''
|-
| rowspan=2| '''Cell Phone (CP)''' || '''Yes''' || 18 || 80 || 98 '''(B)'''
|-
| '''No''' ||7 || 95 || 102 '''(C)'''
|-
| colspan=2|'''Total''' || 25 || 175 || 200
|}
</center>

First, we compute the (conditional!) probabilities (P) of brain cancer (BC) given either cell phone use, P1, or no cell-phone use, P2. We can then form their ratio to determine whether the relative risk of brain cancer (BC) is higher in cell phone users (CP) than in non-users (NCP).

$$ P_1 = P(BC|CP) = \dfrac {18}{98} = 0.184 $$

$$ P_2= P(BC|NCP) = \dfrac {7} {102} = 0.069 $$

Therefore, the relative risk of brain cancer in cell phone users is:
$$ RR= \frac{P(BC|CP)}{P(BC|NCP)} = \frac {0.184}{0.069} = 2.67.$$

The risk of having brain cancer is more than 2.5 times greater among cell phone users compared to non-cell phone users.

For the same example, the odds ratio (OR) of brain cancer relative to cell-phone use is:

$$ OR = \frac{\frac{P \left( BC \mid CP \right)}{1 - P \left( BC \mid CP \right)}}{\frac{P \left( BC \mid NCP \right)}{1 - P \left( BC \mid NCP \right)}}
= \frac{\frac{\frac{18}{98}}{1 - \frac{18}{98}}} {\frac{\frac{7}{102}}{1 - \frac{7}{102}}} =\frac{\frac{0.184}{0.816}}{\frac{0.069}{0.931}} = 3.04 $$

Thus, the odds of having brain cancer is about 3 times greater for cell phone users than it is for non-cell phone users.
We could have compared the odds of owning a cell phone given that a patient had brain cancer (i.e., the column-wise probabilities), $ P(CP|BC) = 18/25 = 0.72 $ versus $ P(CP|NBC) = 80/175 = 0.457 $. However, this does not seem as important scientifically.

===Theory===
<center>
{|class="wikitable" style="text-align:center; width:75%" border="1"
|-
| colspan=2 rowspan=2| || colspan=2|Factor 1|| rowspan=2|Total
|-
|Yes||No
|-
| rowspan=2|Factor 2||Yes||$n_{1,1}$||$n_{1,2}$||$n_{1,1} + n_{1,2}$
|-
|No||$n_{2,1}$||$n_{2,2}$||$n_{2,1} + n_{2,2}$
|-
| colspan=2|Total||$n_{1,1} + n_{2,1}$||$n_{2,1} + n_{1,2}$||$N=n_{1,1} + n_{1,2} + n_{2,1} + n_{2,2}$
|}
</center>

$$RR=\frac{\frac{n_{1,1}}{n_{1,1}+ n_{1,2}}}{\frac{n_{2,1}}{n_{2,1}+n_{2,2}}}.$$

$$OR = \frac{n_{1,1} × n_{2,2}}{n_{1,2}× n_{2,1}}.$$

====Interpretation====
* '''RR''': In general, the measure relative risk (RR) is interpreted as follows:
**RR = 1 indicates that the probabilities of two events are the same.
**RR > 1 implies that there is an increased risk.
**RR < 1 implies that there is a decreased risk.

* '''OR'''
**If event $A|B$ has probability $p = 1/2$, then the odds are $\frac{1/2}{1/2}=1$, or $1:1$, or 1 to 1. This means the probability that event A|B occurs is equal to the probability that it does not occur.
**If event $A|C$ has probability $p = 3/4$, then the odds are $\frac{3/4}{1/4}= 3$, or 3 to 1. The probability that the event A|C occurs is three times as large as the probability that it does not occur.
**Similarly, if A|D has probability $p = 1/4$, then the odds are $\frac {1/4}{3/4}=\frac {1}{3}$, or 1 to 3. The probability that the event A|D occurs is three times smaller than the probability that it does not occur.

*'''RR vs. OR'''
**The formula and reasoning for the relative risk is a little bit easier to follow. In most cases the OR and RR measures are roughly equal to each other.
**Odds ratios have an advantage over relative risk because they can be calculated no matter the row or column comparison
**Relative risk runs into problems when the study design is a cohort study or a case-control design
**Odds ratios are an approximation of relative risk: $ OR = RR \times \frac{1-P_2} {1-P_1}$.

====Inference====
* ''Inference on OR'': In practice, we commonly report ORs along with their confidence intervals (CIs). It turns out that the distribution of ORs is not normal, however, the ''log-transformed OR is approximately normally distributed'', and the standard error of $ ln(OR)$ is:

$$ SE(ln(OR))= \sqrt{\frac {1} {n_{1,1}}+ \frac {1} {n_{1,2}} + \frac {1} {n_{2,1}} + \frac{1} {n_{2,2}}}.$$

Thus, if $\alpha$ is the false-positive (i.e., Type I) error rate, the $ (1-\alpha)100\% $ CI of the log-transformed OR can be computed by:
$$ln(OR)±z\frac{\alpha}{2}SE(ln(OR)),$$

where the OR point estimate is $OR = \frac {n_{1,1}×n_{2,2}}{n_{1,2}×n_{2,1}}$, and the standard error of the log-transformed OR is listed above (i.e., $SE(ln(OR))$).

You can use the [http://socr.ucla.edu/htmls/dist/StudentT_Distribution.html SOCR Student’s T-distribution calculator] to compute the value of the standard normal Z statistics for a given false-positive rate, $\alpha$.

NOTE: Remember that once you find the lower $(L=ln(OR)-z\dfrac{\alpha}{2}SE(ln(OR))$ and upper $(U=ln(OR)+z\frac{\alpha}{2}SE(ln(OR))$ limits of the $ln(OR)$ confidence interval, these represent log-transformed data. To convert these confidence limits into real OR terms, you need to invert the log transform (i.e., using the exponential function). Thus, the $ CI(OR) $ would be: $(e^L,e^R)$.

===Applications===
* [http://dx.doi.org/10.1016/j.ijnurstu.2012.11.014 This article] studies retrospectively the relationship between surveillance, staffing, and serious adverse events in children on general care postoperative units. The paper investigates these hypotheses: (1) the relationship between patient factors and surveillance would be moderated by staffing (i.e., registered nurse hours per patient per shift), and (2) the relationship between staffing and serious adverse events would be mediated by surveillance.

: The study shows that one additional registered nurse full-time equivalent per day reduced the odds of in-hospital mortality, respiratory failure, pneumonia, and failure to rescue, with the greatest cost-benefit for adult surgical patients. Table 4 of the results show the OR and CI(OR). Interpret the findings. Predictors of adverse events as shown in final logistic regression analysis.
<center>
{| class="wikitable" style="text-align:left; width:75%" border="1"
|-
|" | '''Factors''' || '''β (S.E.)''' || '''p-Value''' || '''Odds ratio [95% CI)'''
|-
| '''Staffing''' || −0.41 (0.33) || 0.219 || 0.66 [0.35, 1.28]
|-
|'''American Society of Anesthesiologists Physical Status''' ||0.94 (0.39) ||0.017 || 2.57 [1.88, 5.55]
|-
|'''Comorbidity''' || 0.57 (0.43) || 0.189 || 1.76 [0.76, 4.12]
|-
| '''Perioperative complication'''|| 0.64 (0.22)|| 0.003 || 1.90 [1.24, 2.92]
|-
|'''Interaction staffing × surveillance''' || −1.04 (0.42) || 0.012 ||0.354 [0.157, 0.798]
|}
</center>

* [http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3308724/ This article] investigates whether hospitals with a good organization of care (e.g., improved nurse staffing and work environments) can affect patient care and nurse workforce stability in European countries. It uses data from 488 clinics in 12 European countries; 617 in the United States) and is based on 33,659 nurses and 11,318 patients in Europe; 27,509 nurses and more than 120,000 patients in the US.

: Some of the authors’ findings included (a) nurses in hospitals with better work environments were half as likely to report poor or fair care quality (Europe, adjusted odds ratio 0.56, 95% confidence interval 0.51 to 0.61; US, 0.54, 0.51 to 0.58) and (b) to give their hospitals poor or failing grades on patient safety (0.50, 0.44 to 0.56 EU; 0.55, 0.50 to 0.61 US).

: Interpret the results in the Table below. Note that in this nurse outcomes study, the [[SMHS_OR_RR#References|authors adjusted the regression estimates]] (odds ratios) at the hospital level for differences in the composition of nurses between hospitals and between countries (age, sex, full time employment status, and specialty) by a multilevel model structure in which nurses were nested within hospitals and countries. Effects of nurse staffing and practice environment on nurse outcomes in study countries.

<center>
{| class="wikitable" style="text-align:center; width:75%" border="1"
|-

| rowspan="2" |'''Nurse Outcome''' || colspan="2" |'''Europe''' || || colspan="2" |'''US'''
|-
| '''Unadjusted odds ratio (95% CI)''' ||'''Adjusted odds ratio (95% CI)''' || || '''Unadjusted odds ratio (95% CI)''' || '''Adjusted odds ratio (95% CI)'''
|-
| colspan="6" |'''Poor or fair quality of care in ward'''
|-
|ROWSPAN="2" style="text-align: center;" |'''Practice environment''' || 0.58 || 0.56 || || 0.52 || 0.54
|-
|(0.53 to 0.63)|| (0.51 to 0.61) || || (0.49 to 0.56) || (0.51 to 0.58)
|-
|rowspan="2" |'''Staffing''' || 1.11 || 1.11 || || 1.2 || 1.06
|-
| (1.08 to 1.13) ||(1.07 to 1.15) || || (1.16 to 1.25) || (1.03 to 1.1)
|-
| colspan="6" |'''Poor or fair quality of care in ward'''
|-
|rowspan="2" |'''Practice environment''' || 0.5|| 0.5 || || 0.53 || 0.55
|-
|(0.43 to 0.57) ||(0.44 to 0.56) || || (0.48 to 0.59) ||(0.5 to 0.61)
|-
|rowspan="2"| '''Staffing''' || 1.04 || 1.1 || || 1.18 ||1.05
|-
| (1.01 to 1.08) || (1.05 to 1.16) || || (1.12 to 1.23) ||(1 to 1.1)
|-
|colspan="6" |'''Burnout'''
|-
|rowspan="2"|'''Practice environment''' || 0.69|| 0.67|| || 0.69 || 0.71
|-
|(0.63 to 0.76)||(0.61 to 0.73)|| || (0.66 to 0.73) ||(0.68 to 0.75)
|-
|rowspan="2"|'''Staffing''' || 1.06 || 1.05|| || 1.12 ||1.03
|-
| (1.04 to 1.08)||(1.02 to 1.09)|| ||(1.08 to 1.15)||(1 to 1.06)
|-
| colspan="6" |'''Job dissatisfaction'''
|-
|rowspan="2"|'''Practice environment''' ||0.63|| 0.52 || || 0.58 || 0.6
|-
|(0.57 to 0.69) || (0.47 to 0.57) || ||(0.55 to 0.61) ||(0.57 to 0.64)
|-
|rowspan="2"|'''Staffing''' || 1.1 || 1.07 || || 1.17 || 1.06
|-
| (1.08 to 1.12) || (1.04 to 1.11) || || (1.13 to 1.21) ||(1.03 to 1.09)
|-
| colspan="6" |'''Intention to leave in the next year'''
|-
|rowspan="2"|'''Practice environment''' || 0.72 || 0.61 || || 0.7 ||0.69
|-
|(0.66 to 0.79) ||(0.56 to 0.67)|| || (0.65 to 0.76) ||(0.64 to 0.75)
|-
|rowspan="2"|'''Staffing''' || 1.04 || 1.05 || || 1.1 || 1.03
|-
| (1.01 to 1.06)|| (1.02 to 1.09)|| || (1.05 to 1.15)||(0.98 to 1.08)
|-
| colspan="6" |'''Not confident that patients can manage own care after hospital discharge'''
|-
|rowspan="2"|'''Practice environment''' || 0.62 || 0.73 || || 0.71 || 0.72
|-
|(0.56 to 0.69)||(0.69 to 0.78) || ||(0.67 to 0.75) ||(0.68 to 0.77)
|-
|rowspan="2"|'''Staffing''' || 1.08 || 1.03 || ||1.1 || 1.04
|-
| (1.05 to 1.11) ||(1 to 1.05) || || (1.06 to 1.13) || (1.01 to 1.07)
|-
| colspan="6" |'''Not confident that hospital management would resolve patients’ problems'''
|-
|rowspan="2"|'''Practice environment''' || 0.5 || 0.53 || || 0.56 || 0.56
|-
| (0.46 to 0.54) ||(0.48 to 0.58)|| ||(0.53 to 0.59) ||(0.54 to 0.59)
|-
|rowspan="2"|'''Staffing''' || 1.04 ||1.02 || || 1.12 || 1.01
|-
|(1.01 to 1.07) ||(0.98 to 1.06) || ||(1.09 to 1.17) ||(0.98 to 1.03)
|}
</center>

===Software===
*[http://www.distributome.org/V3/calc/StudentCalculator.html T Distribution Calculator]

*[http://socr.umich.edu/Applets/Normal_T_Chi2_F_Tables.html Distribution Applets]

===Problems===
Formulate some clinically relevant questions in terms of the OR and RR and try to answer them in the following situations. Interpret the results. E.g., the estimate of the relative risk of a heart attack is about <blank> as great for those who smoke versus who do not smoke. Compute the CI (OR).

<center>
{| class="wikitable" style="text-align:center; width:45%" border="1"
|-
| colspan=2 rowspan=2 | || colspan=2| '''Heart Attack (HA)''' || rowspan=2|'''Total'''
|-
|'''Yes''' || '''No'''
|-
| rowspan=2| '''Smoking (S)''' || '''Yes''' || 33 || 18 || 51
|-
| '''No''' ||167 || 182 || 349
|-
| colspan=2|'''Total''' ||200 || 200 || 400
|}
</center>

===References===
* [http://www.sciencedirect.com/science/article/pii/S0378375812001954 Reducing bias and mean squared error associated with regression-based odds ratio estimators]
* [http://www.sciencedirect.com/science/article/pii/S0020748912004166 Nursing surveillance moderates the relationship between staffing levels and pediatric postoperative serious adverse events: A nested case–control study]
* [http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3308724/ Patient safety, satisfaction, and quality of hospital care: cross sectional surveys of nurses and patients in 12 countries in Europe and the United States]
* [http://www.sciencedirect.com/science/article/pii/S0378375812001954 Reducing bias and mean squared error associated with regression-based odds ratio estimators]

<hr>
* SOCR Home page: http://www.socr.umich.edu

{{translate|pageName=http://wiki.socr.umich.edu/index.php?title=SMHS_OR_RR}}

SMHS OR RR

2014-08-31T13:22:27Z

Jslavine: /* Interpretation */

==[[SMHS| Scientific Methods for Health Sciences]] - Odds Ratio and Relative Risk ==

===Overview===
The ''relative risk'' is a measure of dependence that allows us to compare two probabilities in terms of their ratio $ \frac{p_1}{p_2} $ rather than their difference
$(p_1 – p_2) $. Relative risk is a commonly used measure in public health studies. Another way to compare two probabilities is in terms of the odds. If an event takes place with probability p, then the odds of the event occurring are $ \frac{p}{1 - p} $. The ''odds ratio'' is the ratio of odds for two complementary probabilities.

===Motivation===
Suppose we study brain cancer in the context of cell phone use. The table below illustrates some simulated data. One clear healthcare question in this case-study could be: ''Is cell phone use associated with a higher incidence of brain cancer?'' To address this question, we can look at the relative risk of brain cancer in people who use cell phones.

<center>
{| class="wikitable" style="text-align:center; width:45%" border="1"
|-
| colspan=2 rowspan=2 | || colspan=2| '''Brain Cancer (BC)''' ||rowspan=2|'''Total'''
|-
|'''Yes''' || '''No'''
|-
| rowspan=2| '''Cell Phone (CP)''' || '''Yes''' || 18 || 80 || 98 '''(B)'''
|-
| '''No''' ||7 || 95 || 102 '''(C)'''
|-
| colspan=2|'''Total''' || 25 || 175 || 200
|}
</center>

First, we compute the (conditional!) probabilities (P) of brain cancer (BC) given either cell phone use, P1, or no cell-phone use, P2. We can then form their ratio to determine whether the relative risk of brain cancer (BC) is higher in cell phone users (CP) than in non-users (NCP).

$$ P_1 = P(BC|CP) = \dfrac {18}{98} = 0.184 $$

$$ P_2= P(BC|NCP) = \dfrac {7} {102} = 0.069 $$

Therefore, the relative risk of brain cancer in cell phone users is:
$$ RR= \frac{P(BC|CP)}{P(BC|NCP)} = \frac {0.184}{0.069} = 2.67.$$

The risk of having brain cancer is more than 2.5 times greater among cell phone users compared to non-cell phone users.

For the same example, the odds ratio (OR) of brain cancer relative to cell-phone use is:

$$ OR = \frac{\frac{P \left( BC \mid CP \right)}{1 - P \left( BC \mid CP \right)}}{\frac{P \left( BC \mid NCP \right)}{1 - P \left( BC \mid NCP \right)}}
= \frac{\frac{\frac{18}{98}}{1 - \frac{18}{98}}} {\frac{\frac{7}{102}}{1 - \frac{7}{102}}} =\frac{\frac{0.184}{0.816}}{\frac{0.069}{0.931}} = 3.04 $$

Thus, the odds of having brain cancer is about 3 times greater for cell phone users than it is for non-cell phone users.
We could have compared the odds of owning a cell phone given that a patient had brain cancer (i.e., the column-wise probabilities), $ P(CP|BC) = 18/25 = 0.72 $ versus $ P(CP|NBC) = 80/175 = 0.457 $. However, this does not seem as important scientifically.

===Theory===
<center>
{|class="wikitable" style="text-align:center; width:75%" border="1"
|-
| colspan=2 rowspan=2| || colspan=2|Factor 1|| rowspan=2|Total
|-
|Yes||No
|-
| rowspan=2|Factor 2||Yes||$n_{1,1}$||$n_{1,2}$||$n_{1,1} + n_{1,2}$
|-
|No||$n_{2,1}$||$n_{2,2}$||$n_{2,1} + n_{2,2}$
|-
| colspan=2|Total||$n_{1,1} + n_{2,1}$||$n_{2,1} + n_{1,2}$||$N=n_{1,1} + n_{1,2} + n_{2,1} + n_{2,2}$
|}
</center>

$$RR=\frac{\frac{n_{1,1}}{n_{1,1}+ n_{1,2}}}{\frac{n_{2,1}}{n_{2,1}+n_{2,2}}}.$$

$$OR = \frac{n_{1,1} × n_{2,2}}{n_{1,2}× n_{2,1}}.$$

====Interpretation====
* '''RR''': In general, the measure relative risk (RR) is interpreted as follows:
**RR = 1 indicates that the probabilities of two events are the same.
**RR > 1 implies that there is an increased risk.
**RR < 1 implies that there is a decreased risk.

* '''OR'''
**If event $A|B$ has probability $p = 1/2$, then the odds are $\frac{1/2}{1/2}=1$, or $1:1$, or 1 to 1. This means the probability that event A|B occurs is equal to the probability that it does not occur.
**If event $A|C$ has probability $p = 3/4$, then the odds are $\frac{3/4}{1/4}= 3$, or 3 to 1. The probability that the event A|C occurs is three times as large as the probability that it does not occur.
**Similarly, if A|D has probability $p = 1/4$, then the odds are $\frac {1/4}{3/4}=\frac {1}{3}$, or 1 to 3. The probability that the event A|D occurs is three times smaller than the probability that it does not occur.

*'''RR vs. OR'''
**The formula and reasoning for the relative risk is a little bit easier to follow. In most cases the OR and RR measures are roughly equal to each other.
**Odds ratios have an advantage over relative risk because they can be calculated no matter the row or column comparison
**Relative risk runs into problems when the study design is a cohort study or a case-control design
**Odds ratios are an approximation of relative risk: $ OR = RR \times \frac{1-P_2} {1-P_1}$.

====Inference====
* ''Inference about the Odds Ratio'': In practice, we commonly report odds ratios along with their Confidence Intervals (CIs). It turns out that the distribution of OR’s is not normal, however, the ''log-transformed OR is approximately normally distributed'', and the standard error of $ ln(OR)$ is:

$$ SE(ln(OR))= \sqrt{\frac {1} {n_{1,1}}+ \frac {1} {n_{1,2}} + \frac {1} {n_{2,1}} + \frac{1} {n_{2,2}}}.$$

Thus, if $\alpha$ is the false-positive (Type I) error rate, the $ (1-\alpha)100\% $ CI (of the log-transformed OR) can be computed by:
$$ln(OR)±z\frac{\alpha}{2}SE(ln(OR)),$$

where odds-ratio point-estimate is $OR = \frac {n_{1,1}×n_{2,2}}{n_{1,2}×n_{2,1}}$ and the standard error of the log-transformed OR is listed above $ (SE(ln(OR))).$

You can use the [http://socr.ucla.edu/htmls/dist/StudentT_Distribution.html SOCR Student’s T-distribution calculator] to compute the value of the standard-normal Z statistics (for a given false-positive error rate $\alpha$).

NOTE: Remember that once you find the lower $(L=ln(OR)-z\dfrac{\alpha}{2}SE(ln(OR))$ and upper $(U=ln(OR)+z\frac{\alpha}{2}SE(ln(OR))$ limits of the $ln(OR)$ confidence interval, these represent log-transformed data. To convert these confidence limits into real OR terms, you need to invert the log transform (using the exponential function). Thus, the $ CI(OR) $ would be: $(e^L,e^R)$.

===Applications===
* [http://dx.doi.org/10.1016/j.ijnurstu.2012.11.014 This article] studies retrospectively the relationship between surveillance, staffing, and serious adverse events in children on general care postoperative units. The paper investigates these hypotheses: (1) the relationship between patient factors and surveillance would be moderated by staffing (i.e., registered nurse hours per patient per shift), and (2) the relationship between staffing and serious adverse events would be mediated by surveillance.

: The study shows that one additional registered nurse full-time equivalent per day reduced the odds of in-hospital mortality, respiratory failure, pneumonia, and failure to rescue, with the greatest cost-benefit for adult surgical patients. Table 4 of the results show the OR and CI(OR). Interpret the findings. Predictors of adverse events as shown in final logistic regression analysis.
<center>
{| class="wikitable" style="text-align:left; width:75%" border="1"
|-
|" | '''Factors''' || '''β (S.E.)''' || '''p-Value''' || '''Odds ratio [95% CI)'''
|-
| '''Staffing''' || −0.41 (0.33) || 0.219 || 0.66 [0.35, 1.28]
|-
|'''American Society of Anesthesiologists Physical Status''' ||0.94 (0.39) ||0.017 || 2.57 [1.88, 5.55]
|-
|'''Comorbidity''' || 0.57 (0.43) || 0.189 || 1.76 [0.76, 4.12]
|-
| '''Perioperative complication'''|| 0.64 (0.22)|| 0.003 || 1.90 [1.24, 2.92]
|-
|'''Interaction staffing × surveillance''' || −1.04 (0.42) || 0.012 ||0.354 [0.157, 0.798]
|}
</center>

* [http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3308724/ This article] investigates whether hospitals with a good organization of care (e.g., improved nurse staffing and work environments) can affect patient care and nurse workforce stability in European countries. It uses data from 488 clinics in 12 European countries; 617 in the United States) and is based on 33,659 nurses and 11,318 patients in Europe; 27,509 nurses and more than 120,000 patients in the US.

: Some of the authors’ findings included (a) nurses in hospitals with better work environments were half as likely to report poor or fair care quality (Europe, adjusted odds ratio 0.56, 95% confidence interval 0.51 to 0.61; US, 0.54, 0.51 to 0.58) and (b) to give their hospitals poor or failing grades on patient safety (0.50, 0.44 to 0.56 EU; 0.55, 0.50 to 0.61 US).

: Interpret the results in the Table below. Note that in this nurse outcomes study, the [[SMHS_OR_RR#References|authors adjusted the regression estimates]] (odds ratios) at the hospital level for differences in the composition of nurses between hospitals and between countries (age, sex, full time employment status, and specialty) by a multilevel model structure in which nurses were nested within hospitals and countries. Effects of nurse staffing and practice environment on nurse outcomes in study countries.

<center>
{| class="wikitable" style="text-align:center; width:75%" border="1"
|-

| rowspan="2" |'''Nurse Outcome''' || colspan="2" |'''Europe''' || || colspan="2" |'''US'''
|-
| '''Unadjusted odds ratio (95% CI)''' ||'''Adjusted odds ratio (95% CI)''' || || '''Unadjusted odds ratio (95% CI)''' || '''Adjusted odds ratio (95% CI)'''
|-
| colspan="6" |'''Poor or fair quality of care in ward'''
|-
|ROWSPAN="2" style="text-align: center;" |'''Practice environment''' || 0.58 || 0.56 || || 0.52 || 0.54
|-
|(0.53 to 0.63)|| (0.51 to 0.61) || || (0.49 to 0.56) || (0.51 to 0.58)
|-
|rowspan="2" |'''Staffing''' || 1.11 || 1.11 || || 1.2 || 1.06
|-
| (1.08 to 1.13) ||(1.07 to 1.15) || || (1.16 to 1.25) || (1.03 to 1.1)
|-
| colspan="6" |'''Poor or fair quality of care in ward'''
|-
|rowspan="2" |'''Practice environment''' || 0.5|| 0.5 || || 0.53 || 0.55
|-
|(0.43 to 0.57) ||(0.44 to 0.56) || || (0.48 to 0.59) ||(0.5 to 0.61)
|-
|rowspan="2"| '''Staffing''' || 1.04 || 1.1 || || 1.18 ||1.05
|-
| (1.01 to 1.08) || (1.05 to 1.16) || || (1.12 to 1.23) ||(1 to 1.1)
|-
|colspan="6" |'''Burnout'''
|-
|rowspan="2"|'''Practice environment''' || 0.69|| 0.67|| || 0.69 || 0.71
|-
|(0.63 to 0.76)||(0.61 to 0.73)|| || (0.66 to 0.73) ||(0.68 to 0.75)
|-
|rowspan="2"|'''Staffing''' || 1.06 || 1.05|| || 1.12 ||1.03
|-
| (1.04 to 1.08)||(1.02 to 1.09)|| ||(1.08 to 1.15)||(1 to 1.06)
|-
| colspan="6" |'''Job dissatisfaction'''
|-
|rowspan="2"|'''Practice environment''' ||0.63|| 0.52 || || 0.58 || 0.6
|-
|(0.57 to 0.69) || (0.47 to 0.57) || ||(0.55 to 0.61) ||(0.57 to 0.64)
|-
|rowspan="2"|'''Staffing''' || 1.1 || 1.07 || || 1.17 || 1.06
|-
| (1.08 to 1.12) || (1.04 to 1.11) || || (1.13 to 1.21) ||(1.03 to 1.09)
|-
| colspan="6" |'''Intention to leave in the next year'''
|-
|rowspan="2"|'''Practice environment''' || 0.72 || 0.61 || || 0.7 ||0.69
|-
|(0.66 to 0.79) ||(0.56 to 0.67)|| || (0.65 to 0.76) ||(0.64 to 0.75)
|-
|rowspan="2"|'''Staffing''' || 1.04 || 1.05 || || 1.1 || 1.03
|-
| (1.01 to 1.06)|| (1.02 to 1.09)|| || (1.05 to 1.15)||(0.98 to 1.08)
|-
| colspan="6" |'''Not confident that patients can manage own care after hospital discharge'''
|-
|rowspan="2"|'''Practice environment''' || 0.62 || 0.73 || || 0.71 || 0.72
|-
|(0.56 to 0.69)||(0.69 to 0.78) || ||(0.67 to 0.75) ||(0.68 to 0.77)
|-
|rowspan="2"|'''Staffing''' || 1.08 || 1.03 || ||1.1 || 1.04
|-
| (1.05 to 1.11) ||(1 to 1.05) || || (1.06 to 1.13) || (1.01 to 1.07)
|-
| colspan="6" |'''Not confident that hospital management would resolve patients’ problems'''
|-
|rowspan="2"|'''Practice environment''' || 0.5 || 0.53 || || 0.56 || 0.56
|-
| (0.46 to 0.54) ||(0.48 to 0.58)|| ||(0.53 to 0.59) ||(0.54 to 0.59)
|-
|rowspan="2"|'''Staffing''' || 1.04 ||1.02 || || 1.12 || 1.01
|-
|(1.01 to 1.07) ||(0.98 to 1.06) || ||(1.09 to 1.17) ||(0.98 to 1.03)
|}
</center>

===Software===
*[http://www.distributome.org/V3/calc/StudentCalculator.html T Distribution Calculator]

*[http://socr.umich.edu/Applets/Normal_T_Chi2_F_Tables.html Distribution Applets]

===Problems===
Formulate some clinically relevant questions in terms of the OR and RR and try to answer them in the following situations. Interpret the results. E.g., the estimate of the relative risk of a heart attack is about <blank> as great for those who smoke versus who do not smoke. Compute the CI (OR).

<center>
{| class="wikitable" style="text-align:center; width:45%" border="1"
|-
| colspan=2 rowspan=2 | || colspan=2| '''Heart Attack (HA)''' || rowspan=2|'''Total'''
|-
|'''Yes''' || '''No'''
|-
| rowspan=2| '''Smoking (S)''' || '''Yes''' || 33 || 18 || 51
|-
| '''No''' ||167 || 182 || 349
|-
| colspan=2|'''Total''' ||200 || 200 || 400
|}
</center>

===References===
* [http://www.sciencedirect.com/science/article/pii/S0378375812001954 Reducing bias and mean squared error associated with regression-based odds ratio estimators]
* [http://www.sciencedirect.com/science/article/pii/S0020748912004166 Nursing surveillance moderates the relationship between staffing levels and pediatric postoperative serious adverse events: A nested case–control study]
* [http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3308724/ Patient safety, satisfaction, and quality of hospital care: cross sectional surveys of nurses and patients in 12 countries in Europe and the United States]
* [http://www.sciencedirect.com/science/article/pii/S0378375812001954 Reducing bias and mean squared error associated with regression-based odds ratio estimators]

<hr>
* SOCR Home page: http://www.socr.umich.edu

{{translate|pageName=http://wiki.socr.umich.edu/index.php?title=SMHS_OR_RR}}

SMHS OR RR

2014-08-31T13:21:52Z

Jslavine: /* Interpretation */

==[[SMHS| Scientific Methods for Health Sciences]] - Odds Ratio and Relative Risk ==

===Overview===
The ''relative risk'' is a measure of dependence that allows us to compare two probabilities in terms of their ratio $ \frac{p_1}{p_2} $ rather than their difference
$(p_1 – p_2) $. Relative risk is a commonly used measure in public health studies. Another way to compare two probabilities is in terms of the odds. If an event takes place with probability p, then the odds of the event occurring are $ \frac{p}{1 - p} $. The ''odds ratio'' is the ratio of odds for two complementary probabilities.

===Motivation===
Suppose we study brain cancer in the context of cell phone use. The table below illustrates some simulated data. One clear healthcare question in this case-study could be: ''Is cell phone use associated with a higher incidence of brain cancer?'' To address this question, we can look at the relative risk of brain cancer in people who use cell phones.

<center>
{| class="wikitable" style="text-align:center; width:45%" border="1"
|-
| colspan=2 rowspan=2 | || colspan=2| '''Brain Cancer (BC)''' ||rowspan=2|'''Total'''
|-
|'''Yes''' || '''No'''
|-
| rowspan=2| '''Cell Phone (CP)''' || '''Yes''' || 18 || 80 || 98 '''(B)'''
|-
| '''No''' ||7 || 95 || 102 '''(C)'''
|-
| colspan=2|'''Total''' || 25 || 175 || 200
|}
</center>

First, we compute the (conditional!) probabilities (P) of brain cancer (BC) given either cell phone use, P1, or no cell-phone use, P2. We can then form their ratio to determine whether the relative risk of brain cancer (BC) is higher in cell phone users (CP) than in non-users (NCP).

$$ P_1 = P(BC|CP) = \dfrac {18}{98} = 0.184 $$

$$ P_2= P(BC|NCP) = \dfrac {7} {102} = 0.069 $$

Therefore, the relative risk of brain cancer in cell phone users is:
$$ RR= \frac{P(BC|CP)}{P(BC|NCP)} = \frac {0.184}{0.069} = 2.67.$$

The risk of having brain cancer is more than 2.5 times greater among cell phone users compared to non-cell phone users.

For the same example, the odds ratio (OR) of brain cancer relative to cell-phone use is:

$$ OR = \frac{\frac{P \left( BC \mid CP \right)}{1 - P \left( BC \mid CP \right)}}{\frac{P \left( BC \mid NCP \right)}{1 - P \left( BC \mid NCP \right)}}
= \frac{\frac{\frac{18}{98}}{1 - \frac{18}{98}}} {\frac{\frac{7}{102}}{1 - \frac{7}{102}}} =\frac{\frac{0.184}{0.816}}{\frac{0.069}{0.931}} = 3.04 $$

Thus, the odds of having brain cancer is about 3 times greater for cell phone users than it is for non-cell phone users.
We could have compared the odds of owning a cell phone given that a patient had brain cancer (i.e., the column-wise probabilities), $ P(CP|BC) = 18/25 = 0.72 $ versus $ P(CP|NBC) = 80/175 = 0.457 $. However, this does not seem as important scientifically.

===Theory===
<center>
{|class="wikitable" style="text-align:center; width:75%" border="1"
|-
| colspan=2 rowspan=2| || colspan=2|Factor 1|| rowspan=2|Total
|-
|Yes||No
|-
| rowspan=2|Factor 2||Yes||$n_{1,1}$||$n_{1,2}$||$n_{1,1} + n_{1,2}$
|-
|No||$n_{2,1}$||$n_{2,2}$||$n_{2,1} + n_{2,2}$
|-
| colspan=2|Total||$n_{1,1} + n_{2,1}$||$n_{2,1} + n_{1,2}$||$N=n_{1,1} + n_{1,2} + n_{2,1} + n_{2,2}$
|}
</center>

$$RR=\frac{\frac{n_{1,1}}{n_{1,1}+ n_{1,2}}}{\frac{n_{2,1}}{n_{2,1}+n_{2,2}}}.$$

$$OR = \frac{n_{1,1} × n_{2,2}}{n_{1,2}× n_{2,1}}.$$

====Interpretation====
* '''RR''': In general, the measure relative risk (RR) is interpreted as follows:
**RR = 1 indicates that the probabilities of two events are the same.
**RR > 1 implies that there is an increased risk.
**RR < 1 implies that there is a decreased risk.

* '''OR'''
**If event $A|B$ has probability $p = 1/2$, then the odds are $\frac{1/2}{1/2}=1$, or $1:1$, or 1 to 1. This means the probability that event A|B occurs is equal to the probability that it does not occur.
**If event $A|C$ has probability $p = 3/4$, then the odds are $\frac{3/4}{1/4}= 3$, or 3 to 1. The probability that the event A|C occurs is three times as large as the probability that it does not occur.
**Similarly, if A|D has probability $p = 1/4$, then the odds are $\frac {1/4}{3/4}=\frac {1}{3}$, or 1 to 3. The probability that the event A|D occurs is three times smaller than the probability that it does not occur.

*RR vs. OR
**The formula and reasoning for the relative risk is a little bit easier to follow. In most cases the OR and RR measures are roughly equal to each other.
**Odds ratios have an advantage over relative risk because they can be calculated no matter the row or column comparison
**Relative risk runs into problems when the study design is a cohort study or a case-control design
**Odds ratios are an approximation of relative risk: $ OR = RR \times \frac{1-P_2} {1-P_1}$.

====Inference====
* ''Inference about the Odds Ratio'': In practice, we commonly report odds ratios along with their Confidence Intervals (CIs). It turns out that the distribution of OR’s is not normal, however, the ''log-transformed OR is approximately normally distributed'', and the standard error of $ ln(OR)$ is:

$$ SE(ln(OR))= \sqrt{\frac {1} {n_{1,1}}+ \frac {1} {n_{1,2}} + \frac {1} {n_{2,1}} + \frac{1} {n_{2,2}}}.$$

Thus, if $\alpha$ is the false-positive (Type I) error rate, the $ (1-\alpha)100\% $ CI (of the log-transformed OR) can be computed by:
$$ln(OR)±z\frac{\alpha}{2}SE(ln(OR)),$$

where odds-ratio point-estimate is $OR = \frac {n_{1,1}×n_{2,2}}{n_{1,2}×n_{2,1}}$ and the standard error of the log-transformed OR is listed above $ (SE(ln(OR))).$

You can use the [http://socr.ucla.edu/htmls/dist/StudentT_Distribution.html SOCR Student’s T-distribution calculator] to compute the value of the standard-normal Z statistics (for a given false-positive error rate $\alpha$).

NOTE: Remember that once you find the lower $(L=ln(OR)-z\dfrac{\alpha}{2}SE(ln(OR))$ and upper $(U=ln(OR)+z\frac{\alpha}{2}SE(ln(OR))$ limits of the $ln(OR)$ confidence interval, these represent log-transformed data. To convert these confidence limits into real OR terms, you need to invert the log transform (using the exponential function). Thus, the $ CI(OR) $ would be: $(e^L,e^R)$.

===Applications===
* [http://dx.doi.org/10.1016/j.ijnurstu.2012.11.014 This article] studies retrospectively the relationship between surveillance, staffing, and serious adverse events in children on general care postoperative units. The paper investigates these hypotheses: (1) the relationship between patient factors and surveillance would be moderated by staffing (i.e., registered nurse hours per patient per shift), and (2) the relationship between staffing and serious adverse events would be mediated by surveillance.

: The study shows that one additional registered nurse full-time equivalent per day reduced the odds of in-hospital mortality, respiratory failure, pneumonia, and failure to rescue, with the greatest cost-benefit for adult surgical patients. Table 4 of the results show the OR and CI(OR). Interpret the findings. Predictors of adverse events as shown in final logistic regression analysis.
<center>
{| class="wikitable" style="text-align:left; width:75%" border="1"
|-
|" | '''Factors''' || '''β (S.E.)''' || '''p-Value''' || '''Odds ratio [95% CI)'''
|-
| '''Staffing''' || −0.41 (0.33) || 0.219 || 0.66 [0.35, 1.28]
|-
|'''American Society of Anesthesiologists Physical Status''' ||0.94 (0.39) ||0.017 || 2.57 [1.88, 5.55]
|-
|'''Comorbidity''' || 0.57 (0.43) || 0.189 || 1.76 [0.76, 4.12]
|-
| '''Perioperative complication'''|| 0.64 (0.22)|| 0.003 || 1.90 [1.24, 2.92]
|-
|'''Interaction staffing × surveillance''' || −1.04 (0.42) || 0.012 ||0.354 [0.157, 0.798]
|}
</center>

* [http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3308724/ This article] investigates whether hospitals with a good organization of care (e.g., improved nurse staffing and work environments) can affect patient care and nurse workforce stability in European countries. It uses data from 488 clinics in 12 European countries; 617 in the United States) and is based on 33,659 nurses and 11,318 patients in Europe; 27,509 nurses and more than 120,000 patients in the US.

: Some of the authors’ findings included (a) nurses in hospitals with better work environments were half as likely to report poor or fair care quality (Europe, adjusted odds ratio 0.56, 95% confidence interval 0.51 to 0.61; US, 0.54, 0.51 to 0.58) and (b) to give their hospitals poor or failing grades on patient safety (0.50, 0.44 to 0.56 EU; 0.55, 0.50 to 0.61 US).

: Interpret the results in the Table below. Note that in this nurse outcomes study, the [[SMHS_OR_RR#References|authors adjusted the regression estimates]] (odds ratios) at the hospital level for differences in the composition of nurses between hospitals and between countries (age, sex, full time employment status, and specialty) by a multilevel model structure in which nurses were nested within hospitals and countries. Effects of nurse staffing and practice environment on nurse outcomes in study countries.

<center>
{| class="wikitable" style="text-align:center; width:75%" border="1"
|-

| rowspan="2" |'''Nurse Outcome''' || colspan="2" |'''Europe''' || || colspan="2" |'''US'''
|-
| '''Unadjusted odds ratio (95% CI)''' ||'''Adjusted odds ratio (95% CI)''' || || '''Unadjusted odds ratio (95% CI)''' || '''Adjusted odds ratio (95% CI)'''
|-
| colspan="6" |'''Poor or fair quality of care in ward'''
|-
|ROWSPAN="2" style="text-align: center;" |'''Practice environment''' || 0.58 || 0.56 || || 0.52 || 0.54
|-
|(0.53 to 0.63)|| (0.51 to 0.61) || || (0.49 to 0.56) || (0.51 to 0.58)
|-
|rowspan="2" |'''Staffing''' || 1.11 || 1.11 || || 1.2 || 1.06
|-
| (1.08 to 1.13) ||(1.07 to 1.15) || || (1.16 to 1.25) || (1.03 to 1.1)
|-
| colspan="6" |'''Poor or fair quality of care in ward'''
|-
|rowspan="2" |'''Practice environment''' || 0.5|| 0.5 || || 0.53 || 0.55
|-
|(0.43 to 0.57) ||(0.44 to 0.56) || || (0.48 to 0.59) ||(0.5 to 0.61)
|-
|rowspan="2"| '''Staffing''' || 1.04 || 1.1 || || 1.18 ||1.05
|-
| (1.01 to 1.08) || (1.05 to 1.16) || || (1.12 to 1.23) ||(1 to 1.1)
|-
|colspan="6" |'''Burnout'''
|-
|rowspan="2"|'''Practice environment''' || 0.69|| 0.67|| || 0.69 || 0.71
|-
|(0.63 to 0.76)||(0.61 to 0.73)|| || (0.66 to 0.73) ||(0.68 to 0.75)
|-
|rowspan="2"|'''Staffing''' || 1.06 || 1.05|| || 1.12 ||1.03
|-
| (1.04 to 1.08)||(1.02 to 1.09)|| ||(1.08 to 1.15)||(1 to 1.06)
|-
| colspan="6" |'''Job dissatisfaction'''
|-
|rowspan="2"|'''Practice environment''' ||0.63|| 0.52 || || 0.58 || 0.6
|-
|(0.57 to 0.69) || (0.47 to 0.57) || ||(0.55 to 0.61) ||(0.57 to 0.64)
|-
|rowspan="2"|'''Staffing''' || 1.1 || 1.07 || || 1.17 || 1.06
|-
| (1.08 to 1.12) || (1.04 to 1.11) || || (1.13 to 1.21) ||(1.03 to 1.09)
|-
| colspan="6" |'''Intention to leave in the next year'''
|-
|rowspan="2"|'''Practice environment''' || 0.72 || 0.61 || || 0.7 ||0.69
|-
|(0.66 to 0.79) ||(0.56 to 0.67)|| || (0.65 to 0.76) ||(0.64 to 0.75)
|-
|rowspan="2"|'''Staffing''' || 1.04 || 1.05 || || 1.1 || 1.03
|-
| (1.01 to 1.06)|| (1.02 to 1.09)|| || (1.05 to 1.15)||(0.98 to 1.08)
|-
| colspan="6" |'''Not confident that patients can manage own care after hospital discharge'''
|-
|rowspan="2"|'''Practice environment''' || 0.62 || 0.73 || || 0.71 || 0.72
|-
|(0.56 to 0.69)||(0.69 to 0.78) || ||(0.67 to 0.75) ||(0.68 to 0.77)
|-
|rowspan="2"|'''Staffing''' || 1.08 || 1.03 || ||1.1 || 1.04
|-
| (1.05 to 1.11) ||(1 to 1.05) || || (1.06 to 1.13) || (1.01 to 1.07)
|-
| colspan="6" |'''Not confident that hospital management would resolve patients’ problems'''
|-
|rowspan="2"|'''Practice environment''' || 0.5 || 0.53 || || 0.56 || 0.56
|-
| (0.46 to 0.54) ||(0.48 to 0.58)|| ||(0.53 to 0.59) ||(0.54 to 0.59)
|-
|rowspan="2"|'''Staffing''' || 1.04 ||1.02 || || 1.12 || 1.01
|-
|(1.01 to 1.07) ||(0.98 to 1.06) || ||(1.09 to 1.17) ||(0.98 to 1.03)
|}
</center>

===Software===
*[http://www.distributome.org/V3/calc/StudentCalculator.html T Distribution Calculator]

*[http://socr.umich.edu/Applets/Normal_T_Chi2_F_Tables.html Distribution Applets]

===Problems===
Formulate some clinically relevant questions in terms of the OR and RR and try to answer them in the following situations. Interpret the results. E.g., the estimate of the relative risk of a heart attack is about <blank> as great for those who smoke versus who do not smoke. Compute the CI (OR).

<center>
{| class="wikitable" style="text-align:center; width:45%" border="1"
|-
| colspan=2 rowspan=2 | || colspan=2| '''Heart Attack (HA)''' || rowspan=2|'''Total'''
|-
|'''Yes''' || '''No'''
|-
| rowspan=2| '''Smoking (S)''' || '''Yes''' || 33 || 18 || 51
|-
| '''No''' ||167 || 182 || 349
|-
| colspan=2|'''Total''' ||200 || 200 || 400
|}
</center>

===References===
* [http://www.sciencedirect.com/science/article/pii/S0378375812001954 Reducing bias and mean squared error associated with regression-based odds ratio estimators]
* [http://www.sciencedirect.com/science/article/pii/S0020748912004166 Nursing surveillance moderates the relationship between staffing levels and pediatric postoperative serious adverse events: A nested case–control study]
* [http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3308724/ Patient safety, satisfaction, and quality of hospital care: cross sectional surveys of nurses and patients in 12 countries in Europe and the United States]
* [http://www.sciencedirect.com/science/article/pii/S0378375812001954 Reducing bias and mean squared error associated with regression-based odds ratio estimators]

<hr>
* SOCR Home page: http://www.socr.umich.edu

{{translate|pageName=http://wiki.socr.umich.edu/index.php?title=SMHS_OR_RR}}

SMHS OR RR

2014-08-31T13:13:38Z

Jslavine: /* Motivation */

==[[SMHS| Scientific Methods for Health Sciences]] - Odds Ratio and Relative Risk ==

===Overview===
The ''relative risk'' is a measure of dependence that allows us to compare two probabilities in terms of their ratio $ \frac{p_1}{p_2} $ rather than their difference
$(p_1 – p_2) $. Relative risk is a commonly used measure in public health studies. Another way to compare two probabilities is in terms of the odds. If an event takes place with probability p, then the odds of the event occurring are $ \frac{p}{1 - p} $. The ''odds ratio'' is the ratio of odds for two complementary probabilities.

===Motivation===
Suppose we study brain cancer in the context of cell phone use. The table below illustrates some simulated data. One clear healthcare question in this case-study could be: ''Is cell phone use associated with a higher incidence of brain cancer?'' To address this question, we can look at the relative risk of brain cancer in people who use cell phones.

<center>
{| class="wikitable" style="text-align:center; width:45%" border="1"
|-
| colspan=2 rowspan=2 | || colspan=2| '''Brain Cancer (BC)''' ||rowspan=2|'''Total'''
|-
|'''Yes''' || '''No'''
|-
| rowspan=2| '''Cell Phone (CP)''' || '''Yes''' || 18 || 80 || 98 '''(B)'''
|-
| '''No''' ||7 || 95 || 102 '''(C)'''
|-
| colspan=2|'''Total''' || 25 || 175 || 200
|}
</center>

First, we compute the (conditional!) probabilities (P) of brain cancer (BC) given either cell phone use, P1, or no cell-phone use, P2. We can then form their ratio to determine whether the relative risk of brain cancer (BC) is higher in cell phone users (CP) than in non-users (NCP).

$$ P_1 = P(BC|CP) = \dfrac {18}{98} = 0.184 $$

$$ P_2= P(BC|NCP) = \dfrac {7} {102} = 0.069 $$

Therefore, the relative risk of brain cancer in cell phone users is:
$$ RR= \frac{P(BC|CP)}{P(BC|NCP)} = \frac {0.184}{0.069} = 2.67.$$

The risk of having brain cancer is more than 2.5 times greater among cell phone users compared to non-cell phone users.

For the same example, the odds ratio (OR) of brain cancer relative to cell-phone use is:

$$ OR = \frac{\frac{P \left( BC \mid CP \right)}{1 - P \left( BC \mid CP \right)}}{\frac{P \left( BC \mid NCP \right)}{1 - P \left( BC \mid NCP \right)}}
= \frac{\frac{\frac{18}{98}}{1 - \frac{18}{98}}} {\frac{\frac{7}{102}}{1 - \frac{7}{102}}} =\frac{\frac{0.184}{0.816}}{\frac{0.069}{0.931}} = 3.04 $$

Thus, the odds of having brain cancer is about 3 times greater for cell phone users than it is for non-cell phone users.
We could have compared the odds of owning a cell phone given that a patient had brain cancer (i.e., the column-wise probabilities), $ P(CP|BC) = 18/25 = 0.72 $ versus $ P(CP|NBC) = 80/175 = 0.457 $. However, this does not seem as important scientifically.

===Theory===
<center>
{|class="wikitable" style="text-align:center; width:75%" border="1"
|-
| colspan=2 rowspan=2| || colspan=2|Factor 1|| rowspan=2|Total
|-
|Yes||No
|-
| rowspan=2|Factor 2||Yes||$n_{1,1}$||$n_{1,2}$||$n_{1,1} + n_{1,2}$
|-
|No||$n_{2,1}$||$n_{2,2}$||$n_{2,1} + n_{2,2}$
|-
| colspan=2|Total||$n_{1,1} + n_{2,1}$||$n_{2,1} + n_{1,2}$||$N=n_{1,1} + n_{1,2} + n_{2,1} + n_{2,2}$
|}
</center>

$$RR=\frac{\frac{n_{1,1}}{n_{1,1}+ n_{1,2}}}{\frac{n_{2,1}}{n_{2,1}+n_{2,2}}}.$$

$$OR = \frac{n_{1,1} × n_{2,2}}{n_{1,2}× n_{2,1}}.$$

====Interpretation====
* RR: In general, relative risk (RR) measure is interpreted as follows
**RR = 1 indicates that the probabilities of two events are the same.
**RR > 1 implies that there is increased risk
**RR < 1 implies that there is decreased risk

* OR
**If event $A|B$ has probability $p = 1/2$, then the odds are $\frac{1/2}{1/2}=1$, or $1:1$, or '''1 to 1''' (the probability that event A|B occurs is equal to the probability that it does not occur).
**If event $A|C$ has probability $p = 3/4$, then the odds are $\frac{3/4}{1/4}= 3$, or 3 to 1 (the probability that event A|C occurs is three times as large as the probability that it does not occur).
**Similarly, if A|D has probability $p = 1/4$, then the odds are $\frac {1/4}{3/4}=\frac {1}{3}$, or 1 to 3 (the probability that event A|D occurs is three times smaller the probability that it does not occur).

*RR vs. OR
**The formula and reasoning for the relative risk is a little bit easier to follow. In most cases the OR and RR measures are roughly equal to each other.
**Odds ratios have an advantage over relative risk because they can be calculated no matter the row or column comparison
**Relative risk runs into problems when the study design is a cohort study or a case-control design
**Odds ratios are an approximation of relative risk: $ OR = RR \times \frac{1-P_2} {1-P_1}$.

====Inference====
* ''Inference about the Odds Ratio'': In practice, we commonly report odds ratios along with their Confidence Intervals (CIs). It turns out that the distribution of OR’s is not normal, however, the ''log-transformed OR is approximately normally distributed'', and the standard error of $ ln(OR)$ is:

$$ SE(ln(OR))= \sqrt{\frac {1} {n_{1,1}}+ \frac {1} {n_{1,2}} + \frac {1} {n_{2,1}} + \frac{1} {n_{2,2}}}.$$

Thus, if $\alpha$ is the false-positive (Type I) error rate, the $ (1-\alpha)100\% $ CI (of the log-transformed OR) can be computed by:
$$ln(OR)±z\frac{\alpha}{2}SE(ln(OR)),$$

where odds-ratio point-estimate is $OR = \frac {n_{1,1}×n_{2,2}}{n_{1,2}×n_{2,1}}$ and the standard error of the log-transformed OR is listed above $ (SE(ln(OR))).$

You can use the [http://socr.ucla.edu/htmls/dist/StudentT_Distribution.html SOCR Student’s T-distribution calculator] to compute the value of the standard-normal Z statistics (for a given false-positive error rate $\alpha$).

NOTE: Remember that once you find the lower $(L=ln(OR)-z\dfrac{\alpha}{2}SE(ln(OR))$ and upper $(U=ln(OR)+z\frac{\alpha}{2}SE(ln(OR))$ limits of the $ln(OR)$ confidence interval, these represent log-transformed data. To convert these confidence limits into real OR terms, you need to invert the log transform (using the exponential function). Thus, the $ CI(OR) $ would be: $(e^L,e^R)$.

===Applications===
* [http://dx.doi.org/10.1016/j.ijnurstu.2012.11.014 This article] studies retrospectively the relationship between surveillance, staffing, and serious adverse events in children on general care postoperative units. The paper investigates these hypotheses: (1) the relationship between patient factors and surveillance would be moderated by staffing (i.e., registered nurse hours per patient per shift), and (2) the relationship between staffing and serious adverse events would be mediated by surveillance.

: The study shows that one additional registered nurse full-time equivalent per day reduced the odds of in-hospital mortality, respiratory failure, pneumonia, and failure to rescue, with the greatest cost-benefit for adult surgical patients. Table 4 of the results show the OR and CI(OR). Interpret the findings. Predictors of adverse events as shown in final logistic regression analysis.
<center>
{| class="wikitable" style="text-align:left; width:75%" border="1"
|-
|" | '''Factors''' || '''β (S.E.)''' || '''p-Value''' || '''Odds ratio [95% CI)'''
|-
| '''Staffing''' || −0.41 (0.33) || 0.219 || 0.66 [0.35, 1.28]
|-
|'''American Society of Anesthesiologists Physical Status''' ||0.94 (0.39) ||0.017 || 2.57 [1.88, 5.55]
|-
|'''Comorbidity''' || 0.57 (0.43) || 0.189 || 1.76 [0.76, 4.12]
|-
| '''Perioperative complication'''|| 0.64 (0.22)|| 0.003 || 1.90 [1.24, 2.92]
|-
|'''Interaction staffing × surveillance''' || −1.04 (0.42) || 0.012 ||0.354 [0.157, 0.798]
|}
</center>

* [http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3308724/ This article] investigates whether hospitals with a good organization of care (e.g., improved nurse staffing and work environments) can affect patient care and nurse workforce stability in European countries. It uses data from 488 clinics in 12 European countries; 617 in the United States) and is based on 33,659 nurses and 11,318 patients in Europe; 27,509 nurses and more than 120,000 patients in the US.

: Some of the authors’ findings included (a) nurses in hospitals with better work environments were half as likely to report poor or fair care quality (Europe, adjusted odds ratio 0.56, 95% confidence interval 0.51 to 0.61; US, 0.54, 0.51 to 0.58) and (b) to give their hospitals poor or failing grades on patient safety (0.50, 0.44 to 0.56 EU; 0.55, 0.50 to 0.61 US).

: Interpret the results in the Table below. Note that in this nurse outcomes study, the [[SMHS_OR_RR#References|authors adjusted the regression estimates]] (odds ratios) at the hospital level for differences in the composition of nurses between hospitals and between countries (age, sex, full time employment status, and specialty) by a multilevel model structure in which nurses were nested within hospitals and countries. Effects of nurse staffing and practice environment on nurse outcomes in study countries.

<center>
{| class="wikitable" style="text-align:center; width:75%" border="1"
|-

| rowspan="2" |'''Nurse Outcome''' || colspan="2" |'''Europe''' || || colspan="2" |'''US'''
|-
| '''Unadjusted odds ratio (95% CI)''' ||'''Adjusted odds ratio (95% CI)''' || || '''Unadjusted odds ratio (95% CI)''' || '''Adjusted odds ratio (95% CI)'''
|-
| colspan="6" |'''Poor or fair quality of care in ward'''
|-
|ROWSPAN="2" style="text-align: center;" |'''Practice environment''' || 0.58 || 0.56 || || 0.52 || 0.54
|-
|(0.53 to 0.63)|| (0.51 to 0.61) || || (0.49 to 0.56) || (0.51 to 0.58)
|-
|rowspan="2" |'''Staffing''' || 1.11 || 1.11 || || 1.2 || 1.06
|-
| (1.08 to 1.13) ||(1.07 to 1.15) || || (1.16 to 1.25) || (1.03 to 1.1)
|-
| colspan="6" |'''Poor or fair quality of care in ward'''
|-
|rowspan="2" |'''Practice environment''' || 0.5|| 0.5 || || 0.53 || 0.55
|-
|(0.43 to 0.57) ||(0.44 to 0.56) || || (0.48 to 0.59) ||(0.5 to 0.61)
|-
|rowspan="2"| '''Staffing''' || 1.04 || 1.1 || || 1.18 ||1.05
|-
| (1.01 to 1.08) || (1.05 to 1.16) || || (1.12 to 1.23) ||(1 to 1.1)
|-
|colspan="6" |'''Burnout'''
|-
|rowspan="2"|'''Practice environment''' || 0.69|| 0.67|| || 0.69 || 0.71
|-
|(0.63 to 0.76)||(0.61 to 0.73)|| || (0.66 to 0.73) ||(0.68 to 0.75)
|-
|rowspan="2"|'''Staffing''' || 1.06 || 1.05|| || 1.12 ||1.03
|-
| (1.04 to 1.08)||(1.02 to 1.09)|| ||(1.08 to 1.15)||(1 to 1.06)
|-
| colspan="6" |'''Job dissatisfaction'''
|-
|rowspan="2"|'''Practice environment''' ||0.63|| 0.52 || || 0.58 || 0.6
|-
|(0.57 to 0.69) || (0.47 to 0.57) || ||(0.55 to 0.61) ||(0.57 to 0.64)
|-
|rowspan="2"|'''Staffing''' || 1.1 || 1.07 || || 1.17 || 1.06
|-
| (1.08 to 1.12) || (1.04 to 1.11) || || (1.13 to 1.21) ||(1.03 to 1.09)
|-
| colspan="6" |'''Intention to leave in the next year'''
|-
|rowspan="2"|'''Practice environment''' || 0.72 || 0.61 || || 0.7 ||0.69
|-
|(0.66 to 0.79) ||(0.56 to 0.67)|| || (0.65 to 0.76) ||(0.64 to 0.75)
|-
|rowspan="2"|'''Staffing''' || 1.04 || 1.05 || || 1.1 || 1.03
|-
| (1.01 to 1.06)|| (1.02 to 1.09)|| || (1.05 to 1.15)||(0.98 to 1.08)
|-
| colspan="6" |'''Not confident that patients can manage own care after hospital discharge'''
|-
|rowspan="2"|'''Practice environment''' || 0.62 || 0.73 || || 0.71 || 0.72
|-
|(0.56 to 0.69)||(0.69 to 0.78) || ||(0.67 to 0.75) ||(0.68 to 0.77)
|-
|rowspan="2"|'''Staffing''' || 1.08 || 1.03 || ||1.1 || 1.04
|-
| (1.05 to 1.11) ||(1 to 1.05) || || (1.06 to 1.13) || (1.01 to 1.07)
|-
| colspan="6" |'''Not confident that hospital management would resolve patients’ problems'''
|-
|rowspan="2"|'''Practice environment''' || 0.5 || 0.53 || || 0.56 || 0.56
|-
| (0.46 to 0.54) ||(0.48 to 0.58)|| ||(0.53 to 0.59) ||(0.54 to 0.59)
|-
|rowspan="2"|'''Staffing''' || 1.04 ||1.02 || || 1.12 || 1.01
|-
|(1.01 to 1.07) ||(0.98 to 1.06) || ||(1.09 to 1.17) ||(0.98 to 1.03)
|}
</center>

===Software===
*[http://www.distributome.org/V3/calc/StudentCalculator.html T Distribution Calculator]

*[http://socr.umich.edu/Applets/Normal_T_Chi2_F_Tables.html Distribution Applets]

===Problems===
Formulate some clinically relevant questions in terms of the OR and RR and try to answer them in the following situations. Interpret the results. E.g., the estimate of the relative risk of a heart attack is about <blank> as great for those who smoke versus who do not smoke. Compute the CI (OR).

<center>
{| class="wikitable" style="text-align:center; width:45%" border="1"
|-
| colspan=2 rowspan=2 | || colspan=2| '''Heart Attack (HA)''' || rowspan=2|'''Total'''
|-
|'''Yes''' || '''No'''
|-
| rowspan=2| '''Smoking (S)''' || '''Yes''' || 33 || 18 || 51
|-
| '''No''' ||167 || 182 || 349
|-
| colspan=2|'''Total''' ||200 || 200 || 400
|}
</center>

===References===
* [http://www.sciencedirect.com/science/article/pii/S0378375812001954 Reducing bias and mean squared error associated with regression-based odds ratio estimators]
* [http://www.sciencedirect.com/science/article/pii/S0020748912004166 Nursing surveillance moderates the relationship between staffing levels and pediatric postoperative serious adverse events: A nested case–control study]
* [http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3308724/ Patient safety, satisfaction, and quality of hospital care: cross sectional surveys of nurses and patients in 12 countries in Europe and the United States]
* [http://www.sciencedirect.com/science/article/pii/S0378375812001954 Reducing bias and mean squared error associated with regression-based odds ratio estimators]

<hr>
* SOCR Home page: http://www.socr.umich.edu

{{translate|pageName=http://wiki.socr.umich.edu/index.php?title=SMHS_OR_RR}}

SMHS OR RR

2014-08-31T13:08:43Z

Jslavine: /* Overview */

==[[SMHS| Scientific Methods for Health Sciences]] - Odds Ratio and Relative Risk ==

===Overview===
The ''relative risk'' is a measure of dependence that allows us to compare two probabilities in terms of their ratio $ \frac{p_1}{p_2} $ rather than their difference
$(p_1 – p_2) $. Relative risk is a commonly used measure in public health studies. Another way to compare two probabilities is in terms of the odds. If an event takes place with probability p, then the odds of the event occurring are $ \frac{p}{1 - p} $. The ''odds ratio'' is the ratio of odds for two complementary probabilities.

===Motivation===
Suppose we study Brain Cancer in the context of cell phone use. The table below illustrates some (simulated) data. One clear healthcare question in this case-study could be: ''Is cell phone use associated with higher incidence of brain cancer?'' To address this question, we can look at the relative risk of cell-phone usage.

<center>
{| class="wikitable" style="text-align:center; width:45%" border="1"
|-
| colspan=2 rowspan=2 | || colspan=2| '''Brain Cancer (BC)''' ||rowspan=2|'''Total'''
|-
|'''Yes''' || '''No'''
|-
| rowspan=2| '''Cell Phone (CP)''' || '''Yes''' || 18 || 80 || 98 '''(B)'''
|-
| '''No''' ||7 || 95 || 102 '''(C)'''
|-
| colspan=2|'''Total''' || 25 || 175 || 200
|}
</center>

Computing the (conditional!) probabilities (P) of brain cancer (BC) given either cell-phone use, P1, no cell-phone use, P2, we can form their ratio to determine if the relative risk of brain cancer (BC) is higher in cell-phone users (CP), relative to non-users (NCP).

$$ P_1 = P(BC|CP) = \dfrac {18}{98} = 0.184 $$

$$ P_2= P(BC|NCP) = \dfrac {7} {102} = 0.069 $$

So the relative risk of brain cancer (cell-phone use vs. no cell-phone use) is:
$$ RR= \frac{P(BC|CP)}{P(BC|NCP)} = \frac {0.184}{0.069} = 2.67.$$

The risk of having brain cancer is more than 2.5 times greater for cell-phone users when compared to no-cellphone owners.

For the same example, the odds ratio (OR) of brain cancer relative to cell-phone use is:

$$ OR = \frac{\frac{P \left( BC \mid CP \right)}{1 - P \left( BC \mid CP \right)}}{\frac{P \left( BC \mid NCP \right)}{1 - P \left( BC \mid NCP \right)}}
= \frac{\frac{\frac{18}{98}}{1 - \frac{18}{98}}} {\frac{\frac{7}{102}}{1 - \frac{7}{102}}} =\frac{\frac{0.184}{0.816}}{\frac{0.069}{0.931}} = 3.04 $$

Thus, the odds of having brain cancer is about 3 times greater for cell phone owners when compared to non-cell phone owners.
We could have compared the odds of owning a cell phone, given that a patient had brain cancer (i.e., the column-wise probabilities), $ P(CP|BC) = 18/25 = 0.72 $ versus $ P(CP|NBC) = 80/175 = 0.457 $. However this does not seem as important scientifically.

===Theory===
<center>
{|class="wikitable" style="text-align:center; width:75%" border="1"
|-
| colspan=2 rowspan=2| || colspan=2|Factor 1|| rowspan=2|Total
|-
|Yes||No
|-
| rowspan=2|Factor 2||Yes||$n_{1,1}$||$n_{1,2}$||$n_{1,1} + n_{1,2}$
|-
|No||$n_{2,1}$||$n_{2,2}$||$n_{2,1} + n_{2,2}$
|-
| colspan=2|Total||$n_{1,1} + n_{2,1}$||$n_{2,1} + n_{1,2}$||$N=n_{1,1} + n_{1,2} + n_{2,1} + n_{2,2}$
|}
</center>

$$RR=\frac{\frac{n_{1,1}}{n_{1,1}+ n_{1,2}}}{\frac{n_{2,1}}{n_{2,1}+n_{2,2}}}.$$

$$OR = \frac{n_{1,1} × n_{2,2}}{n_{1,2}× n_{2,1}}.$$

====Interpretation====
* RR: In general, relative risk (RR) measure is interpreted as follows
**RR = 1 indicates that the probabilities of two events are the same.
**RR > 1 implies that there is increased risk
**RR < 1 implies that there is decreased risk

* OR
**If event $A|B$ has probability $p = 1/2$, then the odds are $\frac{1/2}{1/2}=1$, or $1:1$, or '''1 to 1''' (the probability that event A|B occurs is equal to the probability that it does not occur).
**If event $A|C$ has probability $p = 3/4$, then the odds are $\frac{3/4}{1/4}= 3$, or 3 to 1 (the probability that event A|C occurs is three times as large as the probability that it does not occur).
**Similarly, if A|D has probability $p = 1/4$, then the odds are $\frac {1/4}{3/4}=\frac {1}{3}$, or 1 to 3 (the probability that event A|D occurs is three times smaller the probability that it does not occur).

*RR vs. OR
**The formula and reasoning for the relative risk is a little bit easier to follow. In most cases the OR and RR measures are roughly equal to each other.
**Odds ratios have an advantage over relative risk because they can be calculated no matter the row or column comparison
**Relative risk runs into problems when the study design is a cohort study or a case-control design
**Odds ratios are an approximation of relative risk: $ OR = RR \times \frac{1-P_2} {1-P_1}$.

====Inference====
* ''Inference about the Odds Ratio'': In practice, we commonly report odds ratios along with their Confidence Intervals (CIs). It turns out that the distribution of OR’s is not normal, however, the ''log-transformed OR is approximately normally distributed'', and the standard error of $ ln(OR)$ is:

$$ SE(ln(OR))= \sqrt{\frac {1} {n_{1,1}}+ \frac {1} {n_{1,2}} + \frac {1} {n_{2,1}} + \frac{1} {n_{2,2}}}.$$

Thus, if $\alpha$ is the false-positive (Type I) error rate, the $ (1-\alpha)100\% $ CI (of the log-transformed OR) can be computed by:
$$ln(OR)±z\frac{\alpha}{2}SE(ln(OR)),$$

where odds-ratio point-estimate is $OR = \frac {n_{1,1}×n_{2,2}}{n_{1,2}×n_{2,1}}$ and the standard error of the log-transformed OR is listed above $ (SE(ln(OR))).$

You can use the [http://socr.ucla.edu/htmls/dist/StudentT_Distribution.html SOCR Student’s T-distribution calculator] to compute the value of the standard-normal Z statistics (for a given false-positive error rate $\alpha$).

NOTE: Remember that once you find the lower $(L=ln(OR)-z\dfrac{\alpha}{2}SE(ln(OR))$ and upper $(U=ln(OR)+z\frac{\alpha}{2}SE(ln(OR))$ limits of the $ln(OR)$ confidence interval, these represent log-transformed data. To convert these confidence limits into real OR terms, you need to invert the log transform (using the exponential function). Thus, the $ CI(OR) $ would be: $(e^L,e^R)$.

===Applications===
* [http://dx.doi.org/10.1016/j.ijnurstu.2012.11.014 This article] studies retrospectively the relationship between surveillance, staffing, and serious adverse events in children on general care postoperative units. The paper investigates these hypotheses: (1) the relationship between patient factors and surveillance would be moderated by staffing (i.e., registered nurse hours per patient per shift), and (2) the relationship between staffing and serious adverse events would be mediated by surveillance.

: The study shows that one additional registered nurse full-time equivalent per day reduced the odds of in-hospital mortality, respiratory failure, pneumonia, and failure to rescue, with the greatest cost-benefit for adult surgical patients. Table 4 of the results show the OR and CI(OR). Interpret the findings. Predictors of adverse events as shown in final logistic regression analysis.
<center>
{| class="wikitable" style="text-align:left; width:75%" border="1"
|-
|" | '''Factors''' || '''β (S.E.)''' || '''p-Value''' || '''Odds ratio [95% CI)'''
|-
| '''Staffing''' || −0.41 (0.33) || 0.219 || 0.66 [0.35, 1.28]
|-
|'''American Society of Anesthesiologists Physical Status''' ||0.94 (0.39) ||0.017 || 2.57 [1.88, 5.55]
|-
|'''Comorbidity''' || 0.57 (0.43) || 0.189 || 1.76 [0.76, 4.12]
|-
| '''Perioperative complication'''|| 0.64 (0.22)|| 0.003 || 1.90 [1.24, 2.92]
|-
|'''Interaction staffing × surveillance''' || −1.04 (0.42) || 0.012 ||0.354 [0.157, 0.798]
|}
</center>

* [http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3308724/ This article] investigates whether hospitals with a good organization of care (e.g., improved nurse staffing and work environments) can affect patient care and nurse workforce stability in European countries. It uses data from 488 clinics in 12 European countries; 617 in the United States) and is based on 33,659 nurses and 11,318 patients in Europe; 27,509 nurses and more than 120,000 patients in the US.

: Some of the authors’ findings included (a) nurses in hospitals with better work environments were half as likely to report poor or fair care quality (Europe, adjusted odds ratio 0.56, 95% confidence interval 0.51 to 0.61; US, 0.54, 0.51 to 0.58) and (b) to give their hospitals poor or failing grades on patient safety (0.50, 0.44 to 0.56 EU; 0.55, 0.50 to 0.61 US).

: Interpret the results in the Table below. Note that in this nurse outcomes study, the [[SMHS_OR_RR#References|authors adjusted the regression estimates]] (odds ratios) at the hospital level for differences in the composition of nurses between hospitals and between countries (age, sex, full time employment status, and specialty) by a multilevel model structure in which nurses were nested within hospitals and countries. Effects of nurse staffing and practice environment on nurse outcomes in study countries.

<center>
{| class="wikitable" style="text-align:center; width:75%" border="1"
|-

| rowspan="2" |'''Nurse Outcome''' || colspan="2" |'''Europe''' || || colspan="2" |'''US'''
|-
| '''Unadjusted odds ratio (95% CI)''' ||'''Adjusted odds ratio (95% CI)''' || || '''Unadjusted odds ratio (95% CI)''' || '''Adjusted odds ratio (95% CI)'''
|-
| colspan="6" |'''Poor or fair quality of care in ward'''
|-
|ROWSPAN="2" style="text-align: center;" |'''Practice environment''' || 0.58 || 0.56 || || 0.52 || 0.54
|-
|(0.53 to 0.63)|| (0.51 to 0.61) || || (0.49 to 0.56) || (0.51 to 0.58)
|-
|rowspan="2" |'''Staffing''' || 1.11 || 1.11 || || 1.2 || 1.06
|-
| (1.08 to 1.13) ||(1.07 to 1.15) || || (1.16 to 1.25) || (1.03 to 1.1)
|-
| colspan="6" |'''Poor or fair quality of care in ward'''
|-
|rowspan="2" |'''Practice environment''' || 0.5|| 0.5 || || 0.53 || 0.55
|-
|(0.43 to 0.57) ||(0.44 to 0.56) || || (0.48 to 0.59) ||(0.5 to 0.61)
|-
|rowspan="2"| '''Staffing''' || 1.04 || 1.1 || || 1.18 ||1.05
|-
| (1.01 to 1.08) || (1.05 to 1.16) || || (1.12 to 1.23) ||(1 to 1.1)
|-
|colspan="6" |'''Burnout'''
|-
|rowspan="2"|'''Practice environment''' || 0.69|| 0.67|| || 0.69 || 0.71
|-
|(0.63 to 0.76)||(0.61 to 0.73)|| || (0.66 to 0.73) ||(0.68 to 0.75)
|-
|rowspan="2"|'''Staffing''' || 1.06 || 1.05|| || 1.12 ||1.03
|-
| (1.04 to 1.08)||(1.02 to 1.09)|| ||(1.08 to 1.15)||(1 to 1.06)
|-
| colspan="6" |'''Job dissatisfaction'''
|-
|rowspan="2"|'''Practice environment''' ||0.63|| 0.52 || || 0.58 || 0.6
|-
|(0.57 to 0.69) || (0.47 to 0.57) || ||(0.55 to 0.61) ||(0.57 to 0.64)
|-
|rowspan="2"|'''Staffing''' || 1.1 || 1.07 || || 1.17 || 1.06
|-
| (1.08 to 1.12) || (1.04 to 1.11) || || (1.13 to 1.21) ||(1.03 to 1.09)
|-
| colspan="6" |'''Intention to leave in the next year'''
|-
|rowspan="2"|'''Practice environment''' || 0.72 || 0.61 || || 0.7 ||0.69
|-
|(0.66 to 0.79) ||(0.56 to 0.67)|| || (0.65 to 0.76) ||(0.64 to 0.75)
|-
|rowspan="2"|'''Staffing''' || 1.04 || 1.05 || || 1.1 || 1.03
|-
| (1.01 to 1.06)|| (1.02 to 1.09)|| || (1.05 to 1.15)||(0.98 to 1.08)
|-
| colspan="6" |'''Not confident that patients can manage own care after hospital discharge'''
|-
|rowspan="2"|'''Practice environment''' || 0.62 || 0.73 || || 0.71 || 0.72
|-
|(0.56 to 0.69)||(0.69 to 0.78) || ||(0.67 to 0.75) ||(0.68 to 0.77)
|-
|rowspan="2"|'''Staffing''' || 1.08 || 1.03 || ||1.1 || 1.04
|-
| (1.05 to 1.11) ||(1 to 1.05) || || (1.06 to 1.13) || (1.01 to 1.07)
|-
| colspan="6" |'''Not confident that hospital management would resolve patients’ problems'''
|-
|rowspan="2"|'''Practice environment''' || 0.5 || 0.53 || || 0.56 || 0.56
|-
| (0.46 to 0.54) ||(0.48 to 0.58)|| ||(0.53 to 0.59) ||(0.54 to 0.59)
|-
|rowspan="2"|'''Staffing''' || 1.04 ||1.02 || || 1.12 || 1.01
|-
|(1.01 to 1.07) ||(0.98 to 1.06) || ||(1.09 to 1.17) ||(0.98 to 1.03)
|}
</center>

===Software===
*[http://www.distributome.org/V3/calc/StudentCalculator.html T Distribution Calculator]

*[http://socr.umich.edu/Applets/Normal_T_Chi2_F_Tables.html Distribution Applets]

===Problems===
Formulate some clinically relevant questions in terms of the OR and RR and try to answer them in the following situations. Interpret the results. E.g., the estimate of the relative risk of a heart attack is about <blank> as great for those who smoke versus who do not smoke. Compute the CI (OR).

<center>
{| class="wikitable" style="text-align:center; width:45%" border="1"
|-
| colspan=2 rowspan=2 | || colspan=2| '''Heart Attack (HA)''' || rowspan=2|'''Total'''
|-
|'''Yes''' || '''No'''
|-
| rowspan=2| '''Smoking (S)''' || '''Yes''' || 33 || 18 || 51
|-
| '''No''' ||167 || 182 || 349
|-
| colspan=2|'''Total''' ||200 || 200 || 400
|}
</center>

===References===
* [http://www.sciencedirect.com/science/article/pii/S0378375812001954 Reducing bias and mean squared error associated with regression-based odds ratio estimators]
* [http://www.sciencedirect.com/science/article/pii/S0020748912004166 Nursing surveillance moderates the relationship between staffing levels and pediatric postoperative serious adverse events: A nested case–control study]
* [http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3308724/ Patient safety, satisfaction, and quality of hospital care: cross sectional surveys of nurses and patients in 12 countries in Europe and the United States]
* [http://www.sciencedirect.com/science/article/pii/S0378375812001954 Reducing bias and mean squared error associated with regression-based odds ratio estimators]

<hr>
* SOCR Home page: http://www.socr.umich.edu

{{translate|pageName=http://wiki.socr.umich.edu/index.php?title=SMHS_OR_RR}}