Difference between revisions of "SMHS PCA ICA FA"

From SOCR
Jump to: navigation, search
(PCA Limitations)
(Applications)
 
(25 intermediate revisions by 2 users not shown)
Line 10: Line 10:
  
 
===Theory===
 
===Theory===
 +
====Fundamentals====
 +
First review the [http://en.wikipedia.org/wiki/Matrix_multiplication rules for matrix and vector multiplication]. In an essence, the arithmetic process of multiplying numbers in row $i$ in matrix, or vector, $X^T$ and column $j$ in matrix, or vector, $A$ corresponds to component-wise multiplication followed by adding the terms, summation, to obtain entry $X\times A_{i,j}$ in the product matrix. As an example, $X_{1\times n}^T A_{n\times n}X_{n\times 1} = B_{1\times 1}$ is a scalar value. The transpose notation, $X^T$, indicates that the rows and columns are swapped, e.g., for a standard vector represented as a column, $X=\begin{pmatrix}X_1\\X_2\\...\\X_N \end{pmatrix}$, the transpose is the same vector as a row, $X^T=(X_1, X_2, ..., X_N)$.
 +
 +
Now, suppose we have a sample of $n$ observations as points in a plane: $\{p_1, p_2, …, p_n\}$ in the 2D space, $p=(x_1,x_2)$. How can we account (approximately) for the variation in the sample in as few variables as possible. The figure below illustrates the geometric nature of '''principal components (PCs)'''.
 +
 +
<center>
 +
[[Image:SMHS_PCA_ICA_FA_Fig1_c.png|500px]]
 +
</center>
 +
 +
The $1^{st}$ PC, $Z_1$, is a minimum distance fit to a line in the $X$ space. The $2^{nd}$ PC, $Z_2$, is a minimum distance fit to a line perpendicular to the $1^{st}$ PC (in an (N-1)D  hyper-plane). In general, the next PCs are a series of linear least squares fits to a sample, each orthogonal to all the previous PCs.
 +
 +
Given a sample of n observations on a vector of $N$ variables
 +
$X=\{X_1, X_2, …, X_N\}$, the $1^{st}$ PC is defined by a linear transformation $Z_1=a_1^T X =\sum_{i=1}^N {a_{i,1}X_i}$, where the vector $a_1=\{a_{1,1}, a_{2,1}, …, a_{N,1} \}$ is chosen to maximize the Variance of $Z_1$.
 +
 +
Similarly, $k^{th}$ PC of the sample is defined by a linear transformation $Z_k=a_k^T X =\sum_{i=1}^N {a_{i,k}X_i}$, where the vector $a_k=\{a_{1,k}, a_{2,k}, …, a_{N,k} \}$ is chosen to maximize the Variance of $Z_k$, subject to 3 conditions:
 +
:: Variance of $Z_k$ is maximized, and
 +
:: $Cov(Z_k,Z_l)=0$ for all $1 \leq l < k$, and
 +
:: $a_k^Ta_k = 1.$
 +
 +
To find the transformation ($a$) we express the $var(Z_1)$ in terms of the variance-covariance matrix of $X$:
 +
:: $var(Z_1) = E(Z_1^2) – (E (Z_1))^2 = \sum _{i,j,=1}^N {a_{i,1} a_{j,1} E(x_i x_j)} –
 +
\sum _{i,j,=1}^N { a_{i,1} a_{j,1} E(x_i)E( x_j)} $ $= \sum _{i,j,=1}^N { a_{i,1} a_{j,1} S_{i,j}}$,
 +
where $ S_{i,j} = \sigma_{x_i x_j} = E(X_i X_j) – E(X_i)E(X_j)$.
 +
Thus,
 +
: $var(Z_1) =a_1^TSa_1$, where $S=(S_{i,j})$ is the variance covariance matrix of $X=\{X_1, …, X_N\}$.
 +
As $a_1$ maximizes $Var(Z_1)$ subject to $||a_1||=a_i^Ta_1=1$, we set up this optimization problem as:
 +
 +
$\max_{a_1} \{a_1^T Sa_1 - \lambda(a_1^Ta_1 -1) \}.$
 +
Differentiating w.r.t. $a_1$ and setting the derivative equal to zero, yields:
 +
$Sa_1 - \lambda a_1 = 0$, and thus, $(S-\lambda I_N)a_1=0$.
 +
Therefore, $a_1$ is an eigenvector of S, corresponding to the eigenvalue $\lambda=\lambda_1$.
 +
Maximizing $Var(Z_1)=a_1^TSa_1= a_1^T \lambda_1 a_1$, yields the largest eigenvalue of $S$, hence the first PC $Z_1$ retains the largest amount of variation in the sample.
 +
 +
Similarly, the second eigenvector $a_2$ maximizes $Var(Z_2)$, subject to:
 +
:: $0=Cov(Z_2,Z_1)=a_1^TSa_2=\lambda_1a_1^Ta_2$, and
 +
:: $a_2^Ta_2 = 1.$
 +
 +
Thus, for a pair of ([http://en.wikipedia.org/wiki/Lagrange_multiplier Lagrange multipliers]) $\lambda_1$ and $\lambda_2$, we need to maximize:
 +
$$ a_1^TSa_2 -\lambda_1(a_2^Ta_2-1) - \lambda_2(a_1^Ta_2).$$
 +
Differentiating w.r.t. $a_1$ and setting the derivative equal to zero yields that:
 +
$(S-\lambda_2 I_N)a_2=0$, which means that $a_2$ is another eigenvector of $S$ corresponding to the eigenvalue $\lambda_2$. And so on.
 +
In general, $Var(Z_k)=a_k^TSa_k=\lambda_k$, where $a_k$ is the $k^{th}$ largest eigenvalue of $S$ and $Z_k$ is the $k^{th}$ PC retaining the $k^{th}$ largest variation in the sample $X$.
 +
 +
In Matrix notation:
 +
:: Observations $X=\{X_1, X_2, …, X_N \}$
 +
:: PCs: $Z=\{Z_1, Z_2, …, Z_N \}$, with $Z=A^TZ$ and $A=\{a_1, a_2, …, a_N\}$ is an orthogonal $N\times N$ matrix with columns representing the eigenvectors of S. That is, $\Lambda = A^T S A$, where $\Lambda$ is a diagonal matrix $\Lambda=diag(\lambda_1, \lambda_2, …, \lambda_n)$ of eigenvalues.
 +
 
====PCA (principal component analysis)====
 
====PCA (principal component analysis)====
 
PCA is a statistical technique that uses an orthogonal transformation to convert a set of observations of possibly correlated variables into a set of variables that are linearly uncorrelated. The resulting uncorrelated variables are called principal components. The first principal component accounts for as much of the variability in the data as possible, and each succeeding component accounts for the remaining variability. PCA is the simplest of the true eigenvector-based multivariate analyses; it reveals the internal structure of the data in a way that best explains the variance in the data. It is a useful statistical technique that has been applied in many fields, including computer networks and image processing, and is a powerful method for finding patterns in high-dimensional datasets.
 
PCA is a statistical technique that uses an orthogonal transformation to convert a set of observations of possibly correlated variables into a set of variables that are linearly uncorrelated. The resulting uncorrelated variables are called principal components. The first principal component accounts for as much of the variability in the data as possible, and each succeeding component accounts for the remaining variability. PCA is the simplest of the true eigenvector-based multivariate analyses; it reveals the internal structure of the data in a way that best explains the variance in the data. It is a useful statistical technique that has been applied in many fields, including computer networks and image processing, and is a powerful method for finding patterns in high-dimensional datasets.
 
 
Consider a data matrix $X_{n\times p}$ with column-wise empirical means of zero (i.e., the sample mean of each column has been shifted to zero), where each of the $n$ rows represents a different repetition of the experiment, and each of the $p$ columns gives a particular kind of data-element (i.e., variable). Mathematically, the transformation is defined by a set of $p$-dimensional vectors of weights $w_{(k)}=(w_1,w_2,…,w_p)_{(k)}$, constrained to be unitary ($\\w_{(k)}||=1$), which map each row vector $x_i$ of $X$ to a new vector of principal component scores $t_{(i)}=(t_1,t_2,…,t_p)_{(i)}$, given by $t_{k(i)}=x_{(i)} w_{(k)}$. The mapping occurs such that the individual elements of $t$ considered over the data set successively inherit the maximum possible variance from $x$.
+
Consider a data matrix $X_{n\times p}$ with column-wise empirical means of zero (i.e., the sample mean of each column has been shifted to zero), where each of the $n$ rows represents a different repetition of the experiment, and each of the $p$ columns gives a particular kind of data-element (i.e., variable). Mathematically, the transformation is defined by a set of $p$-dimensional vectors of weights $w_{(k)}=(w_1,w_2,…,w_p)_{(k)}$, constrained to be unitary ($||w_{(k)}||=1$), which map each row vector $x_i$ of $X$ to a new vector of principal component scores $t_{(i)}=(t_1,t_2,…,t_p)_{(i)}$, given by $t_{k(i)}=x_{(i)} w_{(k)}$. The mapping occurs such that the individual elements of $t$ considered over the data set successively inherit the maximum possible variance from $x$.
  
 
* First component: the first loading vector $w_{(1)}= \arg\max_{||w||=1} {\sum_i{(t_1)_{(i)}^2 }}= \arg\max_{||w||=1} {\sum_i{(x_{(i)} w)^2 }}$, in matrix form: $w_{(1)}=\arg\max_{||w||=1} {||Xw||^2 }=\arg\max_{||w||=1} {w^T X^T Xw}$.
 
* First component: the first loading vector $w_{(1)}= \arg\max_{||w||=1} {\sum_i{(t_1)_{(i)}^2 }}= \arg\max_{||w||=1} {\sum_i{(x_{(i)} w)^2 }}$, in matrix form: $w_{(1)}=\arg\max_{||w||=1} {||Xw||^2 }=\arg\max_{||w||=1} {w^T X^T Xw}$.
Line 29: Line 76:
 
** Calculate the covariance matrix between the variables of the normalized dataset;  
 
** Calculate the covariance matrix between the variables of the normalized dataset;  
 
** Calculate the eigenvalue and eigenvectors of the covariance matrix (the eigenvectors must be normalized to a length of 1);  
 
** Calculate the eigenvalue and eigenvectors of the covariance matrix (the eigenvectors must be normalized to a length of 1);  
**(5) Choose the most significant principal component, which is simply the eigenvector with the highest eigenvalue.
+
** Choose the most significant principal component, which is simply the eigenvector with the highest eigenvalue.
  
 
====PCA Properties====
 
====PCA Properties====
 
* For any integer $q$, $1 \leq q \leq p$, consider the orthogonal linear transformation $y=B'x$, where $y$ is a $q$-element vector and $B'$ is a $q\times q$ matrix, and let $\Sigma_y =B'\Sigma B$ be the variance-covariance matrix for $y$. Then the trace of $\Sigma_y$, $tr(\Sigma_y)$, is maximized by taking $B=A_q$, where $A_q$ consists of the first $q$ columns of $A$;
 
* For any integer $q$, $1 \leq q \leq p$, consider the orthogonal linear transformation $y=B'x$, where $y$ is a $q$-element vector and $B'$ is a $q\times q$ matrix, and let $\Sigma_y =B'\Sigma B$ be the variance-covariance matrix for $y$. Then the trace of $\Sigma_y$, $tr(\Sigma_y)$, is maximized by taking $B=A_q$, where $A_q$ consists of the first $q$ columns of $A$;
 
* $y=B'x$ and the $tr(\Sigma_y)$ are minimized by taking $B=A_q^*$ where $A_q^*$ consists of the last $q$ columns of $A$. The last few principal components are not simply unstructured left-overs after removing the important ones;  
 
* $y=B'x$ and the $tr(\Sigma_y)$ are minimized by taking $B=A_q^*$ where $A_q^*$ consists of the last $q$ columns of $A$. The last few principal components are not simply unstructured left-overs after removing the important ones;  
* $\Sigma = \lambda_1 \alpha_1 \alpha_1'+\lambda_2 \alpha_2 \alpha_2'+⋯+\lambda_p \alpha_p \alpha_p'$. Given that $var(x_j)=\sum_{k=1}^p {\lambda_k \alpha_{kj}^2 }$, the elements of $\lambda_k \alpha_k \alpha_k'$ tends to become smaller as SkS increases, whereas the elements of $\lambda_k$  tends to stay about the same size because $\alpha_k \alpha_k'=1,$ for $k=1,2,…,p$.
+
* $\Sigma = \lambda_1 \alpha_1 \alpha_1'+\lambda_2 \alpha_2 \alpha_2'+⋯+\lambda_p \alpha_p \alpha_p'$. Given that $var(x_j)=\sum_{k=1}^p {\lambda_k \alpha_{kj}^2 }$, the elements of $\lambda_k \alpha_k \alpha_k'$ tends to become smaller as $k$ increases, whereas the elements of $\alpha_k$  tends to stay about the same size because of the normalization constraint $\alpha_k \alpha_k'=1,$ for $k=1,2,…,p$.
  
 
====PCA Limitations====
 
====PCA Limitations====
Line 44: Line 91:
 
  ## The variances of the variables in the
 
  ## The variances of the variables in the
 
  ## USArrests data vary by orders of magnitude, so scaling is appropriate
 
  ## USArrests data vary by orders of magnitude, so scaling is appropriate
  (pc.cr <- princomp(USArrests))  # inappropriate
+
  # This data set contains statistics, in arrests per 100,000 residents for assault,
 +
# murder, and rape in each of the 50 US states in 1973. Also given is the percent of
 +
# the population living in urban areas.
 +
help(USArrests)
 +
head(USArrests)
 +
 
 +
pc.cr <- princomp(USArrests)  # inappropriate
 
   
 
   
 
  Call:
 
  Call:
Line 94: Line 147:
 
  biplot(pc.cr)  ## shows the plot of PCA in a different format
 
  biplot(pc.cr)  ## shows the plot of PCA in a different format
 
   
 
   
From the chart above, we can see the distribution variance attributed to different variables on the four principal components.
+
From the chart above, we can see the distribution of the variance attributed to different variables in the four principal components.
  
 
==== [[SOCR_EduMaterials_AnalysisActivities_PCA|PCA using SOCR Analyses]]====
 
==== [[SOCR_EduMaterials_AnalysisActivities_PCA|PCA using SOCR Analyses]]====
This [[SOCR_EduMaterials_AnalysisActivities_PCA| SOCR Activity illustrates the use of PCA]].
+
[[SOCR_EduMaterials_AnalysisActivities_PCA| This SOCR activity]] illustrates the use of PCA.
  
 
====ICA (independent component analysis)====
 
====ICA (independent component analysis)====
ICA is a computational method separating a multivariate signal into additive subcomponents by assuming that the subcomponents are non-Gaussian signals and are statistically independent from each other.
+
ICA is a computational method that separates a multivariate signal into additive subcomponents by assuming that the subcomponents are non-Gaussian signals and are statistically independent from each other.
  
 
* ICA Assumptions:  
 
* ICA Assumptions:  
** The source signals are independent of each other;
+
** The source signals are independent of each other.
** The distribution of the values in each source signals are non-Gaussian.
+
** The distributions of the values in each source signal are non-Gaussian.
** Independence: the source of signals are assumed to be independent but their signal mixture are not independent because they share the same source signals;  
+
** Independence: The sources of signals are assumed to be independent, but their signal mixtures are not independent because they share the same source signals;  
** Normality: based on [[SMHS_CLT_LLN|CLT]], the distribution of a sum of independent random variables approximate a Gaussian distribution;  
+
** Normality: Based on [[SMHS_CLT_LLN|CLT]], the distribution of the sum of independent random variables approximates a Gaussian distribution;  
** Complexity: the temporal complexity of any signal mixture is greater than that of its simplest constituent source signal.
+
** Complexity: The temporal complexity of any signal mixture is greater than that of its simplest constituent source signal.
 
 
: ICA maximizes the statistical independence of the estimated components to find the independent components. In general, ICA can’t identify the actual number of source signals nor can it identify the proper scaling of the source signals. Suppose the data is represented by the random vector $x=(x_1,x_2,…,x_m )^t$ and the components denoted as $s=(s_1,s_2,…,s_n )^t$. We need to transform the observed data $x$ using a linear transformation $w$, $s=Wx$, into maximally independent components $s$ measured by some functions of independence. There are alternative ''models'' for ICA:  
+
: ICA maximizes the statistical independence of the estimated components to find the independent components. In general, ICA cannot identify the actual number of source signals nor can it identify the proper scaling of the source signals. Suppose the data is represented by the random vector $x=(x_1,x_2,…,x_m )^t$, and the components are denoted as $s=(s_1,s_2,…,s_n )^t$. We need to transform the observed data $x$ using a linear transformation $w$, $s=Wx$, into maximally independent components $s$ measured by some functions of independence. There are alternative ''models'' for ICA:  
 
* Linear noiseless ICA: the components $x_i$ of the data $x=(x_1,x_2,…,x_m )^t$ are generated as a sum of the independent components $s_k$, for $k=1,…,n$; $x_i=a_{i,1} s_1 + ⋯ + a_{i,k} s_k + ⋯ + a_{i,n} s_n$, weighted by the mixing weights $a_{i,k}$.  
 
* Linear noiseless ICA: the components $x_i$ of the data $x=(x_1,x_2,…,x_m )^t$ are generated as a sum of the independent components $s_k$, for $k=1,…,n$; $x_i=a_{i,1} s_1 + ⋯ + a_{i,k} s_k + ⋯ + a_{i,n} s_n$, weighted by the mixing weights $a_{i,k}$.  
 
* Linear noisy ICA: with additional assumption of zero-mean and uncorrelated Gaussian noise $n \sim N(0,diag(\Sigma))$, the ICA model takes the form $x=As+n$.
 
* Linear noisy ICA: with additional assumption of zero-mean and uncorrelated Gaussian noise $n \sim N(0,diag(\Sigma))$, the ICA model takes the form $x=As+n$.
 
* Nonlinear ICA: the mixing of the sources is not necessarily linear. Using a nonlinear mixing function $f(.|θ)$ with parameter $\theta$ the nonlinear ICA model is $x=f(s│θ)+n$.
 
* Nonlinear ICA: the mixing of the sources is not necessarily linear. Using a nonlinear mixing function $f(.|θ)$ with parameter $\theta$ the nonlinear ICA model is $x=f(s│θ)+n$.
 
* Identifiability: the independent components are identifiable up to a permutation and scaling of the sources, which requires: (1) at most one of the sources $s_k$ is Gaussian; (2) the number of observed mixtures, $m$, must be at least as large as the number of estimated components: $n$ such that $n\leq m$, i.e., the mixing matrix $A$ must be of full rank in order to have inverse.
 
* Identifiability: the independent components are identifiable up to a permutation and scaling of the sources, which requires: (1) at most one of the sources $s_k$ is Gaussian; (2) the number of observed mixtures, $m$, must be at least as large as the number of estimated components: $n$ such that $n\leq m$, i.e., the mixing matrix $A$ must be of full rank in order to have inverse.
* ICA in R using package [http://cran.at.r-project.org/web/packages/fastICA/fastICA.pdf fastICA]. This example shows un-mixing two mixed independent uniforms:  
+
* ICA in R using package [http://cran.at.r-project.org/web/packages/fastICA/fastICA.pdf fastICA]. This example demonstrates how to separate two mixed independent uniform variables:  
  
 
  library(fastICA)
 
  library(fastICA)
Line 130: Line 183:
 
: Another example of un-mixing two independent signals is shown below:
 
: Another example of un-mixing two independent signals is shown below:
 
  S <- cbind(sin((1:1000)/20), rep((((1:200)-100)/100), 5))
 
  S <- cbind(sin((1:1000)/20), rep((((1:200)-100)/100), 5))
  # [http://stat.ethz.ch/R-manual/R-devel/library/base/html/cbind.html cbind] combines objects by rows and columns. It takes a sequence of vector, matrix or data frames arguments and combines them by columns or rows, respectively.  
+
  # [http://stat.ethz.ch/R-manual/R-devel/library/base/html/cbind.html cbind] combines objects by rows and columns.  
 +
#It takes a sequence of vector, matrix or data frames arguments and combines them by columns or rows, respectively.  
 
   
 
   
 
  A <- matrix(c(0.291, 0.6557, -0.5439, 0.5572), 2, 2)
 
  A <- matrix(c(0.291, 0.6557, -0.5439, 0.5572), 2, 2)
Line 149: Line 203:
  
 
====Factor analysis (FA)====
 
====Factor analysis (FA)====
FA is a statistical method, which describes variability among observed correlated variables in terms of potentially lower number of unobserved variables. Consider a set of $p$ observable random variables, $x_1,x_2,…,x_p$ with means $μ_1,μ_2,…,μ_p$, respectively. Suppose for some unknown constants $l_{i,j}$ and $k$ unobserved random variables $F_j$, where $i\in \{1,…,p\}$ and $j \in \{1,…,k\}$ where $k<p$. We have $x_i-μ_i=l_{i,1} F_1 + ⋯ +l_{i,k} F_k + ε_i$, where $ε_i$ are independently distributed error terms with mean zero and finite variance. In matrix form, we have $x-μ = LF+ε$, with $n$ observations, we have $x$ is a $p\times n$ matrix, $L$ is a $p \times k$ matrix and $F$ is $k\times n$ matrix. Assumptions: (1) $ε$ and $F$ are independent; (2) $E(F)=0$; (3) $cov(F)=I$ to make sure the factors are uncorrelated. Solutions to the equations above yield the factors $F$ and the loading matrix, $L$.  
+
FA is a statistical method that describes variability among observed correlated variables in terms of a potentially lower number of unobserved variables. Consider a set of $p$ observable random variables, $x_1,x_2,…,x_p$ with means $μ_1,μ_2,…,μ_p$, respectively. Suppose for some unknown constants $l_{i,j}$ and $k$ unobserved random variables $F_j$, where $i\in \{1,…,p\}$ and $j \in \{1,…,k\}$ where $k<p$. We have $x_i-μ_i=l_{i,1} F_1 + ⋯ +l_{i,k} F_k + ε_i$, where $ε_i$ are independently distributed error terms with mean zero and finite variance. In matrix form, we have $x-μ = LF+ε$, with $n$ observations$x$ is a $p\times n$ matrix, $L$ is a $p \times k$ matrix and $F$ is $k\times n$ matrix. Assumptions: (1) $ε$ and $F$ are independent; (2) $E(F)=0$; (3) $cov(F)=I$ to make sure the factors are uncorrelated. Solutions to the equations above yield the factors $F$ and the loading matrix, $L$.  
  
* Example: In the following, matrices will be indicated by indexed variables. "Subject" indices will be indicated using letters $a$, $b$ and $c$, with values running from $1$ to $N_a$, which is equal to $10$ in the above example. "Factor" indices will be indicated using letters $p$, $q$ and $r$, with values running from $1$ to $N_p$ which is equal to $2$ in the above example. "Instance" or "sample" indices will be indicated using letters $i$, $j$ and $k$, with values running from $1$ to $N_i$. In the example above, if a sample of $N_i=1000$ students responded to the $N_a=10$ questions, the $i^{th}$ student's score for the $a^{th}$ question are given by $x_{ai}$. The purpose of factor analysis is to characterize the correlations between the variables $x_a$ of which the $x_{ai}$ is a particular instance, or set of observations. To ensure that all variables are on equal footing, they are standardized: $z_{ai}=\frac{x_{ai}- μ_a}{σ_a}$, where $μ_a$ is the sample mean and sample variance $σ_a^2=\frac{1}{N_a} \sum_i {(x_{ai}-μ_a)^2}$. The factor analysis model is expressed by:
+
* Example: In the following, matrices will be indicated by indexed variables. "Subject" indices will be indicated using the letters $a$, $b$ and $c$, with values running from $1$ to $N_a$, which is equal to $10$ in the above example. "Factor" indices will be indicated using the letters $p$, $q$ and $r$, with values running from $1$ to $N_p$ which is equal to $2$ in the above example. "Instance" or "sample" indices will be indicated using letters $i$, $j$ and $k$, with values running from $1$ to $N_i$. In the example above, if a sample of $N_i=1000$ students responded to the $N_a=10$ questions, the $i^{th}$ student's score for the $a^{th}$ question are given by $x_{ai}$. The purpose of factor analysis is to characterize the correlations between the variables $x_a$ of which the $x_{ai}$ is a particular instance, or set of observations. To ensure that all variables are on equal footing, they are standardized: $z_{ai}=\frac{x_{ai}- μ_a}{σ_a}$, where $μ_a$ is the sample mean and sample variance $σ_a^2=\frac{1}{N_a} \sum_i {(x_{ai}-μ_a)^2}$. The factor analysis model is expressed by:
 
:$$\begin{matrix}z_{1,i} & =  & \ell_{1,1}F_{1,i} & + & \ell_{1,2}F_{2,i} & + & \varepsilon_{1,i} \\
 
:$$\begin{matrix}z_{1,i} & =  & \ell_{1,1}F_{1,i} & + & \ell_{1,2}F_{2,i} & + & \varepsilon_{1,i} \\
 
\vdots & & \vdots & & \vdots & & \vdots \\
 
\vdots & & \vdots & & \vdots & & \vdots \\
Line 200: Line 254:
  
 
===PCA, ICA, FA: Similarities and Differences===  
 
===PCA, ICA, FA: Similarities and Differences===  
* PCA is closely related to factor analysis. The later typically incorporates more domain specific assumptions about the underlying structure and solves eigenvectors of a slightly different matrix. Principal components create variables that are linear combinations of the original variables. The new variables have the property that the variables are all orthogonal. The principal components can be used to find clusters in a set of data.  
+
* PCA is closely related to factor analysis. The latter typically incorporates more domain specific assumptions about the underlying structure and solves eigenvectors of a slightly different matrix. Principal components create variables that are linear combinations of the original variables. The new variables have the property that the variables are all orthogonal. The principal components can be used to find clusters in a set of data.  
  
* PCA is a variance-focused approach seeking to reproduce the total variable variance, in which components reflect both common and unique variance of the variable. It is generally preferred for purposes of data reduction (i.e., translating variable space into optimal factor space) but not when detect the latent construct or factors. Factor analysis is similar to principal component analysis, in that factor analysis also involves linear combinations of variables.  
+
* PCA is a variance-focused approach that seeks to reproduce the variable's total variance. In PCA, the components reflect both common and unique variance of the variable. It is generally preferred for the purposes of data reduction (i.e., translating variable space into optimal factor space) but not for detecting latent constructs or factors. Factor analysis is similar to principal component analysis in that factor analysis also involves linear combinations of variables.  
  
* Different from PCA, factor analysis is a correlation-focused approach seeking to reproduce the inter-correlations among variables, in which the factors “represent the common variance of variables, excluding unique variance. Factor analysis is generally used when the research purpose is detecting data structure (i.e., latent constructs or factors) or causal modeling.
+
* In contrast to PCA, factor analysis is a correlation-focused approach seeking to reproduce the correlations among variables. The factors represent the common variance of variables and exclude unique variance. Factor analysis is generally used when the purpose of the research is detection of underlying structure in the data structure (i.e., latent constructs or factors) or causal modeling.
  
 
===Applications===
 
===Applications===
* [[SOCR_EduMaterials_AnalysisActivities_PCA| This SOCR Activity]] demonstrated the utilization of SOCR analyses package for statistical computing in the SOCR environment. It presents a general introduction to PCA and the theoretical background of this statistical tool and shows how to use PCA and how to read and interpret the outcome. It introduced students to input data in the correct format, read the results of PCA and make interpretation of the resulting transformed data.
+
====Parkinson's disease case study====
 +
The table below includes data for N=33 Parkinson's disease (PD) cases (3001, …, 3808) from the [http://www.ppmi-info.org/ Parkinson’s Progression Markers Initiative (PPMI) database]. We are interested in examining the relationships between imaging biomarkers, derived from the red nucleus in the Substantia Nigra (SN), and [http://www.movementdisorders.org/MDS-Files1/PDFs/MDS-UPDRS-Rating-Scales/NewUPDRS7308final.pdf MDS-UPDRS assessment measures] for Parkinson's Diseases patients, Part I.A, Part I.B, Part II and Part III scores.
  
* [http://www.sciencedirect.com/science/article/pii/0169743987800849 This article] presents a general introduction to PCA. Principal component analysis of a data matrix extracts the dominant patterns in the matrix in terms of a complementary set of score and loading plots. It is the responsibility of the data analyst to formulate the scientific issue at hand in terms of PC projections, PLS regressions, etc. Ask yourself, or the investigator, why the data matrix was collected, and for what purpose the experiments and measurements were made. Specify before the analysis what kinds of patterns you would expect and what you would find exciting. The results of the analysis depend on the scaling of the matrix, which therefore must be specified. Variance scaling, where each variable is scaled to unit variance, can be recommended for general use, provided that almost constant variables are left unscaled. Combining different types of variables warrants blockscaling. In the initial analysis, look for outliers and strong groupings in the plots, indicating that the data matrix perhaps should be “polished” or whether disjoint modeling is the proper course. For plotting purposes, two or three principal components are usually sufficient, but for modeling purposes the number of significant components should be properly determined, e.g. by cross-validation. Use the resulting principal components to guide your continued investigation or chemical experimentation, not as an end in itself.  
+
* '''Hypothesis''': The a priori hypothesis is that there is a strong inverse correlation between the imaging biomarkers (voxel intensity ratios) and the MDS-UPDRS scores. In other words, an increase of the (average) voxel intensities (i.e., more intact nigrosome-1) would correspond with a decrease in UPDRS scores (i.e., less PD symptoms), and vice-versa, lower voxel intensity ratios would correspond with higher UPDRS scores.  
  
* [http://link.springer.com/article/10.1007/BF02294359 This article] introduced the Akaike Information Criterion (AIC) to extend the method of maximum likelihood to the multi-model situation. It related the successful experience of the order determination of an autoregressive model to the determination of the number of factors in the maximum likelihood factor analysis. The use of the AIC criterion in the factor analysis is particularly interesting when it is viewed as the choice of a Bayesian model. This observation shows that the area of application of AIC can be much wider than the conventional i.i.d. type models on which the original derivation of the criterion was based. The observation of the Bayesian structure of the factor analysis model leads us to the handling of the problem of improper solution by introducing a natural prior distribution of factor loadings.
+
* '''Data''':  
 +
: The derived imaging biomarkers include:
 +
:: '''Top of SN Voxel Intensity Ratio''' obtained by averaging five random points, taken from the left side within the axial slice, from the nigrosome-1 region. Five random points from the ventral medial portion of the substantia nigra within the slice of interest are averaged. The mean for the nigrosome-1 was divided over the mean for the other ROI’s to generate the Top of SN Voxel Intensity ratios bopth on the left and right side.
 +
:: '''Side of SN Voxel Intensity Ratio''' obtained by averaging five random points, taken from the left side within the axial slice, from the nigrosome-1. Again, five random points from the outer swallow tail of the substantia nigra within the slice of interest are averaged on the left and right side
  
* [http://onlinelibrary.wiley.com/doi/10.1002/9780470057339.vnn086/abstract This article] contains a good introduction to the application of ICA. Independent component models have gained increasing interest in various fields of applications in recent years. The basic independent component model is a semi-parametric model assuming that a p-variate observed random vector is a linear transformation of an unobserved vector of p independent latent variables. This linear transformation is given by an unknown mixing matrix, and one of the main objectives of independent component analysis (ICA) is to estimate an unmixing matrix by means of which the latent variables can be recovered. In this article, we discuss the basic independent component model in detail, define the concepts and analysis tools carefully, and consider two families of ICA estimates. The statistical properties (consistency, asymptotic normality, efficiency, robustness) of the estimates can be analyzed and compared via the so called gain matrices. Some extensions of the basic independent component model, such as models with additive noise or models with dependent observations, are briefly discussed. The article ends with a short example.
+
: The Movement Disorder Society (MDS)-Unified Parkinson’s Disease Rating Scale (MDS-UPDRS): includes Part I (Non-motor Experiences of Daily Living); Part II (Motor Experiences of Daily Living); Part III (Motor Examination); and Part IV (Motor Complications) which was excluded as it was only conducted for PPMI subjects who had started PD medication.
 +
 
 +
<center>
 +
{| class="wikitable" style="text-align:center;" border="2"
 +
|-
 +
!Patient_ID||Top_of_SN_Voxel_Intensity_Ratio||Side_of_SN_Voxel_Intensity_Ratio||Part_IA||Part_IB||Part_II||Part_III
 +
|-
 +
|3001||1.188075152||0.940778688||0||8||2||12
 +
|-
 +
|3002||1.621806155||1.123298088||3||5||15||17
 +
|-
 +
|3003||1.058309086||1.015158507||1||11||6||29
 +
|-
 +
|3004||2.027838499||1.31348223||0||2||0||2
 +
|-
 +
|3006||1.754682932||1.064932943||1||2||4||22
 +
|-
 +
|3008||1.399353481||1.15022502||1||7||2||10
 +
|-
 +
|3010||1.327449894||0.930592788||2||9||11||19
 +
|-
 +
|3011||1.765207347||1.241295983||1||0||0||0
 +
|-
 +
|3012||1.565801729||1.329574405||4||7||6||20
 +
|-
 +
|3014||1.231222196||1.038040254||0||8||3||36
 +
|-
 +
|3016||1.560849827||1.305709046||0||2||0||1
 +
|-
 +
|3018||1.2011895||0.963859444||2||5||5||19
 +
|-
 +
|3020||1.693136427||1.010054879||2||13||10||20
 +
|-
 +
|3021||1.334255775||1.027113745||1||7||5||12
 +
|-
 +
|3023||1.221884984||0.949202296||4||5||4||30
 +
|-
 +
|3024||1.406800428||0.995810978||0||3||9||12
 +
|-
 +
|3029||1.74564323||1.22107546||0||1||2||3
 +
|-
 +
|3188||1.795850966||1.381062457||0||5||0||0
 +
|-
 +
|3551||1.811029249||1.160219407||0||2||0||0
 +
|-
 +
|3554||1.611463658||1.184391119||3||4||1||15
 +
|-
 +
|3361||2.149437904||1.197802706||0||1||2||9
 +
|-
 +
|3161||1.801208214||1.061680928||0||6||3||16
 +
|-
 +
|3767||1.484570315||1.208752123||3||4||6||21
 +
|-
 +
|3305||1.415250395||1.116697914||0||1||1||12
 +
|-
 +
|3307||1.79439256||1.111008744||1||8||17||26
 +
|-
 +
|3308||1.398475885||0.985062613||1||6||1||6
 +
|-
 +
|3309||1.151021722||0.963016059||3||6||8||29
 +
|-
 +
|3311||1.298993676||1.102174959||0||1||0||0
 +
|-
 +
|3314||1.371292596||1.214219683||1||2||0||0
 +
|-
 +
|3322||1.437867168||0.977680828||0||0||0||2
 +
|-
 +
|3352||1.84019943||0.960479831||6||13||10||29
 +
|-
 +
|3586||1.722158512||1.157249933||0||0||0||0
 +
|-
 +
|3808||1.355532247||1.111517624||1||8||2||13
 +
|}
 +
</center>
 +
 
 +
* R-script analyses of the PPMI data:
 +
 
 +
dataset <- read.csv('C:\\Users\\Desktop\\PPMI_data.csv', header = TRUE)
 +
# Patient_ID, Top_of_SN_Voxel_Intensity_Ratio, Side_of_SN_Voxel_Intensity_Ratio
 +
# Part_IA, Part_IB, Part_II, Part_III
 +
 +
# remove the first columns (Patient ID number)
 +
dataset <- dataset[,-1]
 +
attach(dataset)
 +
 
 +
# form the multivariate response variable, Y
 +
Y <- cbind(Part_IA, Part_IB, Part_II, Part_III)
 +
 +
# form the predictors vector, X
 +
X <- cbind(Top_of_SN_Voxel_Intensity_Ratio,Side_of_SN_Voxel_Intensity_Ratio)
 +
 +
summary(dataset)
 +
head(dataset)
 +
 
 +
# some diagnostic plots
 +
par(mfrow=c(2,3))
 +
plot(sort(dataset$\$ $Top_of_SN_Voxel_Intensity_Ratio))
 +
hist(dataset$\$ $Top_of_SN_Voxel_Intensity_Ratio)
 +
plot(density(dataset$\$ $Top_of_SN_Voxel_Intensity_Ratio,na.rm=FALSE))
 +
 +
plot(sort(dataset$\$ $Side_of_SN_Voxel_Intensity_Ratio))
 +
hist(dataset$\$ $Side_of_SN_Voxel_Intensity_Ratio)
 +
plot(density(dataset$\$ $Side_of_SN_Voxel_Intensity_Ratio,na.rm=FALSE))
 +
 
 +
# Try MANOVA first to explore predictive value of imaging biomarkers on clinical PD scores
 +
# MMLR Analysis
 +
# mmlr.model <- lm(Y ~ X)    # Fit the multivariate model
 +
# summary(manova(mmlr.model))
 +
# summary (manova(mmlr.model), test='Wilks')
 +
# manova(mmlr.model)
 +
# mmlr.model
 +
 
 +
# Fit linear regression models
 +
Fit_IA <- lm(Part_IA ~ Top_of_SN_Voxel_Intensity_Ratio + Side_of_SN_Voxel_Intensity_Ratio)
 +
Fit_IA
 +
summary(Fit_IA)
 +
 +
Fit_IB <- lm(Part_IB ~ Top_of_SN_Voxel_Intensity_Ratio + Side_of_SN_Voxel_Intensity_Ratio)
 +
Fit_IB
 +
summary(Fit_IB)
 +
 +
Fit_III <- lm(Part_III ~ Top_of_SN_Voxel_Intensity_Ratio + Side_of_SN_Voxel_Intensity_Ratio)
 +
Fit_III
 +
summary(Fit_III)
 +
 
 +
par(mar = rep(3, 4))
 +
plot(Fit_IA); # text(0.5,0.5,"First title",cex=2,font=2)
 +
plot(Fit_IB); # text(0.5,0.5,"Second title",cex=2,font=2)
 +
plot(Fit_III); # text(0.5,0.5,"Third title",cex=2,font=2)
 +
 
 +
# PCA/FA Analyses
 +
par(mfrow=c(1,1))
 +
pca.model <- princomp(dataset, cor=TRUE)
 +
summary(pca.model) # print variance accounted for
 +
loadings(pca.model) # pc loadings (i.e., igenvector columns)
 +
 +
Loadings:
 +
                                Comp.1 Comp.2 Comp.3 Comp.4   Comp.5 Comp.6
 +
Top_of_SN_Voxel_Intensity_Ratio  -0.256 -0.713 -0.373 -0.105   0.477 -0.221
 +
Side_of_SN_Voxel_Intensity_Ratio -0.386 -0.472  0.357  0.433   -0.558
 +
Part_IA                          0.383 -0.373  0.710 -0.320   0.238  0.227
 +
Part_IB                          0.460        -0.112  0.794   0.292  0.226
 +
Part_II                          0.425 -0.342 -0.464 -0.262    -0.534  0.365
 +
Part_III                          0.498                          -0.183 -0.844
 +
 +
plot(pca.model, type="lines") # scree plot
 +
 +
pca.model$\$ $scores # the principal components
 +
par(mfrow=c(1,1))
 +
biplot(pca.model)
 +
 
 +
library("FactoMineR", lib.loc="~/R/win-library/3.1")
 +
pca.model <- PCA(dataset)
 +
pca.model
 +
 
 +
# Factor Analysis
 +
#install.packages("factanal")
 +
fa.model <- factanal(dataset, 2, rotation="varimax")
 +
print(fa.model, digits=2, cutoff=.3, sort=TRUE)
 +
# plot factor 1 by factor 2 
 +
load <- fa.model$\$ $loadings[,1:2]
 +
plot(load,type="n") # set up plot
 +
text(load, labels=names(fa.model),cex=.7) # add variable names
 +
 +
# factanal(x = dataset, factors = 2, rotation = "varimax")
 +
 +
Uniquenesses:
 +
Top_of_SN_Voxel_Intensity_Ratio      0.02
 +
Side_of_SN_Voxel_Intensity_Ratio    0.53
 +
                        Part_IA      0.57
 +
                        Part_IB      0.41
 +
                        Part_II      0.39
 +
                        Part_III      0.22
 +
 +
Loadings:
 +
                                Factor1 Factor2
 +
Part_IA                          0.65         
 +
Part_IB                          0.73         
 +
Part_II                          0.78         
 +
Part_III                          0.82  -0.32 
 +
Top_of_SN_Voxel_Intensity_Ratio          0.99 
 +
Side_of_SN_Voxel_Intensity_Ratio -0.42    0.54 
 +
 +
              Factor1 Factor2
 +
SS loadings      2.41    1.44
 +
Proportion Var    0.40    0.24
 +
Cumulative Var    0.40    0.64
 +
 
 +
* [[SOCR_EduMaterials_AnalysisActivities_PCA| This SOCR Activity]] demonstrates the utilization of a SOCR analysis package for statistical computing in the SOCR environment. It presents a general introduction to PCA and the theoretical background of this statistical tool and demonstrates how to use PCA and read and interpret the outcome. It introduces students to inputting data in the correct format, reading the results of PCA and interpreting the resulting transformed data.
 +
 
 +
* [http://www.sciencedirect.com/science/article/pii/0169743987800849 This article] presents a general introduction to PCA. Principal component analysis of a data matrix extracts the dominant patterns in the matrix in terms of a complementary set of score and loading plots. It is the responsibility of the data analyst to formulate the scientific issue at hand in terms of PC projections, PLS regressions, etc. Ask yourself, or the investigator, why the data matrix was collected, and for what purpose the experiments and measurements were made. Specify before the analysis what kinds of patterns you would expect and what you would find exciting. The results of the analysis depend on the scaling of the matrix, which therefore must be specified. Variance scaling, where each variable is scaled to unit variance, is recommended for general use, provided that almost constant variables are left unscaled. Combining different types of variables warrants block-scaling. In the initial analysis, look for outliers and strong groupings in the plots, indicating that the data matrix perhaps should be “polished” or whether disjoint modeling is the proper course. For plotting purposes, two or three principal components are usually sufficient, but for modeling purposes the number of significant components should be properly determined, e.g. by cross-validation. Use the resulting principal components to guide your continued investigation or chemical experimentation, not as an end in itself.
 +
 
 +
* [http://link.springer.com/article/10.1007/BF02294359 This article] introduces the Akaike Information Criterion (AIC) to extend the method of maximum likelihood to the multi-model situation. It relates the order determination of an autoregressive model to the determination of the number of factors in a maximum likelihood factor analysis. The use of the AIC criterion in factor analysis is particularly interesting when it used with a Bayesian model. This observation reveals that AIC can be more widely applied than only to the conventional i.i.d. type models on which the original derivation of the criterion was based. The observation of the Bayesian structure of the factor analysis model leads to the handling of the problem of improper solutions by introducing a natural prior distribution of factor loadings.
 +
 
 +
* [http://onlinelibrary.wiley.com/doi/10.1002/9780470057339.vnn086/abstract This article] contains a good introduction to the application of ICA. Independent component models have gained increasing interest in various fields in recent years. The basic independent component model is a semi-parametric model that assumes that a p-variate observed random vector is a linear transformation of an unobserved vector of p independent latent variables. This linear transformation is given by an unknown mixing matrix, and one of the main objectives of independent component analysis (ICA) is to estimate the unmixing matrix so the latent variables can be recovered. In this article, we discuss the basic independent component model in detail, define the concepts and analysis tools carefully, and consider two families of ICA estimates. The statistical properties (i.e., consistency, asymptotic normality, efficiency, robustness) of the estimates can be analyzed and compared via the so-called gain matrices. Some extensions of the basic independent component model, such as models with additive noise or models with dependent observations, are briefly discussed. The article ends with a short example.
  
 
===Software ===
 
===Software ===
 
* [http://socr.ucla.edu/htmls/ana/ SOCR ANalyses]
 
* [http://socr.ucla.edu/htmls/ana/ SOCR ANalyses]
* [[SOCR_EduMaterials_AnalysisActivities_PCA SOCR PCA Activity]]
+
* [[SOCR_EduMaterials_AnalysisActivities_PCA|SOCR PCA Activity]]
 
* [http://stat.ethz.ch/R-manual/R-patched/library/stats/html/princomp.html R princomp package]  
 
* [http://stat.ethz.ch/R-manual/R-patched/library/stats/html/princomp.html R princomp package]  
 
* [http://cran.r-project.org/web/packages/fastICA/ R fastICA package] and [http://cran.at.r-project.org/web/packages/fastICA/fastICA.pdf documentation]
 
* [http://cran.r-project.org/web/packages/fastICA/ R fastICA package] and [http://cran.at.r-project.org/web/packages/fastICA/fastICA.pdf documentation]
Line 227: Line 479:
 
   
 
   
 
  > library(fastICA)
 
  > library(fastICA)
  # Using the SOCR 1981-2006 CPI Data (http://wiki.socr.umich.edu/index.php/SOCR_Data_Dinov_021808_ConsumerPriceIndex),  
+
  # Using the SOCR 1981-2006 CPI Data  
  # save the table in ASCII text file [[SMHS_PCA_ICA_FA#Appendix|CPI_Data.dat]]. Note the "dir" (folder) where you saved the data and reference it below
+
# (http://wiki.socr.umich.edu/index.php/SOCR_Data_Dinov_021808_ConsumerPriceIndex),  
 +
  # save the table in ASCII text file [[SMHS_PCA_ICA_FA#Appendix|CPI_Data.dat]].  
 +
# Note the "dir" (folder) where you saved the data and reference it below
 
  > CPI_Data <- as.matrix(read.table("/dir/CPI_Data.dat",header=T))
 
  > CPI_Data <- as.matrix(read.table("/dir/CPI_Data.dat",header=T))
 
   
 
   
Line 235: Line 489:
 
  > pc.cor <- princomp(X,cor=T)
 
  > pc.cor <- princomp(X,cor=T)
 
  > summary(pc.cor )
 
  > summary(pc.cor )
 +
 
  Importance of components:
 
  Importance of components:
 
                           Comp.1    Comp.2    Comp.3    Comp.4    Comp.5    Comp.6      Comp.7
 
                           Comp.1    Comp.2    Comp.3    Comp.4    Comp.5    Comp.6      Comp.7
Line 243: Line 498:
 
  > ica <- fastICA(X,n.comp=7)
 
  > ica <- fastICA(X,n.comp=7)
 
  > names(ica)
 
  > names(ica)
 +
 
  [1] "X" "K" "W" "A" "S"
 
  [1] "X" "K" "W" "A" "S"
 
  #  X: pre-processed data matrix (whitened/sphered data)
 
  #  X: pre-processed data matrix (whitened/sphered data)
Line 262: Line 518:
 
   
 
   
 
  > loadings(pc.cor)  
 
  > loadings(pc.cor)  
+
 
 
  Loadings:
 
  Loadings:
 
               Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6
 
               Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6
Line 294: Line 550:
 
  > dimnames(A.str) <- list(dimnames(X)[[2]],paste("Cmp.",1:7,sep=""))
 
  > dimnames(A.str) <- list(dimnames(X)[[2]],paste("Cmp.",1:7,sep=""))
 
  > write.table(A.str[,1:4],"",quote=F,row.names=T,col.names=NA)
 
  > write.table(A.str[,1:4],"",quote=F,row.names=T,col.names=NA)
 +
 
   Cmp.1 Cmp.2 Cmp.3 Cmp.4
 
   Cmp.1 Cmp.2 Cmp.3 Cmp.4
 
  Electricity                             
 
  Electricity                             
Line 307: Line 564:
 
  > dimnames(L.str) <- list(dimnames(X)[[2]],paste("Cmp.",1:7,sep=""))
 
  > dimnames(L.str) <- list(dimnames(X)[[2]],paste("Cmp.",1:7,sep=""))
 
  > write.table(L.str[,1:4],"",quote=F,row.names=T,col.names=T)
 
  > write.table(L.str[,1:4],"",quote=F,row.names=T,col.names=T)
 +
 
  Cmp.1 Cmp.2 Cmp.3 Cmp.4
 
  Cmp.1 Cmp.2 Cmp.3 Cmp.4
 
  Electricity -0.41                0.51   
 
  Electricity -0.41                0.51   

Latest revision as of 11:39, 24 November 2014

Scientific Methods for Health Sciences - Dimensionality Reduction: PCA, ICA, FA

Overview

  • PCA (principal component analysis) is a mathematical procedure that transforms a number of possibly correlated variables into a smaller number of uncorrelated variables through a process known as orthogonal transformation.
  • ICA (independent component analysis) is a computational tool to separate a multivariate signal into additive subcomponents by assuming that the subcomponents are non-Gaussian signals and are statistically independent from each other.
  • Factor analysis is a statistical method that describes variability among observed correlated variables in terms of a potentially lower number of unobserved variables. It is related to PCA but they are not identical. In this section, we are going to introduce these three commonly used statistical tools and illustrate their application in various studies with examples and R code samples.

Motivation

Suppose we have a set of observable correlated random variables, and we want to reduce the dimensionality of the data into a reasonable new set. How can we achieve this dimensionality reduction? Principal component analysis, independent component analysis and factor analysis may be the answers here. How does each of them work? What are the differences among those statistical methods; what are their strengths and weaknesses? How can we decide on the best method for a specific dataset?

Theory

Fundamentals

First review the rules for matrix and vector multiplication. In an essence, the arithmetic process of multiplying numbers in row $i$ in matrix, or vector, $X^T$ and column $j$ in matrix, or vector, $A$ corresponds to component-wise multiplication followed by adding the terms, summation, to obtain entry $X\times A_{i,j}$ in the product matrix. As an example, $X_{1\times n}^T A_{n\times n}X_{n\times 1} = B_{1\times 1}$ is a scalar value. The transpose notation, $X^T$, indicates that the rows and columns are swapped, e.g., for a standard vector represented as a column, $X=\begin{pmatrix}X_1\\X_2\\...\\X_N \end{pmatrix}$, the transpose is the same vector as a row, $X^T=(X_1, X_2, ..., X_N)$.

Now, suppose we have a sample of $n$ observations as points in a plane: $\{p_1, p_2, …, p_n\}$ in the 2D space, $p=(x_1,x_2)$. How can we account (approximately) for the variation in the sample in as few variables as possible. The figure below illustrates the geometric nature of principal components (PCs).

SMHS PCA ICA FA Fig1 c.png

The $1^{st}$ PC, $Z_1$, is a minimum distance fit to a line in the $X$ space. The $2^{nd}$ PC, $Z_2$, is a minimum distance fit to a line perpendicular to the $1^{st}$ PC (in an (N-1)D hyper-plane). In general, the next PCs are a series of linear least squares fits to a sample, each orthogonal to all the previous PCs.

Given a sample of n observations on a vector of $N$ variables $X=\{X_1, X_2, …, X_N\}$, the $1^{st}$ PC is defined by a linear transformation $Z_1=a_1^T X =\sum_{i=1}^N {a_{i,1}X_i}$, where the vector $a_1=\{a_{1,1}, a_{2,1}, …, a_{N,1} \}$ is chosen to maximize the Variance of $Z_1$.

Similarly, $k^{th}$ PC of the sample is defined by a linear transformation $Z_k=a_k^T X =\sum_{i=1}^N {a_{i,k}X_i}$, where the vector $a_k=\{a_{1,k}, a_{2,k}, …, a_{N,k} \}$ is chosen to maximize the Variance of $Z_k$, subject to 3 conditions:

Variance of $Z_k$ is maximized, and
$Cov(Z_k,Z_l)=0$ for all $1 \leq l < k$, and
$a_k^Ta_k = 1.$

To find the transformation ($a$) we express the $var(Z_1)$ in terms of the variance-covariance matrix of $X$:

$var(Z_1) = E(Z_1^2) – (E (Z_1))^2 = \sum _{i,j,=1}^N {a_{i,1} a_{j,1} E(x_i x_j)} –

\sum _{i,j,=1}^N { a_{i,1} a_{j,1} E(x_i)E( x_j)} $ $= \sum _{i,j,=1}^N { a_{i,1} a_{j,1} S_{i,j}}$, where $ S_{i,j} = \sigma_{x_i x_j} = E(X_i X_j) – E(X_i)E(X_j)$. Thus,

$var(Z_1) =a_1^TSa_1$, where $S=(S_{i,j})$ is the variance covariance matrix of $X=\{X_1, …, X_N\}$.

As $a_1$ maximizes $Var(Z_1)$ subject to $||a_1||=a_i^Ta_1=1$, we set up this optimization problem as:

$\max_{a_1} \{a_1^T Sa_1 - \lambda(a_1^Ta_1 -1) \}.$ Differentiating w.r.t. $a_1$ and setting the derivative equal to zero, yields: $Sa_1 - \lambda a_1 = 0$, and thus, $(S-\lambda I_N)a_1=0$. Therefore, $a_1$ is an eigenvector of S, corresponding to the eigenvalue $\lambda=\lambda_1$. Maximizing $Var(Z_1)=a_1^TSa_1= a_1^T \lambda_1 a_1$, yields the largest eigenvalue of $S$, hence the first PC $Z_1$ retains the largest amount of variation in the sample.

Similarly, the second eigenvector $a_2$ maximizes $Var(Z_2)$, subject to:

$0=Cov(Z_2,Z_1)=a_1^TSa_2=\lambda_1a_1^Ta_2$, and
$a_2^Ta_2 = 1.$

Thus, for a pair of (Lagrange multipliers) $\lambda_1$ and $\lambda_2$, we need to maximize: $$ a_1^TSa_2 -\lambda_1(a_2^Ta_2-1) - \lambda_2(a_1^Ta_2).$$ Differentiating w.r.t. $a_1$ and setting the derivative equal to zero yields that: $(S-\lambda_2 I_N)a_2=0$, which means that $a_2$ is another eigenvector of $S$ corresponding to the eigenvalue $\lambda_2$. And so on. In general, $Var(Z_k)=a_k^TSa_k=\lambda_k$, where $a_k$ is the $k^{th}$ largest eigenvalue of $S$ and $Z_k$ is the $k^{th}$ PC retaining the $k^{th}$ largest variation in the sample $X$.

In Matrix notation:

Observations $X=\{X_1, X_2, …, X_N \}$
PCs: $Z=\{Z_1, Z_2, …, Z_N \}$, with $Z=A^TZ$ and $A=\{a_1, a_2, …, a_N\}$ is an orthogonal $N\times N$ matrix with columns representing the eigenvectors of S. That is, $\Lambda = A^T S A$, where $\Lambda$ is a diagonal matrix $\Lambda=diag(\lambda_1, \lambda_2, …, \lambda_n)$ of eigenvalues.

PCA (principal component analysis)

PCA is a statistical technique that uses an orthogonal transformation to convert a set of observations of possibly correlated variables into a set of variables that are linearly uncorrelated. The resulting uncorrelated variables are called principal components. The first principal component accounts for as much of the variability in the data as possible, and each succeeding component accounts for the remaining variability. PCA is the simplest of the true eigenvector-based multivariate analyses; it reveals the internal structure of the data in a way that best explains the variance in the data. It is a useful statistical technique that has been applied in many fields, including computer networks and image processing, and is a powerful method for finding patterns in high-dimensional datasets.

Consider a data matrix $X_{n\times p}$ with column-wise empirical means of zero (i.e., the sample mean of each column has been shifted to zero), where each of the $n$ rows represents a different repetition of the experiment, and each of the $p$ columns gives a particular kind of data-element (i.e., variable). Mathematically, the transformation is defined by a set of $p$-dimensional vectors of weights $w_{(k)}=(w_1,w_2,…,w_p)_{(k)}$, constrained to be unitary ($||w_{(k)}||=1$), which map each row vector $x_i$ of $X$ to a new vector of principal component scores $t_{(i)}=(t_1,t_2,…,t_p)_{(i)}$, given by $t_{k(i)}=x_{(i)} w_{(k)}$. The mapping occurs such that the individual elements of $t$ considered over the data set successively inherit the maximum possible variance from $x$.

  • First component: the first loading vector $w_{(1)}= \arg\max_{||w||=1} {\sum_i{(t_1)_{(i)}^2 }}= \arg\max_{||w||=1} {\sum_i{(x_{(i)} w)^2 }}$, in matrix form: $w_{(1)}=\arg\max_{||w||=1} {||Xw||^2 }=\arg\max_{||w||=1} {w^T X^T Xw}$.
  • Further components: the $k^{th}$ component can be found by subtracting the first $k-1$ principal components from $X$. $\hat{X}_{k-1} = X-\sum_{s=1}^{k-1}{X w_{(s)} w_{(s)}^T }$ and finding the loading vector involves extracting the maximum variance from this new data matrix: $w_{(k)}=\arg\max_{||w||=1} {||\bar{X}_{k-1} w||^2 } = \arg\max {\frac{w^T \hat{X}_{k-1}^T \hat{X}_{k-1} w}{w^T w}}.$ This gives the remaining eigenvectors of $X^T X$ with the maximum values for the quantity in brackets given by other corresponding eigenvalues. The full principal components decomposition of $X$ can therefore be given as $T=XW$ where $W$ is a $p\times p$ matrix whose columns are the eigenvectors of $X^T X$.
  • Dimensionality reduction: The faithful transformation $T = X W$ maps a data vector $x_{(i)}$ from an original space of $p$ variables to a new space of $p$ (other) variables which are uncorrelated over the dataset. However, not all the principal components need to be kept. Keeping only the first $L$ principal components, produced by using only the first $L$ loading vectors, gives the truncated transformation $T_L=XW_L$ where $T_L$ has $n$ rows but only $L$ columns. This dimensionality reduction is very useful for visualizing and processing high-dimensional datasets while still retaining as much of the variance in the dataset as possible.
  • Singular value decomposition (SVD)
The SVD of $X=U \Sigma W^T$ where $\Sigma$ is an $n\times p$ rectangular diagonal matrix of positive numbers $\sigma_{(k)}$, the singular values of $X$; $U$ is an $n\times n$ matrix, the columns of which are orthogonal until vector of length $n$ called the left singular vectors of $X$; and $W$ is a $p\times p$ whose columns are orthogonal until vector of length $p$ and called the right singular vectors of $X$. With factorization, the matrix $X^T X=W \Sigma U^T U \Sigma W^T = W \Sigma^2 W^T$; with singular value decomposition the score matrix $T=XW=U\Sigma W^T W=U\Sigma$, where each column of $T$ is given by one of the left singular vectors of $X$ multiplied by the corresponding singular value.
  • Computational details of PCA:
    • Begin with a dataset that contains at least two dimensions (i.e., variables). The dataset can contain as many observations (dimensions) as you like;
    • Normalize the observations for each variable. To do this, simply subtract the mean (average) from each observation within a given variable. For example, let $X$ and $Y$ be the two variables from the original dataset, with variable $X$ containing observations $X_1,X_2,…,X_n$, and variable $Y$ containing observations $Y_1,Y_2,…,Y_n$. Let $\bar{X}$ be the average of the $n$ observations of $X$, i.e. $\bar{X}= \frac{X_1+X_2+⋯+X_n}{n}$, and similarly let $\bar{Y}= \frac{Y_1+Y_2+⋯+Y_n}{n}$ be the average of the $Y$ observations. Then, the normalized dataset would be, for variable $X$: $\{X_1-\bar{X}, X_2-\bar{X}, …,X_n-\bar{X}\}$ and for variable $Y$: $\{Y_1-\bar{Y},Y_2-\bar{Y}, …,Y_n-\bar{Y}\}$
    • Calculate the covariance matrix between the variables of the normalized dataset;
    • Calculate the eigenvalue and eigenvectors of the covariance matrix (the eigenvectors must be normalized to a length of 1);
    • Choose the most significant principal component, which is simply the eigenvector with the highest eigenvalue.

PCA Properties

  • For any integer $q$, $1 \leq q \leq p$, consider the orthogonal linear transformation $y=B'x$, where $y$ is a $q$-element vector and $B'$ is a $q\times q$ matrix, and let $\Sigma_y =B'\Sigma B$ be the variance-covariance matrix for $y$. Then the trace of $\Sigma_y$, $tr(\Sigma_y)$, is maximized by taking $B=A_q$, where $A_q$ consists of the first $q$ columns of $A$;
  • $y=B'x$ and the $tr(\Sigma_y)$ are minimized by taking $B=A_q^*$ where $A_q^*$ consists of the last $q$ columns of $A$. The last few principal components are not simply unstructured left-overs after removing the important ones;
  • $\Sigma = \lambda_1 \alpha_1 \alpha_1'+\lambda_2 \alpha_2 \alpha_2'+⋯+\lambda_p \alpha_p \alpha_p'$. Given that $var(x_j)=\sum_{k=1}^p {\lambda_k \alpha_{kj}^2 }$, the elements of $\lambda_k \alpha_k \alpha_k'$ tends to become smaller as $k$ increases, whereas the elements of $\alpha_k$ tends to stay about the same size because of the normalization constraint $\alpha_k \alpha_k'=1,$ for $k=1,2,…,p$.

PCA Limitations

  • The results of PCA depend on the scaling of the variables.
  • The applicability of PCA is limited by certain assumptions made in its derivation.

PCA in R

require(graphics)
## The variances of the variables in the
## USArrests data vary by orders of magnitude, so scaling is appropriate
# This data set contains statistics, in arrests per 100,000 residents for assault, 
# murder, and rape in each of the 50 US states in 1973. Also given is the percent of 
# the population living in urban areas.
help(USArrests)
head(USArrests)
pc.cr <- princomp(USArrests)  # inappropriate

Call:
princomp(x = USArrests)

Standard deviations:
   Comp.1    Comp.2    Comp.3    Comp.4 
82.890847 14.069560  6.424204  2.45783

princomp(USArrests, cor = TRUE) # =^= prcomp(USArrests, scale=TRUE)

Call:
princomp(x = USArrests, cor = TRUE)

Standard deviations:
   Comp.1    Comp.2    Comp.3    Comp.4 
1.5748783 0.9948694 0.5971291 0.4164494

## Similar, but different:
## The standard deviations differ by a factor of sqrt(49/50)
summary(pc.cr <- princomp(USArrests, cor = TRUE))

Importance of components:
                          Comp.1    Comp.2    Comp.3     Comp.4
Standard deviation     1.5748783 0.9948694 0.5971291 0.41644938
Proportion of Variance 0.6200604 0.2474413 0.0891408 0.04335752
Cumulative Proportion  0.6200604 0.8675017 0.9566425 1.00000000

loadings(pc.cr)  # note that blank entries are small but not zero
## The signs of the columns are arbitrary


Loadings:
         Comp.1 Comp.2 Comp.3 Comp.4
Murder   -0.536  0.418 -0.341  0.649
Assault  -0.583  0.188 -0.268 -0.743
UrbanPop -0.278 -0.873 -0.378  0.134
Rape     -0.543 -0.167  0.818   

               Comp.1 Comp.2 Comp.3 Comp.4
SS loadings      1.00   1.00   1.00   1.00
Proportion Var   0.25   0.25   0.25   0.25
Cumulative Var   0.25   0.50   0.75   1.00

plot(pc.cr) # shows a screeplot.
 
# The histogram distribution presents a vivid picture of the variance attributable to the first four significant principal  
# components respectively.
biplot(pc.cr)  ## shows the plot of PCA in a different format

From the chart above, we can see the distribution of the variance attributed to different variables in the four principal components.

PCA using SOCR Analyses

This SOCR activity illustrates the use of PCA.

ICA (independent component analysis)

ICA is a computational method that separates a multivariate signal into additive subcomponents by assuming that the subcomponents are non-Gaussian signals and are statistically independent from each other.

  • ICA Assumptions:
    • The source signals are independent of each other.
    • The distributions of the values in each source signal are non-Gaussian.
    • Independence: The sources of signals are assumed to be independent, but their signal mixtures are not independent because they share the same source signals;
    • Normality: Based on CLT, the distribution of the sum of independent random variables approximates a Gaussian distribution;
    • Complexity: The temporal complexity of any signal mixture is greater than that of its simplest constituent source signal.
ICA maximizes the statistical independence of the estimated components to find the independent components. In general, ICA cannot identify the actual number of source signals nor can it identify the proper scaling of the source signals. Suppose the data is represented by the random vector $x=(x_1,x_2,…,x_m )^t$, and the components are denoted as $s=(s_1,s_2,…,s_n )^t$. We need to transform the observed data $x$ using a linear transformation $w$, $s=Wx$, into maximally independent components $s$ measured by some functions of independence. There are alternative models for ICA:
  • Linear noiseless ICA: the components $x_i$ of the data $x=(x_1,x_2,…,x_m )^t$ are generated as a sum of the independent components $s_k$, for $k=1,…,n$; $x_i=a_{i,1} s_1 + ⋯ + a_{i,k} s_k + ⋯ + a_{i,n} s_n$, weighted by the mixing weights $a_{i,k}$.
  • Linear noisy ICA: with additional assumption of zero-mean and uncorrelated Gaussian noise $n \sim N(0,diag(\Sigma))$, the ICA model takes the form $x=As+n$.
  • Nonlinear ICA: the mixing of the sources is not necessarily linear. Using a nonlinear mixing function $f(.|θ)$ with parameter $\theta$ the nonlinear ICA model is $x=f(s│θ)+n$.
  • Identifiability: the independent components are identifiable up to a permutation and scaling of the sources, which requires: (1) at most one of the sources $s_k$ is Gaussian; (2) the number of observed mixtures, $m$, must be at least as large as the number of estimated components: $n$ such that $n\leq m$, i.e., the mixing matrix $A$ must be of full rank in order to have inverse.
  • ICA in R using package fastICA. This example demonstrates how to separate two mixed independent uniform variables:
library(fastICA)
S <- matrix(runif(10000), 5000, 2)
A <- matrix(c(1, 1, -1, 3), 2, 2, byrow = TRUE)
X <- S %*% A # In R,  "*" and "%*%" indicate "scalar" and matrix multiplication, respectively!

a <- fastICA(X, 2, alg.typ = "parallel", fun = "logcosh", alpha = 1,
                 method = "C", row.norm = FALSE, maxit = 200,
                 tol = 0.0001, verbose = TRUE)
par(mfrow = c(1, 2))
plot(a$\$ $X, main = "Pre-processed data")
plot(a$\$ $S, main = "ICA components")

Another example of un-mixing two independent signals is shown below:
S <- cbind(sin((1:1000)/20), rep((((1:200)-100)/100), 5))
# cbind combines objects by rows and columns. 
#It takes a sequence of vector, matrix or data frames arguments and combines them by columns or rows, respectively. 

A <- matrix(c(0.291, 0.6557, -0.5439, 0.5572), 2, 2)
X <- S %*% A
a <- fastICA(X, 2, alg.typ = "parallel", fun = "logcosh", alpha = 1,
                 method = "R", row.norm = FALSE, maxit = 200,
                 tol = 0.0001, verbose = TRUE)
par(mfcol = c(2, 3))
plot(1:1000, S[,1 ], type = "l", main = "Original Signals",
         xlab = "", ylab = "")
plot(1:1000, S[,2 ], type = "l", xlab = "", ylab = "")
plot(1:1000, X[,1 ], type = "l", main = "Mixed Signals",
         xlab = "", ylab = "")
plot(1:1000, X[,2 ], type = "l", xlab = "", ylab = "")
plot(1:1000, a$\$ $S[,1 ], type = "l", main = "ICA source estimates",
         xlab = "", ylab = "")
plot(1:1000, a$\$ $S[, 2], type = "l", xlab = "", ylab = "")

Factor analysis (FA)

FA is a statistical method that describes variability among observed correlated variables in terms of a potentially lower number of unobserved variables. Consider a set of $p$ observable random variables, $x_1,x_2,…,x_p$ with means $μ_1,μ_2,…,μ_p$, respectively. Suppose for some unknown constants $l_{i,j}$ and $k$ unobserved random variables $F_j$, where $i\in \{1,…,p\}$ and $j \in \{1,…,k\}$ where $k<p$. We have $x_i-μ_i=l_{i,1} F_1 + ⋯ +l_{i,k} F_k + ε_i$, where $ε_i$ are independently distributed error terms with mean zero and finite variance. In matrix form, we have $x-μ = LF+ε$, with $n$ observations; $x$ is a $p\times n$ matrix, $L$ is a $p \times k$ matrix and $F$ is $k\times n$ matrix. Assumptions: (1) $ε$ and $F$ are independent; (2) $E(F)=0$; (3) $cov(F)=I$ to make sure the factors are uncorrelated. Solutions to the equations above yield the factors $F$ and the loading matrix, $L$.

  • Example: In the following, matrices will be indicated by indexed variables. "Subject" indices will be indicated using the letters $a$, $b$ and $c$, with values running from $1$ to $N_a$, which is equal to $10$ in the above example. "Factor" indices will be indicated using the letters $p$, $q$ and $r$, with values running from $1$ to $N_p$ which is equal to $2$ in the above example. "Instance" or "sample" indices will be indicated using letters $i$, $j$ and $k$, with values running from $1$ to $N_i$. In the example above, if a sample of $N_i=1000$ students responded to the $N_a=10$ questions, the $i^{th}$ student's score for the $a^{th}$ question are given by $x_{ai}$. The purpose of factor analysis is to characterize the correlations between the variables $x_a$ of which the $x_{ai}$ is a particular instance, or set of observations. To ensure that all variables are on equal footing, they are standardized: $z_{ai}=\frac{x_{ai}- μ_a}{σ_a}$, where $μ_a$ is the sample mean and sample variance $σ_a^2=\frac{1}{N_a} \sum_i {(x_{ai}-μ_a)^2}$. The factor analysis model is expressed by:
$$\begin{matrix}z_{1,i} & = & \ell_{1,1}F_{1,i} & + & \ell_{1,2}F_{2,i} & + & \varepsilon_{1,i} \\ \vdots & & \vdots & & \vdots & & \vdots \\ z_{10,i} & = & \ell_{10,1}F_{1,i} & + & \ell_{10,2}F_{2,i} & + & \varepsilon_{10,i} \end{matrix}$$

In matrix form ($Z=LF+ϵ$), this model can be expressed as: $$z_{ai}=\sum_p \ell_{ap}F_{pi}+\varepsilon_{ai},$$ where $F_{1,i}$ is the $i^{th}$ student’s verbal intelligence, $F_{2,i}$ is the $i^{th}$ student’s mathematical intelligence; $l_{ap}$ are the factor loadings for the $a^{th}$ subject for $p=1,2$.

FA in R: using factanal()
# Maximum Likelihood Factor Analysis
# entering raw data and extracting 3 factors, 
# with varimax rotation 

mydata <- read.csv("http://www.ats.ucla.edu/stat/data/mmreg.csv")
# mmreg.csv includes 600 observations and 8 variables. 
# The psychological variables are locus_of_control, self_concept and motivation. 
# The academic variables are standardized tests in reading (read), writing (write), math (math) and science (science). 
# Additionally, the variable female is a zero-one indicator variable with the one indicating a female student.

# We can get some basic descriptions of the entire data set by using summary. 
# To get the standard deviations, we use sapply to apply the sd function to each variable in the dataset.
summary(mydata)

sapply(mydata, sd)

fit <- factanal(mydata, 3, rotation="varimax")  # mydata can be a raw data matrix or a covariance matrix. 
# Pairwise deletion  of missing data is used. Rotation can be "varimax" or "promax".

print(fit, digits=2, cutoff=.3, sort=TRUE)
# plot factor 1 by factor 2 
load <- fit$\$ $loadings[,1:2] 
plot(load,type="n") # set up plot 
text(load,labels=names(mydata),cex=.7) # add variable names

# Principal Axis Factor Analysis
library(psych)
fit <- factor.pa(mydata, nfactors=3, rotation="varimax")
fit # print results

# Determine Number of Factors to Extract
library(nFactors)
ev <- eigen(cor(mydata)) # get eigenvalues
ap <- parallel(subject=nrow(mydata),var=ncol(mydata), rep=100,cent=.05)
nS <- nScree(x=ev$\$ $values, aparallel=ap$\$ $eigen$\$ $qevpea)
plotnScree(nS)

PCA, ICA, FA: Similarities and Differences

  • PCA is closely related to factor analysis. The latter typically incorporates more domain specific assumptions about the underlying structure and solves eigenvectors of a slightly different matrix. Principal components create variables that are linear combinations of the original variables. The new variables have the property that the variables are all orthogonal. The principal components can be used to find clusters in a set of data.
  • PCA is a variance-focused approach that seeks to reproduce the variable's total variance. In PCA, the components reflect both common and unique variance of the variable. It is generally preferred for the purposes of data reduction (i.e., translating variable space into optimal factor space) but not for detecting latent constructs or factors. Factor analysis is similar to principal component analysis in that factor analysis also involves linear combinations of variables.
  • In contrast to PCA, factor analysis is a correlation-focused approach seeking to reproduce the correlations among variables. The factors represent the common variance of variables and exclude unique variance. Factor analysis is generally used when the purpose of the research is detection of underlying structure in the data structure (i.e., latent constructs or factors) or causal modeling.

Applications

Parkinson's disease case study

The table below includes data for N=33 Parkinson's disease (PD) cases (3001, …, 3808) from the Parkinson’s Progression Markers Initiative (PPMI) database. We are interested in examining the relationships between imaging biomarkers, derived from the red nucleus in the Substantia Nigra (SN), and MDS-UPDRS assessment measures for Parkinson's Diseases patients, Part I.A, Part I.B, Part II and Part III scores.

  • Hypothesis: The a priori hypothesis is that there is a strong inverse correlation between the imaging biomarkers (voxel intensity ratios) and the MDS-UPDRS scores. In other words, an increase of the (average) voxel intensities (i.e., more intact nigrosome-1) would correspond with a decrease in UPDRS scores (i.e., less PD symptoms), and vice-versa, lower voxel intensity ratios would correspond with higher UPDRS scores.
  • Data:
The derived imaging biomarkers include:
Top of SN Voxel Intensity Ratio obtained by averaging five random points, taken from the left side within the axial slice, from the nigrosome-1 region. Five random points from the ventral medial portion of the substantia nigra within the slice of interest are averaged. The mean for the nigrosome-1 was divided over the mean for the other ROI’s to generate the Top of SN Voxel Intensity ratios bopth on the left and right side.
Side of SN Voxel Intensity Ratio obtained by averaging five random points, taken from the left side within the axial slice, from the nigrosome-1. Again, five random points from the outer swallow tail of the substantia nigra within the slice of interest are averaged on the left and right side
The Movement Disorder Society (MDS)-Unified Parkinson’s Disease Rating Scale (MDS-UPDRS): includes Part I (Non-motor Experiences of Daily Living); Part II (Motor Experiences of Daily Living); Part III (Motor Examination); and Part IV (Motor Complications) which was excluded as it was only conducted for PPMI subjects who had started PD medication.
Patient_ID Top_of_SN_Voxel_Intensity_Ratio Side_of_SN_Voxel_Intensity_Ratio Part_IA Part_IB Part_II Part_III
3001 1.188075152 0.940778688 0 8 2 12
3002 1.621806155 1.123298088 3 5 15 17
3003 1.058309086 1.015158507 1 11 6 29
3004 2.027838499 1.31348223 0 2 0 2
3006 1.754682932 1.064932943 1 2 4 22
3008 1.399353481 1.15022502 1 7 2 10
3010 1.327449894 0.930592788 2 9 11 19
3011 1.765207347 1.241295983 1 0 0 0
3012 1.565801729 1.329574405 4 7 6 20
3014 1.231222196 1.038040254 0 8 3 36
3016 1.560849827 1.305709046 0 2 0 1
3018 1.2011895 0.963859444 2 5 5 19
3020 1.693136427 1.010054879 2 13 10 20
3021 1.334255775 1.027113745 1 7 5 12
3023 1.221884984 0.949202296 4 5 4 30
3024 1.406800428 0.995810978 0 3 9 12
3029 1.74564323 1.22107546 0 1 2 3
3188 1.795850966 1.381062457 0 5 0 0
3551 1.811029249 1.160219407 0 2 0 0
3554 1.611463658 1.184391119 3 4 1 15
3361 2.149437904 1.197802706 0 1 2 9
3161 1.801208214 1.061680928 0 6 3 16
3767 1.484570315 1.208752123 3 4 6 21
3305 1.415250395 1.116697914 0 1 1 12
3307 1.79439256 1.111008744 1 8 17 26
3308 1.398475885 0.985062613 1 6 1 6
3309 1.151021722 0.963016059 3 6 8 29
3311 1.298993676 1.102174959 0 1 0 0
3314 1.371292596 1.214219683 1 2 0 0
3322 1.437867168 0.977680828 0 0 0 2
3352 1.84019943 0.960479831 6 13 10 29
3586 1.722158512 1.157249933 0 0 0 0
3808 1.355532247 1.111517624 1 8 2 13
  • R-script analyses of the PPMI data:
dataset <- read.csv('C:\\Users\\Desktop\\PPMI_data.csv', header = TRUE)
# Patient_ID, Top_of_SN_Voxel_Intensity_Ratio, Side_of_SN_Voxel_Intensity_Ratio	
# Part_IA, Part_IB, Part_II, Part_III

# remove the first columns (Patient ID number)
dataset <- dataset[,-1]
attach(dataset)
# form the multivariate response variable, Y
Y <- cbind(Part_IA, Part_IB, Part_II, Part_III)

# form the predictors vector, X
X <- cbind(Top_of_SN_Voxel_Intensity_Ratio,Side_of_SN_Voxel_Intensity_Ratio)

summary(dataset)
head(dataset)
# some diagnostic plots
par(mfrow=c(2,3))
plot(sort(dataset$\$ $Top_of_SN_Voxel_Intensity_Ratio))
hist(dataset$\$ $Top_of_SN_Voxel_Intensity_Ratio)
plot(density(dataset$\$ $Top_of_SN_Voxel_Intensity_Ratio,na.rm=FALSE)) 

plot(sort(dataset$\$ $Side_of_SN_Voxel_Intensity_Ratio))
hist(dataset$\$ $Side_of_SN_Voxel_Intensity_Ratio)
plot(density(dataset$\$ $Side_of_SN_Voxel_Intensity_Ratio,na.rm=FALSE)) 
# Try MANOVA first to explore predictive value of imaging biomarkers on clinical PD scores
# MMLR Analysis
# mmlr.model <- lm(Y ~ X)     # Fit the multivariate model
# summary(manova(mmlr.model)) 
# summary (manova(mmlr.model), test='Wilks')
# manova(mmlr.model)
# mmlr.model 
# Fit linear regression models
Fit_IA <- lm(Part_IA ~ Top_of_SN_Voxel_Intensity_Ratio + Side_of_SN_Voxel_Intensity_Ratio)
Fit_IA
summary(Fit_IA)

Fit_IB <- lm(Part_IB ~ Top_of_SN_Voxel_Intensity_Ratio + Side_of_SN_Voxel_Intensity_Ratio)
Fit_IB
summary(Fit_IB)

Fit_III <- lm(Part_III ~ Top_of_SN_Voxel_Intensity_Ratio + Side_of_SN_Voxel_Intensity_Ratio)
Fit_III
summary(Fit_III)
par(mar = rep(3, 4))
plot(Fit_IA); # text(0.5,0.5,"First title",cex=2,font=2)
plot(Fit_IB); # text(0.5,0.5,"Second title",cex=2,font=2)
plot(Fit_III); # text(0.5,0.5,"Third title",cex=2,font=2)
# PCA/FA Analyses
par(mfrow=c(1,1))
pca.model <- princomp(dataset, cor=TRUE)
summary(pca.model) # print variance accounted for 
loadings(pca.model) # pc loadings (i.e., igenvector columns)

Loadings:
                                Comp.1 Comp.2 Comp.3 Comp.4	  Comp.5 	Comp.6
Top_of_SN_Voxel_Intensity_Ratio  -0.256 -0.713 -0.373 -0.105	  0.477 	-0.221
Side_of_SN_Voxel_Intensity_Ratio -0.386 -0.472  0.357  0.433	  -0.558
Part_IA                           0.383 -0.373  0.710 -0.320	  0.238  	0.227
Part_IB                           0.460        -0.112  0.794	  0.292  	0.226
Part_II                           0.425 -0.342 -0.464 -0.262     -0.534  	0.365
Part_III                          0.498                          -0.183 	-0.844

plot(pca.model, type="lines") # scree plot 

pca.model$\$ $scores # the principal components
par(mfrow=c(1,1))
biplot(pca.model)
library("FactoMineR", lib.loc="~/R/win-library/3.1")
pca.model <- PCA(dataset)
pca.model
# Factor Analysis
#install.packages("factanal")
fa.model <- factanal(dataset, 2, rotation="varimax")
print(fa.model, digits=2, cutoff=.3, sort=TRUE)
# plot factor 1 by factor 2  
load <- fa.model$\$ $loadings[,1:2] 
plot(load,type="n") # set up plot 
text(load, labels=names(fa.model),cex=.7) # add variable names

# factanal(x = dataset, factors = 2, rotation = "varimax")

Uniquenesses:
Top_of_SN_Voxel_Intensity_Ratio      0.02 
Side_of_SN_Voxel_Intensity_Ratio     0.53 
                        Part_IA      0.57 
                        Part_IB      0.41 
                        Part_II      0.39 
                       Part_III      0.22 

Loadings:
                                Factor1 Factor2
Part_IA                           0.65          
Part_IB                           0.73          
Part_II                           0.78          
Part_III                          0.82   -0.32  
Top_of_SN_Voxel_Intensity_Ratio           0.99  
Side_of_SN_Voxel_Intensity_Ratio -0.42    0.54  

              Factor1 Factor2
SS loadings       2.41    1.44
Proportion Var    0.40    0.24
Cumulative Var    0.40    0.64
  • This SOCR Activity demonstrates the utilization of a SOCR analysis package for statistical computing in the SOCR environment. It presents a general introduction to PCA and the theoretical background of this statistical tool and demonstrates how to use PCA and read and interpret the outcome. It introduces students to inputting data in the correct format, reading the results of PCA and interpreting the resulting transformed data.
  • This article presents a general introduction to PCA. Principal component analysis of a data matrix extracts the dominant patterns in the matrix in terms of a complementary set of score and loading plots. It is the responsibility of the data analyst to formulate the scientific issue at hand in terms of PC projections, PLS regressions, etc. Ask yourself, or the investigator, why the data matrix was collected, and for what purpose the experiments and measurements were made. Specify before the analysis what kinds of patterns you would expect and what you would find exciting. The results of the analysis depend on the scaling of the matrix, which therefore must be specified. Variance scaling, where each variable is scaled to unit variance, is recommended for general use, provided that almost constant variables are left unscaled. Combining different types of variables warrants block-scaling. In the initial analysis, look for outliers and strong groupings in the plots, indicating that the data matrix perhaps should be “polished” or whether disjoint modeling is the proper course. For plotting purposes, two or three principal components are usually sufficient, but for modeling purposes the number of significant components should be properly determined, e.g. by cross-validation. Use the resulting principal components to guide your continued investigation or chemical experimentation, not as an end in itself.
  • This article introduces the Akaike Information Criterion (AIC) to extend the method of maximum likelihood to the multi-model situation. It relates the order determination of an autoregressive model to the determination of the number of factors in a maximum likelihood factor analysis. The use of the AIC criterion in factor analysis is particularly interesting when it used with a Bayesian model. This observation reveals that AIC can be more widely applied than only to the conventional i.i.d. type models on which the original derivation of the criterion was based. The observation of the Bayesian structure of the factor analysis model leads to the handling of the problem of improper solutions by introducing a natural prior distribution of factor loadings.
  • This article contains a good introduction to the application of ICA. Independent component models have gained increasing interest in various fields in recent years. The basic independent component model is a semi-parametric model that assumes that a p-variate observed random vector is a linear transformation of an unobserved vector of p independent latent variables. This linear transformation is given by an unknown mixing matrix, and one of the main objectives of independent component analysis (ICA) is to estimate the unmixing matrix so the latent variables can be recovered. In this article, we discuss the basic independent component model in detail, define the concepts and analysis tools carefully, and consider two families of ICA estimates. The statistical properties (i.e., consistency, asymptotic normality, efficiency, robustness) of the estimates can be analyzed and compared via the so-called gain matrices. Some extensions of the basic independent component model, such as models with additive noise or models with dependent observations, are briefly discussed. The article ends with a short example.

Software

Problems

R Example 1

# Install package ‘fastICA’

> library(fastICA)
# Using the SOCR 1981-2006 CPI Data 
# (http://wiki.socr.umich.edu/index.php/SOCR_Data_Dinov_021808_ConsumerPriceIndex), 
# save the table in ASCII text file CPI_Data.dat. 
# Note the "dir" (folder) where you saved the data and reference it below
> CPI_Data <- as.matrix(read.table("/dir/CPI_Data.dat",header=T))

> # compare PCA and FA analyses
> X <- CPI_Data[,-1]
> pc.cor <- princomp(X,cor=T)
> summary(pc.cor )
Importance of components:
                          Comp.1    Comp.2     Comp.3     Comp.4     Comp.5     Comp.6      Comp.7
Standard deviation     2.1348817 1.1696678 0.71243186 0.54364890 0.38449985 0.31956304 0.145200770
Proportion of Variance 0.6511029 0.1954461 0.07250845 0.04222202 0.02112002 0.01458865 0.003011895
Cumulative Proportion  0.6511029 0.8465490 0.91905742 0.96127944 0.98239946 0.99698811 1.000000000

> ica <- fastICA(X,n.comp=7)
> names(ica)
[1] "X" "K" "W" "A" "S"
#  X: pre-processed data matrix (whitened/sphered data)
#  K: pre-whitening matrix that projects data onto the first n.comp
#     principal components.
#  W: estimated un-mixing matrix (XW = S)
#  A: estimated mixing matrix (X = SA)
#  S: estimated source matrix (factor scores, $\Theta$ in the notes)

> windows()
> biplot(pc.cor)

> S <- ica$\$ $S
> dimnames(S) <- list(dimnames(X)1,paste("Cmp.",1:7,sep=""))
> A <- ica$\$ $A
> dimnames(A) <- list(dimnames(X)2,paste("Cmp.",1:7,sep=""))
> windows()
> biplot(S[,1:2],A[,1:2])

> loadings(pc.cor) 
Loadings:
             Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6
Electricity  -0.415 -0.227  0.164  0.512 -0.373 -0.576
Fuel_Oil     -0.351  0.547               -0.198  0.312
Bananas      -0.373 -0.415  0.258  0.365  0.393  0.578
Tomatoes     -0.369 -0.320  0.357 -0.738 -0.294       
Orange_Juice -0.324 -0.311 -0.871        -0.119       
Beef         -0.424  0.220        -0.216  0.721 -0.449
Gasoline     -0.380  0.479               -0.216  0.161
             Comp.7
Electricity  -0.131
Fuel_Oil     -0.657
Bananas            
Tomatoes           
Orange_Juice  0.108
Beef               
Gasoline      0.733

               Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6
SS loadings     1.000  1.000  1.000  1.000  1.000  1.000
Proportion Var  0.143  0.143  0.143  0.143  0.143  0.143
Cumulative Var  0.143  0.286  0.429  0.571  0.714  0.857
               Comp.7
SS loadings     1.000
Proportion Var  0.143
Cumulative Var  1.000

> field <- function(x) { substr(paste(x,"           ",sep=""),1,6) }
> A.str <- ifelse(abs(A)<2,field(" "),field(round(A,2)))
> dimnames(A.str) <- list(dimnames(X)2,paste("Cmp.",1:7,sep=""))
> write.table(A.str[,1:4],"",quote=F,row.names=T,col.names=NA)
 		Cmp.1 Cmp.2 Cmp.3 Cmp.4
Electricity                            
Fuel_Oil                            
Bananas 	-2.59                      
Tomatoes 	-2.9                       
Orange_Juice 	-2.66                      
Beef                            
Gasoline                            

> L <- pc.cor$loadings
> L.str <- ifelse(abs(L)<0.3,field(" "),field(round(L,2)))
> dimnames(L.str) <- list(dimnames(X)2,paste("Cmp.",1:7,sep=""))
> write.table(L.str[,1:4],"",quote=F,row.names=T,col.names=T)
		Cmp.1 Cmp.2 Cmp.3 Cmp.4
Electricity 	-0.41                0.51  
Fuel_Oil 	-0.35  0.55                
Bananas 	-0.37  -0.41         0.37  
Tomatoes 	-0.37  -0.32  0.36   -0.74 
Orange_Juice 	-0.32  -0.31  -0.87        
Beef 		-0.42                      
Gasoline 	-0.38  0.48

References


Appendix

SOCR 1981-2006 CPI Dataset

Year Electricity Fuel_Oil Bananas Tomatoes Orange_Juice Beef Gasoline
1981 31.552 1.15 0.343 0.792 1.141 1.856 1.269
1982 36.006 1.254 0.364 0.763 1.465 1.794 1.341
1983 37.184 1.194 0.332 0.726 1.418 1.756 1.214
1984 38.6 1.122 0.344 0.854 1.408 1.721 1.2
1985 38.975 1.078 0.35 0.697 1.685 1.711 1.145
1986 40.223 1.126 0.337 1.104 1.756 1.662 1.19
1987 40.022 0.817 0.374 0.871 1.512 1.694 0.868
1988 40.195 0.89 0.394 0.797 1.638 1.736 0.947
1989 40.828 0.883 0.429 1.735 1.868 1.806 0.944
1990 41.663 1.259 0.438 0.912 1.817 1.907 1.09
1991 43.226 1.235 0.428 0.936 2.005 1.996 1.304
1992 44.501 0.985 0.426 1.141 1.879 1.926 1.135
1993 46.959 0.969 0.44 1.604 1.677 1.97 1.182
1994 48.2 0.919 0.503 1.323 1.674 1.892 1.109
1995 48.874 0.913 0.463 1.103 1.583 1.847 1.19
1996 48.538 1.007 0.497 1.213 1.577 1.799 1.186
1997 49.245 1.136 0.473 1.452 1.737 1.85 1.318
1998 46.401 0.966 0.489 1.904 1.601 1.818 1.186
1999 45.061 0.834 0.49 1.443 1.753 1.834 1.031
2000 45.207 1.189 0.5 1.414 1.823 1.903 1.356
2001 47.472 1.509 0.509 1.451 1.863 2.037 1.525
2002 47.868 1.123 0.526 1.711 1.876 2.152 1.209
2003 47.663 1.396 0.512 1.472 1.848 2.131 1.557
2004 49.159 1.508 0.485 1.66 1.957 2.585 1.635
2005 50.847 1.859 0.49 2.162 1.872 2.478 1.866
2006 57.223 2.418 0.505 1.621 1.853 2.607 2.359





Translate this page:

(default)
Uk flag.gif

Deutsch
De flag.gif

Español
Es flag.gif

Français
Fr flag.gif

Italiano
It flag.gif

Português
Pt flag.gif

日本語
Jp flag.gif

България
Bg flag.gif

الامارات العربية المتحدة
Ae flag.gif

Suomi
Fi flag.gif

इस भाषा में
In flag.gif

Norge
No flag.png

한국어
Kr flag.gif

中文
Cn flag.gif

繁体中文
Cn flag.gif

Русский
Ru flag.gif

Nederlands
Nl flag.gif

Ελληνικά
Gr flag.gif

Hrvatska
Hr flag.gif

Česká republika
Cz flag.gif

Danmark
Dk flag.gif

Polska
Pl flag.png

România
Ro flag.png

Sverige
Se flag.gif