Difference between revisions of "SMHS PCA ICA FA"
(→Factor analysis (FA)) |
(→Applications) |
||
(58 intermediate revisions by 2 users not shown) | |||
Line 2: | Line 2: | ||
===Overview=== | ===Overview=== | ||
− | * ''PCA'' (principal component analysis) is a mathematical procedure that transforms a number of possibly correlated variables into a | + | * ''PCA'' (principal component analysis) is a mathematical procedure that transforms a number of possibly correlated variables into a smaller number of uncorrelated variables through a process known as orthogonal transformation. |
* ''ICA'' (independent component analysis) is a computational tool to separate a multivariate signal into additive subcomponents by assuming that the subcomponents are non-Gaussian signals and are statistically independent from each other. | * ''ICA'' (independent component analysis) is a computational tool to separate a multivariate signal into additive subcomponents by assuming that the subcomponents are non-Gaussian signals and are statistically independent from each other. | ||
− | * ''Factor analysis'' is a statistical method | + | * ''Factor analysis'' is a statistical method that describes variability among observed correlated variables in terms of a potentially lower number of unobserved variables. It is related to PCA but they are not identical. In this section, we are going to introduce these three commonly used statistical tools and illustrate their application in various studies with examples and R code samples. |
===Motivation=== | ===Motivation=== | ||
− | Suppose we have a set of observable correlated random variables, and we want to reduce the dimensionality of the data into a reasonable new set. How can we achieve dimensionality reduction? Principal component analysis, independent component analysis and factor analysis may be the answers here. How does each of them work? What | + | Suppose we have a set of observable correlated random variables, and we want to reduce the dimensionality of the data into a reasonable new set. How can we achieve this dimensionality reduction? Principal component analysis, independent component analysis and factor analysis may be the answers here. How does each of them work? What are the differences among those statistical methods; what are their strengths and weaknesses? How can we decide on the best method for a specific dataset? |
===Theory=== | ===Theory=== | ||
+ | ====Fundamentals==== | ||
+ | First review the [http://en.wikipedia.org/wiki/Matrix_multiplication rules for matrix and vector multiplication]. In an essence, the arithmetic process of multiplying numbers in row $i$ in matrix, or vector, $X^T$ and column $j$ in matrix, or vector, $A$ corresponds to component-wise multiplication followed by adding the terms, summation, to obtain entry $X\times A_{i,j}$ in the product matrix. As an example, $X_{1\times n}^T A_{n\times n}X_{n\times 1} = B_{1\times 1}$ is a scalar value. The transpose notation, $X^T$, indicates that the rows and columns are swapped, e.g., for a standard vector represented as a column, $X=\begin{pmatrix}X_1\\X_2\\...\\X_N \end{pmatrix}$, the transpose is the same vector as a row, $X^T=(X_1, X_2, ..., X_N)$. | ||
+ | |||
+ | Now, suppose we have a sample of $n$ observations as points in a plane: $\{p_1, p_2, …, p_n\}$ in the 2D space, $p=(x_1,x_2)$. How can we account (approximately) for the variation in the sample in as few variables as possible. The figure below illustrates the geometric nature of '''principal components (PCs)'''. | ||
+ | |||
+ | <center> | ||
+ | [[Image:SMHS_PCA_ICA_FA_Fig1_c.png|500px]] | ||
+ | </center> | ||
+ | |||
+ | The $1^{st}$ PC, $Z_1$, is a minimum distance fit to a line in the $X$ space. The $2^{nd}$ PC, $Z_2$, is a minimum distance fit to a line perpendicular to the $1^{st}$ PC (in an (N-1)D hyper-plane). In general, the next PCs are a series of linear least squares fits to a sample, each orthogonal to all the previous PCs. | ||
+ | |||
+ | Given a sample of n observations on a vector of $N$ variables | ||
+ | $X=\{X_1, X_2, …, X_N\}$, the $1^{st}$ PC is defined by a linear transformation $Z_1=a_1^T X =\sum_{i=1}^N {a_{i,1}X_i}$, where the vector $a_1=\{a_{1,1}, a_{2,1}, …, a_{N,1} \}$ is chosen to maximize the Variance of $Z_1$. | ||
+ | |||
+ | Similarly, $k^{th}$ PC of the sample is defined by a linear transformation $Z_k=a_k^T X =\sum_{i=1}^N {a_{i,k}X_i}$, where the vector $a_k=\{a_{1,k}, a_{2,k}, …, a_{N,k} \}$ is chosen to maximize the Variance of $Z_k$, subject to 3 conditions: | ||
+ | :: Variance of $Z_k$ is maximized, and | ||
+ | :: $Cov(Z_k,Z_l)=0$ for all $1 \leq l < k$, and | ||
+ | :: $a_k^Ta_k = 1.$ | ||
+ | |||
+ | To find the transformation ($a$) we express the $var(Z_1)$ in terms of the variance-covariance matrix of $X$: | ||
+ | :: $var(Z_1) = E(Z_1^2) – (E (Z_1))^2 = \sum _{i,j,=1}^N {a_{i,1} a_{j,1} E(x_i x_j)} – | ||
+ | \sum _{i,j,=1}^N { a_{i,1} a_{j,1} E(x_i)E( x_j)} $ $= \sum _{i,j,=1}^N { a_{i,1} a_{j,1} S_{i,j}}$, | ||
+ | where $ S_{i,j} = \sigma_{x_i x_j} = E(X_i X_j) – E(X_i)E(X_j)$. | ||
+ | Thus, | ||
+ | : $var(Z_1) =a_1^TSa_1$, where $S=(S_{i,j})$ is the variance covariance matrix of $X=\{X_1, …, X_N\}$. | ||
+ | As $a_1$ maximizes $Var(Z_1)$ subject to $||a_1||=a_i^Ta_1=1$, we set up this optimization problem as: | ||
+ | |||
+ | $\max_{a_1} \{a_1^T Sa_1 - \lambda(a_1^Ta_1 -1) \}.$ | ||
+ | Differentiating w.r.t. $a_1$ and setting the derivative equal to zero, yields: | ||
+ | $Sa_1 - \lambda a_1 = 0$, and thus, $(S-\lambda I_N)a_1=0$. | ||
+ | Therefore, $a_1$ is an eigenvector of S, corresponding to the eigenvalue $\lambda=\lambda_1$. | ||
+ | Maximizing $Var(Z_1)=a_1^TSa_1= a_1^T \lambda_1 a_1$, yields the largest eigenvalue of $S$, hence the first PC $Z_1$ retains the largest amount of variation in the sample. | ||
+ | |||
+ | Similarly, the second eigenvector $a_2$ maximizes $Var(Z_2)$, subject to: | ||
+ | :: $0=Cov(Z_2,Z_1)=a_1^TSa_2=\lambda_1a_1^Ta_2$, and | ||
+ | :: $a_2^Ta_2 = 1.$ | ||
+ | |||
+ | Thus, for a pair of ([http://en.wikipedia.org/wiki/Lagrange_multiplier Lagrange multipliers]) $\lambda_1$ and $\lambda_2$, we need to maximize: | ||
+ | $$ a_1^TSa_2 -\lambda_1(a_2^Ta_2-1) - \lambda_2(a_1^Ta_2).$$ | ||
+ | Differentiating w.r.t. $a_1$ and setting the derivative equal to zero yields that: | ||
+ | $(S-\lambda_2 I_N)a_2=0$, which means that $a_2$ is another eigenvector of $S$ corresponding to the eigenvalue $\lambda_2$. And so on. | ||
+ | In general, $Var(Z_k)=a_k^TSa_k=\lambda_k$, where $a_k$ is the $k^{th}$ largest eigenvalue of $S$ and $Z_k$ is the $k^{th}$ PC retaining the $k^{th}$ largest variation in the sample $X$. | ||
+ | |||
+ | In Matrix notation: | ||
+ | :: Observations $X=\{X_1, X_2, …, X_N \}$ | ||
+ | :: PCs: $Z=\{Z_1, Z_2, …, Z_N \}$, with $Z=A^TZ$ and $A=\{a_1, a_2, …, a_N\}$ is an orthogonal $N\times N$ matrix with columns representing the eigenvectors of S. That is, $\Lambda = A^T S A$, where $\Lambda$ is a diagonal matrix $\Lambda=diag(\lambda_1, \lambda_2, …, \lambda_n)$ of eigenvalues. | ||
+ | |||
====PCA (principal component analysis)==== | ====PCA (principal component analysis)==== | ||
− | PCA is a statistical technique | + | PCA is a statistical technique that uses an orthogonal transformation to convert a set of observations of possibly correlated variables into a set of variables that are linearly uncorrelated. The resulting uncorrelated variables are called principal components. The first principal component accounts for as much of the variability in the data as possible, and each succeeding component accounts for the remaining variability. PCA is the simplest of the true eigenvector-based multivariate analyses; it reveals the internal structure of the data in a way that best explains the variance in the data. It is a useful statistical technique that has been applied in many fields, including computer networks and image processing, and is a powerful method for finding patterns in high-dimensional datasets. |
− | Consider a data matrix | + | Consider a data matrix $X_{n\times p}$ with column-wise empirical means of zero (i.e., the sample mean of each column has been shifted to zero), where each of the $n$ rows represents a different repetition of the experiment, and each of the $p$ columns gives a particular kind of data-element (i.e., variable). Mathematically, the transformation is defined by a set of $p$-dimensional vectors of weights $w_{(k)}=(w_1,w_2,…,w_p)_{(k)}$, constrained to be unitary ($||w_{(k)}||=1$), which map each row vector $x_i$ of $X$ to a new vector of principal component scores $t_{(i)}=(t_1,t_2,…,t_p)_{(i)}$, given by $t_{k(i)}=x_{(i)} w_{(k)}$. The mapping occurs such that the individual elements of $t$ considered over the data set successively inherit the maximum possible variance from $x$. |
− | * First component: the first loading vector w_ | + | * First component: the first loading vector $w_{(1)}= \arg\max_{||w||=1} {\sum_i{(t_1)_{(i)}^2 }}= \arg\max_{||w||=1} {\sum_i{(x_{(i)} w)^2 }}$, in matrix form: $w_{(1)}=\arg\max_{||w||=1} {||Xw||^2 }=\arg\max_{||w||=1} {w^T X^T Xw}$. |
− | * Further components: the k^th component can be found by subtracting the first k-1 principal components from X. X | + | * Further components: the $k^{th}$ component can be found by subtracting the first $k-1$ principal components from $X$. $\hat{X}_{k-1} = X-\sum_{s=1}^{k-1}{X w_{(s)} w_{(s)}^T }$ and finding the loading vector involves extracting the maximum variance from this new data matrix: $w_{(k)}=\arg\max_{||w||=1} {||\bar{X}_{k-1} w||^2 } = \arg\max {\frac{w^T \hat{X}_{k-1}^T \hat{X}_{k-1} w}{w^T w}}.$ This gives the remaining eigenvectors of $X^T X$ with the maximum values for the quantity in brackets given by other corresponding eigenvalues. The full principal components decomposition of $X$ can therefore be given as $T=XW$ where $W$ is a $p\times p$ matrix whose columns are the eigenvectors of $X^T X$. |
− | * Dimensionality reduction: The faithful transformation T = X W maps a data vector x_ | + | * Dimensionality reduction: The faithful transformation $T = X W$ maps a data vector $x_{(i)}$ from an original space of $p$ variables to a new space of $p$ (other) variables which are uncorrelated over the dataset. However, not all the principal components need to be kept. Keeping only the first $L$ principal components, produced by using only the first $L$ loading vectors, gives the truncated transformation $T_L=XW_L$ where $T_L$ has $n$ rows but only $L$ columns. This dimensionality reduction is very useful for visualizing and processing high-dimensional datasets while still retaining as much of the variance in the dataset as possible. |
− | * Singular value decomposition (SVD): | + | * Singular value decomposition (SVD) |
+ | : The SVD of $X=U \Sigma W^T$ where $\Sigma$ is an $n\times p$ rectangular diagonal matrix of positive numbers $\sigma_{(k)}$, the singular values of $X$; $U$ is an $n\times n$ matrix, the columns of which are orthogonal until vector of length $n$ called the left singular vectors of $X$; and $W$ is a $p\times p$ whose columns are orthogonal until vector of length $p$ and called the right singular vectors of $X$. With factorization, the matrix $X^T X=W \Sigma U^T U \Sigma W^T = W \Sigma^2 W^T$; with singular value decomposition the score matrix $T=XW=U\Sigma W^T W=U\Sigma$, where each column of $T$ is given by one of the left singular vectors of $X$ multiplied by the corresponding singular value. | ||
* Computational details of PCA: | * Computational details of PCA: | ||
− | ** Begin with a dataset | + | ** Begin with a dataset that contains at least two dimensions (i.e., variables). The dataset can contain as many observations (dimensions) as you like; |
− | ** Normalize the observations for each variable. To do this, simply subtract the mean (average) from each observation | + | ** Normalize the observations for each variable. To do this, simply subtract the mean (average) from each observation within a given variable. For example, let $X$ and $Y$ be the two variables from the original dataset, with variable $X$ containing observations $X_1,X_2,…,X_n$, and variable $Y$ containing observations $Y_1,Y_2,…,Y_n$. Let $\bar{X}$ be the average of the $n$ observations of $X$, i.e. $\bar{X}= \frac{X_1+X_2+⋯+X_n}{n}$, and similarly let $\bar{Y}= \frac{Y_1+Y_2+⋯+Y_n}{n}$ be the average of the $Y$ observations. Then, the normalized dataset would be, for variable $X$: $\{X_1-\bar{X}, X_2-\bar{X}, …,X_n-\bar{X}\}$ and for variable $Y$: $\{Y_1-\bar{Y},Y_2-\bar{Y}, …,Y_n-\bar{Y}\}$ |
** Calculate the covariance matrix between the variables of the normalized dataset; | ** Calculate the covariance matrix between the variables of the normalized dataset; | ||
− | ** Calculate the | + | ** Calculate the eigenvalue and eigenvectors of the covariance matrix (the eigenvectors must be normalized to a length of 1); |
− | ** | + | ** Choose the most significant principal component, which is simply the eigenvector with the highest eigenvalue. |
− | + | ====PCA Properties==== | |
− | + | * For any integer $q$, $1 \leq q \leq p$, consider the orthogonal linear transformation $y=B'x$, where $y$ is a $q$-element vector and $B'$ is a $q\times q$ matrix, and let $\Sigma_y =B'\Sigma B$ be the variance-covariance matrix for $y$. Then the trace of $\Sigma_y$, $tr(\Sigma_y)$, is maximized by taking $B=A_q$, where $A_q$ consists of the first $q$ columns of $A$; | |
− | * | + | * $y=B'x$ and the $tr(\Sigma_y)$ are minimized by taking $B=A_q^*$ where $A_q^*$ consists of the last $q$ columns of $A$. The last few principal components are not simply unstructured left-overs after removing the important ones; |
− | * | + | * $\Sigma = \lambda_1 \alpha_1 \alpha_1'+\lambda_2 \alpha_2 \alpha_2'+⋯+\lambda_p \alpha_p \alpha_p'$. Given that $var(x_j)=\sum_{k=1}^p {\lambda_k \alpha_{kj}^2 }$, the elements of $\lambda_k \alpha_k \alpha_k'$ tends to become smaller as $k$ increases, whereas the elements of $\alpha_k$ tends to stay about the same size because of the normalization constraint $\alpha_k \alpha_k'=1,$ for $k=1,2,…,p$. |
− | + | ====PCA Limitations==== | |
− | + | * The results of PCA depend on the scaling of the variables. | |
− | + | * The applicability of PCA is limited by certain assumptions made in its derivation. | |
− | + | ==== PCA in R==== | |
require(graphics) | require(graphics) | ||
## The variances of the variables in the | ## The variances of the variables in the | ||
## USArrests data vary by orders of magnitude, so scaling is appropriate | ## USArrests data vary by orders of magnitude, so scaling is appropriate | ||
− | (pc.cr <- princomp(USArrests | + | # This data set contains statistics, in arrests per 100,000 residents for assault, |
+ | # murder, and rape in each of the 50 US states in 1973. Also given is the percent of | ||
+ | # the population living in urban areas. | ||
+ | help(USArrests) | ||
+ | head(USArrests) | ||
+ | |||
+ | pc.cr <- princomp(USArrests) # inappropriate | ||
Call: | Call: | ||
Line 93: | Line 147: | ||
biplot(pc.cr) ## shows the plot of PCA in a different format | biplot(pc.cr) ## shows the plot of PCA in a different format | ||
− | From the chart above, we can see the distribution variance attributed to different variables | + | From the chart above, we can see the distribution of the variance attributed to different variables in the four principal components. |
− | + | ==== [[SOCR_EduMaterials_AnalysisActivities_PCA|PCA using SOCR Analyses]]==== | |
+ | [[SOCR_EduMaterials_AnalysisActivities_PCA| This SOCR activity]] illustrates the use of PCA. | ||
====ICA (independent component analysis)==== | ====ICA (independent component analysis)==== | ||
− | ICA is a computational method | + | ICA is a computational method that separates a multivariate signal into additive subcomponents by assuming that the subcomponents are non-Gaussian signals and are statistically independent from each other. |
* ICA Assumptions: | * ICA Assumptions: | ||
− | ** The source signals are independent of each other | + | ** The source signals are independent of each other. |
− | ** The | + | ** The distributions of the values in each source signal are non-Gaussian. |
− | ** Independence: | + | ** Independence: The sources of signals are assumed to be independent, but their signal mixtures are not independent because they share the same source signals; |
− | ** Normality: | + | ** Normality: Based on [[SMHS_CLT_LLN|CLT]], the distribution of the sum of independent random variables approximates a Gaussian distribution; |
− | ** Complexity: | + | ** Complexity: The temporal complexity of any signal mixture is greater than that of its simplest constituent source signal. |
− | ICA maximizes the statistical independence of the estimated components to find the independent components. In general, ICA | + | : ICA maximizes the statistical independence of the estimated components to find the independent components. In general, ICA cannot identify the actual number of source signals nor can it identify the proper scaling of the source signals. Suppose the data is represented by the random vector $x=(x_1,x_2,…,x_m )^t$, and the components are denoted as $s=(s_1,s_2,…,s_n )^t$. We need to transform the observed data $x$ using a linear transformation $w$, $s=Wx$, into maximally independent components $s$ measured by some functions of independence. There are alternative ''models'' for ICA: |
− | * Linear noiseless ICA: the components x_i of the data x=(x_1,x_2,…,x_m )^ | + | * Linear noiseless ICA: the components $x_i$ of the data $x=(x_1,x_2,…,x_m )^t$ are generated as a sum of the independent components $s_k$, for $k=1,…,n$; $x_i=a_{i,1} s_1 + ⋯ + a_{i,k} s_k + ⋯ + a_{i,n} s_n$, weighted by the mixing weights $a_{i,k}$. |
− | * Linear noisy ICA: with additional assumption of zero-mean and uncorrelated Gaussian noise n | + | * Linear noisy ICA: with additional assumption of zero-mean and uncorrelated Gaussian noise $n \sim N(0,diag(\Sigma))$, the ICA model takes the form $x=As+n$. |
− | * Nonlinear ICA: the mixing of the sources is not necessarily linear. Using a nonlinear mixing function f(.|θ) with parameter | + | * Nonlinear ICA: the mixing of the sources is not necessarily linear. Using a nonlinear mixing function $f(.|θ)$ with parameter $\theta$ the nonlinear ICA model is $x=f(s│θ)+n$. |
− | * Identifiability: the independent components are identifiable up to a permutation and scaling of the sources, which requires: (1) at most one of the sources s_k is Gaussian; (2) the number of observed mixtures, m, must be at least as large as the number of estimated components: n | + | * Identifiability: the independent components are identifiable up to a permutation and scaling of the sources, which requires: (1) at most one of the sources $s_k$ is Gaussian; (2) the number of observed mixtures, $m$, must be at least as large as the number of estimated components: $n$ such that $n\leq m$, i.e., the mixing matrix $A$ must be of full rank in order to have inverse. |
− | + | * ICA in R using package [http://cran.at.r-project.org/web/packages/fastICA/fastICA.pdf fastICA]. This example demonstrates how to separate two mixed independent uniform variables: | |
− | * ICA in R using package [http://cran.at.r-project.org/web/packages/fastICA/fastICA.pdf fastICA]. This example | ||
library(fastICA) | library(fastICA) | ||
S <- matrix(runif(10000), 5000, 2) | S <- matrix(runif(10000), 5000, 2) | ||
− | + | A <- matrix(c(1, 1, -1, 3), 2, 2, byrow = TRUE) | |
− | + | X <- S %*% A # In R, "*" and "%*%" indicate "scalar" and matrix multiplication, respectively! | |
− | + | ||
+ | a <- fastICA(X, 2, alg.typ = "parallel", fun = "logcosh", alpha = 1, | ||
method = "C", row.norm = FALSE, maxit = 200, | method = "C", row.norm = FALSE, maxit = 200, | ||
tol = 0.0001, verbose = TRUE) | tol = 0.0001, verbose = TRUE) | ||
− | + | par(mfrow = c(1, 2)) | |
− | + | plot(a$\$ $X, main = "Pre-processed data") | |
− | + | plot(a$\$ $S, main = "ICA components") | |
: Another example of un-mixing two independent signals is shown below: | : Another example of un-mixing two independent signals is shown below: | ||
S <- cbind(sin((1:1000)/20), rep((((1:200)-100)/100), 5)) | S <- cbind(sin((1:1000)/20), rep((((1:200)-100)/100), 5)) | ||
− | + | # [http://stat.ethz.ch/R-manual/R-devel/library/base/html/cbind.html cbind] combines objects by rows and columns. | |
− | + | #It takes a sequence of vector, matrix or data frames arguments and combines them by columns or rows, respectively. | |
− | + | ||
+ | A <- matrix(c(0.291, 0.6557, -0.5439, 0.5572), 2, 2) | ||
+ | X <- S %*% A | ||
+ | a <- fastICA(X, 2, alg.typ = "parallel", fun = "logcosh", alpha = 1, | ||
method = "R", row.norm = FALSE, maxit = 200, | method = "R", row.norm = FALSE, maxit = 200, | ||
tol = 0.0001, verbose = TRUE) | tol = 0.0001, verbose = TRUE) | ||
− | + | par(mfcol = c(2, 3)) | |
− | + | plot(1:1000, S[,1 ], type = "l", main = "Original Signals", | |
xlab = "", ylab = "") | xlab = "", ylab = "") | ||
− | + | plot(1:1000, S[,2 ], type = "l", xlab = "", ylab = "") | |
− | + | plot(1:1000, X[,1 ], type = "l", main = "Mixed Signals", | |
xlab = "", ylab = "") | xlab = "", ylab = "") | ||
− | + | plot(1:1000, X[,2 ], type = "l", xlab = "", ylab = "") | |
− | + | plot(1:1000, a$\$ $S[,1 ], type = "l", main = "ICA source estimates", | |
xlab = "", ylab = "") | xlab = "", ylab = "") | ||
− | plot(1:1000, a$S[, 2], type = "l", xlab = "", ylab = "") | + | plot(1:1000, a$\$ $S[, 2], type = "l", xlab = "", ylab = "") |
====Factor analysis (FA)==== | ====Factor analysis (FA)==== | ||
− | FA is a statistical method | + | FA is a statistical method that describes variability among observed correlated variables in terms of a potentially lower number of unobserved variables. Consider a set of $p$ observable random variables, $x_1,x_2,…,x_p$ with means $μ_1,μ_2,…,μ_p$, respectively. Suppose for some unknown constants $l_{i,j}$ and $k$ unobserved random variables $F_j$, where $i\in \{1,…,p\}$ and $j \in \{1,…,k\}$ where $k<p$. We have $x_i-μ_i=l_{i,1} F_1 + ⋯ +l_{i,k} F_k + ε_i$, where $ε_i$ are independently distributed error terms with mean zero and finite variance. In matrix form, we have $x-μ = LF+ε$, with $n$ observations; $x$ is a $p\times n$ matrix, $L$ is a $p \times k$ matrix and $F$ is $k\times n$ matrix. Assumptions: (1) $ε$ and $F$ are independent; (2) $E(F)=0$; (3) $cov(F)=I$ to make sure the factors are uncorrelated. Solutions to the equations above yield the factors $F$ and the loading matrix, $L$. |
+ | |||
+ | * Example: In the following, matrices will be indicated by indexed variables. "Subject" indices will be indicated using the letters $a$, $b$ and $c$, with values running from $1$ to $N_a$, which is equal to $10$ in the above example. "Factor" indices will be indicated using the letters $p$, $q$ and $r$, with values running from $1$ to $N_p$ which is equal to $2$ in the above example. "Instance" or "sample" indices will be indicated using letters $i$, $j$ and $k$, with values running from $1$ to $N_i$. In the example above, if a sample of $N_i=1000$ students responded to the $N_a=10$ questions, the $i^{th}$ student's score for the $a^{th}$ question are given by $x_{ai}$. The purpose of factor analysis is to characterize the correlations between the variables $x_a$ of which the $x_{ai}$ is a particular instance, or set of observations. To ensure that all variables are on equal footing, they are standardized: $z_{ai}=\frac{x_{ai}- μ_a}{σ_a}$, where $μ_a$ is the sample mean and sample variance $σ_a^2=\frac{1}{N_a} \sum_i {(x_{ai}-μ_a)^2}$. The factor analysis model is expressed by: | ||
+ | :$$\begin{matrix}z_{1,i} & = & \ell_{1,1}F_{1,i} & + & \ell_{1,2}F_{2,i} & + & \varepsilon_{1,i} \\ | ||
+ | \vdots & & \vdots & & \vdots & & \vdots \\ | ||
+ | z_{10,i} & = & \ell_{10,1}F_{1,i} & + & \ell_{10,2}F_{2,i} & + & \varepsilon_{10,i} | ||
+ | \end{matrix}$$ | ||
− | + | In matrix form ($Z=LF+ϵ$), this model can be expressed as: | |
+ | $$z_{ai}=\sum_p \ell_{ap}F_{pi}+\varepsilon_{ai},$$ | ||
+ | where $F_{1,i}$ is the $i^{th}$ student’s ''verbal intelligence'', $F_{2,i}$ is the $i^{th}$ student’s ''mathematical intelligence''; $l_{ap}$ are the factor loadings for the $a^{th}$ subject for $p=1,2$. | ||
− | + | =====FA in R: using ''factanal()''===== | |
# Maximum Likelihood Factor Analysis | # Maximum Likelihood Factor Analysis | ||
# entering raw data and extracting 3 factors, | # entering raw data and extracting 3 factors, | ||
Line 163: | Line 229: | ||
# To get the standard deviations, we use sapply to apply the sd function to each variable in the dataset. | # To get the standard deviations, we use sapply to apply the sd function to each variable in the dataset. | ||
summary(mydata) | summary(mydata) | ||
− | + | ||
sapply(mydata, sd) | sapply(mydata, sd) | ||
fit <- factanal(mydata, 3, rotation="varimax") # mydata can be a raw data matrix or a covariance matrix. | fit <- factanal(mydata, 3, rotation="varimax") # mydata can be a raw data matrix or a covariance matrix. | ||
# Pairwise deletion of missing data is used. Rotation can be "varimax" or "promax". | # Pairwise deletion of missing data is used. Rotation can be "varimax" or "promax". | ||
− | + | ||
print(fit, digits=2, cutoff=.3, sort=TRUE) | print(fit, digits=2, cutoff=.3, sort=TRUE) | ||
# plot factor 1 by factor 2 | # plot factor 1 by factor 2 | ||
Line 184: | Line 250: | ||
ev <- eigen(cor(mydata)) # get eigenvalues | ev <- eigen(cor(mydata)) # get eigenvalues | ||
ap <- parallel(subject=nrow(mydata),var=ncol(mydata), rep=100,cent=.05) | ap <- parallel(subject=nrow(mydata),var=ncol(mydata), rep=100,cent=.05) | ||
− | nS <- nScree(x=ev$values, aparallel=ap$eigen$qevpea) | + | nS <- nScree(x=ev$\$ $values, aparallel=ap$\$ $eigen$\$ $qevpea) |
plotnScree(nS) | plotnScree(nS) | ||
− | ===PCA, ICA, FA: | + | ===PCA, ICA, FA: Similarities and Differences=== |
− | * PCA is closely related to factor analysis. The | + | * PCA is closely related to factor analysis. The latter typically incorporates more domain specific assumptions about the underlying structure and solves eigenvectors of a slightly different matrix. Principal components create variables that are linear combinations of the original variables. The new variables have the property that the variables are all orthogonal. The principal components can be used to find clusters in a set of data. |
− | * PCA is a variance-focused approach | + | * PCA is a variance-focused approach that seeks to reproduce the variable's total variance. In PCA, the components reflect both common and unique variance of the variable. It is generally preferred for the purposes of data reduction (i.e., translating variable space into optimal factor space) but not for detecting latent constructs or factors. Factor analysis is similar to principal component analysis in that factor analysis also involves linear combinations of variables. |
− | * | + | * In contrast to PCA, factor analysis is a correlation-focused approach seeking to reproduce the correlations among variables. The factors represent the common variance of variables and exclude unique variance. Factor analysis is generally used when the purpose of the research is detection of underlying structure in the data structure (i.e., latent constructs or factors) or causal modeling. |
===Applications=== | ===Applications=== | ||
− | + | ====Parkinson's disease case study==== | |
+ | The table below includes data for N=33 Parkinson's disease (PD) cases (3001, …, 3808) from the [http://www.ppmi-info.org/ Parkinson’s Progression Markers Initiative (PPMI) database]. We are interested in examining the relationships between imaging biomarkers, derived from the red nucleus in the Substantia Nigra (SN), and [http://www.movementdisorders.org/MDS-Files1/PDFs/MDS-UPDRS-Rating-Scales/NewUPDRS7308final.pdf MDS-UPDRS assessment measures] for Parkinson's Diseases patients, Part I.A, Part I.B, Part II and Part III scores. | ||
+ | |||
+ | * '''Hypothesis''': The a priori hypothesis is that there is a strong inverse correlation between the imaging biomarkers (voxel intensity ratios) and the MDS-UPDRS scores. In other words, an increase of the (average) voxel intensities (i.e., more intact nigrosome-1) would correspond with a decrease in UPDRS scores (i.e., less PD symptoms), and vice-versa, lower voxel intensity ratios would correspond with higher UPDRS scores. | ||
+ | |||
+ | * '''Data''': | ||
+ | : The derived imaging biomarkers include: | ||
+ | :: '''Top of SN Voxel Intensity Ratio''' obtained by averaging five random points, taken from the left side within the axial slice, from the nigrosome-1 region. Five random points from the ventral medial portion of the substantia nigra within the slice of interest are averaged. The mean for the nigrosome-1 was divided over the mean for the other ROI’s to generate the Top of SN Voxel Intensity ratios bopth on the left and right side. | ||
+ | :: '''Side of SN Voxel Intensity Ratio''' obtained by averaging five random points, taken from the left side within the axial slice, from the nigrosome-1. Again, five random points from the outer swallow tail of the substantia nigra within the slice of interest are averaged on the left and right side | ||
− | + | : The Movement Disorder Society (MDS)-Unified Parkinson’s Disease Rating Scale (MDS-UPDRS): includes Part I (Non-motor Experiences of Daily Living); Part II (Motor Experiences of Daily Living); Part III (Motor Examination); and Part IV (Motor Complications) which was excluded as it was only conducted for PPMI subjects who had started PD medication. | |
− | + | <center> | |
+ | {| class="wikitable" style="text-align:center;" border="2" | ||
+ | |- | ||
+ | !Patient_ID||Top_of_SN_Voxel_Intensity_Ratio||Side_of_SN_Voxel_Intensity_Ratio||Part_IA||Part_IB||Part_II||Part_III | ||
+ | |- | ||
+ | |3001||1.188075152||0.940778688||0||8||2||12 | ||
+ | |- | ||
+ | |3002||1.621806155||1.123298088||3||5||15||17 | ||
+ | |- | ||
+ | |3003||1.058309086||1.015158507||1||11||6||29 | ||
+ | |- | ||
+ | |3004||2.027838499||1.31348223||0||2||0||2 | ||
+ | |- | ||
+ | |3006||1.754682932||1.064932943||1||2||4||22 | ||
+ | |- | ||
+ | |3008||1.399353481||1.15022502||1||7||2||10 | ||
+ | |- | ||
+ | |3010||1.327449894||0.930592788||2||9||11||19 | ||
+ | |- | ||
+ | |3011||1.765207347||1.241295983||1||0||0||0 | ||
+ | |- | ||
+ | |3012||1.565801729||1.329574405||4||7||6||20 | ||
+ | |- | ||
+ | |3014||1.231222196||1.038040254||0||8||3||36 | ||
+ | |- | ||
+ | |3016||1.560849827||1.305709046||0||2||0||1 | ||
+ | |- | ||
+ | |3018||1.2011895||0.963859444||2||5||5||19 | ||
+ | |- | ||
+ | |3020||1.693136427||1.010054879||2||13||10||20 | ||
+ | |- | ||
+ | |3021||1.334255775||1.027113745||1||7||5||12 | ||
+ | |- | ||
+ | |3023||1.221884984||0.949202296||4||5||4||30 | ||
+ | |- | ||
+ | |3024||1.406800428||0.995810978||0||3||9||12 | ||
+ | |- | ||
+ | |3029||1.74564323||1.22107546||0||1||2||3 | ||
+ | |- | ||
+ | |3188||1.795850966||1.381062457||0||5||0||0 | ||
+ | |- | ||
+ | |3551||1.811029249||1.160219407||0||2||0||0 | ||
+ | |- | ||
+ | |3554||1.611463658||1.184391119||3||4||1||15 | ||
+ | |- | ||
+ | |3361||2.149437904||1.197802706||0||1||2||9 | ||
+ | |- | ||
+ | |3161||1.801208214||1.061680928||0||6||3||16 | ||
+ | |- | ||
+ | |3767||1.484570315||1.208752123||3||4||6||21 | ||
+ | |- | ||
+ | |3305||1.415250395||1.116697914||0||1||1||12 | ||
+ | |- | ||
+ | |3307||1.79439256||1.111008744||1||8||17||26 | ||
+ | |- | ||
+ | |3308||1.398475885||0.985062613||1||6||1||6 | ||
+ | |- | ||
+ | |3309||1.151021722||0.963016059||3||6||8||29 | ||
+ | |- | ||
+ | |3311||1.298993676||1.102174959||0||1||0||0 | ||
+ | |- | ||
+ | |3314||1.371292596||1.214219683||1||2||0||0 | ||
+ | |- | ||
+ | |3322||1.437867168||0.977680828||0||0||0||2 | ||
+ | |- | ||
+ | |3352||1.84019943||0.960479831||6||13||10||29 | ||
+ | |- | ||
+ | |3586||1.722158512||1.157249933||0||0||0||0 | ||
+ | |- | ||
+ | |3808||1.355532247||1.111517624||1||8||2||13 | ||
+ | |} | ||
+ | </center> | ||
− | * [http://onlinelibrary.wiley.com/doi/10.1002/9780470057339.vnn086/abstract This article] contains a good introduction to the application of ICA. Independent component models have gained increasing interest in various fields | + | * R-script analyses of the PPMI data: |
+ | |||
+ | dataset <- read.csv('C:\\Users\\Desktop\\PPMI_data.csv', header = TRUE) | ||
+ | # Patient_ID, Top_of_SN_Voxel_Intensity_Ratio, Side_of_SN_Voxel_Intensity_Ratio | ||
+ | # Part_IA, Part_IB, Part_II, Part_III | ||
+ | |||
+ | # remove the first columns (Patient ID number) | ||
+ | dataset <- dataset[,-1] | ||
+ | attach(dataset) | ||
+ | |||
+ | # form the multivariate response variable, Y | ||
+ | Y <- cbind(Part_IA, Part_IB, Part_II, Part_III) | ||
+ | |||
+ | # form the predictors vector, X | ||
+ | X <- cbind(Top_of_SN_Voxel_Intensity_Ratio,Side_of_SN_Voxel_Intensity_Ratio) | ||
+ | |||
+ | summary(dataset) | ||
+ | head(dataset) | ||
+ | |||
+ | # some diagnostic plots | ||
+ | par(mfrow=c(2,3)) | ||
+ | plot(sort(dataset$\$ $Top_of_SN_Voxel_Intensity_Ratio)) | ||
+ | hist(dataset$\$ $Top_of_SN_Voxel_Intensity_Ratio) | ||
+ | plot(density(dataset$\$ $Top_of_SN_Voxel_Intensity_Ratio,na.rm=FALSE)) | ||
+ | |||
+ | plot(sort(dataset$\$ $Side_of_SN_Voxel_Intensity_Ratio)) | ||
+ | hist(dataset$\$ $Side_of_SN_Voxel_Intensity_Ratio) | ||
+ | plot(density(dataset$\$ $Side_of_SN_Voxel_Intensity_Ratio,na.rm=FALSE)) | ||
+ | |||
+ | # Try MANOVA first to explore predictive value of imaging biomarkers on clinical PD scores | ||
+ | # MMLR Analysis | ||
+ | # mmlr.model <- lm(Y ~ X) # Fit the multivariate model | ||
+ | # summary(manova(mmlr.model)) | ||
+ | # summary (manova(mmlr.model), test='Wilks') | ||
+ | # manova(mmlr.model) | ||
+ | # mmlr.model | ||
+ | |||
+ | # Fit linear regression models | ||
+ | Fit_IA <- lm(Part_IA ~ Top_of_SN_Voxel_Intensity_Ratio + Side_of_SN_Voxel_Intensity_Ratio) | ||
+ | Fit_IA | ||
+ | summary(Fit_IA) | ||
+ | |||
+ | Fit_IB <- lm(Part_IB ~ Top_of_SN_Voxel_Intensity_Ratio + Side_of_SN_Voxel_Intensity_Ratio) | ||
+ | Fit_IB | ||
+ | summary(Fit_IB) | ||
+ | |||
+ | Fit_III <- lm(Part_III ~ Top_of_SN_Voxel_Intensity_Ratio + Side_of_SN_Voxel_Intensity_Ratio) | ||
+ | Fit_III | ||
+ | summary(Fit_III) | ||
+ | |||
+ | par(mar = rep(3, 4)) | ||
+ | plot(Fit_IA); # text(0.5,0.5,"First title",cex=2,font=2) | ||
+ | plot(Fit_IB); # text(0.5,0.5,"Second title",cex=2,font=2) | ||
+ | plot(Fit_III); # text(0.5,0.5,"Third title",cex=2,font=2) | ||
+ | |||
+ | # PCA/FA Analyses | ||
+ | par(mfrow=c(1,1)) | ||
+ | pca.model <- princomp(dataset, cor=TRUE) | ||
+ | summary(pca.model) # print variance accounted for | ||
+ | loadings(pca.model) # pc loadings (i.e., igenvector columns) | ||
+ | |||
+ | Loadings: | ||
+ | Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 | ||
+ | Top_of_SN_Voxel_Intensity_Ratio -0.256 -0.713 -0.373 -0.105 0.477 -0.221 | ||
+ | Side_of_SN_Voxel_Intensity_Ratio -0.386 -0.472 0.357 0.433 -0.558 | ||
+ | Part_IA 0.383 -0.373 0.710 -0.320 0.238 0.227 | ||
+ | Part_IB 0.460 -0.112 0.794 0.292 0.226 | ||
+ | Part_II 0.425 -0.342 -0.464 -0.262 -0.534 0.365 | ||
+ | Part_III 0.498 -0.183 -0.844 | ||
+ | |||
+ | plot(pca.model, type="lines") # scree plot | ||
+ | |||
+ | pca.model$\$ $scores # the principal components | ||
+ | par(mfrow=c(1,1)) | ||
+ | biplot(pca.model) | ||
+ | |||
+ | library("FactoMineR", lib.loc="~/R/win-library/3.1") | ||
+ | pca.model <- PCA(dataset) | ||
+ | pca.model | ||
+ | |||
+ | # Factor Analysis | ||
+ | #install.packages("factanal") | ||
+ | fa.model <- factanal(dataset, 2, rotation="varimax") | ||
+ | print(fa.model, digits=2, cutoff=.3, sort=TRUE) | ||
+ | # plot factor 1 by factor 2 | ||
+ | load <- fa.model$\$ $loadings[,1:2] | ||
+ | plot(load,type="n") # set up plot | ||
+ | text(load, labels=names(fa.model),cex=.7) # add variable names | ||
+ | |||
+ | # factanal(x = dataset, factors = 2, rotation = "varimax") | ||
+ | |||
+ | Uniquenesses: | ||
+ | Top_of_SN_Voxel_Intensity_Ratio 0.02 | ||
+ | Side_of_SN_Voxel_Intensity_Ratio 0.53 | ||
+ | Part_IA 0.57 | ||
+ | Part_IB 0.41 | ||
+ | Part_II 0.39 | ||
+ | Part_III 0.22 | ||
+ | |||
+ | Loadings: | ||
+ | Factor1 Factor2 | ||
+ | Part_IA 0.65 | ||
+ | Part_IB 0.73 | ||
+ | Part_II 0.78 | ||
+ | Part_III 0.82 -0.32 | ||
+ | Top_of_SN_Voxel_Intensity_Ratio 0.99 | ||
+ | Side_of_SN_Voxel_Intensity_Ratio -0.42 0.54 | ||
+ | |||
+ | Factor1 Factor2 | ||
+ | SS loadings 2.41 1.44 | ||
+ | Proportion Var 0.40 0.24 | ||
+ | Cumulative Var 0.40 0.64 | ||
+ | |||
+ | * [[SOCR_EduMaterials_AnalysisActivities_PCA| This SOCR Activity]] demonstrates the utilization of a SOCR analysis package for statistical computing in the SOCR environment. It presents a general introduction to PCA and the theoretical background of this statistical tool and demonstrates how to use PCA and read and interpret the outcome. It introduces students to inputting data in the correct format, reading the results of PCA and interpreting the resulting transformed data. | ||
+ | |||
+ | * [http://www.sciencedirect.com/science/article/pii/0169743987800849 This article] presents a general introduction to PCA. Principal component analysis of a data matrix extracts the dominant patterns in the matrix in terms of a complementary set of score and loading plots. It is the responsibility of the data analyst to formulate the scientific issue at hand in terms of PC projections, PLS regressions, etc. Ask yourself, or the investigator, why the data matrix was collected, and for what purpose the experiments and measurements were made. Specify before the analysis what kinds of patterns you would expect and what you would find exciting. The results of the analysis depend on the scaling of the matrix, which therefore must be specified. Variance scaling, where each variable is scaled to unit variance, is recommended for general use, provided that almost constant variables are left unscaled. Combining different types of variables warrants block-scaling. In the initial analysis, look for outliers and strong groupings in the plots, indicating that the data matrix perhaps should be “polished” or whether disjoint modeling is the proper course. For plotting purposes, two or three principal components are usually sufficient, but for modeling purposes the number of significant components should be properly determined, e.g. by cross-validation. Use the resulting principal components to guide your continued investigation or chemical experimentation, not as an end in itself. | ||
+ | |||
+ | * [http://link.springer.com/article/10.1007/BF02294359 This article] introduces the Akaike Information Criterion (AIC) to extend the method of maximum likelihood to the multi-model situation. It relates the order determination of an autoregressive model to the determination of the number of factors in a maximum likelihood factor analysis. The use of the AIC criterion in factor analysis is particularly interesting when it used with a Bayesian model. This observation reveals that AIC can be more widely applied than only to the conventional i.i.d. type models on which the original derivation of the criterion was based. The observation of the Bayesian structure of the factor analysis model leads to the handling of the problem of improper solutions by introducing a natural prior distribution of factor loadings. | ||
+ | |||
+ | * [http://onlinelibrary.wiley.com/doi/10.1002/9780470057339.vnn086/abstract This article] contains a good introduction to the application of ICA. Independent component models have gained increasing interest in various fields in recent years. The basic independent component model is a semi-parametric model that assumes that a p-variate observed random vector is a linear transformation of an unobserved vector of p independent latent variables. This linear transformation is given by an unknown mixing matrix, and one of the main objectives of independent component analysis (ICA) is to estimate the unmixing matrix so the latent variables can be recovered. In this article, we discuss the basic independent component model in detail, define the concepts and analysis tools carefully, and consider two families of ICA estimates. The statistical properties (i.e., consistency, asymptotic normality, efficiency, robustness) of the estimates can be analyzed and compared via the so-called gain matrices. Some extensions of the basic independent component model, such as models with additive noise or models with dependent observations, are briefly discussed. The article ends with a short example. | ||
===Software === | ===Software === | ||
* [http://socr.ucla.edu/htmls/ana/ SOCR ANalyses] | * [http://socr.ucla.edu/htmls/ana/ SOCR ANalyses] | ||
− | * [[SOCR_EduMaterials_AnalysisActivities_PCA SOCR PCA Activity]] | + | * [[SOCR_EduMaterials_AnalysisActivities_PCA|SOCR PCA Activity]] |
* [http://stat.ethz.ch/R-manual/R-patched/library/stats/html/princomp.html R princomp package] | * [http://stat.ethz.ch/R-manual/R-patched/library/stats/html/princomp.html R princomp package] | ||
* [http://cran.r-project.org/web/packages/fastICA/ R fastICA package] and [http://cran.at.r-project.org/web/packages/fastICA/fastICA.pdf documentation] | * [http://cran.r-project.org/web/packages/fastICA/ R fastICA package] and [http://cran.at.r-project.org/web/packages/fastICA/fastICA.pdf documentation] | ||
Line 215: | Line 479: | ||
> library(fastICA) | > library(fastICA) | ||
− | # Using the SOCR 1981-2006 CPI Data (http://wiki.socr.umich.edu/index.php/SOCR_Data_Dinov_021808_ConsumerPriceIndex), | + | # Using the SOCR 1981-2006 CPI Data |
− | # save the table in ASCII text file | + | # (http://wiki.socr.umich.edu/index.php/SOCR_Data_Dinov_021808_ConsumerPriceIndex), |
− | > CPI_Data <- as.matrix(read.table("CPI_Data.dat",header=T)) | + | # save the table in ASCII text file [[SMHS_PCA_ICA_FA#Appendix|CPI_Data.dat]]. |
+ | # Note the "dir" (folder) where you saved the data and reference it below | ||
+ | > CPI_Data <- as.matrix(read.table("/dir/CPI_Data.dat",header=T)) | ||
> # compare PCA and FA analyses | > # compare PCA and FA analyses | ||
Line 223: | Line 489: | ||
> pc.cor <- princomp(X,cor=T) | > pc.cor <- princomp(X,cor=T) | ||
> summary(pc.cor ) | > summary(pc.cor ) | ||
+ | |||
Importance of components: | Importance of components: | ||
Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7 | Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7 | ||
Line 231: | Line 498: | ||
> ica <- fastICA(X,n.comp=7) | > ica <- fastICA(X,n.comp=7) | ||
> names(ica) | > names(ica) | ||
+ | |||
[1] "X" "K" "W" "A" "S" | [1] "X" "K" "W" "A" "S" | ||
# X: pre-processed data matrix (whitened/sphered data) | # X: pre-processed data matrix (whitened/sphered data) | ||
Line 241: | Line 509: | ||
> windows() | > windows() | ||
> biplot(pc.cor) | > biplot(pc.cor) | ||
− | + | ||
− | > S <- ica$S | + | > S <- ica$\$ $S |
> dimnames(S) <- list(dimnames(X)[[1]],paste("Cmp.",1:7,sep="")) | > dimnames(S) <- list(dimnames(X)[[1]],paste("Cmp.",1:7,sep="")) | ||
− | > A <- ica$A | + | > A <- ica$\$ $A |
> dimnames(A) <- list(dimnames(X)[[2]],paste("Cmp.",1:7,sep="")) | > dimnames(A) <- list(dimnames(X)[[2]],paste("Cmp.",1:7,sep="")) | ||
> windows() | > windows() | ||
Line 250: | Line 518: | ||
> loadings(pc.cor) | > loadings(pc.cor) | ||
− | + | ||
Loadings: | Loadings: | ||
Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 | Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 | ||
Line 282: | Line 550: | ||
> dimnames(A.str) <- list(dimnames(X)[[2]],paste("Cmp.",1:7,sep="")) | > dimnames(A.str) <- list(dimnames(X)[[2]],paste("Cmp.",1:7,sep="")) | ||
> write.table(A.str[,1:4],"",quote=F,row.names=T,col.names=NA) | > write.table(A.str[,1:4],"",quote=F,row.names=T,col.names=NA) | ||
+ | |||
Cmp.1 Cmp.2 Cmp.3 Cmp.4 | Cmp.1 Cmp.2 Cmp.3 Cmp.4 | ||
Electricity | Electricity | ||
Line 295: | Line 564: | ||
> dimnames(L.str) <- list(dimnames(X)[[2]],paste("Cmp.",1:7,sep="")) | > dimnames(L.str) <- list(dimnames(X)[[2]],paste("Cmp.",1:7,sep="")) | ||
> write.table(L.str[,1:4],"",quote=F,row.names=T,col.names=T) | > write.table(L.str[,1:4],"",quote=F,row.names=T,col.names=T) | ||
+ | |||
Cmp.1 Cmp.2 Cmp.3 Cmp.4 | Cmp.1 Cmp.2 Cmp.3 Cmp.4 | ||
Electricity -0.41 0.51 | Electricity -0.41 0.51 | ||
Line 302: | Line 572: | ||
Orange_Juice -0.32 -0.31 -0.87 | Orange_Juice -0.32 -0.31 -0.87 | ||
Beef -0.42 | Beef -0.42 | ||
− | Gasoline -0.38 0.48 | + | Gasoline -0.38 0.48 |
===References=== | ===References=== |
Latest revision as of 11:39, 24 November 2014
Contents
Scientific Methods for Health Sciences - Dimensionality Reduction: PCA, ICA, FA
Overview
- PCA (principal component analysis) is a mathematical procedure that transforms a number of possibly correlated variables into a smaller number of uncorrelated variables through a process known as orthogonal transformation.
- ICA (independent component analysis) is a computational tool to separate a multivariate signal into additive subcomponents by assuming that the subcomponents are non-Gaussian signals and are statistically independent from each other.
- Factor analysis is a statistical method that describes variability among observed correlated variables in terms of a potentially lower number of unobserved variables. It is related to PCA but they are not identical. In this section, we are going to introduce these three commonly used statistical tools and illustrate their application in various studies with examples and R code samples.
Motivation
Suppose we have a set of observable correlated random variables, and we want to reduce the dimensionality of the data into a reasonable new set. How can we achieve this dimensionality reduction? Principal component analysis, independent component analysis and factor analysis may be the answers here. How does each of them work? What are the differences among those statistical methods; what are their strengths and weaknesses? How can we decide on the best method for a specific dataset?
Theory
Fundamentals
First review the rules for matrix and vector multiplication. In an essence, the arithmetic process of multiplying numbers in row $i$ in matrix, or vector, $X^T$ and column $j$ in matrix, or vector, $A$ corresponds to component-wise multiplication followed by adding the terms, summation, to obtain entry $X\times A_{i,j}$ in the product matrix. As an example, $X_{1\times n}^T A_{n\times n}X_{n\times 1} = B_{1\times 1}$ is a scalar value. The transpose notation, $X^T$, indicates that the rows and columns are swapped, e.g., for a standard vector represented as a column, $X=\begin{pmatrix}X_1\\X_2\\...\\X_N \end{pmatrix}$, the transpose is the same vector as a row, $X^T=(X_1, X_2, ..., X_N)$.
Now, suppose we have a sample of $n$ observations as points in a plane: $\{p_1, p_2, …, p_n\}$ in the 2D space, $p=(x_1,x_2)$. How can we account (approximately) for the variation in the sample in as few variables as possible. The figure below illustrates the geometric nature of principal components (PCs).
The $1^{st}$ PC, $Z_1$, is a minimum distance fit to a line in the $X$ space. The $2^{nd}$ PC, $Z_2$, is a minimum distance fit to a line perpendicular to the $1^{st}$ PC (in an (N-1)D hyper-plane). In general, the next PCs are a series of linear least squares fits to a sample, each orthogonal to all the previous PCs.
Given a sample of n observations on a vector of $N$ variables $X=\{X_1, X_2, …, X_N\}$, the $1^{st}$ PC is defined by a linear transformation $Z_1=a_1^T X =\sum_{i=1}^N {a_{i,1}X_i}$, where the vector $a_1=\{a_{1,1}, a_{2,1}, …, a_{N,1} \}$ is chosen to maximize the Variance of $Z_1$.
Similarly, $k^{th}$ PC of the sample is defined by a linear transformation $Z_k=a_k^T X =\sum_{i=1}^N {a_{i,k}X_i}$, where the vector $a_k=\{a_{1,k}, a_{2,k}, …, a_{N,k} \}$ is chosen to maximize the Variance of $Z_k$, subject to 3 conditions:
- Variance of $Z_k$ is maximized, and
- $Cov(Z_k,Z_l)=0$ for all $1 \leq l < k$, and
- $a_k^Ta_k = 1.$
To find the transformation ($a$) we express the $var(Z_1)$ in terms of the variance-covariance matrix of $X$:
- $var(Z_1) = E(Z_1^2) – (E (Z_1))^2 = \sum _{i,j,=1}^N {a_{i,1} a_{j,1} E(x_i x_j)} –
\sum _{i,j,=1}^N { a_{i,1} a_{j,1} E(x_i)E( x_j)} $ $= \sum _{i,j,=1}^N { a_{i,1} a_{j,1} S_{i,j}}$, where $ S_{i,j} = \sigma_{x_i x_j} = E(X_i X_j) – E(X_i)E(X_j)$. Thus,
- $var(Z_1) =a_1^TSa_1$, where $S=(S_{i,j})$ is the variance covariance matrix of $X=\{X_1, …, X_N\}$.
As $a_1$ maximizes $Var(Z_1)$ subject to $||a_1||=a_i^Ta_1=1$, we set up this optimization problem as:
$\max_{a_1} \{a_1^T Sa_1 - \lambda(a_1^Ta_1 -1) \}.$ Differentiating w.r.t. $a_1$ and setting the derivative equal to zero, yields: $Sa_1 - \lambda a_1 = 0$, and thus, $(S-\lambda I_N)a_1=0$. Therefore, $a_1$ is an eigenvector of S, corresponding to the eigenvalue $\lambda=\lambda_1$. Maximizing $Var(Z_1)=a_1^TSa_1= a_1^T \lambda_1 a_1$, yields the largest eigenvalue of $S$, hence the first PC $Z_1$ retains the largest amount of variation in the sample.
Similarly, the second eigenvector $a_2$ maximizes $Var(Z_2)$, subject to:
- $0=Cov(Z_2,Z_1)=a_1^TSa_2=\lambda_1a_1^Ta_2$, and
- $a_2^Ta_2 = 1.$
Thus, for a pair of (Lagrange multipliers) $\lambda_1$ and $\lambda_2$, we need to maximize: $$ a_1^TSa_2 -\lambda_1(a_2^Ta_2-1) - \lambda_2(a_1^Ta_2).$$ Differentiating w.r.t. $a_1$ and setting the derivative equal to zero yields that: $(S-\lambda_2 I_N)a_2=0$, which means that $a_2$ is another eigenvector of $S$ corresponding to the eigenvalue $\lambda_2$. And so on. In general, $Var(Z_k)=a_k^TSa_k=\lambda_k$, where $a_k$ is the $k^{th}$ largest eigenvalue of $S$ and $Z_k$ is the $k^{th}$ PC retaining the $k^{th}$ largest variation in the sample $X$.
In Matrix notation:
- Observations $X=\{X_1, X_2, …, X_N \}$
- PCs: $Z=\{Z_1, Z_2, …, Z_N \}$, with $Z=A^TZ$ and $A=\{a_1, a_2, …, a_N\}$ is an orthogonal $N\times N$ matrix with columns representing the eigenvectors of S. That is, $\Lambda = A^T S A$, where $\Lambda$ is a diagonal matrix $\Lambda=diag(\lambda_1, \lambda_2, …, \lambda_n)$ of eigenvalues.
PCA (principal component analysis)
PCA is a statistical technique that uses an orthogonal transformation to convert a set of observations of possibly correlated variables into a set of variables that are linearly uncorrelated. The resulting uncorrelated variables are called principal components. The first principal component accounts for as much of the variability in the data as possible, and each succeeding component accounts for the remaining variability. PCA is the simplest of the true eigenvector-based multivariate analyses; it reveals the internal structure of the data in a way that best explains the variance in the data. It is a useful statistical technique that has been applied in many fields, including computer networks and image processing, and is a powerful method for finding patterns in high-dimensional datasets.
Consider a data matrix $X_{n\times p}$ with column-wise empirical means of zero (i.e., the sample mean of each column has been shifted to zero), where each of the $n$ rows represents a different repetition of the experiment, and each of the $p$ columns gives a particular kind of data-element (i.e., variable). Mathematically, the transformation is defined by a set of $p$-dimensional vectors of weights $w_{(k)}=(w_1,w_2,…,w_p)_{(k)}$, constrained to be unitary ($||w_{(k)}||=1$), which map each row vector $x_i$ of $X$ to a new vector of principal component scores $t_{(i)}=(t_1,t_2,…,t_p)_{(i)}$, given by $t_{k(i)}=x_{(i)} w_{(k)}$. The mapping occurs such that the individual elements of $t$ considered over the data set successively inherit the maximum possible variance from $x$.
- First component: the first loading vector $w_{(1)}= \arg\max_{||w||=1} {\sum_i{(t_1)_{(i)}^2 }}= \arg\max_{||w||=1} {\sum_i{(x_{(i)} w)^2 }}$, in matrix form: $w_{(1)}=\arg\max_{||w||=1} {||Xw||^2 }=\arg\max_{||w||=1} {w^T X^T Xw}$.
- Further components: the $k^{th}$ component can be found by subtracting the first $k-1$ principal components from $X$. $\hat{X}_{k-1} = X-\sum_{s=1}^{k-1}{X w_{(s)} w_{(s)}^T }$ and finding the loading vector involves extracting the maximum variance from this new data matrix: $w_{(k)}=\arg\max_{||w||=1} {||\bar{X}_{k-1} w||^2 } = \arg\max {\frac{w^T \hat{X}_{k-1}^T \hat{X}_{k-1} w}{w^T w}}.$ This gives the remaining eigenvectors of $X^T X$ with the maximum values for the quantity in brackets given by other corresponding eigenvalues. The full principal components decomposition of $X$ can therefore be given as $T=XW$ where $W$ is a $p\times p$ matrix whose columns are the eigenvectors of $X^T X$.
- Dimensionality reduction: The faithful transformation $T = X W$ maps a data vector $x_{(i)}$ from an original space of $p$ variables to a new space of $p$ (other) variables which are uncorrelated over the dataset. However, not all the principal components need to be kept. Keeping only the first $L$ principal components, produced by using only the first $L$ loading vectors, gives the truncated transformation $T_L=XW_L$ where $T_L$ has $n$ rows but only $L$ columns. This dimensionality reduction is very useful for visualizing and processing high-dimensional datasets while still retaining as much of the variance in the dataset as possible.
- Singular value decomposition (SVD)
- The SVD of $X=U \Sigma W^T$ where $\Sigma$ is an $n\times p$ rectangular diagonal matrix of positive numbers $\sigma_{(k)}$, the singular values of $X$; $U$ is an $n\times n$ matrix, the columns of which are orthogonal until vector of length $n$ called the left singular vectors of $X$; and $W$ is a $p\times p$ whose columns are orthogonal until vector of length $p$ and called the right singular vectors of $X$. With factorization, the matrix $X^T X=W \Sigma U^T U \Sigma W^T = W \Sigma^2 W^T$; with singular value decomposition the score matrix $T=XW=U\Sigma W^T W=U\Sigma$, where each column of $T$ is given by one of the left singular vectors of $X$ multiplied by the corresponding singular value.
- Computational details of PCA:
- Begin with a dataset that contains at least two dimensions (i.e., variables). The dataset can contain as many observations (dimensions) as you like;
- Normalize the observations for each variable. To do this, simply subtract the mean (average) from each observation within a given variable. For example, let $X$ and $Y$ be the two variables from the original dataset, with variable $X$ containing observations $X_1,X_2,…,X_n$, and variable $Y$ containing observations $Y_1,Y_2,…,Y_n$. Let $\bar{X}$ be the average of the $n$ observations of $X$, i.e. $\bar{X}= \frac{X_1+X_2+⋯+X_n}{n}$, and similarly let $\bar{Y}= \frac{Y_1+Y_2+⋯+Y_n}{n}$ be the average of the $Y$ observations. Then, the normalized dataset would be, for variable $X$: $\{X_1-\bar{X}, X_2-\bar{X}, …,X_n-\bar{X}\}$ and for variable $Y$: $\{Y_1-\bar{Y},Y_2-\bar{Y}, …,Y_n-\bar{Y}\}$
- Calculate the covariance matrix between the variables of the normalized dataset;
- Calculate the eigenvalue and eigenvectors of the covariance matrix (the eigenvectors must be normalized to a length of 1);
- Choose the most significant principal component, which is simply the eigenvector with the highest eigenvalue.
PCA Properties
- For any integer $q$, $1 \leq q \leq p$, consider the orthogonal linear transformation $y=B'x$, where $y$ is a $q$-element vector and $B'$ is a $q\times q$ matrix, and let $\Sigma_y =B'\Sigma B$ be the variance-covariance matrix for $y$. Then the trace of $\Sigma_y$, $tr(\Sigma_y)$, is maximized by taking $B=A_q$, where $A_q$ consists of the first $q$ columns of $A$;
- $y=B'x$ and the $tr(\Sigma_y)$ are minimized by taking $B=A_q^*$ where $A_q^*$ consists of the last $q$ columns of $A$. The last few principal components are not simply unstructured left-overs after removing the important ones;
- $\Sigma = \lambda_1 \alpha_1 \alpha_1'+\lambda_2 \alpha_2 \alpha_2'+⋯+\lambda_p \alpha_p \alpha_p'$. Given that $var(x_j)=\sum_{k=1}^p {\lambda_k \alpha_{kj}^2 }$, the elements of $\lambda_k \alpha_k \alpha_k'$ tends to become smaller as $k$ increases, whereas the elements of $\alpha_k$ tends to stay about the same size because of the normalization constraint $\alpha_k \alpha_k'=1,$ for $k=1,2,…,p$.
PCA Limitations
- The results of PCA depend on the scaling of the variables.
- The applicability of PCA is limited by certain assumptions made in its derivation.
PCA in R
require(graphics) ## The variances of the variables in the ## USArrests data vary by orders of magnitude, so scaling is appropriate # This data set contains statistics, in arrests per 100,000 residents for assault, # murder, and rape in each of the 50 US states in 1973. Also given is the percent of # the population living in urban areas. help(USArrests) head(USArrests)
pc.cr <- princomp(USArrests) # inappropriate Call: princomp(x = USArrests) Standard deviations: Comp.1 Comp.2 Comp.3 Comp.4 82.890847 14.069560 6.424204 2.45783 princomp(USArrests, cor = TRUE) # =^= prcomp(USArrests, scale=TRUE) Call: princomp(x = USArrests, cor = TRUE) Standard deviations: Comp.1 Comp.2 Comp.3 Comp.4 1.5748783 0.9948694 0.5971291 0.4164494 ## Similar, but different: ## The standard deviations differ by a factor of sqrt(49/50) summary(pc.cr <- princomp(USArrests, cor = TRUE)) Importance of components: Comp.1 Comp.2 Comp.3 Comp.4 Standard deviation 1.5748783 0.9948694 0.5971291 0.41644938 Proportion of Variance 0.6200604 0.2474413 0.0891408 0.04335752 Cumulative Proportion 0.6200604 0.8675017 0.9566425 1.00000000 loadings(pc.cr) # note that blank entries are small but not zero ## The signs of the columns are arbitrary Loadings: Comp.1 Comp.2 Comp.3 Comp.4 Murder -0.536 0.418 -0.341 0.649 Assault -0.583 0.188 -0.268 -0.743 UrbanPop -0.278 -0.873 -0.378 0.134 Rape -0.543 -0.167 0.818 Comp.1 Comp.2 Comp.3 Comp.4 SS loadings 1.00 1.00 1.00 1.00 Proportion Var 0.25 0.25 0.25 0.25 Cumulative Var 0.25 0.50 0.75 1.00 plot(pc.cr) # shows a screeplot. # The histogram distribution presents a vivid picture of the variance attributable to the first four significant principal # components respectively. biplot(pc.cr) ## shows the plot of PCA in a different format
From the chart above, we can see the distribution of the variance attributed to different variables in the four principal components.
PCA using SOCR Analyses
This SOCR activity illustrates the use of PCA.
ICA (independent component analysis)
ICA is a computational method that separates a multivariate signal into additive subcomponents by assuming that the subcomponents are non-Gaussian signals and are statistically independent from each other.
- ICA Assumptions:
- The source signals are independent of each other.
- The distributions of the values in each source signal are non-Gaussian.
- Independence: The sources of signals are assumed to be independent, but their signal mixtures are not independent because they share the same source signals;
- Normality: Based on CLT, the distribution of the sum of independent random variables approximates a Gaussian distribution;
- Complexity: The temporal complexity of any signal mixture is greater than that of its simplest constituent source signal.
- ICA maximizes the statistical independence of the estimated components to find the independent components. In general, ICA cannot identify the actual number of source signals nor can it identify the proper scaling of the source signals. Suppose the data is represented by the random vector $x=(x_1,x_2,…,x_m )^t$, and the components are denoted as $s=(s_1,s_2,…,s_n )^t$. We need to transform the observed data $x$ using a linear transformation $w$, $s=Wx$, into maximally independent components $s$ measured by some functions of independence. There are alternative models for ICA:
- Linear noiseless ICA: the components $x_i$ of the data $x=(x_1,x_2,…,x_m )^t$ are generated as a sum of the independent components $s_k$, for $k=1,…,n$; $x_i=a_{i,1} s_1 + ⋯ + a_{i,k} s_k + ⋯ + a_{i,n} s_n$, weighted by the mixing weights $a_{i,k}$.
- Linear noisy ICA: with additional assumption of zero-mean and uncorrelated Gaussian noise $n \sim N(0,diag(\Sigma))$, the ICA model takes the form $x=As+n$.
- Nonlinear ICA: the mixing of the sources is not necessarily linear. Using a nonlinear mixing function $f(.|θ)$ with parameter $\theta$ the nonlinear ICA model is $x=f(s│θ)+n$.
- Identifiability: the independent components are identifiable up to a permutation and scaling of the sources, which requires: (1) at most one of the sources $s_k$ is Gaussian; (2) the number of observed mixtures, $m$, must be at least as large as the number of estimated components: $n$ such that $n\leq m$, i.e., the mixing matrix $A$ must be of full rank in order to have inverse.
- ICA in R using package fastICA. This example demonstrates how to separate two mixed independent uniform variables:
library(fastICA) S <- matrix(runif(10000), 5000, 2) A <- matrix(c(1, 1, -1, 3), 2, 2, byrow = TRUE) X <- S %*% A # In R, "*" and "%*%" indicate "scalar" and matrix multiplication, respectively! a <- fastICA(X, 2, alg.typ = "parallel", fun = "logcosh", alpha = 1, method = "C", row.norm = FALSE, maxit = 200, tol = 0.0001, verbose = TRUE) par(mfrow = c(1, 2)) plot(a$\$ $X, main = "Pre-processed data") plot(a$\$ $S, main = "ICA components")
- Another example of un-mixing two independent signals is shown below:
S <- cbind(sin((1:1000)/20), rep((((1:200)-100)/100), 5)) # cbind combines objects by rows and columns. #It takes a sequence of vector, matrix or data frames arguments and combines them by columns or rows, respectively. A <- matrix(c(0.291, 0.6557, -0.5439, 0.5572), 2, 2) X <- S %*% A a <- fastICA(X, 2, alg.typ = "parallel", fun = "logcosh", alpha = 1, method = "R", row.norm = FALSE, maxit = 200, tol = 0.0001, verbose = TRUE) par(mfcol = c(2, 3)) plot(1:1000, S[,1 ], type = "l", main = "Original Signals", xlab = "", ylab = "") plot(1:1000, S[,2 ], type = "l", xlab = "", ylab = "") plot(1:1000, X[,1 ], type = "l", main = "Mixed Signals", xlab = "", ylab = "") plot(1:1000, X[,2 ], type = "l", xlab = "", ylab = "") plot(1:1000, a$\$ $S[,1 ], type = "l", main = "ICA source estimates", xlab = "", ylab = "") plot(1:1000, a$\$ $S[, 2], type = "l", xlab = "", ylab = "")
Factor analysis (FA)
FA is a statistical method that describes variability among observed correlated variables in terms of a potentially lower number of unobserved variables. Consider a set of $p$ observable random variables, $x_1,x_2,…,x_p$ with means $μ_1,μ_2,…,μ_p$, respectively. Suppose for some unknown constants $l_{i,j}$ and $k$ unobserved random variables $F_j$, where $i\in \{1,…,p\}$ and $j \in \{1,…,k\}$ where $k<p$. We have $x_i-μ_i=l_{i,1} F_1 + ⋯ +l_{i,k} F_k + ε_i$, where $ε_i$ are independently distributed error terms with mean zero and finite variance. In matrix form, we have $x-μ = LF+ε$, with $n$ observations; $x$ is a $p\times n$ matrix, $L$ is a $p \times k$ matrix and $F$ is $k\times n$ matrix. Assumptions: (1) $ε$ and $F$ are independent; (2) $E(F)=0$; (3) $cov(F)=I$ to make sure the factors are uncorrelated. Solutions to the equations above yield the factors $F$ and the loading matrix, $L$.
- Example: In the following, matrices will be indicated by indexed variables. "Subject" indices will be indicated using the letters $a$, $b$ and $c$, with values running from $1$ to $N_a$, which is equal to $10$ in the above example. "Factor" indices will be indicated using the letters $p$, $q$ and $r$, with values running from $1$ to $N_p$ which is equal to $2$ in the above example. "Instance" or "sample" indices will be indicated using letters $i$, $j$ and $k$, with values running from $1$ to $N_i$. In the example above, if a sample of $N_i=1000$ students responded to the $N_a=10$ questions, the $i^{th}$ student's score for the $a^{th}$ question are given by $x_{ai}$. The purpose of factor analysis is to characterize the correlations between the variables $x_a$ of which the $x_{ai}$ is a particular instance, or set of observations. To ensure that all variables are on equal footing, they are standardized: $z_{ai}=\frac{x_{ai}- μ_a}{σ_a}$, where $μ_a$ is the sample mean and sample variance $σ_a^2=\frac{1}{N_a} \sum_i {(x_{ai}-μ_a)^2}$. The factor analysis model is expressed by:
- $$\begin{matrix}z_{1,i} & = & \ell_{1,1}F_{1,i} & + & \ell_{1,2}F_{2,i} & + & \varepsilon_{1,i} \\ \vdots & & \vdots & & \vdots & & \vdots \\ z_{10,i} & = & \ell_{10,1}F_{1,i} & + & \ell_{10,2}F_{2,i} & + & \varepsilon_{10,i} \end{matrix}$$
In matrix form ($Z=LF+ϵ$), this model can be expressed as: $$z_{ai}=\sum_p \ell_{ap}F_{pi}+\varepsilon_{ai},$$ where $F_{1,i}$ is the $i^{th}$ student’s verbal intelligence, $F_{2,i}$ is the $i^{th}$ student’s mathematical intelligence; $l_{ap}$ are the factor loadings for the $a^{th}$ subject for $p=1,2$.
FA in R: using factanal()
# Maximum Likelihood Factor Analysis # entering raw data and extracting 3 factors, # with varimax rotation mydata <- read.csv("http://www.ats.ucla.edu/stat/data/mmreg.csv") # mmreg.csv includes 600 observations and 8 variables. # The psychological variables are locus_of_control, self_concept and motivation. # The academic variables are standardized tests in reading (read), writing (write), math (math) and science (science). # Additionally, the variable female is a zero-one indicator variable with the one indicating a female student. # We can get some basic descriptions of the entire data set by using summary. # To get the standard deviations, we use sapply to apply the sd function to each variable in the dataset. summary(mydata) sapply(mydata, sd) fit <- factanal(mydata, 3, rotation="varimax") # mydata can be a raw data matrix or a covariance matrix. # Pairwise deletion of missing data is used. Rotation can be "varimax" or "promax". print(fit, digits=2, cutoff=.3, sort=TRUE) # plot factor 1 by factor 2 load <- fit$\$ $loadings[,1:2] plot(load,type="n") # set up plot text(load,labels=names(mydata),cex=.7) # add variable names # Principal Axis Factor Analysis library(psych) fit <- factor.pa(mydata, nfactors=3, rotation="varimax") fit # print results # Determine Number of Factors to Extract library(nFactors) ev <- eigen(cor(mydata)) # get eigenvalues ap <- parallel(subject=nrow(mydata),var=ncol(mydata), rep=100,cent=.05) nS <- nScree(x=ev$\$ $values, aparallel=ap$\$ $eigen$\$ $qevpea) plotnScree(nS)
PCA, ICA, FA: Similarities and Differences
- PCA is closely related to factor analysis. The latter typically incorporates more domain specific assumptions about the underlying structure and solves eigenvectors of a slightly different matrix. Principal components create variables that are linear combinations of the original variables. The new variables have the property that the variables are all orthogonal. The principal components can be used to find clusters in a set of data.
- PCA is a variance-focused approach that seeks to reproduce the variable's total variance. In PCA, the components reflect both common and unique variance of the variable. It is generally preferred for the purposes of data reduction (i.e., translating variable space into optimal factor space) but not for detecting latent constructs or factors. Factor analysis is similar to principal component analysis in that factor analysis also involves linear combinations of variables.
- In contrast to PCA, factor analysis is a correlation-focused approach seeking to reproduce the correlations among variables. The factors represent the common variance of variables and exclude unique variance. Factor analysis is generally used when the purpose of the research is detection of underlying structure in the data structure (i.e., latent constructs or factors) or causal modeling.
Applications
Parkinson's disease case study
The table below includes data for N=33 Parkinson's disease (PD) cases (3001, …, 3808) from the Parkinson’s Progression Markers Initiative (PPMI) database. We are interested in examining the relationships between imaging biomarkers, derived from the red nucleus in the Substantia Nigra (SN), and MDS-UPDRS assessment measures for Parkinson's Diseases patients, Part I.A, Part I.B, Part II and Part III scores.
- Hypothesis: The a priori hypothesis is that there is a strong inverse correlation between the imaging biomarkers (voxel intensity ratios) and the MDS-UPDRS scores. In other words, an increase of the (average) voxel intensities (i.e., more intact nigrosome-1) would correspond with a decrease in UPDRS scores (i.e., less PD symptoms), and vice-versa, lower voxel intensity ratios would correspond with higher UPDRS scores.
- Data:
- The derived imaging biomarkers include:
- Top of SN Voxel Intensity Ratio obtained by averaging five random points, taken from the left side within the axial slice, from the nigrosome-1 region. Five random points from the ventral medial portion of the substantia nigra within the slice of interest are averaged. The mean for the nigrosome-1 was divided over the mean for the other ROI’s to generate the Top of SN Voxel Intensity ratios bopth on the left and right side.
- Side of SN Voxel Intensity Ratio obtained by averaging five random points, taken from the left side within the axial slice, from the nigrosome-1. Again, five random points from the outer swallow tail of the substantia nigra within the slice of interest are averaged on the left and right side
- The Movement Disorder Society (MDS)-Unified Parkinson’s Disease Rating Scale (MDS-UPDRS): includes Part I (Non-motor Experiences of Daily Living); Part II (Motor Experiences of Daily Living); Part III (Motor Examination); and Part IV (Motor Complications) which was excluded as it was only conducted for PPMI subjects who had started PD medication.
Patient_ID | Top_of_SN_Voxel_Intensity_Ratio | Side_of_SN_Voxel_Intensity_Ratio | Part_IA | Part_IB | Part_II | Part_III |
---|---|---|---|---|---|---|
3001 | 1.188075152 | 0.940778688 | 0 | 8 | 2 | 12 |
3002 | 1.621806155 | 1.123298088 | 3 | 5 | 15 | 17 |
3003 | 1.058309086 | 1.015158507 | 1 | 11 | 6 | 29 |
3004 | 2.027838499 | 1.31348223 | 0 | 2 | 0 | 2 |
3006 | 1.754682932 | 1.064932943 | 1 | 2 | 4 | 22 |
3008 | 1.399353481 | 1.15022502 | 1 | 7 | 2 | 10 |
3010 | 1.327449894 | 0.930592788 | 2 | 9 | 11 | 19 |
3011 | 1.765207347 | 1.241295983 | 1 | 0 | 0 | 0 |
3012 | 1.565801729 | 1.329574405 | 4 | 7 | 6 | 20 |
3014 | 1.231222196 | 1.038040254 | 0 | 8 | 3 | 36 |
3016 | 1.560849827 | 1.305709046 | 0 | 2 | 0 | 1 |
3018 | 1.2011895 | 0.963859444 | 2 | 5 | 5 | 19 |
3020 | 1.693136427 | 1.010054879 | 2 | 13 | 10 | 20 |
3021 | 1.334255775 | 1.027113745 | 1 | 7 | 5 | 12 |
3023 | 1.221884984 | 0.949202296 | 4 | 5 | 4 | 30 |
3024 | 1.406800428 | 0.995810978 | 0 | 3 | 9 | 12 |
3029 | 1.74564323 | 1.22107546 | 0 | 1 | 2 | 3 |
3188 | 1.795850966 | 1.381062457 | 0 | 5 | 0 | 0 |
3551 | 1.811029249 | 1.160219407 | 0 | 2 | 0 | 0 |
3554 | 1.611463658 | 1.184391119 | 3 | 4 | 1 | 15 |
3361 | 2.149437904 | 1.197802706 | 0 | 1 | 2 | 9 |
3161 | 1.801208214 | 1.061680928 | 0 | 6 | 3 | 16 |
3767 | 1.484570315 | 1.208752123 | 3 | 4 | 6 | 21 |
3305 | 1.415250395 | 1.116697914 | 0 | 1 | 1 | 12 |
3307 | 1.79439256 | 1.111008744 | 1 | 8 | 17 | 26 |
3308 | 1.398475885 | 0.985062613 | 1 | 6 | 1 | 6 |
3309 | 1.151021722 | 0.963016059 | 3 | 6 | 8 | 29 |
3311 | 1.298993676 | 1.102174959 | 0 | 1 | 0 | 0 |
3314 | 1.371292596 | 1.214219683 | 1 | 2 | 0 | 0 |
3322 | 1.437867168 | 0.977680828 | 0 | 0 | 0 | 2 |
3352 | 1.84019943 | 0.960479831 | 6 | 13 | 10 | 29 |
3586 | 1.722158512 | 1.157249933 | 0 | 0 | 0 | 0 |
3808 | 1.355532247 | 1.111517624 | 1 | 8 | 2 | 13 |
- R-script analyses of the PPMI data:
dataset <- read.csv('C:\\Users\\Desktop\\PPMI_data.csv', header = TRUE) # Patient_ID, Top_of_SN_Voxel_Intensity_Ratio, Side_of_SN_Voxel_Intensity_Ratio # Part_IA, Part_IB, Part_II, Part_III # remove the first columns (Patient ID number) dataset <- dataset[,-1] attach(dataset)
# form the multivariate response variable, Y Y <- cbind(Part_IA, Part_IB, Part_II, Part_III) # form the predictors vector, X X <- cbind(Top_of_SN_Voxel_Intensity_Ratio,Side_of_SN_Voxel_Intensity_Ratio) summary(dataset) head(dataset)
# some diagnostic plots par(mfrow=c(2,3)) plot(sort(dataset$\$ $Top_of_SN_Voxel_Intensity_Ratio)) hist(dataset$\$ $Top_of_SN_Voxel_Intensity_Ratio) plot(density(dataset$\$ $Top_of_SN_Voxel_Intensity_Ratio,na.rm=FALSE)) plot(sort(dataset$\$ $Side_of_SN_Voxel_Intensity_Ratio)) hist(dataset$\$ $Side_of_SN_Voxel_Intensity_Ratio) plot(density(dataset$\$ $Side_of_SN_Voxel_Intensity_Ratio,na.rm=FALSE))
# Try MANOVA first to explore predictive value of imaging biomarkers on clinical PD scores # MMLR Analysis # mmlr.model <- lm(Y ~ X) # Fit the multivariate model # summary(manova(mmlr.model)) # summary (manova(mmlr.model), test='Wilks') # manova(mmlr.model) # mmlr.model
# Fit linear regression models Fit_IA <- lm(Part_IA ~ Top_of_SN_Voxel_Intensity_Ratio + Side_of_SN_Voxel_Intensity_Ratio) Fit_IA summary(Fit_IA) Fit_IB <- lm(Part_IB ~ Top_of_SN_Voxel_Intensity_Ratio + Side_of_SN_Voxel_Intensity_Ratio) Fit_IB summary(Fit_IB) Fit_III <- lm(Part_III ~ Top_of_SN_Voxel_Intensity_Ratio + Side_of_SN_Voxel_Intensity_Ratio) Fit_III summary(Fit_III)
par(mar = rep(3, 4)) plot(Fit_IA); # text(0.5,0.5,"First title",cex=2,font=2) plot(Fit_IB); # text(0.5,0.5,"Second title",cex=2,font=2) plot(Fit_III); # text(0.5,0.5,"Third title",cex=2,font=2)
# PCA/FA Analyses par(mfrow=c(1,1)) pca.model <- princomp(dataset, cor=TRUE) summary(pca.model) # print variance accounted for loadings(pca.model) # pc loadings (i.e., igenvector columns) Loadings: Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Top_of_SN_Voxel_Intensity_Ratio -0.256 -0.713 -0.373 -0.105 0.477 -0.221 Side_of_SN_Voxel_Intensity_Ratio -0.386 -0.472 0.357 0.433 -0.558 Part_IA 0.383 -0.373 0.710 -0.320 0.238 0.227 Part_IB 0.460 -0.112 0.794 0.292 0.226 Part_II 0.425 -0.342 -0.464 -0.262 -0.534 0.365 Part_III 0.498 -0.183 -0.844 plot(pca.model, type="lines") # scree plot pca.model$\$ $scores # the principal components par(mfrow=c(1,1)) biplot(pca.model)
library("FactoMineR", lib.loc="~/R/win-library/3.1") pca.model <- PCA(dataset) pca.model
# Factor Analysis #install.packages("factanal") fa.model <- factanal(dataset, 2, rotation="varimax") print(fa.model, digits=2, cutoff=.3, sort=TRUE) # plot factor 1 by factor 2 load <- fa.model$\$ $loadings[,1:2] plot(load,type="n") # set up plot text(load, labels=names(fa.model),cex=.7) # add variable names # factanal(x = dataset, factors = 2, rotation = "varimax") Uniquenesses: Top_of_SN_Voxel_Intensity_Ratio 0.02 Side_of_SN_Voxel_Intensity_Ratio 0.53 Part_IA 0.57 Part_IB 0.41 Part_II 0.39 Part_III 0.22 Loadings: Factor1 Factor2 Part_IA 0.65 Part_IB 0.73 Part_II 0.78 Part_III 0.82 -0.32 Top_of_SN_Voxel_Intensity_Ratio 0.99 Side_of_SN_Voxel_Intensity_Ratio -0.42 0.54 Factor1 Factor2 SS loadings 2.41 1.44 Proportion Var 0.40 0.24 Cumulative Var 0.40 0.64
- This SOCR Activity demonstrates the utilization of a SOCR analysis package for statistical computing in the SOCR environment. It presents a general introduction to PCA and the theoretical background of this statistical tool and demonstrates how to use PCA and read and interpret the outcome. It introduces students to inputting data in the correct format, reading the results of PCA and interpreting the resulting transformed data.
- This article presents a general introduction to PCA. Principal component analysis of a data matrix extracts the dominant patterns in the matrix in terms of a complementary set of score and loading plots. It is the responsibility of the data analyst to formulate the scientific issue at hand in terms of PC projections, PLS regressions, etc. Ask yourself, or the investigator, why the data matrix was collected, and for what purpose the experiments and measurements were made. Specify before the analysis what kinds of patterns you would expect and what you would find exciting. The results of the analysis depend on the scaling of the matrix, which therefore must be specified. Variance scaling, where each variable is scaled to unit variance, is recommended for general use, provided that almost constant variables are left unscaled. Combining different types of variables warrants block-scaling. In the initial analysis, look for outliers and strong groupings in the plots, indicating that the data matrix perhaps should be “polished” or whether disjoint modeling is the proper course. For plotting purposes, two or three principal components are usually sufficient, but for modeling purposes the number of significant components should be properly determined, e.g. by cross-validation. Use the resulting principal components to guide your continued investigation or chemical experimentation, not as an end in itself.
- This article introduces the Akaike Information Criterion (AIC) to extend the method of maximum likelihood to the multi-model situation. It relates the order determination of an autoregressive model to the determination of the number of factors in a maximum likelihood factor analysis. The use of the AIC criterion in factor analysis is particularly interesting when it used with a Bayesian model. This observation reveals that AIC can be more widely applied than only to the conventional i.i.d. type models on which the original derivation of the criterion was based. The observation of the Bayesian structure of the factor analysis model leads to the handling of the problem of improper solutions by introducing a natural prior distribution of factor loadings.
- This article contains a good introduction to the application of ICA. Independent component models have gained increasing interest in various fields in recent years. The basic independent component model is a semi-parametric model that assumes that a p-variate observed random vector is a linear transformation of an unobserved vector of p independent latent variables. This linear transformation is given by an unknown mixing matrix, and one of the main objectives of independent component analysis (ICA) is to estimate the unmixing matrix so the latent variables can be recovered. In this article, we discuss the basic independent component model in detail, define the concepts and analysis tools carefully, and consider two families of ICA estimates. The statistical properties (i.e., consistency, asymptotic normality, efficiency, robustness) of the estimates can be analyzed and compared via the so-called gain matrices. Some extensions of the basic independent component model, such as models with additive noise or models with dependent observations, are briefly discussed. The article ends with a short example.
Software
Problems
R Example 1
# Install package ‘fastICA’ > library(fastICA) # Using the SOCR 1981-2006 CPI Data # (http://wiki.socr.umich.edu/index.php/SOCR_Data_Dinov_021808_ConsumerPriceIndex), # save the table in ASCII text file CPI_Data.dat. # Note the "dir" (folder) where you saved the data and reference it below > CPI_Data <- as.matrix(read.table("/dir/CPI_Data.dat",header=T)) > # compare PCA and FA analyses > X <- CPI_Data[,-1] > pc.cor <- princomp(X,cor=T) > summary(pc.cor )
Importance of components: Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7 Standard deviation 2.1348817 1.1696678 0.71243186 0.54364890 0.38449985 0.31956304 0.145200770 Proportion of Variance 0.6511029 0.1954461 0.07250845 0.04222202 0.02112002 0.01458865 0.003011895 Cumulative Proportion 0.6511029 0.8465490 0.91905742 0.96127944 0.98239946 0.99698811 1.000000000 > ica <- fastICA(X,n.comp=7) > names(ica)
[1] "X" "K" "W" "A" "S" # X: pre-processed data matrix (whitened/sphered data) # K: pre-whitening matrix that projects data onto the first n.comp # principal components. # W: estimated un-mixing matrix (XW = S) # A: estimated mixing matrix (X = SA) # S: estimated source matrix (factor scores, $\Theta$ in the notes) > windows() > biplot(pc.cor) > S <- ica$\$ $S > dimnames(S) <- list(dimnames(X)1,paste("Cmp.",1:7,sep="")) > A <- ica$\$ $A > dimnames(A) <- list(dimnames(X)2,paste("Cmp.",1:7,sep="")) > windows() > biplot(S[,1:2],A[,1:2]) > loadings(pc.cor)
Loadings: Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Electricity -0.415 -0.227 0.164 0.512 -0.373 -0.576 Fuel_Oil -0.351 0.547 -0.198 0.312 Bananas -0.373 -0.415 0.258 0.365 0.393 0.578 Tomatoes -0.369 -0.320 0.357 -0.738 -0.294 Orange_Juice -0.324 -0.311 -0.871 -0.119 Beef -0.424 0.220 -0.216 0.721 -0.449 Gasoline -0.380 0.479 -0.216 0.161 Comp.7 Electricity -0.131 Fuel_Oil -0.657 Bananas Tomatoes Orange_Juice 0.108 Beef Gasoline 0.733 Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 SS loadings 1.000 1.000 1.000 1.000 1.000 1.000 Proportion Var 0.143 0.143 0.143 0.143 0.143 0.143 Cumulative Var 0.143 0.286 0.429 0.571 0.714 0.857 Comp.7 SS loadings 1.000 Proportion Var 0.143 Cumulative Var 1.000 > field <- function(x) { substr(paste(x," ",sep=""),1,6) } > A.str <- ifelse(abs(A)<2,field(" "),field(round(A,2))) > dimnames(A.str) <- list(dimnames(X)2,paste("Cmp.",1:7,sep="")) > write.table(A.str[,1:4],"",quote=F,row.names=T,col.names=NA)
Cmp.1 Cmp.2 Cmp.3 Cmp.4 Electricity Fuel_Oil Bananas -2.59 Tomatoes -2.9 Orange_Juice -2.66 Beef Gasoline > L <- pc.cor$loadings > L.str <- ifelse(abs(L)<0.3,field(" "),field(round(L,2))) > dimnames(L.str) <- list(dimnames(X)2,paste("Cmp.",1:7,sep="")) > write.table(L.str[,1:4],"",quote=F,row.names=T,col.names=T)
Cmp.1 Cmp.2 Cmp.3 Cmp.4 Electricity -0.41 0.51 Fuel_Oil -0.35 0.55 Bananas -0.37 -0.41 0.37 Tomatoes -0.37 -0.32 0.36 -0.74 Orange_Juice -0.32 -0.31 -0.87 Beef -0.42 Gasoline -0.38 0.48
References
- A tutorial on Principal Components Analysis
- Sampling / Steven K. Thompson
- Sampling theory and methods / S. Sampath
- Factor Analysis details
Appendix
Year | Electricity | Fuel_Oil | Bananas | Tomatoes | Orange_Juice | Beef | Gasoline |
---|---|---|---|---|---|---|---|
1981 | 31.552 | 1.15 | 0.343 | 0.792 | 1.141 | 1.856 | 1.269 |
1982 | 36.006 | 1.254 | 0.364 | 0.763 | 1.465 | 1.794 | 1.341 |
1983 | 37.184 | 1.194 | 0.332 | 0.726 | 1.418 | 1.756 | 1.214 |
1984 | 38.6 | 1.122 | 0.344 | 0.854 | 1.408 | 1.721 | 1.2 |
1985 | 38.975 | 1.078 | 0.35 | 0.697 | 1.685 | 1.711 | 1.145 |
1986 | 40.223 | 1.126 | 0.337 | 1.104 | 1.756 | 1.662 | 1.19 |
1987 | 40.022 | 0.817 | 0.374 | 0.871 | 1.512 | 1.694 | 0.868 |
1988 | 40.195 | 0.89 | 0.394 | 0.797 | 1.638 | 1.736 | 0.947 |
1989 | 40.828 | 0.883 | 0.429 | 1.735 | 1.868 | 1.806 | 0.944 |
1990 | 41.663 | 1.259 | 0.438 | 0.912 | 1.817 | 1.907 | 1.09 |
1991 | 43.226 | 1.235 | 0.428 | 0.936 | 2.005 | 1.996 | 1.304 |
1992 | 44.501 | 0.985 | 0.426 | 1.141 | 1.879 | 1.926 | 1.135 |
1993 | 46.959 | 0.969 | 0.44 | 1.604 | 1.677 | 1.97 | 1.182 |
1994 | 48.2 | 0.919 | 0.503 | 1.323 | 1.674 | 1.892 | 1.109 |
1995 | 48.874 | 0.913 | 0.463 | 1.103 | 1.583 | 1.847 | 1.19 |
1996 | 48.538 | 1.007 | 0.497 | 1.213 | 1.577 | 1.799 | 1.186 |
1997 | 49.245 | 1.136 | 0.473 | 1.452 | 1.737 | 1.85 | 1.318 |
1998 | 46.401 | 0.966 | 0.489 | 1.904 | 1.601 | 1.818 | 1.186 |
1999 | 45.061 | 0.834 | 0.49 | 1.443 | 1.753 | 1.834 | 1.031 |
2000 | 45.207 | 1.189 | 0.5 | 1.414 | 1.823 | 1.903 | 1.356 |
2001 | 47.472 | 1.509 | 0.509 | 1.451 | 1.863 | 2.037 | 1.525 |
2002 | 47.868 | 1.123 | 0.526 | 1.711 | 1.876 | 2.152 | 1.209 |
2003 | 47.663 | 1.396 | 0.512 | 1.472 | 1.848 | 2.131 | 1.557 |
2004 | 49.159 | 1.508 | 0.485 | 1.66 | 1.957 | 2.585 | 1.635 |
2005 | 50.847 | 1.859 | 0.49 | 2.162 | 1.872 | 2.478 | 1.866 |
2006 | 57.223 | 2.418 | 0.505 | 1.621 | 1.853 | 2.607 | 2.359 |
- SOCR Home page: http://www.socr.umich.edu
Translate this page: