Difference between revisions of "SMHS PCA ICA FA"
(→Motivation) |
(→PCA (principal component analysis)) |
||
Line 11: | Line 11: | ||
===Theory=== | ===Theory=== | ||
====PCA (principal component analysis)==== | ====PCA (principal component analysis)==== | ||
− | PCA is a statistical technique | + | PCA is a statistical technique that uses an orthogonal transformation to convert a set of observations of possibly correlated variables into a set of variables that are linearly uncorrelated. The resulting uncorrelated variables are called principal components. The first principal component accounts for as much of the variability in the data as possible, and each succeeding component accounts for the remaining variability. PCA is the simplest of the true eigenvector-based multivariate analyses; it reveals the internal structure of the data in a way that best explains the variance in the data. It is a useful statistical technique that has been applied in many fields, including computer networks and image processing, and is a powerful method for finding patterns in high-dimensional datasets. |
− | Consider a data matrix $X_{n\times p}$ with column-wise zero | + | Consider a data matrix $X_{n\times p}$ with column-wise empirical means of zero (i.e., the sample mean of each column has been shifted to zero), where each of the $n$ rows represents a different repetition of the experiment, and each of the $p$ columns gives a particular kind of data-element (i.e., variable). Mathematically, the transformation is defined by a set of $p$-dimensional vectors of weights $w_{(k)}=(w_1,w_2,…,w_p)_{(k)}$, constrained to be unitary ($\\w_{(k)}||=1$), which map each row vector $x_i$ of $X$ to a new vector of principal component scores $t_{(i)}=(t_1,t_2,…,t_p)_{(i)}$, given by $t_{k(i)}=x_{(i)} w_{(k)}$. The mapping occurs such that the individual elements of $t$ considered over the data set successively inherit the maximum possible variance from $x$. |
* First component: the first loading vector $w_{(1)}= \arg\max_{||w||=1} {\sum_i{(t_1)_{(i)}^2 }}= \arg\max_{||w||=1} {\sum_i{(x_{(i)} w)^2 }}$, in matrix form: $w_{(1)}=\arg\max_{||w||=1} {||Xw||^2 }=\arg\max_{||w||=1} {w^T X^T Xw}$. | * First component: the first loading vector $w_{(1)}= \arg\max_{||w||=1} {\sum_i{(t_1)_{(i)}^2 }}= \arg\max_{||w||=1} {\sum_i{(x_{(i)} w)^2 }}$, in matrix form: $w_{(1)}=\arg\max_{||w||=1} {||Xw||^2 }=\arg\max_{||w||=1} {w^T X^T Xw}$. | ||
Line 19: | Line 19: | ||
* Further components: the $k^{th}$ component can be found by subtracting the first $k-1$ principal components from $X$. $\hat{X}_{k-1} = X-\sum_{s=1}^{k-1}{X w_{(s)} w_{(s)}^T }$ and finding the loading vector involves extracting the maximum variance from this new data matrix: $w_{(k)}=\arg\max_{||w||=1} {||\bar{X}_{k-1} w||^2 } = \arg\max {\frac{w^T \hat{X}_{k-1}^T \hat{X}_{k-1} w}{w^T w}}.$ This gives the remaining eigenvectors of $X^T X$ with the maximum values for the quantity in brackets given by other corresponding eigenvalues. The full principal components decomposition of $X$ can therefore be given as $T=XW$ where $W$ is a $p\times p$ matrix whose columns are the eigenvectors of $X^T X$. | * Further components: the $k^{th}$ component can be found by subtracting the first $k-1$ principal components from $X$. $\hat{X}_{k-1} = X-\sum_{s=1}^{k-1}{X w_{(s)} w_{(s)}^T }$ and finding the loading vector involves extracting the maximum variance from this new data matrix: $w_{(k)}=\arg\max_{||w||=1} {||\bar{X}_{k-1} w||^2 } = \arg\max {\frac{w^T \hat{X}_{k-1}^T \hat{X}_{k-1} w}{w^T w}}.$ This gives the remaining eigenvectors of $X^T X$ with the maximum values for the quantity in brackets given by other corresponding eigenvalues. The full principal components decomposition of $X$ can therefore be given as $T=XW$ where $W$ is a $p\times p$ matrix whose columns are the eigenvectors of $X^T X$. | ||
− | * Dimensionality reduction: The faithful transformation $T = X W$ maps a data vector $x_{(i)}$ from an original space of $p$ variables to a new space of $p$ (other) variables which are uncorrelated over the dataset. However, not all the principal components need to be kept. Keeping only the first $L$ principal components, produced by using only the first $L$ loading vectors, gives the truncated transformation $T_L=XW_L$ where $T_L$ has $n$ rows but only $L$ columns. This dimensionality reduction is very useful for visualizing and processing high-dimensional | + | * Dimensionality reduction: The faithful transformation $T = X W$ maps a data vector $x_{(i)}$ from an original space of $p$ variables to a new space of $p$ (other) variables which are uncorrelated over the dataset. However, not all the principal components need to be kept. Keeping only the first $L$ principal components, produced by using only the first $L$ loading vectors, gives the truncated transformation $T_L=XW_L$ where $T_L$ has $n$ rows but only $L$ columns. This dimensionality reduction is very useful for visualizing and processing high-dimensional datasets while still retaining as much of the variance in the dataset as possible. |
* Singular value decomposition (SVD) | * Singular value decomposition (SVD) | ||
− | : The SVD of $X=U \Sigma W^T$ where $\Sigma$ is | + | : The SVD of $X=U \Sigma W^T$ where $\Sigma$ is an $n\times p$ rectangular diagonal matrix of positive numbers $\sigma_{(k)}$, the singular values of $X$; $U$ is an $n\times n$ matrix, the columns of which are orthogonal until vector of length $n$ called the left singular vectors of $X$; and $W$ is a $p\times p$ whose columns are orthogonal until vector of length $p$ and called the right singular vectors of $X$. With factorization, the matrix $X^T X=W \Sigma U^T U \Sigma W^T = W \Sigma^2 W^T$; with singular value decomposition the score matrix $T=XW=U\Sigma W^T W=U\Sigma$, where each column of $T$ is given by one of the left singular vectors of $X$ multiplied by the corresponding singular value. |
* Computational details of PCA: | * Computational details of PCA: | ||
− | ** Begin with a dataset | + | ** Begin with a dataset that contains at least two dimensions (i.e., variables). The dataset can contain as many observations (dimensions) as you like; |
− | ** Normalize the observations for each variable. To do this, simply subtract the mean (average) from each observation | + | ** Normalize the observations for each variable. To do this, simply subtract the mean (average) from each observation within a given variable. For example, let $X$ and $Y$ be the two variables from the original dataset, with variable $X$ containing observations $X_1,X_2,…,X_n$, and variable $Y$ containing observations $Y_1,Y_2,…,Y_n$. Let $\bar{X}$ be the average of the $n$ observations of $X$, i.e. $\bar{X}= \frac{X_1+X_2+⋯+X_n}{n}$, and similarly let $\bar{Y}= \frac{Y_1+Y_2+⋯+Y_n}{n}$ be the average of the $Y$ observations. Then, the normalized dataset would be, for variable $X$: $\{X_1-\bar{X}, X_2-\bar{X}, …,X_n-\bar{X}\}$ and for variable $Y$: $\{Y_1-\bar{Y},Y_2-\bar{Y}, …,Y_n-\bar{Y}\}$ |
** Calculate the covariance matrix between the variables of the normalized dataset; | ** Calculate the covariance matrix between the variables of the normalized dataset; | ||
− | ** Calculate the | + | ** Calculate the eigenvalue and eigenvectors of the covariance matrix (the eigenvectors must be normalized to a length of 1); |
− | **(5) Choose the most significant principal component, which is simply the | + | **(5) Choose the most significant principal component, which is simply the eigenvector with the highest eigenvalue. |
====PCA Properties==== | ====PCA Properties==== |
Revision as of 10:53, 1 September 2014
Contents
Scientific Methods for Health Sciences - Dimensionality Reduction: PCA, ICA, FA
Overview
- PCA (principal component analysis) is a mathematical procedure that transforms a number of possibly correlated variables into a smaller number of uncorrelated variables through a process known as orthogonal transformation.
- ICA (independent component analysis) is a computational tool to separate a multivariate signal into additive subcomponents by assuming that the subcomponents are non-Gaussian signals and are statistically independent from each other.
- Factor analysis is a statistical method that describes variability among observed correlated variables in terms of a potentially lower number of unobserved variables. It is related to PCA but they are not identical. In this section, we are going to introduce these three commonly used statistical tools and illustrate their application in various studies with examples and R code samples.
Motivation
Suppose we have a set of observable correlated random variables, and we want to reduce the dimensionality of the data into a reasonable new set. How can we achieve this dimensionality reduction? Principal component analysis, independent component analysis and factor analysis may be the answers here. How does each of them work? What are the differences among those statistical methods; what are their strengths and weaknesses? How can we decide on the best method for a specific dataset?
Theory
PCA (principal component analysis)
PCA is a statistical technique that uses an orthogonal transformation to convert a set of observations of possibly correlated variables into a set of variables that are linearly uncorrelated. The resulting uncorrelated variables are called principal components. The first principal component accounts for as much of the variability in the data as possible, and each succeeding component accounts for the remaining variability. PCA is the simplest of the true eigenvector-based multivariate analyses; it reveals the internal structure of the data in a way that best explains the variance in the data. It is a useful statistical technique that has been applied in many fields, including computer networks and image processing, and is a powerful method for finding patterns in high-dimensional datasets.
Consider a data matrix $X_{n\times p}$ with column-wise empirical means of zero (i.e., the sample mean of each column has been shifted to zero), where each of the $n$ rows represents a different repetition of the experiment, and each of the $p$ columns gives a particular kind of data-element (i.e., variable). Mathematically, the transformation is defined by a set of $p$-dimensional vectors of weights $w_{(k)}=(w_1,w_2,…,w_p)_{(k)}$, constrained to be unitary ($\\w_{(k)}||=1$), which map each row vector $x_i$ of $X$ to a new vector of principal component scores $t_{(i)}=(t_1,t_2,…,t_p)_{(i)}$, given by $t_{k(i)}=x_{(i)} w_{(k)}$. The mapping occurs such that the individual elements of $t$ considered over the data set successively inherit the maximum possible variance from $x$.
- First component: the first loading vector $w_{(1)}= \arg\max_{||w||=1} {\sum_i{(t_1)_{(i)}^2 }}= \arg\max_{||w||=1} {\sum_i{(x_{(i)} w)^2 }}$, in matrix form: $w_{(1)}=\arg\max_{||w||=1} {||Xw||^2 }=\arg\max_{||w||=1} {w^T X^T Xw}$.
- Further components: the $k^{th}$ component can be found by subtracting the first $k-1$ principal components from $X$. $\hat{X}_{k-1} = X-\sum_{s=1}^{k-1}{X w_{(s)} w_{(s)}^T }$ and finding the loading vector involves extracting the maximum variance from this new data matrix: $w_{(k)}=\arg\max_{||w||=1} {||\bar{X}_{k-1} w||^2 } = \arg\max {\frac{w^T \hat{X}_{k-1}^T \hat{X}_{k-1} w}{w^T w}}.$ This gives the remaining eigenvectors of $X^T X$ with the maximum values for the quantity in brackets given by other corresponding eigenvalues. The full principal components decomposition of $X$ can therefore be given as $T=XW$ where $W$ is a $p\times p$ matrix whose columns are the eigenvectors of $X^T X$.
- Dimensionality reduction: The faithful transformation $T = X W$ maps a data vector $x_{(i)}$ from an original space of $p$ variables to a new space of $p$ (other) variables which are uncorrelated over the dataset. However, not all the principal components need to be kept. Keeping only the first $L$ principal components, produced by using only the first $L$ loading vectors, gives the truncated transformation $T_L=XW_L$ where $T_L$ has $n$ rows but only $L$ columns. This dimensionality reduction is very useful for visualizing and processing high-dimensional datasets while still retaining as much of the variance in the dataset as possible.
- Singular value decomposition (SVD)
- The SVD of $X=U \Sigma W^T$ where $\Sigma$ is an $n\times p$ rectangular diagonal matrix of positive numbers $\sigma_{(k)}$, the singular values of $X$; $U$ is an $n\times n$ matrix, the columns of which are orthogonal until vector of length $n$ called the left singular vectors of $X$; and $W$ is a $p\times p$ whose columns are orthogonal until vector of length $p$ and called the right singular vectors of $X$. With factorization, the matrix $X^T X=W \Sigma U^T U \Sigma W^T = W \Sigma^2 W^T$; with singular value decomposition the score matrix $T=XW=U\Sigma W^T W=U\Sigma$, where each column of $T$ is given by one of the left singular vectors of $X$ multiplied by the corresponding singular value.
- Computational details of PCA:
- Begin with a dataset that contains at least two dimensions (i.e., variables). The dataset can contain as many observations (dimensions) as you like;
- Normalize the observations for each variable. To do this, simply subtract the mean (average) from each observation within a given variable. For example, let $X$ and $Y$ be the two variables from the original dataset, with variable $X$ containing observations $X_1,X_2,…,X_n$, and variable $Y$ containing observations $Y_1,Y_2,…,Y_n$. Let $\bar{X}$ be the average of the $n$ observations of $X$, i.e. $\bar{X}= \frac{X_1+X_2+⋯+X_n}{n}$, and similarly let $\bar{Y}= \frac{Y_1+Y_2+⋯+Y_n}{n}$ be the average of the $Y$ observations. Then, the normalized dataset would be, for variable $X$: $\{X_1-\bar{X}, X_2-\bar{X}, …,X_n-\bar{X}\}$ and for variable $Y$: $\{Y_1-\bar{Y},Y_2-\bar{Y}, …,Y_n-\bar{Y}\}$
- Calculate the covariance matrix between the variables of the normalized dataset;
- Calculate the eigenvalue and eigenvectors of the covariance matrix (the eigenvectors must be normalized to a length of 1);
- (5) Choose the most significant principal component, which is simply the eigenvector with the highest eigenvalue.
PCA Properties
- For any integer $q$, $1 \leq q \leq p$, consider the orthogonal linear transformation $y=B'x$, where $y$ is a $q$-element vector and $B'$ is a $q\times q$ matrix and let $\Sigma_y =B'\Sigma B$ be the variance-covariance matrix for $y$. Then the trace of $\Sigma_y$, $tr(\Sigma_y)$, is maximized by taking $B=A_q$, where $A_q$ consists of the first $q$ columns of $A$;
- $y=B'x$, and the $tr(\Sigma_y)$ is minimized by taking $B=A_q^*$ where $A_q^*$ consists of last $q$ columns of $A$. The last few principal components are not simply unstructured left-overs after removing the important ones;
- $\Sigma = \lambda_1 \alpha_1 \alpha_1'+\lambda_2 \alpha_2 \alpha_2'+⋯+\lambda_p \alpha_p \alpha_p'$. Given that $var(x_j)=\sum_{k=1}^p {\lambda_k \alpha_{kj}^2 }$, the elements of $\lambda_k \alpha_k \alpha_k'$ tends to become smaller as SkS increases, whereas the elements of $\lambda_k$ tends to stay about the same size because $\alpha_k \alpha_k'=1,$ for $k=1,2,…,p$.
PCA Limitations
- The results of PCA depend on the scaling of the variables; (
- The applicability of PCA is limited by certain assumptions made in its derivation.
PCA in R
require(graphics) ## The variances of the variables in the ## USArrests data vary by orders of magnitude, so scaling is appropriate (pc.cr <- princomp(USArrests)) # inappropriate Call: princomp(x = USArrests) Standard deviations: Comp.1 Comp.2 Comp.3 Comp.4 82.890847 14.069560 6.424204 2.45783 princomp(USArrests, cor = TRUE) # =^= prcomp(USArrests, scale=TRUE) Call: princomp(x = USArrests, cor = TRUE) Standard deviations: Comp.1 Comp.2 Comp.3 Comp.4 1.5748783 0.9948694 0.5971291 0.4164494 ## Similar, but different: ## The standard deviations differ by a factor of sqrt(49/50) summary(pc.cr <- princomp(USArrests, cor = TRUE)) Importance of components: Comp.1 Comp.2 Comp.3 Comp.4 Standard deviation 1.5748783 0.9948694 0.5971291 0.41644938 Proportion of Variance 0.6200604 0.2474413 0.0891408 0.04335752 Cumulative Proportion 0.6200604 0.8675017 0.9566425 1.00000000 loadings(pc.cr) # note that blank entries are small but not zero ## The signs of the columns are arbitrary Loadings: Comp.1 Comp.2 Comp.3 Comp.4 Murder -0.536 0.418 -0.341 0.649 Assault -0.583 0.188 -0.268 -0.743 UrbanPop -0.278 -0.873 -0.378 0.134 Rape -0.543 -0.167 0.818 Comp.1 Comp.2 Comp.3 Comp.4 SS loadings 1.00 1.00 1.00 1.00 Proportion Var 0.25 0.25 0.25 0.25 Cumulative Var 0.25 0.50 0.75 1.00 plot(pc.cr) # shows a screeplot. # The histogram distribution presents a vivid picture of the variance attributable to the first four significant principal # components respectively. biplot(pc.cr) ## shows the plot of PCA in a different format
From the chart above, we can see the distribution variance attributed to different variables on the four principal components.
PCA using SOCR Analyses
This SOCR Activity illustrates the use of PCA.
ICA (independent component analysis)
ICA is a computational method separating a multivariate signal into additive subcomponents by assuming that the subcomponents are non-Gaussian signals and are statistically independent from each other.
- ICA Assumptions:
- The source signals are independent of each other;
- The distribution of the values in each source signals are non-Gaussian.
- Independence: the source of signals are assumed to be independent but their signal mixture are not independent because they share the same source signals;
- Normality: based on CLT, the distribution of a sum of independent random variables approximate a Gaussian distribution;
- Complexity: the temporal complexity of any signal mixture is greater than that of its simplest constituent source signal.
- ICA maximizes the statistical independence of the estimated components to find the independent components. In general, ICA can’t identify the actual number of source signals nor can it identify the proper scaling of the source signals. Suppose the data is represented by the random vector $x=(x_1,x_2,…,x_m )^t$ and the components denoted as $s=(s_1,s_2,…,s_n )^t$. We need to transform the observed data $x$ using a linear transformation $w$, $s=Wx$, into maximally independent components $s$ measured by some functions of independence. There are alternative models for ICA:
- Linear noiseless ICA: the components $x_i$ of the data $x=(x_1,x_2,…,x_m )^t$ are generated as a sum of the independent components $s_k$, for $k=1,…,n$; $x_i=a_{i,1} s_1 + ⋯ + a_{i,k} s_k + ⋯ + a_{i,n} s_n$, weighted by the mixing weights $a_{i,k}$.
- Linear noisy ICA: with additional assumption of zero-mean and uncorrelated Gaussian noise $n \sim N(0,diag(\Sigma))$, the ICA model takes the form $x=As+n$.
- Nonlinear ICA: the mixing of the sources is not necessarily linear. Using a nonlinear mixing function $f(.|θ)$ with parameter $\theta$ the nonlinear ICA model is $x=f(s│θ)+n$.
- Identifiability: the independent components are identifiable up to a permutation and scaling of the sources, which requires: (1) at most one of the sources $s_k$ is Gaussian; (2) the number of observed mixtures, $m$, must be at least as large as the number of estimated components: $n$ such that $n\leq m$, i.e., the mixing matrix $A$ must be of full rank in order to have inverse.
- ICA in R using package fastICA. This example shows un-mixing two mixed independent uniforms:
library(fastICA) S <- matrix(runif(10000), 5000, 2) A <- matrix(c(1, 1, -1, 3), 2, 2, byrow = TRUE) X <- S %*% A # In R, "*" and "%*%" indicate "scalar" and matrix multiplication, respectively! a <- fastICA(X, 2, alg.typ = "parallel", fun = "logcosh", alpha = 1, method = "C", row.norm = FALSE, maxit = 200, tol = 0.0001, verbose = TRUE) par(mfrow = c(1, 2)) plot(a$\$ $X, main = "Pre-processed data") plot(a$\$ $S, main = "ICA components")
- Another example of un-mixing two independent signals is shown below:
S <- cbind(sin((1:1000)/20), rep((((1:200)-100)/100), 5)) # cbind combines objects by rows and columns. It takes a sequence of vector, matrix or data frames arguments and combines them by columns or rows, respectively. A <- matrix(c(0.291, 0.6557, -0.5439, 0.5572), 2, 2) X <- S %*% A a <- fastICA(X, 2, alg.typ = "parallel", fun = "logcosh", alpha = 1, method = "R", row.norm = FALSE, maxit = 200, tol = 0.0001, verbose = TRUE) par(mfcol = c(2, 3)) plot(1:1000, S[,1 ], type = "l", main = "Original Signals", xlab = "", ylab = "") plot(1:1000, S[,2 ], type = "l", xlab = "", ylab = "") plot(1:1000, X[,1 ], type = "l", main = "Mixed Signals", xlab = "", ylab = "") plot(1:1000, X[,2 ], type = "l", xlab = "", ylab = "") plot(1:1000, a$\$ $S[,1 ], type = "l", main = "ICA source estimates", xlab = "", ylab = "") plot(1:1000, a$\$ $S[, 2], type = "l", xlab = "", ylab = "")
Factor analysis (FA)
FA is a statistical method, which describes variability among observed correlated variables in terms of potentially lower number of unobserved variables. Consider a set of $p$ observable random variables, $x_1,x_2,…,x_p$ with means $μ_1,μ_2,…,μ_p$, respectively. Suppose for some unknown constants $l_{i,j}$ and $k$ unobserved random variables $F_j$, where $i\in \{1,…,p\}$ and $j \in \{1,…,k\}$ where $k<p$. We have $x_i-μ_i=l_{i,1} F_1 + ⋯ +l_{i,k} F_k + ε_i$, where $ε_i$ are independently distributed error terms with mean zero and finite variance. In matrix form, we have $x-μ = LF+ε$, with $n$ observations, we have $x$ is a $p\times n$ matrix, $L$ is a $p \times k$ matrix and $F$ is $k\times n$ matrix. Assumptions: (1) $ε$ and $F$ are independent; (2) $E(F)=0$; (3) $cov(F)=I$ to make sure the factors are uncorrelated. Solutions to the equations above yield the factors $F$ and the loading matrix, $L$.
- Example: In the following, matrices will be indicated by indexed variables. "Subject" indices will be indicated using letters $a$, $b$ and $c$, with values running from $1$ to $N_a$, which is equal to $10$ in the above example. "Factor" indices will be indicated using letters $p$, $q$ and $r$, with values running from $1$ to $N_p$ which is equal to $2$ in the above example. "Instance" or "sample" indices will be indicated using letters $i$, $j$ and $k$, with values running from $1$ to $N_i$. In the example above, if a sample of $N_i=1000$ students responded to the $N_a=10$ questions, the $i^{th}$ student's score for the $a^{th}$ question are given by $x_{ai}$. The purpose of factor analysis is to characterize the correlations between the variables $x_a$ of which the $x_{ai}$ is a particular instance, or set of observations. To ensure that all variables are on equal footing, they are standardized: $z_{ai}=\frac{x_{ai}- μ_a}{σ_a}$, where $μ_a$ is the sample mean and sample variance $σ_a^2=\frac{1}{N_a} \sum_i {(x_{ai}-μ_a)^2}$. The factor analysis model is expressed by:
- $$\begin{matrix}z_{1,i} & = & \ell_{1,1}F_{1,i} & + & \ell_{1,2}F_{2,i} & + & \varepsilon_{1,i} \\ \vdots & & \vdots & & \vdots & & \vdots \\ z_{10,i} & = & \ell_{10,1}F_{1,i} & + & \ell_{10,2}F_{2,i} & + & \varepsilon_{10,i} \end{matrix}$$
In matrix form ($Z=LF+ϵ$), this model can be expressed as: $$z_{ai}=\sum_p \ell_{ap}F_{pi}+\varepsilon_{ai},$$ where $F_{1,i}$ is the $i^{th}$ student’s verbal intelligence, $F_{2,i}$ is the $i^{th}$ student’s mathematical intelligence; $l_{ap}$ are the factor loadings for the $a^{th}$ subject for $p=1,2$.
FA in R: using factanal()
# Maximum Likelihood Factor Analysis # entering raw data and extracting 3 factors, # with varimax rotation mydata <- read.csv("http://www.ats.ucla.edu/stat/data/mmreg.csv") # mmreg.csv includes 600 observations and 8 variables. # The psychological variables are locus_of_control, self_concept and motivation. # The academic variables are standardized tests in reading (read), writing (write), math (math) and science (science). # Additionally, the variable female is a zero-one indicator variable with the one indicating a female student. # We can get some basic descriptions of the entire data set by using summary. # To get the standard deviations, we use sapply to apply the sd function to each variable in the dataset. summary(mydata) sapply(mydata, sd) fit <- factanal(mydata, 3, rotation="varimax") # mydata can be a raw data matrix or a covariance matrix. # Pairwise deletion of missing data is used. Rotation can be "varimax" or "promax". print(fit, digits=2, cutoff=.3, sort=TRUE) # plot factor 1 by factor 2 load <- fit$\$ $loadings[,1:2] plot(load,type="n") # set up plot text(load,labels=names(mydata),cex=.7) # add variable names # Principal Axis Factor Analysis library(psych) fit <- factor.pa(mydata, nfactors=3, rotation="varimax") fit # print results # Determine Number of Factors to Extract library(nFactors) ev <- eigen(cor(mydata)) # get eigenvalues ap <- parallel(subject=nrow(mydata),var=ncol(mydata), rep=100,cent=.05) nS <- nScree(x=ev$\$ $values, aparallel=ap$\$ $eigen$\$ $qevpea) plotnScree(nS)
PCA, ICA, FA: Similarities and Differences
- PCA is closely related to factor analysis. The later typically incorporates more domain specific assumptions about the underlying structure and solves eigenvectors of a slightly different matrix. Principal components create variables that are linear combinations of the original variables. The new variables have the property that the variables are all orthogonal. The principal components can be used to find clusters in a set of data.
- PCA is a variance-focused approach seeking to reproduce the total variable variance, in which components reflect both common and unique variance of the variable. It is generally preferred for purposes of data reduction (i.e., translating variable space into optimal factor space) but not when detect the latent construct or factors. Factor analysis is similar to principal component analysis, in that factor analysis also involves linear combinations of variables.
- Different from PCA, factor analysis is a correlation-focused approach seeking to reproduce the inter-correlations among variables, in which the factors “represent the common variance of variables, excluding unique variance. Factor analysis is generally used when the research purpose is detecting data structure (i.e., latent constructs or factors) or causal modeling.
Applications
- This SOCR Activity demonstrated the utilization of SOCR analyses package for statistical computing in the SOCR environment. It presents a general introduction to PCA and the theoretical background of this statistical tool and shows how to use PCA and how to read and interpret the outcome. It introduced students to input data in the correct format, read the results of PCA and make interpretation of the resulting transformed data.
- This article presents a general introduction to PCA. Principal component analysis of a data matrix extracts the dominant patterns in the matrix in terms of a complementary set of score and loading plots. It is the responsibility of the data analyst to formulate the scientific issue at hand in terms of PC projections, PLS regressions, etc. Ask yourself, or the investigator, why the data matrix was collected, and for what purpose the experiments and measurements were made. Specify before the analysis what kinds of patterns you would expect and what you would find exciting. The results of the analysis depend on the scaling of the matrix, which therefore must be specified. Variance scaling, where each variable is scaled to unit variance, can be recommended for general use, provided that almost constant variables are left unscaled. Combining different types of variables warrants blockscaling. In the initial analysis, look for outliers and strong groupings in the plots, indicating that the data matrix perhaps should be “polished” or whether disjoint modeling is the proper course. For plotting purposes, two or three principal components are usually sufficient, but for modeling purposes the number of significant components should be properly determined, e.g. by cross-validation. Use the resulting principal components to guide your continued investigation or chemical experimentation, not as an end in itself.
- This article introduced the Akaike Information Criterion (AIC) to extend the method of maximum likelihood to the multi-model situation. It related the successful experience of the order determination of an autoregressive model to the determination of the number of factors in the maximum likelihood factor analysis. The use of the AIC criterion in the factor analysis is particularly interesting when it is viewed as the choice of a Bayesian model. This observation shows that the area of application of AIC can be much wider than the conventional i.i.d. type models on which the original derivation of the criterion was based. The observation of the Bayesian structure of the factor analysis model leads us to the handling of the problem of improper solution by introducing a natural prior distribution of factor loadings.
- This article contains a good introduction to the application of ICA. Independent component models have gained increasing interest in various fields of applications in recent years. The basic independent component model is a semi-parametric model assuming that a p-variate observed random vector is a linear transformation of an unobserved vector of p independent latent variables. This linear transformation is given by an unknown mixing matrix, and one of the main objectives of independent component analysis (ICA) is to estimate an unmixing matrix by means of which the latent variables can be recovered. In this article, we discuss the basic independent component model in detail, define the concepts and analysis tools carefully, and consider two families of ICA estimates. The statistical properties (consistency, asymptotic normality, efficiency, robustness) of the estimates can be analyzed and compared via the so called gain matrices. Some extensions of the basic independent component model, such as models with additive noise or models with dependent observations, are briefly discussed. The article ends with a short example.
Software
- SOCR ANalyses
- SOCR_EduMaterials_AnalysisActivities_PCA SOCR PCA Activity
- R princomp package
- R fastICA package and documentation
Problems
R Example 1
# Install package ‘fastICA’ > library(fastICA) # Using the SOCR 1981-2006 CPI Data (http://wiki.socr.umich.edu/index.php/SOCR_Data_Dinov_021808_ConsumerPriceIndex), # save the table in ASCII text file CPI_Data.dat. Note the "dir" (folder) where you saved the data and reference it below > CPI_Data <- as.matrix(read.table("/dir/CPI_Data.dat",header=T)) > # compare PCA and FA analyses > X <- CPI_Data[,-1] > pc.cor <- princomp(X,cor=T) > summary(pc.cor ) Importance of components: Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7 Standard deviation 2.1348817 1.1696678 0.71243186 0.54364890 0.38449985 0.31956304 0.145200770 Proportion of Variance 0.6511029 0.1954461 0.07250845 0.04222202 0.02112002 0.01458865 0.003011895 Cumulative Proportion 0.6511029 0.8465490 0.91905742 0.96127944 0.98239946 0.99698811 1.000000000 > ica <- fastICA(X,n.comp=7) > names(ica) [1] "X" "K" "W" "A" "S" # X: pre-processed data matrix (whitened/sphered data) # K: pre-whitening matrix that projects data onto the first n.comp # principal components. # W: estimated un-mixing matrix (XW = S) # A: estimated mixing matrix (X = SA) # S: estimated source matrix (factor scores, $\Theta$ in the notes) > windows() > biplot(pc.cor) > S <- ica$\$ $S > dimnames(S) <- list(dimnames(X)1,paste("Cmp.",1:7,sep="")) > A <- ica$\$ $A > dimnames(A) <- list(dimnames(X)2,paste("Cmp.",1:7,sep="")) > windows() > biplot(S[,1:2],A[,1:2]) > loadings(pc.cor) Loadings: Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Electricity -0.415 -0.227 0.164 0.512 -0.373 -0.576 Fuel_Oil -0.351 0.547 -0.198 0.312 Bananas -0.373 -0.415 0.258 0.365 0.393 0.578 Tomatoes -0.369 -0.320 0.357 -0.738 -0.294 Orange_Juice -0.324 -0.311 -0.871 -0.119 Beef -0.424 0.220 -0.216 0.721 -0.449 Gasoline -0.380 0.479 -0.216 0.161 Comp.7 Electricity -0.131 Fuel_Oil -0.657 Bananas Tomatoes Orange_Juice 0.108 Beef Gasoline 0.733 Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 SS loadings 1.000 1.000 1.000 1.000 1.000 1.000 Proportion Var 0.143 0.143 0.143 0.143 0.143 0.143 Cumulative Var 0.143 0.286 0.429 0.571 0.714 0.857 Comp.7 SS loadings 1.000 Proportion Var 0.143 Cumulative Var 1.000 > field <- function(x) { substr(paste(x," ",sep=""),1,6) } > A.str <- ifelse(abs(A)<2,field(" "),field(round(A,2))) > dimnames(A.str) <- list(dimnames(X)2,paste("Cmp.",1:7,sep="")) > write.table(A.str[,1:4],"",quote=F,row.names=T,col.names=NA) Cmp.1 Cmp.2 Cmp.3 Cmp.4 Electricity Fuel_Oil Bananas -2.59 Tomatoes -2.9 Orange_Juice -2.66 Beef Gasoline > L <- pc.cor$loadings > L.str <- ifelse(abs(L)<0.3,field(" "),field(round(L,2))) > dimnames(L.str) <- list(dimnames(X)2,paste("Cmp.",1:7,sep="")) > write.table(L.str[,1:4],"",quote=F,row.names=T,col.names=T) Cmp.1 Cmp.2 Cmp.3 Cmp.4 Electricity -0.41 0.51 Fuel_Oil -0.35 0.55 Bananas -0.37 -0.41 0.37 Tomatoes -0.37 -0.32 0.36 -0.74 Orange_Juice -0.32 -0.31 -0.87 Beef -0.42 Gasoline -0.38 0.48
References
- A tutorial on Principal Components Analysis
- Sampling / Steven K. Thompson
- Sampling theory and methods / S. Sampath
- Factor Analysis details
Appendix
Year | Electricity | Fuel_Oil | Bananas | Tomatoes | Orange_Juice | Beef | Gasoline |
---|---|---|---|---|---|---|---|
1981 | 31.552 | 1.15 | 0.343 | 0.792 | 1.141 | 1.856 | 1.269 |
1982 | 36.006 | 1.254 | 0.364 | 0.763 | 1.465 | 1.794 | 1.341 |
1983 | 37.184 | 1.194 | 0.332 | 0.726 | 1.418 | 1.756 | 1.214 |
1984 | 38.6 | 1.122 | 0.344 | 0.854 | 1.408 | 1.721 | 1.2 |
1985 | 38.975 | 1.078 | 0.35 | 0.697 | 1.685 | 1.711 | 1.145 |
1986 | 40.223 | 1.126 | 0.337 | 1.104 | 1.756 | 1.662 | 1.19 |
1987 | 40.022 | 0.817 | 0.374 | 0.871 | 1.512 | 1.694 | 0.868 |
1988 | 40.195 | 0.89 | 0.394 | 0.797 | 1.638 | 1.736 | 0.947 |
1989 | 40.828 | 0.883 | 0.429 | 1.735 | 1.868 | 1.806 | 0.944 |
1990 | 41.663 | 1.259 | 0.438 | 0.912 | 1.817 | 1.907 | 1.09 |
1991 | 43.226 | 1.235 | 0.428 | 0.936 | 2.005 | 1.996 | 1.304 |
1992 | 44.501 | 0.985 | 0.426 | 1.141 | 1.879 | 1.926 | 1.135 |
1993 | 46.959 | 0.969 | 0.44 | 1.604 | 1.677 | 1.97 | 1.182 |
1994 | 48.2 | 0.919 | 0.503 | 1.323 | 1.674 | 1.892 | 1.109 |
1995 | 48.874 | 0.913 | 0.463 | 1.103 | 1.583 | 1.847 | 1.19 |
1996 | 48.538 | 1.007 | 0.497 | 1.213 | 1.577 | 1.799 | 1.186 |
1997 | 49.245 | 1.136 | 0.473 | 1.452 | 1.737 | 1.85 | 1.318 |
1998 | 46.401 | 0.966 | 0.489 | 1.904 | 1.601 | 1.818 | 1.186 |
1999 | 45.061 | 0.834 | 0.49 | 1.443 | 1.753 | 1.834 | 1.031 |
2000 | 45.207 | 1.189 | 0.5 | 1.414 | 1.823 | 1.903 | 1.356 |
2001 | 47.472 | 1.509 | 0.509 | 1.451 | 1.863 | 2.037 | 1.525 |
2002 | 47.868 | 1.123 | 0.526 | 1.711 | 1.876 | 2.152 | 1.209 |
2003 | 47.663 | 1.396 | 0.512 | 1.472 | 1.848 | 2.131 | 1.557 |
2004 | 49.159 | 1.508 | 0.485 | 1.66 | 1.957 | 2.585 | 1.635 |
2005 | 50.847 | 1.859 | 0.49 | 2.162 | 1.872 | 2.478 | 1.866 |
2006 | 57.223 | 2.418 | 0.505 | 1.621 | 1.853 | 2.607 | 2.359 |
- SOCR Home page: http://www.socr.umich.edu
Translate this page: