Difference between revisions of "SMHS PCA ICA FA"

Revision as of 11:53, 1 September 2014

Scientific Methods for Health Sciences - Dimensionality Reduction: PCA, ICA, FA

Overview

PCA (principal component analysis) is a mathematical procedure that transforms a number of possibly correlated variables into a smaller number of uncorrelated variables through a process known as orthogonal transformation.
ICA (independent component analysis) is a computational tool to separate a multivariate signal into additive subcomponents by assuming that the subcomponents are non-Gaussian signals and are statistically independent from each other.
Factor analysis is a statistical method that describes variability among observed correlated variables in terms of a potentially lower number of unobserved variables. It is related to PCA but they are not identical. In this section, we are going to introduce these three commonly used statistical tools and illustrate their application in various studies with examples and R code samples.

Motivation

Suppose we have a set of observable correlated random variables, and we want to reduce the dimensionality of the data into a reasonable new set. How can we achieve this dimensionality reduction? Principal component analysis, independent component analysis and factor analysis may be the answers here. How does each of them work? What are the differences among those statistical methods; what are their strengths and weaknesses? How can we decide on the best method for a specific dataset?

Theory

PCA (principal component analysis)

PCA is a statistical technique that uses an orthogonal transformation to convert a set of observations of possibly correlated variables into a set of variables that are linearly uncorrelated. The resulting uncorrelated variables are called principal components. The first principal component accounts for as much of the variability in the data as possible, and each succeeding component accounts for the remaining variability. PCA is the simplest of the true eigenvector-based multivariate analyses; it reveals the internal structure of the data in a way that best explains the variance in the data. It is a useful statistical technique that has been applied in many fields, including computer networks and image processing, and is a powerful method for finding patterns in high-dimensional datasets.

Consider a data matrix $X_{n\times p}$ with column-wise empirical means of zero (i.e., the sample mean of each column has been shifted to zero), where each of the $n$ rows represents a different repetition of the experiment, and each of the $p$ columns gives a particular kind of data-element (i.e., variable). Mathematically, the transformation is defined by a set of $p$-dimensional vectors of weights $w_{(k)}=(w_1,w_2,…,w_p)_{(k)}$, constrained to be unitary ($\\w_{(k)}||=1$), which map each row vector $x_i$ of $X$ to a new vector of principal component scores $t_{(i)}=(t_1,t_2,…,t_p)_{(i)}$, given by $t_{k(i)}=x_{(i)} w_{(k)}$. The mapping occurs such that the individual elements of $t$ considered over the data set successively inherit the maximum possible variance from $x$.

First component: the first loading vector $w_{(1)}= \arg\max_{||w||=1} {\sum_i{(t_1)_{(i)}^2 }}= \arg\max_{||w||=1} {\sum_i{(x_{(i)} w)^2 }}$, in matrix form: $w_{(1)}=\arg\max_{||w||=1} {||Xw||^2 }=\arg\max_{||w||=1} {w^T X^T Xw}$.

Further components: the $k^{th}$ component can be found by subtracting the first $k-1$ principal components from $X$. $\hat{X}_{k-1} = X-\sum_{s=1}^{k-1}{X w_{(s)} w_{(s)}^T }$ and finding the loading vector involves extracting the maximum variance from this new data matrix: $w_{(k)}=\arg\max_{||w||=1} {||\bar{X}_{k-1} w||^2 } = \arg\max {\frac{w^T \hat{X}_{k-1}^T \hat{X}_{k-1} w}{w^T w}}.$ This gives the remaining eigenvectors of $X^T X$ with the maximum values for the quantity in brackets given by other corresponding eigenvalues. The full principal components decomposition of $X$ can therefore be given as $T=XW$ where $W$ is a $p\times p$ matrix whose columns are the eigenvectors of $X^T X$.

Dimensionality reduction: The faithful transformation $T = X W$ maps a data vector $x_{(i)}$ from an original space of $p$ variables to a new space of $p$ (other) variables which are uncorrelated over the dataset. However, not all the principal components need to be kept. Keeping only the first $L$ principal components, produced by using only the first $L$ loading vectors, gives the truncated transformation $T_L=XW_L$ where $T_L$ has $n$ rows but only $L$ columns. This dimensionality reduction is very useful for visualizing and processing high-dimensional datasets while still retaining as much of the variance in the dataset as possible.

Singular value decomposition (SVD)

The SVD of $X=U \Sigma W^T$ where $\Sigma$ is an $n\times p$ rectangular diagonal matrix of positive numbers $\sigma_{(k)}$, the singular values of $X$; $U$ is an $n\times n$ matrix, the columns of which are orthogonal until vector of length $n$ called the left singular vectors of $X$; and $W$ is a $p\times p$ whose columns are orthogonal until vector of length $p$ and called the right singular vectors of $X$. With factorization, the matrix $X^T X=W \Sigma U^T U \Sigma W^T = W \Sigma^2 W^T$; with singular value decomposition the score matrix $T=XW=U\Sigma W^T W=U\Sigma$, where each column of $T$ is given by one of the left singular vectors of $X$ multiplied by the corresponding singular value.

Computational details of PCA:
- Begin with a dataset that contains at least two dimensions (i.e., variables). The dataset can contain as many observations (dimensions) as you like;
- Normalize the observations for each variable. To do this, simply subtract the mean (average) from each observation within a given variable. For example, let $X$ and $Y$ be the two variables from the original dataset, with variable $X$ containing observations $X_1,X_2,…,X_n$, and variable $Y$ containing observations $Y_1,Y_2,…,Y_n$. Let $\bar{X}$ be the average of the $n$ observations of $X$, i.e. $\bar{X}= \frac{X_1+X_2+⋯+X_n}{n}$, and similarly let $\bar{Y}= \frac{Y_1+Y_2+⋯+Y_n}{n}$ be the average of the $Y$ observations. Then, the normalized dataset would be, for variable $X$: $\{X_1-\bar{X}, X_2-\bar{X}, …,X_n-\bar{X}\}$ and for variable $Y$: $\{Y_1-\bar{Y},Y_2-\bar{Y}, …,Y_n-\bar{Y}\}$
- Calculate the covariance matrix between the variables of the normalized dataset;
- Calculate the eigenvalue and eigenvectors of the covariance matrix (the eigenvectors must be normalized to a length of 1);
- (5) Choose the most significant principal component, which is simply the eigenvector with the highest eigenvalue.

PCA Properties

For any integer $q$, $1 \leq q \leq p$, consider the orthogonal linear transformation $y=B'x$, where $y$ is a $q$-element vector and $B'$ is a $q\times q$ matrix and let $\Sigma_y =B'\Sigma B$ be the variance-covariance matrix for $y$. Then the trace of $\Sigma_y$, $tr(\Sigma_y)$, is maximized by taking $B=A_q$, where $A_q$ consists of the first $q$ columns of $A$;
$y=B'x$, and the $tr(\Sigma_y)$ is minimized by taking $B=A_q^*$ where $A_q^*$ consists of last $q$ columns of $A$. The last few principal components are not simply unstructured left-overs after removing the important ones;
$\Sigma = \lambda_1 \alpha_1 \alpha_1'+\lambda_2 \alpha_2 \alpha_2'+⋯+\lambda_p \alpha_p \alpha_p'$. Given that $var(x_j)=\sum_{k=1}^p {\lambda_k \alpha_{kj}^2 }$, the elements of $\lambda_k \alpha_k \alpha_k'$ tends to become smaller as SkS increases, whereas the elements of $\lambda_k$ tends to stay about the same size because $\alpha_k \alpha_k'=1,$ for $k=1,2,…,p$.

PCA Limitations

The results of PCA depend on the scaling of the variables; (
The applicability of PCA is limited by certain assumptions made in its derivation.

PCA in R

require(graphics)
## The variances of the variables in the
## USArrests data vary by orders of magnitude, so scaling is appropriate
(pc.cr <- princomp(USArrests))  # inappropriate

Call:
princomp(x = USArrests)

Standard deviations:
   Comp.1    Comp.2    Comp.3    Comp.4 
82.890847 14.069560  6.424204  2.45783

princomp(USArrests, cor = TRUE) # =^= prcomp(USArrests, scale=TRUE)

Call:
princomp(x = USArrests, cor = TRUE)

Standard deviations:
   Comp.1    Comp.2    Comp.3    Comp.4 
1.5748783 0.9948694 0.5971291 0.4164494

## Similar, but different:
## The standard deviations differ by a factor of sqrt(49/50)
summary(pc.cr <- princomp(USArrests, cor = TRUE))

Importance of components:
                          Comp.1    Comp.2    Comp.3     Comp.4
Standard deviation     1.5748783 0.9948694 0.5971291 0.41644938
Proportion of Variance 0.6200604 0.2474413 0.0891408 0.04335752
Cumulative Proportion  0.6200604 0.8675017 0.9566425 1.00000000

loadings(pc.cr)  # note that blank entries are small but not zero
## The signs of the columns are arbitrary


Loadings:
         Comp.1 Comp.2 Comp.3 Comp.4
Murder   -0.536  0.418 -0.341  0.649
Assault  -0.583  0.188 -0.268 -0.743
UrbanPop -0.278 -0.873 -0.378  0.134
Rape     -0.543 -0.167  0.818   

               Comp.1 Comp.2 Comp.3 Comp.4
SS loadings      1.00   1.00   1.00   1.00
Proportion Var   0.25   0.25   0.25   0.25
Cumulative Var   0.25   0.50   0.75   1.00

plot(pc.cr) # shows a screeplot.
 
# The histogram distribution presents a vivid picture of the variance attributable to the first four significant principal  
# components respectively.
biplot(pc.cr)  ## shows the plot of PCA in a different format

From the chart above, we can see the distribution variance attributed to different variables on the four principal components.

PCA using SOCR Analyses

This SOCR Activity illustrates the use of PCA.

ICA (independent component analysis)

ICA is a computational method separating a multivariate signal into additive subcomponents by assuming that the subcomponents are non-Gaussian signals and are statistically independent from each other.

ICA Assumptions:
- The source signals are independent of each other;
- The distribution of the values in each source signals are non-Gaussian.
- Independence: the source of signals are assumed to be independent but their signal mixture are not independent because they share the same source signals;
- Normality: based on CLT, the distribution of a sum of independent random variables approximate a Gaussian distribution;
- Complexity: the temporal complexity of any signal mixture is greater than that of its simplest constituent source signal.

ICA maximizes the statistical independence of the estimated components to find the independent components. In general, ICA can’t identify the actual number of source signals nor can it identify the proper scaling of the source signals. Suppose the data is represented by the random vector $x=(x_1,x_2,…,x_m )^t$ and the components denoted as $s=(s_1,s_2,…,s_n )^t$. We need to transform the observed data $x$ using a linear transformation $w$, $s=Wx$, into maximally independent components $s$ measured by some functions of independence. There are alternative models for ICA:

Linear noiseless ICA: the components $x_i$ of the data $x=(x_1,x_2,…,x_m )^t$ are generated as a sum of the independent components $s_k$, for $k=1,…,n$; $x_i=a_{i,1} s_1 + ⋯ + a_{i,k} s_k + ⋯ + a_{i,n} s_n$, weighted by the mixing weights $a_{i,k}$.
Linear noisy ICA: with additional assumption of zero-mean and uncorrelated Gaussian noise $n \sim N(0,diag(\Sigma))$, the ICA model takes the form $x=As+n$.
Nonlinear ICA: the mixing of the sources is not necessarily linear. Using a nonlinear mixing function $f(.|θ)$ with parameter $\theta$ the nonlinear ICA model is $x=f(s│θ)+n$.
Identifiability: the independent components are identifiable up to a permutation and scaling of the sources, which requires: (1) at most one of the sources $s_k$ is Gaussian; (2) the number of observed mixtures, $m$, must be at least as large as the number of estimated components: $n$ such that $n\leq m$, i.e., the mixing matrix $A$ must be of full rank in order to have inverse.
ICA in R using package fastICA. This example shows un-mixing two mixed independent uniforms:

library(fastICA)
S <- matrix(runif(10000), 5000, 2)
A <- matrix(c(1, 1, -1, 3), 2, 2, byrow = TRUE)
X <- S %*% A # In R,  "*" and "%*%" indicate "scalar" and matrix multiplication, respectively!

a <- fastICA(X, 2, alg.typ = "parallel", fun = "logcosh", alpha = 1,
                 method = "C", row.norm = FALSE, maxit = 200,
                 tol = 0.0001, verbose = TRUE)
par(mfrow = c(1, 2))
plot(a$\$ $X, main = "Pre-processed data")
plot(a$\$ $S, main = "ICA components")

Another example of un-mixing two independent signals is shown below:

S <- cbind(sin((1:1000)/20), rep((((1:200)-100)/100), 5))
# cbind combines objects by rows and columns. It takes a sequence of vector, matrix or data frames arguments and combines them by columns or rows, respectively. 

A <- matrix(c(0.291, 0.6557, -0.5439, 0.5572), 2, 2)
X <- S %*% A
a <- fastICA(X, 2, alg.typ = "parallel", fun = "logcosh", alpha = 1,
                 method = "R", row.norm = FALSE, maxit = 200,
                 tol = 0.0001, verbose = TRUE)
par(mfcol = c(2, 3))
plot(1:1000, S[,1 ], type = "l", main = "Original Signals",
         xlab = "", ylab = "")
plot(1:1000, S[,2 ], type = "l", xlab = "", ylab = "")
plot(1:1000, X[,1 ], type = "l", main = "Mixed Signals",
         xlab = "", ylab = "")
plot(1:1000, X[,2 ], type = "l", xlab = "", ylab = "")
plot(1:1000, a$\$ $S[,1 ], type = "l", main = "ICA source estimates",
         xlab = "", ylab = "")
plot(1:1000, a$\$ $S[, 2], type = "l", xlab = "", ylab = "")

Factor analysis (FA)

FA is a statistical method, which describes variability among observed correlated variables in terms of potentially lower number of unobserved variables. Consider a set of $p$ observable random variables, $x_1,x_2,…,x_p$ with means $μ_1,μ_2,…,μ_p$, respectively. Suppose for some unknown constants $l_{i,j}$ and $k$ unobserved random variables $F_j$, where $i\in \{1,…,p\}$ and $j \in \{1,…,k\}$ where $k<p$. We have $x_i-μ_i=l_{i,1} F_1 + ⋯ +l_{i,k} F_k + ε_i$, where $ε_i$ are independently distributed error terms with mean zero and finite variance. In matrix form, we have $x-μ = LF+ε$, with $n$ observations, we have $x$ is a $p\times n$ matrix, $L$ is a $p \times k$ matrix and $F$ is $k\times n$ matrix. Assumptions: (1) $ε$ and $F$ are independent; (2) $E(F)=0$; (3) $cov(F)=I$ to make sure the factors are uncorrelated. Solutions to the equations above yield the factors $F$ and the loading matrix, $L$.

Example: In the following, matrices will be indicated by indexed variables. "Subject" indices will be indicated using letters $a$, $b$ and $c$, with values running from $1$ to $N_a$, which is equal to $10$ in the above example. "Factor" indices will be indicated using letters $p$, $q$ and $r$, with values running from $1$ to $N_p$ which is equal to $2$ in the above example. "Instance" or "sample" indices will be indicated using letters $i$, $j$ and $k$, with values running from $1$ to $N_i$. In the example above, if a sample of $N_i=1000$ students responded to the $N_a=10$ questions, the $i^{th}$ student's score for the $a^{th}$ question are given by $x_{ai}$. The purpose of factor analysis is to characterize the correlations between the variables $x_a$ of which the $x_{ai}$ is a particular instance, or set of observations. To ensure that all variables are on equal footing, they are standardized: $z_{ai}=\frac{x_{ai}- μ_a}{σ_a}$, where $μ_a$ is the sample mean and sample variance $σ_a^2=\frac{1}{N_a} \sum_i {(x_{ai}-μ_a)^2}$. The factor analysis model is expressed by:

$$\begin{matrix}z_{1,i} & = & \ell_{1,1}F_{1,i} & + & \ell_{1,2}F_{2,i} & + & \varepsilon_{1,i} \\ \vdots & & \vdots & & \vdots & & \vdots \\ z_{10,i} & = & \ell_{10,1}F_{1,i} & + & \ell_{10,2}F_{2,i} & + & \varepsilon_{10,i} \end{matrix}$$

In matrix form ($Z=LF+ϵ$), this model can be expressed as: $$z_{ai}=\sum_p \ell_{ap}F_{pi}+\varepsilon_{ai},$$ where $F_{1,i}$ is the $i^{th}$ student’s verbal intelligence, $F_{2,i}$ is the $i^{th}$ student’s mathematical intelligence; $l_{ap}$ are the factor loadings for the $a^{th}$ subject for $p=1,2$.

FA in R: using factanal()

# Maximum Likelihood Factor Analysis
# entering raw data and extracting 3 factors, 
# with varimax rotation 

mydata <- read.csv("http://www.ats.ucla.edu/stat/data/mmreg.csv")
# mmreg.csv includes 600 observations and 8 variables. 
# The psychological variables are locus_of_control, self_concept and motivation. 
# The academic variables are standardized tests in reading (read), writing (write), math (math) and science (science). 
# Additionally, the variable female is a zero-one indicator variable with the one indicating a female student.

# We can get some basic descriptions of the entire data set by using summary. 
# To get the standard deviations, we use sapply to apply the sd function to each variable in the dataset.
summary(mydata)

sapply(mydata, sd)

fit <- factanal(mydata, 3, rotation="varimax")  # mydata can be a raw data matrix or a covariance matrix. 
# Pairwise deletion  of missing data is used. Rotation can be "varimax" or "promax".

print(fit, digits=2, cutoff=.3, sort=TRUE)
# plot factor 1 by factor 2 
load <- fit$\$ $loadings[,1:2] 
plot(load,type="n") # set up plot 
text(load,labels=names(mydata),cex=.7) # add variable names

# Principal Axis Factor Analysis
library(psych)
fit <- factor.pa(mydata, nfactors=3, rotation="varimax")
fit # print results

# Determine Number of Factors to Extract
library(nFactors)
ev <- eigen(cor(mydata)) # get eigenvalues
ap <- parallel(subject=nrow(mydata),var=ncol(mydata), rep=100,cent=.05)
nS <- nScree(x=ev$\$ $values, aparallel=ap$\$ $eigen$\$ $qevpea)
plotnScree(nS)

PCA, ICA, FA: Similarities and Differences

PCA is closely related to factor analysis. The later typically incorporates more domain specific assumptions about the underlying structure and solves eigenvectors of a slightly different matrix. Principal components create variables that are linear combinations of the original variables. The new variables have the property that the variables are all orthogonal. The principal components can be used to find clusters in a set of data.

PCA is a variance-focused approach seeking to reproduce the total variable variance, in which components reflect both common and unique variance of the variable. It is generally preferred for purposes of data reduction (i.e., translating variable space into optimal factor space) but not when detect the latent construct or factors. Factor analysis is similar to principal component analysis, in that factor analysis also involves linear combinations of variables.

Different from PCA, factor analysis is a correlation-focused approach seeking to reproduce the inter-correlations among variables, in which the factors “represent the common variance of variables, excluding unique variance. Factor analysis is generally used when the research purpose is detecting data structure (i.e., latent constructs or factors) or causal modeling.

Applications

This SOCR Activity demonstrated the utilization of SOCR analyses package for statistical computing in the SOCR environment. It presents a general introduction to PCA and the theoretical background of this statistical tool and shows how to use PCA and how to read and interpret the outcome. It introduced students to input data in the correct format, read the results of PCA and make interpretation of the resulting transformed data.

This article presents a general introduction to PCA. Principal component analysis of a data matrix extracts the dominant patterns in the matrix in terms of a complementary set of score and loading plots. It is the responsibility of the data analyst to formulate the scientific issue at hand in terms of PC projections, PLS regressions, etc. Ask yourself, or the investigator, why the data matrix was collected, and for what purpose the experiments and measurements were made. Specify before the analysis what kinds of patterns you would expect and what you would find exciting. The results of the analysis depend on the scaling of the matrix, which therefore must be specified. Variance scaling, where each variable is scaled to unit variance, can be recommended for general use, provided that almost constant variables are left unscaled. Combining different types of variables warrants blockscaling. In the initial analysis, look for outliers and strong groupings in the plots, indicating that the data matrix perhaps should be “polished” or whether disjoint modeling is the proper course. For plotting purposes, two or three principal components are usually sufficient, but for modeling purposes the number of significant components should be properly determined, e.g. by cross-validation. Use the resulting principal components to guide your continued investigation or chemical experimentation, not as an end in itself.

This article introduced the Akaike Information Criterion (AIC) to extend the method of maximum likelihood to the multi-model situation. It related the successful experience of the order determination of an autoregressive model to the determination of the number of factors in the maximum likelihood factor analysis. The use of the AIC criterion in the factor analysis is particularly interesting when it is viewed as the choice of a Bayesian model. This observation shows that the area of application of AIC can be much wider than the conventional i.i.d. type models on which the original derivation of the criterion was based. The observation of the Bayesian structure of the factor analysis model leads us to the handling of the problem of improper solution by introducing a natural prior distribution of factor loadings.

This article contains a good introduction to the application of ICA. Independent component models have gained increasing interest in various fields of applications in recent years. The basic independent component model is a semi-parametric model assuming that a p-variate observed random vector is a linear transformation of an unobserved vector of p independent latent variables. This linear transformation is given by an unknown mixing matrix, and one of the main objectives of independent component analysis (ICA) is to estimate an unmixing matrix by means of which the latent variables can be recovered. In this article, we discuss the basic independent component model in detail, define the concepts and analysis tools carefully, and consider two families of ICA estimates. The statistical properties (consistency, asymptotic normality, efficiency, robustness) of the estimates can be analyzed and compared via the so called gain matrices. Some extensions of the basic independent component model, such as models with additive noise or models with dependent observations, are briefly discussed. The article ends with a short example.

Software

Problems

R Example 1

# Install package ‘fastICA’

> library(fastICA)
# Using the SOCR 1981-2006 CPI Data (http://wiki.socr.umich.edu/index.php/SOCR_Data_Dinov_021808_ConsumerPriceIndex), 
# save the table in ASCII text file CPI_Data.dat. Note the "dir" (folder) where you saved the data and reference it below
> CPI_Data <- as.matrix(read.table("/dir/CPI_Data.dat",header=T))

> # compare PCA and FA analyses
> X <- CPI_Data[,-1]
> pc.cor <- princomp(X,cor=T)
> summary(pc.cor )
Importance of components:
                          Comp.1    Comp.2     Comp.3     Comp.4     Comp.5     Comp.6      Comp.7
Standard deviation     2.1348817 1.1696678 0.71243186 0.54364890 0.38449985 0.31956304 0.145200770
Proportion of Variance 0.6511029 0.1954461 0.07250845 0.04222202 0.02112002 0.01458865 0.003011895
Cumulative Proportion  0.6511029 0.8465490 0.91905742 0.96127944 0.98239946 0.99698811 1.000000000

> ica <- fastICA(X,n.comp=7)
> names(ica)
[1] "X" "K" "W" "A" "S"
#  X: pre-processed data matrix (whitened/sphered data)
#  K: pre-whitening matrix that projects data onto the first n.comp
#     principal components.
#  W: estimated un-mixing matrix (XW = S)
#  A: estimated mixing matrix (X = SA)
#  S: estimated source matrix (factor scores, $\Theta$ in the notes)

> windows()
> biplot(pc.cor)

> S <- ica$\$ $S
> dimnames(S) <- list(dimnames(X)1,paste("Cmp.",1:7,sep=""))
> A <- ica$\$ $A
> dimnames(A) <- list(dimnames(X)2,paste("Cmp.",1:7,sep=""))
> windows()
> biplot(S[,1:2],A[,1:2])

> loadings(pc.cor) 

Loadings:
             Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6
Electricity  -0.415 -0.227  0.164  0.512 -0.373 -0.576
Fuel_Oil     -0.351  0.547               -0.198  0.312
Bananas      -0.373 -0.415  0.258  0.365  0.393  0.578
Tomatoes     -0.369 -0.320  0.357 -0.738 -0.294       
Orange_Juice -0.324 -0.311 -0.871        -0.119       
Beef         -0.424  0.220        -0.216  0.721 -0.449
Gasoline     -0.380  0.479               -0.216  0.161
             Comp.7
Electricity  -0.131
Fuel_Oil     -0.657
Bananas            
Tomatoes           
Orange_Juice  0.108
Beef               
Gasoline      0.733

               Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6
SS loadings     1.000  1.000  1.000  1.000  1.000  1.000
Proportion Var  0.143  0.143  0.143  0.143  0.143  0.143
Cumulative Var  0.143  0.286  0.429  0.571  0.714  0.857
               Comp.7
SS loadings     1.000
Proportion Var  0.143
Cumulative Var  1.000

> field <- function(x) { substr(paste(x,"           ",sep=""),1,6) }
> A.str <- ifelse(abs(A)<2,field(" "),field(round(A,2)))
> dimnames(A.str) <- list(dimnames(X)2,paste("Cmp.",1:7,sep=""))
> write.table(A.str[,1:4],"",quote=F,row.names=T,col.names=NA)
 		Cmp.1 Cmp.2 Cmp.3 Cmp.4
Electricity                            
Fuel_Oil                            
Bananas 	-2.59                      
Tomatoes 	-2.9                       
Orange_Juice 	-2.66                      
Beef                            
Gasoline                            

> L <- pc.cor$loadings
> L.str <- ifelse(abs(L)<0.3,field(" "),field(round(L,2)))
> dimnames(L.str) <- list(dimnames(X)2,paste("Cmp.",1:7,sep=""))
> write.table(L.str[,1:4],"",quote=F,row.names=T,col.names=T)
		Cmp.1 Cmp.2 Cmp.3 Cmp.4
Electricity 	-0.41                0.51  
Fuel_Oil 	-0.35  0.55                
Bananas 	-0.37  -0.41         0.37  
Tomatoes 	-0.37  -0.32  0.36   -0.74 
Orange_Juice 	-0.32  -0.31  -0.87        
Beef 		-0.42                      
Gasoline 	-0.38  0.48

References

Appendix

SOCR 1981-2006 CPI Dataset

Year	Electricity	Fuel_Oil	Bananas	Tomatoes	Orange_Juice	Beef	Gasoline
1981	31.552	1.15	0.343	0.792	1.141	1.856	1.269
1982	36.006	1.254	0.364	0.763	1.465	1.794	1.341
1983	37.184	1.194	0.332	0.726	1.418	1.756	1.214
1984	38.6	1.122	0.344	0.854	1.408	1.721	1.2
1985	38.975	1.078	0.35	0.697	1.685	1.711	1.145
1986	40.223	1.126	0.337	1.104	1.756	1.662	1.19
1987	40.022	0.817	0.374	0.871	1.512	1.694	0.868
1988	40.195	0.89	0.394	0.797	1.638	1.736	0.947
1989	40.828	0.883	0.429	1.735	1.868	1.806	0.944
1990	41.663	1.259	0.438	0.912	1.817	1.907	1.09
1991	43.226	1.235	0.428	0.936	2.005	1.996	1.304
1992	44.501	0.985	0.426	1.141	1.879	1.926	1.135
1993	46.959	0.969	0.44	1.604	1.677	1.97	1.182
1994	48.2	0.919	0.503	1.323	1.674	1.892	1.109
1995	48.874	0.913	0.463	1.103	1.583	1.847	1.19
1996	48.538	1.007	0.497	1.213	1.577	1.799	1.186
1997	49.245	1.136	0.473	1.452	1.737	1.85	1.318
1998	46.401	0.966	0.489	1.904	1.601	1.818	1.186
1999	45.061	0.834	0.49	1.443	1.753	1.834	1.031
2000	45.207	1.189	0.5	1.414	1.823	1.903	1.356
2001	47.472	1.509	0.509	1.451	1.863	2.037	1.525
2002	47.868	1.123	0.526	1.711	1.876	2.152	1.209
2003	47.663	1.396	0.512	1.472	1.848	2.131	1.557
2004	49.159	1.508	0.485	1.66	1.957	2.585	1.635
2005	50.847	1.859	0.49	2.162	1.872	2.478	1.866
2006	57.223	2.418	0.505	1.621	1.853	2.607	2.359

SOCR Home page: http://www.socr.umich.edu

Translate this page:

(default)	Deutsch	Español	Français	Italiano	Português	日本語	България	الامارات العربية المتحدة	Suomi	इस भाषा में	Norge
한국어	中文	繁体中文	Русский	Nederlands	Ελληνικά	Hrvatska	Česká republika	Danmark	Polska	România	Sverige

@@ Line 11: / Line 11: @@
 ===Theory===
 ====PCA (principal component analysis)====
-PCA is a statistical technique, which uses an orthogonal transformation to convert a set of observations of possibly correlated variables into set of values variables, which are linearly uncorrelated. The resulting uncorrelated variables are called principal components. The first principal component accounts for as much of the variability in the data as possible, and each succeeding component accounts for the remaining variability. PCA is the simplest of true eigenvector-based multivariate analyses to reveal the internal structure of the data in a way that best explains the variance in the data. It is a useful statistical technique that has found application many fields, including computer networks and image processing, and is a powerful method for finding patterns high-dimension dataset.
+PCA is a statistical technique that uses an orthogonal transformation to convert a set of observations of possibly correlated variables into a set of variables that are linearly uncorrelated. The resulting uncorrelated variables are called principal components. The first principal component accounts for as much of the variability in the data as possible, and each succeeding component accounts for the remaining variability. PCA is the simplest of the true eigenvector-based multivariate analyses; it reveals the internal structure of the data in a way that best explains the variance in the data. It is a useful statistical technique that has been applied in many fields, including computer networks and image processing, and is a powerful method for finding patterns in high-dimensional datasets.
-Consider a data matrix $X_{n\times p}$ with column-wise zero empirical mean (the sample mean of each column has been shifted to zero), where each of the $n$ rows represents a different repetition of the experiment, and each of the $p$ columns gives a particular kind of data-element (variable). Mathematically, the transformation is defined by a set of $p$ dimensional vectors of weights $w_{(k)}=(w_1,w_2,…,w_p)_{(k)}$, constrained to be unitary ($\\w_{(k)}||=1$), that map each row vector $x_i$ of $X$ to a new vector of principal component scores $t_{(i)}=(t_1,t_2,…,t_p)_{(i)}$, given by $t_{k(i)}=x_{(i)} w_{(k)}$ in such a way that the individual variable of $t$ considered over the data set successively inherit the maximum possible variance from $x$.
+Consider a data matrix $X_{n\times p}$ with column-wise empirical means of zero (i.e., the sample mean of each column has been shifted to zero), where each of the $n$ rows represents a different repetition of the experiment, and each of the $p$ columns gives a particular kind of data-element (i.e., variable). Mathematically, the transformation is defined by a set of $p$-dimensional vectors of weights $w_{(k)}=(w_1,w_2,…,w_p)_{(k)}$, constrained to be unitary ($\\w_{(k)}||=1$), which map each row vector $x_i$ of $X$ to a new vector of principal component scores $t_{(i)}=(t_1,t_2,…,t_p)_{(i)}$, given by $t_{k(i)}=x_{(i)} w_{(k)}$. The mapping occurs such that the individual elements of $t$ considered over the data set successively inherit the maximum possible variance from $x$.
 * First component: the first loading vector $w_{(1)}= \arg\max_{||w||=1} {\sum_i{(t_1)_{(i)}^2 }}= \arg\max_{||w||=1} {\sum_i{(x_{(i)} w)^2 }}$, in matrix form: $w_{(1)}=\arg\max_{||w||=1} {||Xw||^2 }=\arg\max_{||w||=1} {w^T X^T Xw}$.
@@ Line 19: / Line 19: @@
 * Further components: the $k^{th}$ component can be found by subtracting the first $k-1$ principal components from $X$. $\hat{X}_{k-1} = X-\sum_{s=1}^{k-1}{X w_{(s)} w_{(s)}^T }$ and finding the loading vector involves extracting the maximum variance from this new data matrix: $w_{(k)}=\arg\max_{||w||=1} {||\bar{X}_{k-1} w||^2 } = \arg\max {\frac{w^T \hat{X}_{k-1}^T \hat{X}_{k-1} w}{w^T w}}.$ This gives the remaining eigenvectors of $X^T X$ with the maximum values for the quantity in brackets given by other corresponding eigenvalues. The full principal components decomposition of $X$ can therefore be given as $T=XW$ where $W$ is a $p\times p$ matrix whose columns are the eigenvectors of $X^T X$.
-* Dimensionality reduction: The faithful transformation $T = X W$ maps a data vector $x_{(i)}$ from an original space of $p$ variables to a new space of $p$ (other) variables which are uncorrelated over the dataset. However, not all the principal components need to be kept. Keeping only the first $L$ principal components, produced by using only the first $L$ loading vectors, gives the truncated transformation $T_L=XW_L$ where $T_L$ has $n$ rows but only $L$ columns. This dimensionality reduction is very useful for visualizing and processing high-dimensional dataset while still retaining as much of the variance in the dataset as possible.
+* Dimensionality reduction: The faithful transformation $T = X W$ maps a data vector $x_{(i)}$ from an original space of $p$ variables to a new space of $p$ (other) variables which are uncorrelated over the dataset. However, not all the principal components need to be kept. Keeping only the first $L$ principal components, produced by using only the first $L$ loading vectors, gives the truncated transformation $T_L=XW_L$ where $T_L$ has $n$ rows but only $L$ columns. This dimensionality reduction is very useful for visualizing and processing high-dimensional datasets while still retaining as much of the variance in the dataset as possible.
 * Singular value decomposition (SVD)
-: The SVD of $X=U \Sigma W^T$ where $\Sigma$ is a $n\times p$ rectangular diagonal matrix of positive numbers $\sigma_{(k)}$, the singular values of $X$; $U$ is an $n\times n$ matrix, the columns of which are orthogonal until vector of length $n$ called the left singular vectors of $X$; and $W$ is a $p\times p$ whose columns are orthogonal until vector of length $p$ and called the right singular vectors of $X$. With factorization, the matrix $X^T X=W \Sigma U^T U \Sigma W^T = W \Sigma^2 W^T$; with singular value decomposition the score matrix $T=XW=U\Sigma W^T W=U\Sigma$, where each column of $T$ is given by one of the left singular vectors of $X$ multiplied by the corresponding singular value.
+: The SVD of $X=U \Sigma W^T$ where $\Sigma$ is an $n\times p$ rectangular diagonal matrix of positive numbers $\sigma_{(k)}$, the singular values of $X$; $U$ is an $n\times n$ matrix, the columns of which are orthogonal until vector of length $n$ called the left singular vectors of $X$; and $W$ is a $p\times p$ whose columns are orthogonal until vector of length $p$ and called the right singular vectors of $X$. With factorization, the matrix $X^T X=W \Sigma U^T U \Sigma W^T = W \Sigma^2 W^T$; with singular value decomposition the score matrix $T=XW=U\Sigma W^T W=U\Sigma$, where each column of $T$ is given by one of the left singular vectors of $X$ multiplied by the corresponding singular value.
 * Computational details of PCA:
-** Begin with a dataset, which contains at least two dimensions (variables). The dataset can contain as many observations (dimensions) as possible;
+** Begin with a dataset that contains at least two dimensions (i.e., variables). The dataset can contain as many observations (dimensions) as you like;
-** Normalize the observations for each variable. To do this, simply subtract the mean (average) from each observation in a given variable. For example, let $X$ and $Y$ be the two variables from the original dataset, with variable $X$ containing observations $X_1,X_2,…,X_n$, and variable $Y$ containing observations $Y_1,Y_2,…,Y_n$. Let $X$ ̅be the average of the $n$ $X$ observations, i.e. $\bar{X}= \frac{X_1+X_2+⋯+X_n}{n}$, and similarly let $\bar{Y}= \frac{Y_1+Y_2+⋯+Y_n}{n}$ be the average of the $Y$ observations. Then the normalized dataset would be: for variable $X$: $\{X_1-\bar{X}, X_2-\bar{X}, …,X_n-\bar{X}\}$ and for variable $Y$: $\{Y_1-\bar{Y},Y_2-\bar{Y}, …,Y_n-\bar{Y}\}$
+** Normalize the observations for each variable. To do this, simply subtract the mean (average) from each observation within a given variable. For example, let $X$ and $Y$ be the two variables from the original dataset, with variable $X$ containing observations $X_1,X_2,…,X_n$, and variable $Y$ containing observations $Y_1,Y_2,…,Y_n$. Let $\bar{X}$  be the average of the $n$ observations of $X$, i.e. $\bar{X}= \frac{X_1+X_2+⋯+X_n}{n}$, and similarly let $\bar{Y}= \frac{Y_1+Y_2+⋯+Y_n}{n}$ be the average of the $Y$ observations. Then, the normalized dataset would be, for variable $X$: $\{X_1-\bar{X}, X_2-\bar{X}, …,X_n-\bar{X}\}$ and for variable $Y$: $\{Y_1-\bar{Y},Y_2-\bar{Y}, …,Y_n-\bar{Y}\}$
 ** Calculate the covariance matrix between the variables of the normalized dataset;
-** Calculate the Eigenvalue and Eigenvectors of the covariance matrix (the eigenvectors must be normalized to a length of 1);
+** Calculate the eigenvalue and eigenvectors of the covariance matrix (the eigenvectors must be normalized to a length of 1);
-**(5) Choose the most significant principal component, which is simply the Eigenvector with the highest Eigenvalue.
+**(5) Choose the most significant principal component, which is simply the eigenvector with the highest eigenvalue.
 ====PCA Properties====

Difference between revisions of "SMHS PCA ICA FA"

Revision as of 11:53, 1 September 2014

Contents

Scientific Methods for Health Sciences - Dimensionality Reduction: PCA, ICA, FA

Overview

Motivation

Theory

PCA (principal component analysis)

PCA Properties

PCA Limitations

PCA in R

PCA using SOCR Analyses

ICA (independent component analysis)

Factor analysis (FA)

FA in R: using factanal()

PCA, ICA, FA: Similarities and Differences

Applications

Software

Problems

R Example 1

References

Appendix

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools