Difference between revisions of "SMHS PartialCorrelation"
(→References) |
(→R Example:) |
||
(2 intermediate revisions by the same user not shown) | |||
Line 142: | Line 142: | ||
res2 <- residuals(lr2) | res2 <- residuals(lr2) | ||
## plot res1 vs. res2 to see if there is a linear trend | ## plot res1 vs. res2 to see if there is a linear trend | ||
− | plot(res2, res1, main='Plot of residuals from two linear regression models fitted',xlab='Residuals of Water ~ Acid',ylab='Residuals of Air ~ Acid') | + | plot(res2, res1, main='Plot of residuals from two linear regression models |
+ | fitted',xlab='Residuals of Water ~ Acid',ylab='Residuals of Air ~ Acid') | ||
Line 251: | Line 252: | ||
pcor.test(Air.Flow,Water.Temp,c(Acid.Conc.,stack.loss)) | pcor.test(Air.Flow,Water.Temp,c(Acid.Conc.,stack.loss)) | ||
− | #calculate partial correlation between air flow and water temperature while controlling for air | + | #calculate partial correlation between air flow and water temperature while controlling for |
+ | air concentration and stack loss | ||
estimate p.value statistic n gp Method | estimate p.value statistic n gp Method | ||
Line 259: | Line 261: | ||
===Problems=== | ===Problems=== | ||
− | y.data <- data.frame(hl=c(7,15,19,15,21,22,57,15,20,18), disp=c(0.000,0.964,0.000,0.000,0.921,0.000,0.000,1.006,0.000,1.011), deg=c(9,2,3,4,1,3,1,3,6,1), BC=c(1. | + | y.data <- data.frame(hl=c(7,15,19,15,21,22,57,15,20,18), |
+ | disp=c( 0.000,0.964,0.000,0.000,0.921,0.000,0.000,1.006,0.000,1.011), | ||
+ | deg=c(9,2,3,4,1,3,1,3,6,1), | ||
+ | BC=c(1.78 e-02,1.05e-06,1.37e-05,7.18e-03,0.00e+00,0.00e+00,0.00e+00,4.48e-03,2.10e-06,0.00e+00)) | ||
*1) Apply this dataset and calculate the partial correlation between ‘hl’ and ‘disp’ given ‘deg’ using the linear regression method as well as the ppcor package. State and compare your results. | *1) Apply this dataset and calculate the partial correlation between ‘hl’ and ‘disp’ given ‘deg’ using the linear regression method as well as the ppcor package. State and compare your results. | ||
*2) Apply this dataset and calculate the partial correlation between ‘hl’ and ‘disp’ given ‘deg’ and ‘BC” using the ppcor package introduced in the lecture. Sate your conclusions. | *2) Apply this dataset and calculate the partial correlation between ‘hl’ and ‘disp’ given ‘deg’ and ‘BC” using the ppcor package introduced in the lecture. Sate your conclusions. | ||
− | |||
===References=== | ===References=== |
Latest revision as of 14:20, 18 September 2014
Contents
Scientific Methods for Health Sciences - Partial Correlation
Overview
Partial correlation measures the degree of association between two random variables after removing the effect of set of controlling random variables. It measures variance after certain factors are controlled for. In this lecture, we are going to present a general introduction to partial correlation and illustrate its application with examples and its calculation in R. We are also going to discuss the application of partial correlation function in time series analysis.
Motivation
We have discussed about calculating the correlation between two random variables, which measures the statistical relationships involving dependence. What if we want to measure the dependence relationship between two random variables after removing the set of controlling variables? What is the difference between partial correlation and correlation in general? How would this adjustment for other variables influence the relationship between these two variables?
Theory
1) Partial correlation: the partial correlation between $\it X$ and $\it Y$ given a set of $n$ controlling variables $Z = \left \{ Z_{1}, Z_{2}, \cdots, Z_{n} \right \}$ , written $\rho _{XYZ}$, is the correlation between the residuals $R_{X}$ and $R_{Y}$ resulting from the linear regression of $X$ with $Z$ and $Y$ with $Z$ respectively. The first-order partial correlation is the difference between a correlation and the product of the removable correlations divided by the product of the coefficients of alienation of the removable correlations.
2) Calculation of partial correlation: there are three most commonly used methods to calculate partial correlation.
- Using linear regression: solving the two associated linear regression problems to get the residuals and calculate the correlation between the residuals is a straightforward way to calculate partial correlation.
- Denote the i.i.d. samples of some joint probability distribution over $X, Y\, and\, Z$ as $x_{i}, y_{i} and z_{i},$ solving the linear regression problem by finding the n-dimension vectors $w^{*} _{X} = argmin_{x} \left \{ \sum_{i=1}^{N} (x_{i}-\left \langle w,z_{i} \right \rangle)^{2} \right \}, w^{*} _{Y} = argmin_{x} \left \{ \sum_{i=1}^{N} (y_{i}-\left \langle w,z_{i} \right \rangle)^{2} \right \},$ with $N$ being the number of samples and $\left \langle v, w \right \rangle$ the scalar product between the vectors $v$ and $w$. Note: in some implementations, the regression includes a constant term, so that the matrix $z$ would have an additional column of ones.
- The residuals are then $r_{X,i} = x_{i} - \left \langle w^{*}_{X} , z_{i} \right \rangle, r_{Y,i} = y_{i} - \left \rangle w^{*}_{Y} , z_{i} \right \rangle.$
- The sample partial correlation is $\hat{ \rho}_{XY \dot Z} = \frac {N \sum_{i=1}^{N} r_{X,i} r_{Y,i} - \sum-{i=1}^{N} r_{X,i} r_{Y,i}}{ \sqrt{ N \sum_{i=1}^{N} r^{2}_{X,i}-(\sum_{i=1}^{N} r_{X,i})^{2}} \sqrt{ N \sum_{i=1}^{N} r^{2}_{Y,i}-(\sum_{i=1}^{N} r_{Y,i})^{2}}}$.
- Using recursive formula: given that the method using linear regression requires lots of calculation, we can refer to the recursive formula. Notice that the nth-order partial correlation (i.e., with $|Z|=n$) can be easily computed from three (n-1)th-order partial correlation. The 0th-order partial correlation $\rho_{XY_{O}}$ is defined to be the regular correlation coefficient $\rho_{XY}.$
- This holds for any $Z_{0} \in Z: \rho_{XY. Z} =\frac{\rho_{XY.Z\setminus \left \{ Z_{0} \right \}}-\rho_{XZ_{0}.Z\setminus \left \{ Z_{0} \right \}}\rho_{Z_{0}Y.Z\setminus \left \{ Z_{0} \right \}}}{\sqrt{1-\rho^{2}_{XZ_{0}.Z\setminus \left \{ Z_{0} \right \}}} \sqrt{1-\rho^{2}_{Z_{0}Y.Z\setminus \left \{ Z_{0} \right \}}}}$.
- A recursive implementation of this computation will generate an exponential time complexity. However, this computation has overlapping subproblems property, such that using dynamic programming or simply caching the results of the recursive calls generates a complexity of $O(n^{3})$.
- Note: when Z is a single variable, this reduces to $\rho_{XY. Z} =\frac{\rho_{XY}-\rho_{XZ}\rho_{ZY}}{\sqrt{1-\rho^{2}_{XZ}} \sqrt{1-\rho^{2}_{ZY}}}.$
- Using matrix inversion: in $O(n^{3})$ time, another method allows all partial correlations to be computed between any variables $X_{i}$ and $X_{j}$ of a set of $V$ of cardinality $n$ given all others, i.e., $V \setminus \left \{X_{i}, X_{j} \right \},$ if the correlation matrix is positive and therefore invertible, we define $P=\Omega ^{-1},$ and have $\rho_{X_{i}X_{j}.V\setminus \left \{X_{i}, X_{j} \right \}} = \frac {p_{ij}}{\sqrt{p_{ii}p_{jj}}}$.
Interpretation:
In Geometrical, let three variables $X, Y and\, Z$ where $x$ is the independent variable $(IV)$, $y$ is the dependent variable $(DV)$, and $Z$ is the ‘control’ or ‘extra variable’ be chosen from a joint probability distribution over $n$ variables $V$. Further let $v_{j}$, $I \leq I \leq N,$ be $N$ n-dimensional i.i.d. samples taken from the joint probability distribution over $V$. We then consider the N-dimensional vectors $x$, and $z$. The residuals $R_{X}$ comes from the linear regression of $X$ using $Z$, which can be considered as an N-dimensional vector $r_{X}$, has a zero scalar product with the vector $z$ generated by $Z$. This indicates that the residuals vector lives on a hyperplane $S_{Z}$ that is perpendicular to $Z$.
In conditional independence test, with the assumption that all involved variables are multivariate Gaussian, the partial correlation $\rho_{XY.Z}$ is zero if and only if $X$ is conditionally independent from $Y$ given $Z$, which does not hold in general cases. To test if a sample partial correlation $\hat{ \rho}_{XY.Z}$ vanishes, Fisher’s z-transform of the partial correlation can be used: $z(\hat{ \rho}_{XY.Z}) = 1/2 \ln(\frac{1+ \hat{ \rho}_{XY.Z}}{1- \hat{ \rho}_{XY.Z}}).$ The null hypothesis is $H_{o}: \hat{ \rho}_{XY.Z}=0, vs. H_{A}: \hat{ \rho}_{XY.Z} \neq 0.$ We reject $H_{o}4$ with significance level $4\alpha if \sqrt{N-|Z|-3} * |z(\hat{ \rho}_{XY.Z})| > \Phi ^{-1} (1- \alpha /2)$, where $\Phi ()$ is the cumulative distribution function of a Gaussian distribution with zero mean and standard deviation 1. $N$ is the sample size.
Semi-partial correlation:
Similar to partial correlation, both of which measure variance after certain factors are controlled for, but the former holds the third variable constant for either X or Y, while the latter holds the third variable constant for both. The semi-partial (or part) correlation can be regarded as more practically relevant since it is scaled to (i.e., relative to) the total variability in the dependent (response) variable. Meanwhile, it is less theoretically useful because it is less precise about the unique contribution of the independent variable. Semi-partial correlation of X with Y is always less than or equal to the partial correlation of X with Y.
- In time series analysis, the partial correlation function(PACF)of a time series is defined for lag $h$, as $\Phi (h) = \rho_{X_{0}X_{h}. \left \{X_{1}, \cdots,X_{h-1} \right \} }$. It is important in identifying the extent of lag in an autoregressive model. The partial autocorrelation of an $AR(p)$ process is zero at lag $p+1$ and greater.
- Given a time series $z_{t}$, the partial autocorrelation of lag $k$, $\alpha (k)$, is the autocorrelation between $z_{t}$ and $z_{t+k}$ with the linear dependence of $z_{t+1}$ through to $z_{t+k-1}$ removed, it is also the autocorrelation between $z_{t}$ and $z_{t+k}$ that is not accounted for by lags $1$ to $k-1$, inclusive. $\alpha (1) = Cor(z_{t}, z_{t+1}), \alpha (k) = Cor(z_{t+k}- P_{t,k}(z_{t+k}), z_{t}-P_{t,k}(z_{t}))$, for $k \geq 2, where P_{t,k(x)}$ is the projection of $x$ onto the space spanned by $z_{t+1},\cdots, z_{t+k-1}.$
- One looks for the point on the plot where the partial autocorrelations for all higher lags are essentially zero. Placing on the plot an indication of the sampling uncertainty of the sample PACF is helpful for this purpose: this is usually constructed on the basis that the true value of the PACF, at any given positive lag, is zero. This can be formalized as described below. An approximate test that a given partial correlation is zero (at a 5% significance level) is given by comparing the sample partial autocorrelations against the critical region with upper and lower limits given by $\pm 1.95 \setminus \sqrt{n}$, where $n$ is the record length (number of points) of the time-series being analyzed. This approximation relies on the assumption that the record length is moderately large (say $n>30$) and that the underlying process has finite second moment.
Applications
This article proposed to use a partial correlation analysis to construct approximate Undirected Dependency Graphs from large-scale biochemical data. This method enables a distinction between direct and indirect interactions of biochemical compounds, thereby inferring the underlying network topology. The method is first thoroughly evaluated with a large set of simulated data. Results indicate that the approach has good statistical power and a low False Discovery Rate even in the presence of noise in the data. We then applied the method to an existing data set of yeast gene expression. Several small gene networks were inferred and found to contain genes known to be collectively involved in particular biochemical processes. In some of these networks there are also uncharacterized ORFs present, which lead to hypotheses about their functions.
This article compared empirical type I error and power of different permutation techniques that can be used for partial correlation analysis involving three data vectors and for partial Mantel tests. The partial Mantel test is a form of first-order partial correlation analysis involving three distance matrices, which is widely used in such fields as population genetics, ecology, anthropology, psychometry and sociology. The methods compared are the following:
- (1) permute the objects in one of the vectors (or matrices)
- (2) permute the residuals of a null model
- (3)correlate residualized vector 1 (or matrix A) to residualized vector 2 (or matrix B); permute one of the residualized vectors (or matrices)
- (4) permute the residuals of a full model
In the partial correlation study, the results were compared to those of the parametric t-test which provides a reference under normality. Simulations were carried out to measure the type I error and power of these permutation methods, using normal and non-normal data, without and with an outlier. There were 10 000 simulations for each situation (100 000 when n = 5); 999 permutations were produced per test where permutations were used. The recommended testing procedures are the following:(a) In partial correlation analysis, most methods can be used most of the time. The parametric t-test should not be used with highly skewed data. Permutation of the raw data should be avoided only when highly skewed data are combined with outliers in the covariable. Methods implying permutation of residuals, which are known to only have asymptotically exact significance levels, should not be used when highly skewed data are combined with small sample size. (b) In partial Mantel tests, method 2 can always be used, except when highly skewed data are combined with small sample size. (c) With small sample sizes, one should carefully examine the data before partial correlation or partial Mantel analysis. For highly skewed data, permutation of the raw data has correct type I error in the absence of outliers. When highly skewed data are combined with outliers in the covariable vector or matrix, it is still recommended to use the permutation of raw data. (d) Method 3 should never be used.
Software
R Example:
Consider the built-in data set in R called the stackloss, which contains measurements on operations in a plant that oxidized ammonia $(NH_{3})$ to make nitric acid $(HNO_{3})$ and we want to calculate the partial correlation between air flow speed and water temperature while controlling for acid concentration. So the idea would be to first formulate a linear regression with air flow as target and acid concentration as the predictor and calculate the residuals $e_{1}$; then formulate a linear regression with water temperature as target and acid concentration as target and calculate the residuals, $e_{2}$; finally calculate the correlation coefficient between the residuals from the first two steps and that is the partial correlation between air flow and water temperature while controlling for the effect of acid concentration.
data(stackloss) stackloss
Air Flow | Water Temp | Acid Concentration | Stack Loss | |
1 | 80 | 27 | 89 | 42 |
2 | 80 | 27 | 88 | 37 |
3 | 75 | 25 | 90 | 37 |
4 | 62 | 24 | 87 | 28 |
5 | 62 | 22 | 87 | 18 |
6 | 62 | 23 | 87 | 18 |
7 | 62 | 24 | 93 | 19 |
8 | 62 | 24 | 93 | 20 |
9 | 58 | 23 | 87 | 15 |
10 | 58 | 18 | 80 | 14 |
11 | 58 | 18 | 89 | 14 |
12 | 58 | 17 | 88 | 13 |
13 | 58 | 18 | 82 | 11 |
14 | 58 | 19 | 93 | 12 |
15 | 50 | 18 | 89 | 8 |
16 | 50 | 18 | 86 | 7 |
17 | 50 | 19 | 72 | 8 |
18 | 50 | 19 | 79 | 8 |
19 | 50 | 20 | 80 | 9 |
20 | 56 | 20 | 82 | 15 |
21 | 70 | 20 | 91 | 15 |
summary(stackloss)
Air Flow | Water Temp | Acid Concentration | stackloss |
Min. :50.00 | Min. :17.0 | Min. :72.00 | Min. : 7.00 |
1st Qu.:56.00 | 1st Qu.:18.0 | 1st Qu.:82.00 | 1st Qu.:11.00 |
Median :58.00 | Median :20.0 | Median :87.00 | Median :15.00 |
Mean :60.43 | Mean :21.1 | Mean :86.29 | Mean :17.52 |
3rd Qu.:62.00 | 3rd Qu.:24.0 | 3rd Qu.:89.00 | 3rd Qu.:19.00 |
Max. :80.00 | Max. :27.0 | Max. :93.00 | Max. :42.00 |
attach(stackloss)
## run linear regression with air ~ acid lr1 <- lm(Air.Flow ~ Acid.Conc.) res1 <- residuals(lr1) ## run linear regression with water ~ acid lr2 <- lm(Water.Temp ~ Acid.Conc.) res2 <- residuals(lr2) ## plot res1 vs. res2 to see if there is a linear trend plot(res2, res1, main='Plot of residuals from two linear regression models fitted',xlab='Residuals of Water ~ Acid',ylab='Residuals of Air ~ Acid')
## calculate for partial correlation par.corr <- cor(res1,res2,method='spearman') ## calculate the test statstics and p-value of the Spearman correlation coefficient n <- length(Air.Flow) 21 test.stat <- par.corr*sqrt((n-3)/(1-par.corr^2)) test.stat 3.162624 p.value <- 2*pt(abs(test.stat),n-3,lower.tail=F) p.value 0.005387061
Hence, we have significant evidence to reject the null hypothesis and conclude that Air Flow and Water Temperature are significantly correlated when controlling for acid concentration at 5% level of significance.
Note that: in this method, we used for Spearman’s rho, the sampling distribution is $T= \hat{\rho}_{p} \sqrt{(n-2-k) \setminus (1- \hat{\rho}^{2}_{p})}, where T \sim t_{n-2-k}.$
Same question and we use the package of ppcor in R this time:
library(ppcor) pcor(stackloss) ## calculate pairwise partial correlations for each pair of variables given others, methods could be pearson, kendall or spearman
$estimate$
Air.Flow | Water.Temp | Acid.Conc. | stack.loss | |
Air.Flow | 1.0000000 | -0.1693960 | 0.3838007 | 0.7896593 |
Water.Temp | -0.1693960 | 1.0000000 | 0.1490278 | 0.6492456 |
Acid.Conc. | 0.3838007 | 0.1490278 | 1.0000000 | -0.2297477 |
stack.loss | 0.7896593 | 0.6492456 | -0.2297477 | 1.0000000 |
$p.value$
Air.Flow | Water.Temp | Acid.Conc. | stack.loss | |
Air.Flow | 0.000000e+00 | 0.4785233722 | 0.08658525 | 1.116809e-07 |
Water.Temp | 4.785234e-01 | 0.0000000000 | 0.53433865 | 4.322516e-04 |
Acid.Conc. | 8.658525e-02 | 0.5343386526 | 0.00000000 | 3.303994e-01 |
stack.loss | 1.116809e-07 | 0.0004322516 | 0.33039937 | 0.000000e+00 |
$statistic$
Air.Flow | Water.Temp | Acid.Conc. | stack.loss | |
Air.Flow | 0.0000000 | -0.7086795 | 1.7136923 | 5.3066130 |
Water.Temp | -0.7086795 | 0.0000000 | 0.6213967 | 3.5195672 |
Acid.Conc. | 1.7136923 | 0.6213967 | 0.0000000 | -0.9733098 |
stack.loss | 5.3066130 | 3.5195672 | -0.9733098 | 0.0000000 |
$n [1] 21
$gp [1] 2
$method [1] "pearson"
pcor.test(Air.Flow,Water.Temp,Acid.Conc.) estimate p.value statistic n gp Method 1 0.7356413 4.073273e-06 4.607608 21 1 pearson
pcor.test(Air.Flow,Water.Temp,Acid.Conc.,method='spearman') estimate p.value statistic n gp Method 1 0.6974934 3.634416e-05 4.12957 21 1 spearman
pcor.test(Air.Flow,Water.Temp,c(Acid.Conc.,stack.loss)) #calculate partial correlation between air flow and water temperature while controlling for air concentration and stack loss
estimate p.value statistic n gp Method 1 0.7762696 1.471018e-14 7.690029 42 1 pearson
Problems
y.data <- data.frame(hl=c(7,15,19,15,21,22,57,15,20,18), disp=c( 0.000,0.964,0.000,0.000,0.921,0.000,0.000,1.006,0.000,1.011), deg=c(9,2,3,4,1,3,1,3,6,1), BC=c(1.78 e-02,1.05e-06,1.37e-05,7.18e-03,0.00e+00,0.00e+00,0.00e+00,4.48e-03,2.10e-06,0.00e+00))
- 1) Apply this dataset and calculate the partial correlation between ‘hl’ and ‘disp’ given ‘deg’ using the linear regression method as well as the ppcor package. State and compare your results.
- 2) Apply this dataset and calculate the partial correlation between ‘hl’ and ‘disp’ given ‘deg’ and ‘BC” using the ppcor package introduced in the lecture. Sate your conclusions.
References
Statistical inference / George Casella, Roger L. Berger
Sampling / Steven K. Thompson.
Sampling theory and methods / S. Sampath.
- SOCR Home page: http://www.socr.umich.edu
Translate this page: