Difference between revisions of "SMHS PartialCorrelation"

From SOCR
Jump to: navigation, search
(R Example:)
Line 1: Line 1:
 
==[[SMHS| Scientific Methods for Health Sciences]] - Partial Correlation ==
 
==[[SMHS| Scientific Methods for Health Sciences]] - Partial Correlation ==
  
===Overview===  
+
===Overview===
Partial correlation measures the degree of association between two random variables after removing the effect of set of controlling random variables. It measures variance after certain factors are controlled for. In this lecture, we are going to present a general introduction to partial correlation and illustrate its application with examples and its calculation in R. We are also going to discuss the application of partial correlation function in time series analysis.
+
'''Partial correlation''' measures the degree of association between two random variables after removing the linear effects of one or more controlling variables. It quantifies the unique relationship between two variables while statistically controlling for potential confounding factors. Partial correlation is fundamental in:
 +
- Identifying direct relationships in multivariate systems
 +
- Controlling for confounding variables in observational studies
 +
- Network analysis and graphical models
 +
- Time series analysis (partial autocorrelation)
  
 
===Motivation===
 
===Motivation===
We have discussed about calculating the correlation between two random variables, which measures the statistical relationships involving dependence. What if we want to measure the dependence relationship between two random variables after removing the set of controlling variables? What is the difference between partial correlation and correlation in general? How would this adjustment for other variables influence the relationship between these two variables?
+
Consider a study examining the relationship between exercise frequency and cholesterol levels. Age is known to affect both variables: older individuals tend to exercise less and have higher cholesterol. Simple correlation between exercise and cholesterol would be confounded by age. Partial correlation addresses this by:
 +
 
 +
1. '''Removing spurious correlations''' due to common causes
 +
 
 +
2. '''Identifying direct relationships''' in complex systems
 +
 
 +
3. '''Testing conditional independence''' in graphical models
 +
 
 +
4. '''Decomposing multivariate relationships''' into direct and indirect effects
  
 
===Theory===
 
===Theory===
1) Partial correlation: the partial correlation between $\it X$ and $\it Y$ given a set of $n$ controlling variables $Z = \left \{ Z_{1}, Z_{2}, \cdots, Z_{n} \right \}$ , written $\rho _{XYZ}$, is the correlation between the residuals $R_{X}$ and $R_{Y}$ resulting from the linear regression of $X$ with $Z$ and $Y$ with $Z$ respectively. The first-order partial correlation is the difference between a correlation and the product of the removable correlations divided by the product of the coefficients of alienation of the removable correlations.
 
  
2) Calculation of partial correlation: there are three most commonly used methods to calculate partial correlation.  
+
====1) Mathematical Foundations====
 +
 
 +
=====Definition=====
 +
The partial correlation between variables \(X\) and \(Y\) given a set of controlling variables \(Z = \{Z_1, Z_2, \ldots, Z_n\}\) is defined as:
 +
<math>
 +
\rho_{XY \cdot Z} = \frac{\rho_{XY} - \rho_{XZ}\rho_{YZ}}{\sqrt{(1-\rho_{XZ}^2)(1-\rho_{YZ}^2)}}
 +
</math>
 +
for a single controlling variable \(Z\), and more generally as:
 +
<math>
 +
\rho_{XY \cdot Z} = \frac{\text{Cov}(X_{\perp Z}, Y_{\perp Z})}{\sqrt{\text{Var}(X_{\perp Z})\text{Var}(Y_{\perp Z})}}
 +
</math>
 +
where \(X_{\perp Z}\) and \(Y_{\perp Z}\) are the residuals from regressing \(X\) and \(Y\) on \(Z\).
 +
 
 +
=====Matrix Formulation=====
 +
For a set of variables with covariance matrix \(\boldsymbol{\Sigma}\), let \(\boldsymbol{\Omega} = \boldsymbol{\Sigma}^{-1}\) be the precision matrix. The partial correlation between \(X_i\) and \(X_j\) controlling for all other variables is:
 +
<math>
 +
\rho_{X_i X_j \cdot V\setminus\{X_i,X_j\}} = -\frac{\omega_{ij}}{\sqrt{\omega_{ii}\omega_{jj}}}
 +
</math>
 +
where \(\omega_{ij}\) are elements of \(\boldsymbol{\Omega}\).
 +
 
 +
This relationship reveals that '''zero partial correlation implies conditional independence''' for jointly Gaussian variables:
 +
<math>
 +
\rho_{X_i X_j \cdot V\setminus\{X_i,X_j\}} = 0 \iff X_i \perp\!\!\!\perp X_j \mid V\setminus\{X_i,X_j\}
 +
</math>
 +
 
 +
=====Geometric Interpretation=====
 +
In vector space terminology, partial correlation corresponds to the cosine of the angle between the residual vectors after projecting \(X\) and \(Y\) onto the subspace spanned by \(Z\):
 +
<math>
 +
\rho_{XY\cdot Z} = \cos(\theta_{R_X R_Y})
 +
</math>
 +
where \(R_X = X - P_Z(X)\) and \(R_Y = Y - P_Z(Y)\) are residuals, and \(P_Z\) denotes projection onto the span of \(Z\).
 +
 
 +
====2) Calculation Methods====
 +
 
 +
=====Method 1: Residual Regression=====
 +
1. Regress \(X\) on \(Z\): \(X = \boldsymbol{\beta}_X^\top Z + \varepsilon_X\)
 +
2. Regress \(Y\) on \(Z\): \(Y = \boldsymbol{\beta}_Y^\top Z + \varepsilon_Y\)
 +
3. Compute correlation between residuals: \(\rho_{XY\cdot Z} = \text{Cor}(\varepsilon_X, \varepsilon_Y)\)
  
*Using linear regression: solving the two associated linear regression problems to get the residuals and calculate the correlation between the residuals is a straightforward way to calculate partial correlation.
+
The sample estimate is:
*Denote the i.i.d. samples of some joint probability distribution over $X, Y\, and\, Z$ as $x_{i}, y_{i} and z_{i},$ solving the linear regression problem by finding the n-dimension vectors $w^{*} _{X} = argmin_{x} \left \{ \sum_{i=1}^{N} (x_{i}-\left \langle w,z_{i} \right \rangle)^{2} \right \}, w^{*} _{Y} = argmin_{x} \left \{ \sum_{i=1}^{N} (y_{i}-\left \langle w,z_{i} \right \rangle)^{2} \right \},$ with $N$ being the number of samples and $\left \langle v, w \right \rangle$ the scalar product between the vectors $v$ and $w$. Note: in some implementations, the regression includes a constant term, so that the matrix $z$ would have an additional column of ones.
+
<math>
*The residuals are then $r_{X,i} = x_{i} - \left \langle w^{*}_{X} , z_{i} \right \rangle, r_{Y,i} = y_{i} - \left \rangle w^{*}_{Y} , z_{i} \right \rangle.$
+
\hat{\rho}_{XY\cdot Z} = \frac{\sum_{i=1}^n \hat{\varepsilon}_{X,i}\hat{\varepsilon}_{Y,i}}{\sqrt{\sum_{i=1}^n \hat{\varepsilon}_{X,i}^2 \sum_{i=1}^n \hat{\varepsilon}_{Y,i}^2}}
*The sample partial correlation is $\hat{ \rho}_{XY \dot Z} = \frac {N \sum_{i=1}^{N} r_{X,i} r_{Y,i} - \sum-{i=1}^{N} r_{X,i} r_{Y,i}}{ \sqrt{ N \sum_{i=1}^{N} r^{2}_{X,i}-(\sum_{i=1}^{N} r_{X,i})^{2}} \sqrt{ N \sum_{i=1}^{N} r^{2}_{Y,i}-(\sum_{i=1}^{N} r_{Y,i})^{2}}}$.
+
</math>
*Using recursive formula: given that the method using linear regression requires lots of calculation, we can refer to the recursive formula. Notice that the nth-order partial correlation (i.e., with $|Z|=n$) can be easily computed from three (n-1)th-order partial correlation. The 0th-order partial correlation $\rho_{XY_{O}}$ is defined to be the regular correlation coefficient $\rho_{XY}.$
 
**This holds for any $Z_{0} \in Z: \rho_{XY. Z} =\frac{\rho_{XY.Z\setminus \left \{ Z_{0} \right \}}-\rho_{XZ_{0}.Z\setminus \left \{ Z_{0} \right \}}\rho_{Z_{0}Y.Z\setminus \left \{ Z_{0} \right \}}}{\sqrt{1-\rho^{2}_{XZ_{0}.Z\setminus \left \{ Z_{0} \right \}}} \sqrt{1-\rho^{2}_{Z_{0}Y.Z\setminus \left \{ Z_{0} \right \}}}}$.
 
**A recursive implementation of this computation will generate an exponential time complexity. However, this computation has overlapping subproblems property, such that using dynamic programming or simply caching the results of the recursive calls generates a complexity of $O(n^{3})$.
 
**Note: when Z is a single variable, this reduces to $\rho_{XY. Z} =\frac{\rho_{XY}-\rho_{XZ}\rho_{ZY}}{\sqrt{1-\rho^{2}_{XZ}} \sqrt{1-\rho^{2}_{ZY}}}.$
 
*Using matrix inversion: in $O(n^{3})$ time, another method allows all partial correlations to be computed between any variables $X_{i}$ and $X_{j}$ of a set of $V$ of cardinality $n$ given all others, i.e., $V \setminus \left \{X_{i}, X_{j} \right \},$ if the correlation matrix is positive and therefore invertible, we define $P=\Omega ^{-1},$ and have $\rho_{X_{i}X_{j}.V\setminus \left \{X_{i}, X_{j} \right \}} = \frac {p_{ij}}{\sqrt{p_{ii}p_{jj}}}$.
 
  
==Interpretation:==  
+
=====Method 2: Recursive Formula=====
In Geometrical, let three variables $X, Y and\, Z$ where $x$ is the independent variable $(IV)$, $y$ is the dependent variable $(DV)$, and $Z$ is the ‘control’ or ‘extra variable’ be chosen from a joint probability distribution over $n$ variables $V$. Further let $v_{j}$, $I \leq I \leq N,$ be $N$ n-dimensional i.i.d. samples taken from the joint probability distribution over $V$. We then consider the N-dimensional vectors $x$, and $z$. The residuals $R_{X}$ comes from the linear regression of $X$ using $Z$, which can be considered as an N-dimensional vector $r_{X}$, has a zero scalar product with the vector $z$ generated by $Z$. This indicates that the residuals vector lives on a hyperplane $S_{Z}$ that is perpendicular to $Z$.  
+
For nested sets of controlling variables \(Z \subset W\):
 +
<math>
 +
\rho_{XY\cdot W} = \frac{\rho_{XY\cdot Z} - \rho_{XW_n\cdot Z\setminus\{W_n\}}\rho_{YW_n\cdot Z\setminus\{W_n\}}}{\sqrt{(1-\rho_{XW_n\cdot Z\setminus\{W_n\}}^2)(1-\rho_{YW_n\cdot Z\setminus\{W_n\}}^2)}}
 +
</math>
 +
where \(W_n\) is any variable in \(W\setminus Z\).
  
In conditional independence test, with the assumption that all involved variables are multivariate Gaussian, the partial correlation $\rho_{XY.Z}$ is zero if and only if $X$ is conditionally independent from $Y$ given $Z$, which does not hold in general cases. To test if a sample partial correlation $\hat{ \rho}_{XY.Z}$ vanishes, Fisher’s z-transform of the partial correlation can be used: $z(\hat{ \rho}_{XY.Z}) = 1/2 \ln(\frac{1+ \hat{ \rho}_{XY.Z}}{1- \hat{ \rho}_{XY.Z}}).$ The null hypothesis is $H_{o}: \hat{ \rho}_{XY.Z}=0, vs. H_{A}: \hat{ \rho}_{XY.Z} \neq 0.$ We reject $H_{o}4$ with significance level $4\alpha if \sqrt{N-|Z|-3} * |z(\hat{ \rho}_{XY.Z})| > \Phi ^{-1} (1- \alpha /2)$, where $\Phi ()$ is the cumulative distribution function of a Gaussian distribution with zero mean and standard deviation 1. $N$ is the sample size.  
+
This allows efficient computation with time complexity \(O(p^3)\) for \(p\) variables using dynamic programming.
  
==Semi-partial correlation:==  
+
=====Method 3: Matrix Inversion=====
Similar to partial correlation, both of which measure variance after certain factors are controlled for, but the former holds the third variable constant for either X or Y, while the latter holds the third variable constant for both. The semi-partial (or part) correlation can be regarded as more practically relevant since it is scaled to (i.e., relative to) the total variability in the dependent (response) variable. Meanwhile, it is less theoretically useful because it is less precise about the unique contribution of the independent variable. Semi-partial correlation of X with Y is always less than or equal to the partial correlation of X with Y.
+
From the precision matrix \(\boldsymbol{\Omega} = \boldsymbol{\Sigma}^{-1}\):
 +
<math>
 +
\hat{\rho}_{X_i X_j \cdot V\setminus\{X_i,X_j\}} = -\frac{\hat{\omega}_{ij}}{\sqrt{\hat{\omega}_{ii}\hat{\omega}_{jj}}}
 +
</math>
 +
This method is particularly efficient for computing all pairwise partial correlations simultaneously.
  
*In time series analysis, the partial correlation function(PACF)of a time series is defined for lag $h$, as $\Phi (h) = \rho_{X_{0}X_{h}. \left \{X_{1}, \cdots,X_{h-1} \right \} }$. It is important in identifying the extent of lag in an autoregressive model. The partial autocorrelation of an $AR(p)$ process is zero at lag $p+1$ and greater.
+
<pre>
 +
# R function implementing all three methods
 +
calculate_partial_correlation <- function(X, Y, Z, method = "inversion") {
 +
  n <- length(X)
 +
 
 +
  if (method == "regression") {
 +
    # Method 1: Residual regression
 +
    res_X <- resid(lm(X ~ Z))
 +
    res_Y <- resid(lm(Y ~ Z))
 +
    return(cor(res_X, res_Y))
 +
   
 +
  } else if (method == "recursive") {
 +
    # Method 2: Recursive formula (for single Z)
 +
    if (is.null(dim(Z))) { # Single controlling variable
 +
      r_xy <- cor(X, Y)
 +
      r_xz <- cor(X, Z)
 +
      r_yz <- cor(Y, Z)
 +
      return((r_xy - r_xz * r_yz) / sqrt((1 - r_xz^2) * (1 - r_yz^2)))
 +
    } else {
 +
      # For multiple Z, use recursive approach
 +
      pcor_matrix <- pcor(Z)$estimate
 +
      # Extract relevant partial correlations
 +
      # (This is simplified; full implementation would handle recursion)
 +
    }
 +
   
 +
  } else if (method == "inversion") {
 +
    # Method 3: Matrix inversion
 +
    data_matrix <- cbind(X, Y, Z)
 +
    sigma <- cov(data_matrix)
 +
    omega <- solve(sigma)
 +
    pcor <- -omega[1,2] / sqrt(omega[1,1] * omega[2,2])
 +
    return(pcor)
 +
  }
 +
}
 +
</pre>
  
*Given a time series $z_{t}$, the partial autocorrelation of lag $k$, $\alpha (k)$, is the autocorrelation between $z_{t}$ and $z_{t+k}$ with the linear dependence of $z_{t+1}$ through to $z_{t+k-1}$ removed, it is also the autocorrelation between $z_{t}$ and $z_{t+k}$ that is not accounted for by lags $1$ to $k-1$, inclusive. $\alpha (1) = Cor(z_{t}, z_{t+1}), \alpha (k) = Cor(z_{t+k}- P_{t,k}(z_{t+k}), z_{t}-P_{t,k}(z_{t}))$, for $k \geq 2, where P_{t,k(x)}$ is the projection of $x$ onto the space spanned by $z_{t+1},\cdots, z_{t+k-1}.$
+
====3) Statistical Inference====
  
*One looks for the point on the plot where the partial autocorrelations for all higher lags are essentially zero. Placing on the plot an indication of the sampling uncertainty of the sample PACF is helpful for this purpose: this is usually constructed on the basis that the true value of the PACF, at any given positive lag, is zero. This can be formalized as described below. An approximate test that a given partial correlation is zero (at a 5% significance level) is given by comparing the sample partial autocorrelations against the critical region with upper and lower limits given by $\pm 1.95 \setminus \sqrt{n}$, where $n$ is the record length (number of points) of the time-series being analyzed. This approximation relies on the assumption that the record length is moderately large (say $n>30$) and that the underlying process has finite second moment.
+
=====Hypothesis Testing=====
 +
Test \(H_0: \rho_{XY\cdot Z} = 0\) vs \(H_1: \rho_{XY\cdot Z} \neq 0\).
 +
 
 +
'''Fisher's z-transform''':
 +
<math>
 +
z = \frac{1}{2}\ln\left(\frac{1+\hat{\rho}}{1-\hat{\rho}}\right) \sim N\left(\frac{1}{2}\ln\left(\frac{1+\rho}{1-\rho}\right), \frac{1}{n-|Z|-3}\right)
 +
</math>
 +
 
 +
Test statistic:
 +
<math>
 +
T = \frac{z - \frac{1}{2}\ln\left(\frac{1+\rho_0}{1-\rho_0}\right)}{\sqrt{1/(n-|Z|-3)}} \sim N(0,1)
 +
</math>
 +
Under \(H_0: \rho = 0\), this simplifies to \(T = z\sqrt{n-|Z|-3}\).
 +
 
 +
'''Exact t-test''' (for normally distributed data):
 +
<math>
 +
t = \hat{\rho}\sqrt{\frac{n-|Z|-2}{1-\hat{\rho}^2}} \sim t_{n-|Z|-2}
 +
</math>
 +
 
 +
=====Confidence Intervals=====
 +
Using Fisher's transform:
 +
<math>
 +
\text{CI}_{1-\alpha} = \left[\tanh\left(z - \frac{z_{1-\alpha/2}}{\sqrt{n-|Z|-3}}\right), \tanh\left(z + \frac{z_{1-\alpha/2}}{\sqrt{n-|Z|-3}}\right)\right]
 +
</math>
 +
where \(\tanh(x) = \frac{e^{2x}-1}{e^{2x}+1}\) is the hyperbolic tangent.
 +
 
 +
=====Power Analysis=====
 +
The required sample size to detect partial correlation \(\rho\) with power \(1-\beta\) at level \(\alpha\):
 +
<math>
 +
n = |Z| + 3 + \left(\frac{z_{1-\alpha/2} + z_{1-\beta}}{z(\rho)}\right)^2
 +
</math>
 +
where \(z(\rho) = \frac{1}{2}\ln\left(\frac{1+\rho}{1-\rho}\right)\).
 +
 
 +
<pre>
 +
# R functions for inference
 +
partial_correlation_test <- function(X, Y, Z, method = "fisher") {
 +
  n <- length(X)
 +
  k <- if(is.null(dim(Z))) 1 else ncol(Z)
 +
  r <- calculate_partial_correlation(X, Y, Z, method = "regression")
 +
 
 +
  if (method == "fisher") {
 +
    # Fisher's z-transform
 +
    z <- 0.5 * log((1 + r) / (1 - r))
 +
    se <- 1 / sqrt(n - k - 3)
 +
    z_score <- z / se
 +
    p_value <- 2 * pnorm(-abs(z_score))
 +
   
 +
    # Confidence interval
 +
    z_lower <- z - qnorm(0.975) * se
 +
    z_upper <- z + qnorm(0.975) * se
 +
    ci <- c(tanh(z_lower), tanh(z_upper))
 +
   
 +
  } else if (method == "t_test") {
 +
    # Exact t-test (assumes normality)
 +
    t_stat <- r * sqrt((n - k - 2) / (1 - r^2))
 +
    p_value <- 2 * pt(-abs(t_stat), df = n - k - 2)
 +
    ci <- cor.test(X, Y)$conf.int  # Approximation
 +
  }
 +
 
 +
  return(list(
 +
    estimate = r,
 +
    statistic = if(method == "fisher") z_score else t_stat,
 +
    p_value = p_value,
 +
    confidence_interval = ci,
 +
    n = n,
 +
    df = n - k - 2
 +
  ))
 +
}
 +
 
 +
# Power calculation
 +
partial_correlation_power <- function(rho, n, k, alpha = 0.05) {
 +
  # rho: true partial correlation
 +
  # n: sample size
 +
  # k: number of controlling variables
 +
 
 +
  z_rho <- 0.5 * log((1 + rho) / (1 - rho))
 +
  se <- 1 / sqrt(n - k - 3)
 +
 
 +
  # Non-centrality parameter
 +
  lambda <- z_rho / se
 +
 
 +
  # Critical value under H0
 +
  z_crit <- qnorm(1 - alpha/2)
 +
 
 +
  # Power
 +
  power <- pnorm(-z_crit - lambda) + 1 - pnorm(z_crit - lambda)
 +
  return(power)
 +
}
 +
</pre>
 +
 
 +
====4) Semi-Partial (Part) Correlation====
 +
 
 +
The '''semi-partial correlation''' between \(X\) and \(Y\) controlling for \(Z\) from \(X\) only is:
 +
<math>
 +
r_{X(Y\cdot Z)} = \frac{r_{XY} - r_{XZ}r_{YZ}}{\sqrt{1 - r_{YZ}^2}}
 +
</math>
 +
Equivalently, it's the correlation between \(X\) and the residuals of \(Y\) regressed on \(Z\):
 +
<math>
 +
r_{X(Y\cdot Z)} = \text{Cor}(X, Y - \hat{Y}_Z)
 +
</math>
 +
 
 +
'''Key differences from partial correlation''':
 +
1. '''Denominator''': Semi-partial uses total variance of \(Y\), not residual variance
 +
2. '''Interpretation''': Proportion of total variance in \(Y\) uniquely explained by \(X\)
 +
3. '''Range''': \(|r_{X(Y\cdot Z)}| \leq |r_{XY\cdot Z}| \leq |r_{XY}|\)
 +
4. '''Asymmetry''': \(r_{X(Y\cdot Z)} \neq r_{Y(X\cdot Z)}\) generally
 +
 
 +
====5) Partial Autocorrelation in Time Series====
 +
 
 +
The '''partial autocorrelation function (PACF)''' at lag \(h\) is:
 +
<math>
 +
\phi(h) = \rho_{X_t X_{t+h} \cdot \{X_{t+1}, \ldots, X_{t+h-1}\}}
 +
</math>
 +
 
 +
For an AR(\(p\)) process, \(\phi(h) = 0\) for \(h > p\). The PACF is estimated via:
 +
1. '''Durbin-Levinson algorithm''': Recursive computation
 +
2. '''Regression approach''': \(\phi(h) = \) coefficient of \(X_{t-h}\) in regression of \(X_t\) on \(X_{t-1}, \ldots, X_{t-h}\)
 +
3. '''Matrix inversion''': Using the autocovariance matrix
 +
 
 +
'''Hypothesis testing''': Under \(H_0: \phi(h) = 0\),
 +
<math>
 +
\sqrt{n}\hat{\phi}(h) \sim N(0,1)
 +
</math>
 +
Approximate 95% confidence bands: \(\pm 1.96/\sqrt{n}\).
  
 
===Applications===
 
===Applications===
  
[http://bioinformatics.oxfordjournals.org/content/20/18/3565.short  This article] proposed to use a partial correlation analysis to construct approximate Undirected Dependency Graphs from large-scale biochemical data. This method enables a distinction between direct and indirect interactions of biochemical compounds, thereby inferring the underlying network topology. The method is first thoroughly evaluated with a large set of simulated data. Results indicate that the approach has good statistical power and a low False Discovery Rate even in the presence of noise in the data. We then applied the method to an existing data set of yeast gene expression. Several small gene networks were inferred and found to contain genes known to be collectively involved in particular biochemical processes. In some of these networks there are also uncharacterized ORFs present, which lead to hypotheses about their functions.
+
####Example 1: Medical Research - Controlling for Confounders
 +
<pre>
 +
# Load necessary libraries
 +
library(ppcor)
 +
library(ggplot2)
 +
library(GGally)
 +
 
 +
# Simulate medical data: Cholesterol (Y), Exercise (X), Age (Z1), BMI (Z2)
 +
set.seed(123)
 +
n <- 200
 +
age <- rnorm(n, mean = 50, sd = 10)
 +
bmi <- rnorm(n, mean = 25, sd = 4)
 +
exercise <- 5 - 0.05*age - 0.1*bmi + rnorm(n, sd = 2)
 +
cholesterol <- 200 + 0.8*age + 0.5*bmi - 0.7*exercise + rnorm(n, sd = 15)
 +
 
 +
medical_data <- data.frame(cholesterol, exercise, age, bmi)
 +
 
 +
cat("=== Medical Research Example ===\n")
 +
cat("Research question: Relationship between exercise and cholesterol,\n")
 +
cat("controlling for age and BMI.\n\n")
 +
 
 +
# 1. Simple correlations
 +
cat("1. Simple (Marginal) Correlations:\n")
 +
cor_matrix <- cor(medical_data)
 +
print(cor_matrix)
 +
 
 +
# 2. Partial correlation (exercise ~ cholesterol | age, bmi)
 +
cat("\n2. Partial Correlation Analysis:\n")
 +
pcor_result <- pcor.test(medical_data$exercise, medical_data$cholesterol,
 +
                        medical_data[, c("age", "bmi")])
 +
print(pcor_result)
 +
 
 +
# 3. Compare with semi-partial correlation
 +
cat("\n3. Semi-partial Correlation:\n")
 +
# Regress cholesterol on age and BMI
 +
lm_chol <- lm(cholesterol ~ age + bmi, data = medical_data)
 +
residuals_chol <- resid(lm_chol)
 +
semi_partial <- cor(medical_data$exercise, residuals_chol)
 +
cat("Semi-partial correlation (exercise with cholesterol residuals):",
 +
    round(semi_partial, 4), "\n")
 +
 
 +
# 4. Visualization
 +
par(mfrow = c(2, 2))
 +
 
 +
# Scatterplot matrix
 +
plot(medical_data[, 1:4], main = "Scatterplot Matrix")
 +
 
 +
# Partial correlation network
 +
if (require(qgraph)) {
 +
  pcor_network <- pcor(medical_data)$estimate
 +
  qgraph(pcor_network, layout = "spring",
 +
        labels = colnames(medical_data),
 +
        title = "Partial Correlation Network")
 +
}
 +
 
 +
# Comparison of correlations
 +
cor_types <- data.frame(
 +
  Type = c("Simple", "Partial", "Semi-partial"),
 +
  Value = c(cor(medical_data$exercise, medical_data$cholesterol),
 +
            pcor_result$estimate,
 +
            semi_partial)
 +
)
 +
 
 +
ggplot(cor_types, aes(x = Type, y = Value, fill = Type)) +
 +
  geom_bar(stat = "identity") +
 +
  ylim(-1, 1) +
 +
  labs(title = "Comparison of Correlation Measures",
 +
      subtitle = "Exercise vs Cholesterol",
 +
      y = "Correlation Coefficient") +
 +
  theme_minimal()
 +
 
 +
par(mfrow = c(1, 1))
 +
 
 +
# 5. Sensitivity analysis: How partial correlation changes with different controls
 +
cat("\n4. Sensitivity Analysis:\n")
 +
cat("Partial correlations with different sets of controls:\n")
  
 +
controls_list <- list(
 +
  "None" = NULL,
 +
  "Age only" = medical_data[, "age", drop = FALSE],
 +
  "BMI only" = medical_data[, "bmi", drop = FALSE],
 +
  "Age + BMI" = medical_data[, c("age", "bmi")]
 +
)
  
[http://www.tandfonline.com/doi/abs/10.1080/00949650008812035#.U-JiZxZTWdA  This article] compared empirical type I error and power of different permutation techniques that can be used for partial correlation analysis involving three data vectors and for partial Mantel tests. The partial Mantel test is a form of first-order partial correlation analysis involving three distance matrices, which is widely used in such fields as population genetics, ecology, anthropology, psychometry and sociology. The methods compared are the following:
+
for (name in names(controls_list)) {
*(1) permute the objects in one of the vectors (or matrices)
+
  if (is.null(controls_list[[name]])) {
*(2) permute the residuals of a null model
+
    r <- cor(medical_data$exercise, medical_data$cholesterol)
*(3)correlate residualized vector 1 (or matrix A) to residualized vector 2 (or matrix B); permute one of the residualized vectors (or matrices)  
+
  } else {
*(4) permute the residuals of a full model
+
    r <- calculate_partial_correlation(
 +
      medical_data$exercise,
 +
      medical_data$cholesterol,
 +
      controls_list[[name]],
 +
      method = "regression"
 +
    )
 +
  }
 +
  cat(sprintf("%-15s: r = %.4f\n", name, r))
 +
}
 +
</pre>
  
In the partial correlation study, the results were compared to those of the parametric t-test which provides a reference under normality. Simulations were carried out to measure the type I error and power of these permutation methods, using normal and non-normal data, without and with an outlier. There were 10 000 simulations for each situation (100 000 when n = 5); 999 permutations were produced per test where permutations were used. The recommended testing procedures are the following:(a) In partial correlation analysis, most methods can be used most of the time. The parametric t-test should not be used with highly skewed data. Permutation of the raw data should be avoided only when highly skewed data are combined with outliers in the covariable. Methods implying permutation of residuals, which are known to only have asymptotically exact significance levels, should not be used when highly skewed data are combined with small sample size. (b) In partial Mantel tests, method 2 can always be used, except when highly skewed data are combined with small sample size. (c) With small sample sizes, one should carefully examine the data before partial correlation or partial Mantel analysis. For highly skewed data, permutation of the raw data has correct type I error in the absence of outliers. When highly skewed data are combined with outliers in the covariable vector or matrix, it is still recommended to use the permutation of raw data. (d) Method 3 should never be used.
+
####Example 2: Gene Expression Network Analysis
 +
<pre>
 +
# Using gene expression data to demonstrate partial correlation in networks
 +
library(huge)
  
==Software==
+
# Simulate gene expression data with network structure
[http://cran.r-project.org/web/packages/ppcor/ppcor.pdf  Package ppcor]
+
set.seed(456)
+
n_genes <- 20
[http://www.yilab.gatech.edu/pcor.html Partial Correlation]
+
n_samples <- 100
  
==R Example:==
+
# Generate precision matrix with sparse structure
Consider the built-in data set in R called the stackloss, which contains measurements on operations in a plant that oxidized ammonia $(NH_{3})$ to make nitric acid $(HNO_{3})$ and we want to calculate the partial correlation between air flow speed and water temperature while controlling for acid concentration. So the idea would be to first formulate a linear regression with air flow as target and acid concentration as the predictor and calculate the residuals $e_{1}$; then formulate a linear regression with water temperature as target and acid concentration as target and calculate the residuals, $e_{2}$; finally calculate the correlation coefficient between the residuals from the first two steps and that is the partial correlation between air flow and water temperature while controlling for the effect of acid concentration.
+
prec_matrix <- diag(n_genes)
 +
for(i in 1:(n_genes-1)) {
 +
  for(j in (i+1):n_genes) {
 +
    if(runif(1) < 0.1) { # 10% connections
 +
      prec_matrix[i,j] <- prec_matrix[j,i] <- 0.3
 +
    }
 +
  }
 +
}
  
data(stackloss)
+
# Ensure positive definiteness
stackloss
+
diag(prec_matrix) <- abs(min(eigen(prec_matrix)$values)) + 1.1
  
<center>
+
# Generate multivariate normal data
{| class="wikitable" style="text-align:center; width 35%" border="1"
+
cov_matrix <- solve(prec_matrix)
|-
+
gene_data <- MASS::mvrnorm(n_samples, mu = rep(0, n_genes), Sigma = cov_matrix)
| || Air Flow|| Water Temp|| Acid Concentration|| Stack Loss
+
colnames(gene_data) <- paste0("Gene_", 1:n_genes)
|-
 
|1 ||80|| 27|| 89|| 42
 
|-
 
|2|| 80|| 27|| 88|| 37
 
|-
 
|3|| 75|| 25|| 90|| 37
 
|-
 
|4|| 62|| 24|| 87|| 28
 
|-
 
|5|| 62|| 22|| 87|| 18
 
|-
 
|6|| 62|| 23|| 87|| 18
 
|-
 
|7|| 62|| 24 ||93 ||19
 
|-
 
|8|| 62 ||24|| 93|| 20
 
|-
 
|9|| 58|| 23|| 87|| 15
 
|-
 
|10|| 58|| 18|| 80|| 14
 
|-
 
|11|| 58|| 18|| 89|| 14
 
|-
 
|12|| 58|| 17|| 88|| 13
 
|-
 
|13 ||58 ||18|| 82|| 11
 
|-
 
|14|| 58 ||19|| 93 ||12
 
|-
 
|15|| 50|| 18|| 89|| 8
 
|-
 
|16|| 50|| 18|| 86 ||7
 
|-
 
|17|| 50|| 19|| 72|| 8
 
|-
 
|18|| 50|| 19|| 79|| 8
 
|-
 
|19|| 50|| 20|| 80|| 9
 
|-
 
|20|| 56|| 20|| 82|| 15
 
|-
 
|21|| 70|| 20|| 91|| 15
 
|-
 
|}
 
</center>
 
  
 +
cat("=== Gene Expression Network Analysis ===\n")
  
summary(stackloss)
+
# 1. Simple correlation network
 +
simple_cor <- cor(gene_data)
 +
cat("\n1. Simple correlation density:",
 +
    mean(abs(simple_cor[upper.tri(simple_cor)])), "\n")
  
<center>
+
# 2. Partial correlation network (Graphical Lasso)
{| class="wikitable" style="text-align:center; width 35%" border="1"
+
if (require(huge)) {
|-
+
  huge_result <- huge(gene_data, method = "glasso")
|    Air Flow   ||    Water Temp  ||   Acid Concentration  ||    stackloss
+
  huge_select <- huge.select(huge_result, criterion = "ebic")
|-
+
 
| Min.   :50.00 ||  Min.   :17.0  || Min.  :72.00 ||  Min: 7.00
+
  # Estimated precision matrix
|-
+
  omega_est <- as.matrix(huge_select$opt.icov)
|1st Qu.:56.00 ||  1st Qu.:18.0 ||  1st Qu.:82.00  || 1st Qu.:11.00 
+
 
|-
+
   # Convert to partial correlations
| Median :58.00||  Median :20.0||  Median :87.00 ||  Median :15.00
+
   pcor_network <- -cov2cor(omega_est)
|- 
+
  diag(pcor_network) <- 1
| Mean   :60.43 ||  Mean   :21.1 ||  Mean   :86.29 ||  Mean   :17.52
+
    
|- 
+
   cat("2. Partial correlation density:",
|3rd Qu.:62.00 ||  3rd Qu.:24.0 ||  3rd Qu.:89.00||   3rd Qu.:19.00
+
      mean(abs(pcor_network[upper.tri(pcor_network)])), "\n")
|- 
+
    
|Max:80.00  || Max.  :27.0||   Max.   :93.00 ||  Max.   :42.00
+
  # Compare networks
|-
+
  cat("\n3. Network Comparison:\n")
|}
+
  cat("Number of edges (simple correlation > 0.3):",
</center>
+
      sum(abs(simple_cor[upper.tri(simple_cor)]) > 0.3), "\n")
 +
  cat("Number of edges (partial correlation > 0.3):",
 +
      sum(abs(pcor_network[upper.tri(pcor_network)]) > 0.3), "\n")
 +
 
 +
   # Visualize both networks
 +
   par(mfrow = c(1, 2))
 +
    
 +
   # Simple correlation network
 +
  qgraph(simple_cor, layout = "spring",
 +
        maximum = 0.5, minimum = -0.5,
 +
        title = "Simple Correlation Network")
 +
 
 +
  # Partial correlation network
 +
  qgraph(pcor_network, layout = "spring",
 +
        maximum = 0.5, minimum = -0.5,
 +
        title = "Partial Correlation Network")
 +
 
 +
   par(mfrow = c(1, 1))
 +
 
 +
  # 3. Test specific partial correlations
 +
   cat("\n4. Hypothesis Testing for Specific Gene Pairs:\n")
 +
    
 +
   # Test Gene_1 and Gene_2 controlling for others
 +
   test_result <- pcor.test(gene_data[, "Gene_1"],
 +
                          gene_data[, "Gene_2"],
 +
                          gene_data[, setdiff(colnames(gene_data),
 +
                                            c("Gene_1", "Gene_2"))])
 +
  print(test_result)
 +
}
 +
</pre>
  
 +
####Example 3: Time Series - Partial Autocorrelation Function
 +
<pre>
 +
# Time series analysis with PACF
 +
cat("=== Time Series Analysis: Partial Autocorrelation ===\n")
  
attach(stackloss)
+
# Generate AR(2) process: X_t = 0.5*X_{t-1} - 0.3*X_{t-2} + ε_t
## run linear regression with air ~ acid
+
set.seed(789)
lr1 <- lm(Air.Flow ~ Acid.Conc.)
+
n <- 500
res1 <- residuals(lr1)
+
epsilon <- rnorm(n)
## run linear regression with water ~ acid
+
x <- numeric(n)
lr2 <- lm(Water.Temp ~ Acid.Conc.)
+
x[1] <- epsilon[1]
res2 <- residuals(lr2)
+
x[2] <- 0.5*x[1] + epsilon[2]
## plot res1 vs. res2 to see if there is a linear trend
 
plot(res2, res1, main='Plot of residuals from two linear regression models
 
fitted',xlab='Residuals of Water ~ Acid',ylab='Residuals of Air ~ Acid')
 
  
 +
for(t in 3:n) {
 +
  x[t] <- 0.5*x[t-1] - 0.3*x[t-2] + epsilon[t]
 +
}
  
 +
# 1. Compute PACF using built-in function
 +
pacf_result <- pacf(x, lag.max = 20, plot = FALSE)
  
[[Image:Chapt3 Partial Corr 1.png]]
+
cat("\n1. PACF values (lags 1-10):\n")
 +
print(pacf_result$acf[1:10])
  
 +
# 2. Manual calculation via regression (for understanding)
 +
cat("\n2. Manual PACF calculation via regression:\n")
  
## calculate for partial correlation
+
manual_pacf <- function(x, lag) {
par.corr <- cor(res1,res2,method='spearman')
+
  if (lag == 1) {
## calculate the test statstics and p-value of the Spearman correlation coefficient
+
    return(cor(x[-1], x[-length(x)]))
n <- length(Air.Flow)
+
  } else {
21
+
    # Create design matrix
test.stat <- par.corr*sqrt((n-3)/(1-par.corr^2))
+
    n <- length(x)
test.stat
+
    X <- matrix(NA, nrow = n - lag, ncol = lag)
3.162624
+
    for (i in 1:lag) {
p.value <- 2*pt(abs(test.stat),n-3,lower.tail=F)
+
      X[, i] <- x[(lag+1-i):(n-i)]
  p.value
+
    }
0.005387061
+
   
 +
    # Fit regression and get coefficient for most recent lag
 +
    y <- x[(lag+1):n]
 +
    coefs <- lm(y ~ X)$coefficients
 +
    return(coefs[lag + 1]) # +1 for intercept
 +
  }
 +
}
  
Hence, we have significant evidence to reject the null hypothesis and conclude that Air Flow and Water Temperature are significantly correlated when controlling for acid concentration at 5% level of significance.
+
for (h in 1:5) {
 +
  cat(sprintf("Lag %d: PACF = %.4f (built-in) vs %.4f (manual)\n",
 +
              h, pacf_result$acf[h], manual_pacf(x, h)))
 +
}
  
 +
# 3. Visual comparison
 +
par(mfrow = c(2, 2))
  
Note that: in this method, we used for Spearman’s rho, the sampling distribution is $T= \hat{\rho}_{p} \sqrt{(n-2-k) \setminus (1- \hat{\rho}^{2}_{p})}, where T \sim t_{n-2-k}.$
+
# Time series plot
 +
plot(x, type = "l", main = "AR(2) Time Series",
 +
    xlab = "Time", ylab = "Value")
  
 +
# ACF plot
 +
acf(x, lag.max = 20, main = "Autocorrelation Function")
  
Same question and we use the package of ppcor in R this time:
+
# PACF plot
 +
pacf(x, lag.max = 20, main = "Partial Autocorrelation Function")
  
library(ppcor)
+
# Compare ACF and PACF
pcor(stackloss)   ##
+
plot(1:20, acf(x, lag.max = 20, plot = FALSE)$acf[-1],
calculate pairwise partial correlations for each pair of variables given others,  
+
    type = "h", col = "blue", lwd = 2,
methods could be pearson, kendall or spearman
+
    xlab = "Lag", ylab = "Correlation",
 +
    main = "ACF vs PACF", ylim = c(-0.5, 0.6))
 +
points(1:20 + 0.2, pacf_result$acf,
 +
      type = "h", col = "red", lwd = 2)
 +
legend("topright", legend = c("ACF", "PACF"),
 +
      col = c("blue", "red"), lwd = 2)
  
 +
par(mfrow = c(1, 1))
  
 +
# 4. Model identification
 +
cat("\n3. Model Identification:\n")
 +
cat("Based on PACF cutoff after lag 2, suggest AR(2) model.\n")
 +
cat("PACF significant at lags 1 and 2, insignificant thereafter.\n")
 +
</pre>
  
 +
===Advanced Topics===
  
$estimate$
+
####1) Regularized Partial Correlation
<center>
+
<pre>
{|class="wikitable" style="text-align:center; width 35%" border="1"
+
# Regularized estimation for high-dimensional data
|-
+
library(glasso)
| ||  Air.Flow ||Water.Temp|| Acid.Conc.|| stack.loss
 
|-
 
|Air.Flow ||  1.0000000|| -0.1693960|| 0.3838007 || 0.7896593
 
|-
 
|Water.Temp ||-0.1693960 || 1.0000000 || 0.1490278 || 0.6492456
 
|-
 
|Acid.Conc.||  0.3838007  ||0.1490278  ||1.0000000 ||-0.2297477
 
|-
 
|stack.loss || 0.7896593 || 0.6492456|| -0.2297477 || 1.0000000
 
|-
 
|}
 
</center>
 
  
 +
regularized_partial_correlation <- function(data, lambda = 0.1) {
 +
  # Graphical Lasso for sparse precision matrix estimation
 +
  S <- cov(data)
 +
  glasso_result <- glasso(S, rho = lambda)
 +
 
 +
  # Convert to partial correlations
 +
  omega <- glasso_result$wi
 +
  pcor_matrix <- -cov2cor(omega)
 +
  diag(pcor_matrix) <- 1
 +
 
 +
  return(list(
 +
    precision_matrix = omega,
 +
    partial_correlation = pcor_matrix,
 +
    lambda = lambda
 +
  ))
 +
}
  
 +
# Example with high-dimensional data
 +
set.seed(101)
 +
p <- 50  # Number of variables
 +
n <- 30  # Number of observations (n < p)
  
$p.value$
+
high_dim_data <- matrix(rnorm(n * p), n, p)
<center>
 
{| class="wikitable" style="text-align:center; width 35%" border="1"
 
|-
 
| ||          Air.Flow ||  Water.Temp ||Acid.Conc.||  stack.loss
 
|-
 
|Air.Flow ||  0.000000e+00 ||0.4785233722|| 0.08658525 ||1.116809e-07
 
|-
 
|Water.Temp|| 4.785234e-01|| 0.0000000000|| 0.53433865 ||4.322516e-04
 
|-
 
|Acid.Conc.|| 8.658525e-02|| 0.5343386526|| 0.00000000|| 3.303994e-01
 
|-
 
|stack.loss|| 1.116809e-07|| 0.0004322516|| 0.33039937|| 0.000000e+00
 
|-
 
|}
 
</center>
 
  
 +
cat("=== Regularized Partial Correlation (n < p) ===\n")
 +
cat("Dimensions:", n, "samples ×", p, "variables\n")
  
 +
# Standard correlation matrix is singular
 +
try_cor <- try(cor(high_dim_data), silent = TRUE)
 +
if (inherits(try_cor, "try-error")) {
 +
  cat("Standard correlation matrix is singular (n < p).\n")
 +
}
  
$statistic$
+
# Regularized estimation
<center>
+
reg_result <- regularized_partial_correlation(high_dim_data, lambda = 0.5)
{| class="wikitable" style="text-align:center; width 35%" border="1"
 
|-
 
| ||            Air.Flow ||Water.Temp|| Acid.Conc. ||stack.loss
 
|-
 
|Air.Flow  || 0.0000000|| -0.7086795 || 1.7136923 || 5.3066130
 
|-
 
|Water.Temp|| -0.7086795||  0.0000000||  0.6213967||  3.5195672
 
|-
 
|Acid.Conc.||  1.7136923||  0.6213967 || 0.0000000|| -0.9733098
 
|-
 
|stack.loss||  5.3066130 || 3.5195672 ||-0.9733098 || 0.0000000
 
|-
 
|}
 
</center>
 
  
 +
cat("\nRegularized partial correlation matrix computed successfully.\n")
 +
cat("Sparsity:",
 +
    mean(reg_result$partial_correlation[upper.tri(reg_result$partial_correlation)] == 0),
 +
    "proportion of zero partial correlations.\n")
 +
</pre>
  
$n
+
####2) Bayesian Partial Correlation
[1] 21
+
<pre>
 +
# Bayesian approach to partial correlation
 +
library(rstan)
  
$gp
+
bayesian_partial_correlation <- function(X, Y, Z, n_iter = 2000) {
[1] 2
+
  # Prepare data for Stan
 +
  stan_data <- list(
 +
    N = length(X),
 +
    X = X,
 +
    Y = Y,
 +
    Z = if(is.null(dim(Z))) matrix(Z, ncol = 1) else as.matrix(Z),
 +
    K = if(is.null(dim(Z))) 1 else ncol(Z)
 +
  )
 +
 
 +
  # Stan model for partial correlation
 +
  stan_model_code <- "
 +
  data {
 +
    int<lower=1> N;
 +
    vector[N] X;
 +
    vector[N] Y;
 +
    matrix[N, K] Z;
 +
    int<lower=1> K;
 +
  }
 +
  parameters {
 +
    real beta_X0;
 +
    real beta_Y0;
 +
    vector[K] beta_XZ;
 +
    vector[K] beta_YZ;
 +
    real<lower=-1, upper=1> rho_resid;
 +
    real<lower=0> sigma_X;
 +
    real<lower=0> sigma_Y;
 +
  }
 +
  model {
 +
    // Priors
 +
    beta_X0 ~ normal(0, 10);
 +
    beta_Y0 ~ normal(0, 10);
 +
    beta_XZ ~ normal(0, 5);
 +
    beta_YZ ~ normal(0, 5);
 +
    rho_resid ~ uniform(-1, 1);
 +
    sigma_X ~ cauchy(0, 2.5);
 +
    sigma_Y ~ cauchy(0, 2.5);
 +
   
 +
    // Residual covariance
 +
    for (n in 1:N) {
 +
      real mu_X = beta_X0 + Z[n] * beta_XZ;
 +
      real mu_Y = beta_Y0 + Z[n] * beta_YZ;
 +
      [X[n], Y[n]]' ~ multi_normal([mu_X, mu_Y]',
 +
                                  [[sigma_X^2, rho_resid*sigma_X*sigma_Y],
 +
                                    [rho_resid*sigma_X*sigma_Y, sigma_Y^2]]);
 +
    }
 +
  }
 +
  generated quantities {
 +
    // Partial correlation
 +
    real partial_corr = rho_resid;
 +
  }
 +
  "
 +
 
 +
  # Fit model
 +
  fit <- stan(model_code = stan_model_code,
 +
              data = stan_data,
 +
              iter = n_iter,
 +
              chains = 4)
 +
 
 +
  # Extract results
 +
  samples <- extract(fit)
 +
 
 +
  return(list(
 +
    posterior_mean = mean(samples$partial_corr),
 +
    posterior_median = median(samples$partial_corr),
 +
    credible_interval = quantile(samples$partial_corr, c(0.025, 0.975)),
 +
    posterior_samples = samples$partial_corr
 +
  ))
 +
}
  
$method
+
# Example
[1] "pearson"
+
cat("\n=== Bayesian Partial Correlation ===\n")
 +
bayes_result <- bayesian_partial_correlation(
 +
  medical_data$exercise,
 +
  medical_data$cholesterol,
 +
  medical_data[, c("age", "bmi")],
 +
  n_iter = 1000
 +
)
  
pcor.test(Air.Flow,Water.Temp,Acid.Conc.)
+
print(bayes_result)
  estimate      p.value statistic  n gp  Method
+
</pre>
1 0.7356413 4.073273e-06  4.607608 21  1 pearson
 
  
pcor.test(Air.Flow,Water.Temp,Acid.Conc.,method='spearman')
+
####3) Causal Inference and Partial Correlation
    estimate      p.value statistic  n gp  Method
+
<pre>
1 0.6974934 3.634416e-05  4.12957 21  1 spearman
+
# Partial correlation in causal inference
 +
library(pcalg)
  
pcor.test(Air.Flow,Water.Temp,c(Acid.Conc.,stack.loss))
+
causal_partial_correlation <- function(data) {
#calculate partial correlation between air flow and water temperature while controlling for
+
  # Estimate causal structure using PC algorithm
air concentration and stack loss
+
  suffStat <- list(C = cor(data), n = nrow(data))
 +
  pc_fit <- pc(suffStat, indepTest = gaussCItest,
 +
              p = ncol(data), alpha = 0.05)
 +
 
 +
  # Extract partial correlations for edges in the graph
 +
  adj_matrix <- as(pc_fit, "matrix")
 +
 
 +
  # Compute partial correlations for connected pairs
 +
  p <- ncol(data)
 +
  pcor_matrix <- matrix(0, p, p)
 +
 
 +
  for (i in 1:(p-1)) {
 +
    for (j in (i+1):p) {
 +
      if (adj_matrix[i, j] != 0) {
 +
        # Find separating set
 +
        sep_set <- pc_fit@sepset[[i]][[j]]
 +
        if (length(sep_set) > 0) {
 +
          pcor <- pcor.test(data[, i], data[, j],
 +
                          data[, sep_set])$estimate
 +
          pcor_matrix[i, j] <- pcor_matrix[j, i] <- pcor
 +
        }
 +
      }
 +
    }
 +
  }
 +
 
 +
  return(list(
 +
    graph = pc_fit,
 +
    adjacency = adj_matrix,
 +
    partial_correlations = pcor_matrix
 +
  ))
 +
}
  
estimate      p.value statistic  n gp  Method
+
# Example with simulated causal structure
1 0.7762696 1.471018e-14  7.690029 42  1 pearson
+
set.seed(2023)
 +
n_causal <- 100
 +
causal_data <- data.frame(
 +
  A = rnorm(n_causal),
 +
  B = 0.5 * rnorm(n_causal) + 0.3 * A + rnorm(n_causal, sd = 0.1),
 +
  C = 0.4 * rnorm(n_causal) + 0.2 * A + 0.3 * B + rnorm(n_causal, sd = 0.1),
 +
  D = 0.3 * rnorm(n_causal) + 0.1 * B + 0.2 * C + rnorm(n_causal, sd = 0.1)
 +
)
  
 +
cat("=== Causal Inference with Partial Correlation ===\n")
 +
causal_result <- causal_partial_correlation(causal_data)
  
===Problems===
+
cat("Estimated causal graph (adjacency matrix):\n")
 +
print(causal_result$adjacency)
  
y.data <- data.frame(hl=c(7,15,19,15,21,22,57,15,20,18),
+
cat("\nPartial correlations for direct connections:\n")
disp=c( 0.000,0.964,0.000,0.000,0.921,0.000,0.000,1.006,0.000,1.011),
+
print(causal_result$partial_correlations)
deg=c(9,2,3,4,1,3,1,3,6,1),
+
</pre>
BC=c(1.78 e-02,1.05e-06,1.37e-05,7.18e-03,0.00e+00,0.00e+00,0.00e+00,4.48e-03,2.10e-06,0.00e+00))
 
  
*1) Apply this dataset and calculate the partial correlation between ‘hl’ and ‘disp’ given ‘deg’ using the linear regression method as well as the ppcor package. State and compare your results.
+
===Software Implementation===
  
*2) Apply this dataset and calculate the partial correlation between ‘hl’ and ‘disp’ given ‘deg’ and ‘BC” using the ppcor package introduced in the lecture. Sate your conclusions.
+
<pre>
 +
# Comprehensive partial correlation analysis function
 +
run_partial_correlation_analysis <- function(data, var1, var2, controls,
 +
                                            method = "standard",
 +
                                            inference = TRUE,
 +
                                            visualization = TRUE) {
 +
 
 +
  cat("=== PARTIAL CORRELATION ANALYSIS ===\n\n")
 +
  cat("Variable 1:", var1, "\n")
 +
  cat("Variable 2:", var2, "\n")
 +
  cat("Control variables:", paste(controls, collapse = ", "), "\n")
 +
  cat("Method:", method, "\n\n")
 +
 
 +
  # Extract variables
 +
  X <- data[[var1]]
 +
  Y <- data[[var2]]
 +
  Z <- if(length(controls) > 0) data[, controls, drop = FALSE] else NULL
 +
 
 +
  # 1. Simple correlation
 +
  simple_cor <- cor(X, Y)
 +
  cat("1. Simple correlation:", round(simple_cor, 4), "\n")
 +
 
 +
  # 2. Partial correlation
 +
  if (method == "standard") {
 +
    if (!is.null(Z)) {
 +
      pcor_result <- pcor.test(X, Y, Z)
 +
      cat("\n2. Partial correlation:", round(pcor_result$estimate, 4), "\n")
 +
      if (inference) {
 +
        cat("  p-value:", format.pval(pcor_result$p.value, digits = 4), "\n")
 +
        cat("  95% CI: [",
 +
            round(pcor_result$conf.int[1], 4), ", ",
 +
            round(pcor_result$conf.int[2], 4), "]\n", sep = "")
 +
      }
 +
    } else {
 +
      cat("\nNo control variables specified.\n")
 +
    }
 +
   
 +
  } else if (method == "regularized") {
 +
    # Regularized partial correlation
 +
    all_data <- cbind(X, Y, Z)
 +
    reg_result <- regularized_partial_correlation(all_data, lambda = 0.1)
 +
    pcor_value <- reg_result$partial_correlation[1, 2]
 +
    cat("\n2. Regularized partial correlation:", round(pcor_value, 4), "\n")
 +
    cat("  Regularization parameter lambda:", reg_result$lambda, "\n")
 +
  }
 +
 
 +
  # 3. Semi-partial correlation
 +
  if (!is.null(Z)) {
 +
    # Regress Y on Z
 +
    formula_y <- as.formula(paste(var2, "~", paste(controls, collapse = "+")))
 +
    lm_y <- lm(formula_y, data = data)
 +
    residuals_y <- resid(lm_y)
 +
   
 +
    semi_partial <- cor(X, residuals_y)
 +
    cat("\n3. Semi-partial correlation:", round(semi_partial, 4), "\n")
 +
    cat("  (Correlation of", var1, "with residuals of", var2, "after controlling for Z)\n")
 +
  }
 +
 
 +
  # 4. Visualization
 +
  if (visualization) {
 +
    par(mfrow = c(2, 2))
 +
   
 +
    # Scatterplot
 +
    plot(X, Y, main = paste("Scatterplot:", var1, "vs", var2),
 +
        xlab = var1, ylab = var2, pch = 19, col = "blue")
 +
   
 +
    # Residual plots if controls specified
 +
    if (!is.null(Z)) {
 +
      # Residuals after controlling
 +
      res_X <- resid(lm(as.formula(paste(var1, "~", paste(controls, collapse = "+"))),
 +
                      data = data))
 +
      res_Y <- resid(lm(formula_y, data = data))
 +
     
 +
      plot(res_X, res_Y,
 +
          main = paste("Residual Scatterplot\n(Controlling for",
 +
                      paste(controls, collapse = ", "), ")"),
 +
          xlab = paste("Residuals of", var1),
 +
          ylab = paste("Residuals of", var2),
 +
          pch = 19, col = "red")
 +
     
 +
      # Comparison plot
 +
      cor_types <- data.frame(
 +
        Type = factor(c("Simple", "Partial", "Semi-partial"),
 +
                    levels = c("Simple", "Partial", "Semi-partial")),
 +
        Value = c(simple_cor,
 +
                if(exists("pcor_result")) pcor_result$estimate else NA,
 +
                semi_partial)
 +
      )
 +
     
 +
      barplot(cor_types$Value, names.arg = cor_types$Type,
 +
              ylim = c(-1, 1), col = c("blue", "red", "green"),
 +
              main = "Comparison of Correlation Measures",
 +
              ylab = "Correlation Coefficient")
 +
      abline(h = 0, lty = 2)
 +
    }
 +
   
 +
    # Network visualization if multiple variables
 +
    if (ncol(data) >= 3) {
 +
      pcor_matrix <- pcor(data)$estimate
 +
      qgraph(pcor_matrix, layout = "spring",
 +
            labels = colnames(data),
 +
            title = "Partial Correlation Network",
 +
            maximum = 0.8)
 +
    }
 +
   
 +
    par(mfrow = c(1, 1))
 +
  }
 +
 
 +
  # Return results
 +
  results <- list(
 +
    simple_correlation = simple_cor,
 +
    partial_correlation = if(exists("pcor_result")) pcor_result$estimate else NA,
 +
    semi_partial_correlation = if(exists("semi_partial")) semi_partial else NA,
 +
    p_value = if(exists("pcor_result")) pcor_result$p.value else NA
 +
  )
 +
 
 +
  return(results)
 +
}
 +
 
 +
# Example usage with mtcars dataset
 +
cat("\n=== EXAMPLE: Partial Correlation with mtcars ===\n")
 +
data(mtcars)
 +
 
 +
# Analyze relationship between mpg and hp, controlling for wt and cyl
 +
results <- run_partial_correlation_analysis(
 +
  data = mtcars,
 +
  var1 = "mpg",
 +
  var2 = "hp",
 +
  controls = c("wt", "cyl"),
 +
  method = "standard",
 +
  inference = TRUE,
 +
  visualization = TRUE
 +
)
 +
</pre>
 +
 
 +
===Common Issues and Solutions===
 +
 
 +
####1) Multicollinearity Among Controls
 +
<pre>
 +
# Check for multicollinearity in control variables
 +
check_multicollinearity <- function(Z) {
 +
  if (is.null(dim(Z))) Z <- as.matrix(Z)
 +
 
 +
  # Variance Inflation Factor (VIF)
 +
  if (require(car)) {
 +
    # Create a dummy regression to compute VIF
 +
    dummy_y <- rnorm(nrow(Z))
 +
    dummy_lm <- lm(dummy_y ~ Z)
 +
    vif_values <- vif(dummy_lm)
 +
   
 +
    cat("Variance Inflation Factors (VIF):\n")
 +
    print(vif_values)
 +
   
 +
    # Rule of thumb: VIF > 10 indicates problematic multicollinearity
 +
    problematic <- vif_values > 10
 +
    if (any(problematic)) {
 +
      cat("\nWarning: High multicollinearity detected in control variables.\n")
 +
      cat("Consider removing or combining variables with VIF > 10.\n")
 +
    }
 +
  }
 +
 
 +
  # Condition number
 +
  if (ncol(Z) > 1) {
 +
    kappa_value <- kappa(cor(Z), exact = TRUE)
 +
    cat("\nCondition number of correlation matrix:", round(kappa_value, 2), "\n")
 +
    if (kappa_value > 30) {
 +
      cat("Warning: High condition number indicates multicollinearity.\n")
 +
    }
 +
  }
 +
}
 +
 
 +
# Example
 +
check_multicollinearity(mtcars[, c("wt", "cyl", "disp")])
 +
</pre>
 +
 
 +
####2) Missing Data Handling
 +
<pre>
 +
# Multiple approaches for missing data in partial correlation
 +
partial_correlation_missing <- function(X, Y, Z, method = "complete") {
 +
 
 +
  if (method == "complete") {
 +
    # Listwise deletion
 +
    complete_cases <- complete.cases(cbind(X, Y, Z))
 +
    X_comp <- X[complete_cases]
 +
    Y_comp <- Y[complete_cases]
 +
    Z_comp <- if(!is.null(dim(Z))) Z[complete_cases, ] else Z[complete_cases]
 +
   
 +
    return(pcor.test(X_comp, Y_comp, Z_comp))
 +
   
 +
  } else if (method == "pairwise") {
 +
    # Pairwise deletion (not recommended for partial correlation)
 +
    cat("Warning: Pairwise deletion can lead to inconsistent results in partial correlation.\n")
 +
   
 +
  } else if (method == "multiple_imputation") {
 +
    # Multiple imputation
 +
    if (require(mice)) {
 +
      data_complete <- cbind(X, Y, Z)
 +
      imp <- mice(data_complete, m = 5, printFlag = FALSE)
 +
     
 +
      # Fit model on each imputed dataset
 +
      fits <- with(imp, pcor.test(X, Y, Z))
 +
     
 +
      # Pool results
 +
      pooled <- pool(fits)
 +
      return(summary(pooled))
 +
    }
 +
  }
 +
}
 +
 
 +
# Example with missing data
 +
set.seed(303)
 +
n_missing <- 50
 +
X_miss <- rnorm(n_missing)
 +
Y_miss <- 0.5 * X_miss + rnorm(n_missing, sd = 0.5)
 +
Z_miss <- 0.3 * X_miss + 0.4 * Y_miss + rnorm(n_missing, sd = 0.3)
 +
 
 +
# Introduce missing values
 +
X_miss[sample(1:n_missing, 10)] <- NA
 +
Y_miss[sample(1:n_missing, 10)] <- NA
 +
Z_miss[sample(1:n_missing, 10)] <- NA
 +
 
 +
cat("=== Partial Correlation with Missing Data ===\n")
 +
cat("Complete cases method:\n")
 +
print(partial_correlation_missing(X_miss, Y_miss, Z_miss, method = "complete"))
 +
</pre>
 +
 
 +
===Problems and Exercises===
 +
 
 +
1. '''Conceptual Problems''':
 +
  a) Prove that for jointly Gaussian variables, zero partial correlation implies conditional independence
 +
  b) Derive the relationship between partial correlation and regression coefficients
 +
  c) Show that \(|r_{X(Y\cdot Z)}| \leq |r_{XY\cdot Z}| \leq |r_{XY}|\)
 +
 
 +
2. '''Applied Problems''':
 +
  a) Analyze the relationship between `mpg` and `qsec` in the `mtcars` dataset, controlling for `wt` and `hp`
 +
  b) Compute the partial correlation matrix for the `iris` dataset (sepal and petal measurements)
 +
  c) Conduct a power analysis for detecting partial correlation ρ = 0.3 with 3 control variables, α = 0.05, power = 0.80
 +
 
 +
3. '''Simulation Study''':
 +
<pre>
 +
# Simulation to understand partial correlation properties
 +
simulate_partial_correlation <- function(n_sim = 1000, n = 100,
 +
                                        true_rho = 0.5, n_controls = 2) {
 +
 
 +
  estimates <- numeric(n_sim)
 +
  p_values <- numeric(n_sim)
 +
  coverage <- numeric(n_sim)
 +
 
 +
  for (i in 1:n_sim) {
 +
    # Generate data with specified partial correlation
 +
    # Generate control variables
 +
    Z <- matrix(rnorm(n * n_controls), n, n_controls)
 +
   
 +
    # Generate residuals with specified correlation
 +
    sigma <- matrix(c(1, true_rho, true_rho, 1), 2, 2)
 +
    residuals <- MASS::mvrnorm(n, mu = c(0, 0), Sigma = sigma)
 +
   
 +
    # Generate X and Y
 +
    beta_X <- rnorm(n_controls)
 +
    beta_Y <- rnorm(n_controls)
 +
   
 +
    X <- Z %*% beta_X + residuals[, 1]
 +
    Y <- Z %*% beta_Y + residuals[, 2]
 +
   
 +
    # Estimate partial correlation
 +
    result <- pcor.test(X, Y, Z)
 +
    estimates[i] <- result$estimate
 +
    p_values[i] <- result$p.value
 +
   
 +
    # Check coverage of 95% CI
 +
    ci <- result$conf.int
 +
    coverage[i] <- (true_rho >= ci[1] & true_rho <= ci[2])
 +
  }
 +
 
 +
  cat("Simulation Results (n =", n_sim, "):\n")
 +
  cat("True partial correlation:", true_rho, "\n")
 +
  cat("Mean estimate:", mean(estimates), "\n")
 +
  cat("Bias:", mean(estimates) - true_rho, "\n")
 +
  cat("RMSE:", sqrt(mean((estimates - true_rho)^2)), "\n")
 +
  cat("Type I error rate (for true_rho = 0):", mean(p_values < 0.05), "\n")
 +
  cat("Coverage of 95% CI:", mean(coverage), "\n")
 +
 
 +
  return(data.frame(
 +
    estimate = estimates,
 +
    p_value = p_values,
 +
    coverage = coverage
 +
  ))
 +
}
 +
 
 +
# Run simulation
 +
sim_results <- simulate_partial_correlation(n_sim = 500, n = 50,
 +
                                          true_rho = 0.3, n_controls = 2)
 +
</pre>
  
 
===References===
 
===References===
[http://mirlyn.lib.umich.edu/Record/004199238 Statistical inference / George Casella, Roger L. Berger]
 
  
[http://mirlyn.lib.umich.edu/Record/004232056  Sampling / Steven K. Thompson. ]
+
1. Whittaker, J. (1990). *Graphical Models in Applied Multivariate Statistics*. Wiley.
 +
 
 +
2. Edwards, D. (2000). *Introduction to Graphical Modelling* (2nd ed.). Springer.
 +
 
 +
3. Pearl, J. (2009). *Causality: Models, Reasoning, and Inference* (2nd ed.). Cambridge University Press.
 +
 
 +
4. Koller, D., & Friedman, N. (2009). *Probabilistic Graphical Models: Principles and Techniques*. MIT Press.
  
[http://mirlyn.lib.umich.edu/Record/004133572  Sampling theory and methods / S. Sampath. ]
+
5. Bühlmann, P., & van de Geer, S. (2011). *Statistics for High-Dimensional Data: Methods, Theory and Applications*. Springer.
  
 +
===Online Resources===
  
 +
* [https://cran.r-project.org/web/views/Multivariate.html Multivariate Statistics in R]
 +
* [https://cran.r-project.org/web/packages/ppcor/index.html ppcor Package Documentation]
 +
* [https://www.socr.umich.edu SOCR Resources and Datasets]
 +
* [https://www.bnlearn.com/ Bayesian Network Learning]
  
 
<hr>
 
<hr>
* SOCR Home page: http://www.socr.umich.edu
+
* SOCR Home page: https://www.socr.umich.edu
  
 
{{translate|pageName=http://wiki.socr.umich.edu/index.php?title=SMHS_PartialCorrelation}}
 
{{translate|pageName=http://wiki.socr.umich.edu/index.php?title=SMHS_PartialCorrelation}}

Revision as of 14:04, 9 December 2025

Scientific Methods for Health Sciences - Partial Correlation

Overview

Partial correlation measures the degree of association between two random variables after removing the linear effects of one or more controlling variables. It quantifies the unique relationship between two variables while statistically controlling for potential confounding factors. Partial correlation is fundamental in: - Identifying direct relationships in multivariate systems - Controlling for confounding variables in observational studies - Network analysis and graphical models - Time series analysis (partial autocorrelation)

Motivation

Consider a study examining the relationship between exercise frequency and cholesterol levels. Age is known to affect both variables: older individuals tend to exercise less and have higher cholesterol. Simple correlation between exercise and cholesterol would be confounded by age. Partial correlation addresses this by:

1. Removing spurious correlations due to common causes

2. Identifying direct relationships in complex systems

3. Testing conditional independence in graphical models

4. Decomposing multivariate relationships into direct and indirect effects

Theory

1) Mathematical Foundations

Definition

The partial correlation between variables \(X\) and \(Y\) given a set of controlling variables \(Z = \{Z_1, Z_2, \ldots, Z_n\}\) is defined as\[ \rho_{XY \cdot Z} = \frac{\rho_{XY} - \rho_{XZ}\rho_{YZ}}{\sqrt{(1-\rho_{XZ}^2)(1-\rho_{YZ}^2)}} \] for a single controlling variable \(Z\), and more generally as\[ \rho_{XY \cdot Z} = \frac{\text{Cov}(X_{\perp Z}, Y_{\perp Z})}{\sqrt{\text{Var}(X_{\perp Z})\text{Var}(Y_{\perp Z})}} \] where \(X_{\perp Z}\) and \(Y_{\perp Z}\) are the residuals from regressing \(X\) and \(Y\) on \(Z\).

Matrix Formulation

For a set of variables with covariance matrix \(\boldsymbol{\Sigma}\), let \(\boldsymbol{\Omega} = \boldsymbol{\Sigma}^{-1}\) be the precision matrix. The partial correlation between \(X_i\) and \(X_j\) controlling for all other variables is\[ \rho_{X_i X_j \cdot V\setminus\{X_i,X_j\}} = -\frac{\omega_{ij}}{\sqrt{\omega_{ii}\omega_{jj}}} \] where \(\omega_{ij}\) are elements of \(\boldsymbol{\Omega}\).

This relationship reveals that zero partial correlation implies conditional independence for jointly Gaussian variables\[ \rho_{X_i X_j \cdot V\setminus\{X_i,X_j\}} = 0 \iff X_i \perp\!\!\!\perp X_j \mid V\setminus\{X_i,X_j\} \]

Geometric Interpretation

In vector space terminology, partial correlation corresponds to the cosine of the angle between the residual vectors after projecting \(X\) and \(Y\) onto the subspace spanned by \(Z\)\[ \rho_{XY\cdot Z} = \cos(\theta_{R_X R_Y}) \] where \(R_X = X - P_Z(X)\) and \(R_Y = Y - P_Z(Y)\) are residuals, and \(P_Z\) denotes projection onto the span of \(Z\).

2) Calculation Methods

Method 1: Residual Regression

1. Regress \(X\) on \(Z\): \(X = \boldsymbol{\beta}_X^\top Z + \varepsilon_X\) 2. Regress \(Y\) on \(Z\): \(Y = \boldsymbol{\beta}_Y^\top Z + \varepsilon_Y\) 3. Compute correlation between residuals: \(\rho_{XY\cdot Z} = \text{Cor}(\varepsilon_X, \varepsilon_Y)\)

The sample estimate is\[ \hat{\rho}_{XY\cdot Z} = \frac{\sum_{i=1}^n \hat{\varepsilon}_{X,i}\hat{\varepsilon}_{Y,i}}{\sqrt{\sum_{i=1}^n \hat{\varepsilon}_{X,i}^2 \sum_{i=1}^n \hat{\varepsilon}_{Y,i}^2}} \]

Method 2: Recursive Formula

For nested sets of controlling variables \(Z \subset W\)\[ \rho_{XY\cdot W} = \frac{\rho_{XY\cdot Z} - \rho_{XW_n\cdot Z\setminus\{W_n\}}\rho_{YW_n\cdot Z\setminus\{W_n\}}}{\sqrt{(1-\rho_{XW_n\cdot Z\setminus\{W_n\}}^2)(1-\rho_{YW_n\cdot Z\setminus\{W_n\}}^2)}} \] where \(W_n\) is any variable in \(W\setminus Z\).

This allows efficient computation with time complexity \(O(p^3)\) for \(p\) variables using dynamic programming.

Method 3: Matrix Inversion

From the precision matrix \(\boldsymbol{\Omega} = \boldsymbol{\Sigma}^{-1}\)\[ \hat{\rho}_{X_i X_j \cdot V\setminus\{X_i,X_j\}} = -\frac{\hat{\omega}_{ij}}{\sqrt{\hat{\omega}_{ii}\hat{\omega}_{jj}}} \] This method is particularly efficient for computing all pairwise partial correlations simultaneously.

# R function implementing all three methods
calculate_partial_correlation <- function(X, Y, Z, method = "inversion") {
  n <- length(X)
  
  if (method == "regression") {
    # Method 1: Residual regression
    res_X <- resid(lm(X ~ Z))
    res_Y <- resid(lm(Y ~ Z))
    return(cor(res_X, res_Y))
    
  } else if (method == "recursive") {
    # Method 2: Recursive formula (for single Z)
    if (is.null(dim(Z))) {  # Single controlling variable
      r_xy <- cor(X, Y)
      r_xz <- cor(X, Z)
      r_yz <- cor(Y, Z)
      return((r_xy - r_xz * r_yz) / sqrt((1 - r_xz^2) * (1 - r_yz^2)))
    } else {
      # For multiple Z, use recursive approach
      pcor_matrix <- pcor(Z)$estimate
      # Extract relevant partial correlations
      # (This is simplified; full implementation would handle recursion)
    }
    
  } else if (method == "inversion") {
    # Method 3: Matrix inversion
    data_matrix <- cbind(X, Y, Z)
    sigma <- cov(data_matrix)
    omega <- solve(sigma)
    pcor <- -omega[1,2] / sqrt(omega[1,1] * omega[2,2])
    return(pcor)
  }
}

3) Statistical Inference

Hypothesis Testing

Test \(H_0: \rho_{XY\cdot Z} = 0\) vs \(H_1: \rho_{XY\cdot Z} \neq 0\).

Fisher's z-transform\[ z = \frac{1}{2}\ln\left(\frac{1+\hat{\rho}}{1-\hat{\rho}}\right) \sim N\left(\frac{1}{2}\ln\left(\frac{1+\rho}{1-\rho}\right), \frac{1}{n-|Z|-3}\right) \]

Test statistic\[ T = \frac{z - \frac{1}{2}\ln\left(\frac{1+\rho_0}{1-\rho_0}\right)}{\sqrt{1/(n-|Z|-3)}} \sim N(0,1) \] Under \(H_0: \rho = 0\), this simplifies to \(T = z\sqrt{n-|Z|-3}\).

Exact t-test (for normally distributed data)\[ t = \hat{\rho}\sqrt{\frac{n-|Z|-2}{1-\hat{\rho}^2}} \sim t_{n-|Z|-2} \]

Confidence Intervals

Using Fisher's transform\[ \text{CI}_{1-\alpha} = \left[\tanh\left(z - \frac{z_{1-\alpha/2}}{\sqrt{n-|Z|-3}}\right), \tanh\left(z + \frac{z_{1-\alpha/2}}{\sqrt{n-|Z|-3}}\right)\right] \] where \(\tanh(x) = \frac{e^{2x}-1}{e^{2x}+1}\) is the hyperbolic tangent.

Power Analysis

The required sample size to detect partial correlation \(\rho\) with power \(1-\beta\) at level \(\alpha\)\[ n = |Z| + 3 + \left(\frac{z_{1-\alpha/2} + z_{1-\beta}}{z(\rho)}\right)^2 \] where \(z(\rho) = \frac{1}{2}\ln\left(\frac{1+\rho}{1-\rho}\right)\).

# R functions for inference
partial_correlation_test <- function(X, Y, Z, method = "fisher") {
  n <- length(X)
  k <- if(is.null(dim(Z))) 1 else ncol(Z)
  r <- calculate_partial_correlation(X, Y, Z, method = "regression")
  
  if (method == "fisher") {
    # Fisher's z-transform
    z <- 0.5 * log((1 + r) / (1 - r))
    se <- 1 / sqrt(n - k - 3)
    z_score <- z / se
    p_value <- 2 * pnorm(-abs(z_score))
    
    # Confidence interval
    z_lower <- z - qnorm(0.975) * se
    z_upper <- z + qnorm(0.975) * se
    ci <- c(tanh(z_lower), tanh(z_upper))
    
  } else if (method == "t_test") {
    # Exact t-test (assumes normality)
    t_stat <- r * sqrt((n - k - 2) / (1 - r^2))
    p_value <- 2 * pt(-abs(t_stat), df = n - k - 2)
    ci <- cor.test(X, Y)$conf.int  # Approximation
  }
  
  return(list(
    estimate = r,
    statistic = if(method == "fisher") z_score else t_stat,
    p_value = p_value,
    confidence_interval = ci,
    n = n,
    df = n - k - 2
  ))
}

# Power calculation
partial_correlation_power <- function(rho, n, k, alpha = 0.05) {
  # rho: true partial correlation
  # n: sample size
  # k: number of controlling variables
  
  z_rho <- 0.5 * log((1 + rho) / (1 - rho))
  se <- 1 / sqrt(n - k - 3)
  
  # Non-centrality parameter
  lambda <- z_rho / se
  
  # Critical value under H0
  z_crit <- qnorm(1 - alpha/2)
  
  # Power
  power <- pnorm(-z_crit - lambda) + 1 - pnorm(z_crit - lambda)
  return(power)
}

4) Semi-Partial (Part) Correlation

The semi-partial correlation between \(X\) and \(Y\) controlling for \(Z\) from \(X\) only is\[ r_{X(Y\cdot Z)} = \frac{r_{XY} - r_{XZ}r_{YZ}}{\sqrt{1 - r_{YZ}^2}} \] Equivalently, it's the correlation between \(X\) and the residuals of \(Y\) regressed on \(Z\)\[ r_{X(Y\cdot Z)} = \text{Cor}(X, Y - \hat{Y}_Z) \]

Key differences from partial correlation: 1. Denominator: Semi-partial uses total variance of \(Y\), not residual variance 2. Interpretation: Proportion of total variance in \(Y\) uniquely explained by \(X\) 3. Range: \(|r_{X(Y\cdot Z)}| \leq |r_{XY\cdot Z}| \leq |r_{XY}|\) 4. Asymmetry: \(r_{X(Y\cdot Z)} \neq r_{Y(X\cdot Z)}\) generally

5) Partial Autocorrelation in Time Series

The partial autocorrelation function (PACF) at lag \(h\) is\[ \phi(h) = \rho_{X_t X_{t+h} \cdot \{X_{t+1}, \ldots, X_{t+h-1}\}} \]

For an AR(\(p\)) process, \(\phi(h) = 0\) for \(h > p\). The PACF is estimated via: 1. Durbin-Levinson algorithm: Recursive computation 2. Regression approach: \(\phi(h) = \) coefficient of \(X_{t-h}\) in regression of \(X_t\) on \(X_{t-1}, \ldots, X_{t-h}\) 3. Matrix inversion: Using the autocovariance matrix

Hypothesis testing: Under \(H_0: \phi(h) = 0\), \( \sqrt{n}\hat{\phi}(h) \sim N(0,1) \) Approximate 95% confidence bands: \(\pm 1.96/\sqrt{n}\).

Applications

        1. Example 1: Medical Research - Controlling for Confounders
# Load necessary libraries
library(ppcor)
library(ggplot2)
library(GGally)

# Simulate medical data: Cholesterol (Y), Exercise (X), Age (Z1), BMI (Z2)
set.seed(123)
n <- 200
age <- rnorm(n, mean = 50, sd = 10)
bmi <- rnorm(n, mean = 25, sd = 4)
exercise <- 5 - 0.05*age - 0.1*bmi + rnorm(n, sd = 2)
cholesterol <- 200 + 0.8*age + 0.5*bmi - 0.7*exercise + rnorm(n, sd = 15)

medical_data <- data.frame(cholesterol, exercise, age, bmi)

cat("=== Medical Research Example ===\n")
cat("Research question: Relationship between exercise and cholesterol,\n")
cat("controlling for age and BMI.\n\n")

# 1. Simple correlations
cat("1. Simple (Marginal) Correlations:\n")
cor_matrix <- cor(medical_data)
print(cor_matrix)

# 2. Partial correlation (exercise ~ cholesterol | age, bmi)
cat("\n2. Partial Correlation Analysis:\n")
pcor_result <- pcor.test(medical_data$exercise, medical_data$cholesterol, 
                        medical_data[, c("age", "bmi")])
print(pcor_result)

# 3. Compare with semi-partial correlation
cat("\n3. Semi-partial Correlation:\n")
# Regress cholesterol on age and BMI
lm_chol <- lm(cholesterol ~ age + bmi, data = medical_data)
residuals_chol <- resid(lm_chol)
semi_partial <- cor(medical_data$exercise, residuals_chol)
cat("Semi-partial correlation (exercise with cholesterol residuals):", 
    round(semi_partial, 4), "\n")

# 4. Visualization
par(mfrow = c(2, 2))

# Scatterplot matrix
plot(medical_data[, 1:4], main = "Scatterplot Matrix")

# Partial correlation network
if (require(qgraph)) {
  pcor_network <- pcor(medical_data)$estimate
  qgraph(pcor_network, layout = "spring", 
         labels = colnames(medical_data),
         title = "Partial Correlation Network")
}

# Comparison of correlations
cor_types <- data.frame(
  Type = c("Simple", "Partial", "Semi-partial"),
  Value = c(cor(medical_data$exercise, medical_data$cholesterol),
            pcor_result$estimate,
            semi_partial)
)

ggplot(cor_types, aes(x = Type, y = Value, fill = Type)) +
  geom_bar(stat = "identity") +
  ylim(-1, 1) +
  labs(title = "Comparison of Correlation Measures",
       subtitle = "Exercise vs Cholesterol",
       y = "Correlation Coefficient") +
  theme_minimal()

par(mfrow = c(1, 1))

# 5. Sensitivity analysis: How partial correlation changes with different controls
cat("\n4. Sensitivity Analysis:\n")
cat("Partial correlations with different sets of controls:\n")

controls_list <- list(
  "None" = NULL,
  "Age only" = medical_data[, "age", drop = FALSE],
  "BMI only" = medical_data[, "bmi", drop = FALSE],
  "Age + BMI" = medical_data[, c("age", "bmi")]
)

for (name in names(controls_list)) {
  if (is.null(controls_list[[name]])) {
    r <- cor(medical_data$exercise, medical_data$cholesterol)
  } else {
    r <- calculate_partial_correlation(
      medical_data$exercise, 
      medical_data$cholesterol,
      controls_list[[name]],
      method = "regression"
    )
  }
  cat(sprintf("%-15s: r = %.4f\n", name, r))
}
        1. Example 2: Gene Expression Network Analysis
# Using gene expression data to demonstrate partial correlation in networks
library(huge)

# Simulate gene expression data with network structure
set.seed(456)
n_genes <- 20
n_samples <- 100

# Generate precision matrix with sparse structure
prec_matrix <- diag(n_genes)
for(i in 1:(n_genes-1)) {
  for(j in (i+1):n_genes) {
    if(runif(1) < 0.1) {  # 10% connections
      prec_matrix[i,j] <- prec_matrix[j,i] <- 0.3
    }
  }
}

# Ensure positive definiteness
diag(prec_matrix) <- abs(min(eigen(prec_matrix)$values)) + 1.1

# Generate multivariate normal data
cov_matrix <- solve(prec_matrix)
gene_data <- MASS::mvrnorm(n_samples, mu = rep(0, n_genes), Sigma = cov_matrix)
colnames(gene_data) <- paste0("Gene_", 1:n_genes)

cat("=== Gene Expression Network Analysis ===\n")

# 1. Simple correlation network
simple_cor <- cor(gene_data)
cat("\n1. Simple correlation density:", 
    mean(abs(simple_cor[upper.tri(simple_cor)])), "\n")

# 2. Partial correlation network (Graphical Lasso)
if (require(huge)) {
  huge_result <- huge(gene_data, method = "glasso")
  huge_select <- huge.select(huge_result, criterion = "ebic")
  
  # Estimated precision matrix
  omega_est <- as.matrix(huge_select$opt.icov)
  
  # Convert to partial correlations
  pcor_network <- -cov2cor(omega_est)
  diag(pcor_network) <- 1
  
  cat("2. Partial correlation density:", 
      mean(abs(pcor_network[upper.tri(pcor_network)])), "\n")
  
  # Compare networks
  cat("\n3. Network Comparison:\n")
  cat("Number of edges (simple correlation > 0.3):", 
      sum(abs(simple_cor[upper.tri(simple_cor)]) > 0.3), "\n")
  cat("Number of edges (partial correlation > 0.3):", 
      sum(abs(pcor_network[upper.tri(pcor_network)]) > 0.3), "\n")
  
  # Visualize both networks
  par(mfrow = c(1, 2))
  
  # Simple correlation network
  qgraph(simple_cor, layout = "spring", 
         maximum = 0.5, minimum = -0.5,
         title = "Simple Correlation Network")
  
  # Partial correlation network
  qgraph(pcor_network, layout = "spring",
         maximum = 0.5, minimum = -0.5,
         title = "Partial Correlation Network")
  
  par(mfrow = c(1, 1))
  
  # 3. Test specific partial correlations
  cat("\n4. Hypothesis Testing for Specific Gene Pairs:\n")
  
  # Test Gene_1 and Gene_2 controlling for others
  test_result <- pcor.test(gene_data[, "Gene_1"], 
                          gene_data[, "Gene_2"],
                          gene_data[, setdiff(colnames(gene_data), 
                                             c("Gene_1", "Gene_2"))])
  print(test_result)
}
        1. Example 3: Time Series - Partial Autocorrelation Function
# Time series analysis with PACF
cat("=== Time Series Analysis: Partial Autocorrelation ===\n")

# Generate AR(2) process: X_t = 0.5*X_{t-1} - 0.3*X_{t-2} + ε_t
set.seed(789)
n <- 500
epsilon <- rnorm(n)
x <- numeric(n)
x[1] <- epsilon[1]
x[2] <- 0.5*x[1] + epsilon[2]

for(t in 3:n) {
  x[t] <- 0.5*x[t-1] - 0.3*x[t-2] + epsilon[t]
}

# 1. Compute PACF using built-in function
pacf_result <- pacf(x, lag.max = 20, plot = FALSE)

cat("\n1. PACF values (lags 1-10):\n")
print(pacf_result$acf[1:10])

# 2. Manual calculation via regression (for understanding)
cat("\n2. Manual PACF calculation via regression:\n")

manual_pacf <- function(x, lag) {
  if (lag == 1) {
    return(cor(x[-1], x[-length(x)]))
  } else {
    # Create design matrix
    n <- length(x)
    X <- matrix(NA, nrow = n - lag, ncol = lag)
    for (i in 1:lag) {
      X[, i] <- x[(lag+1-i):(n-i)]
    }
    
    # Fit regression and get coefficient for most recent lag
    y <- x[(lag+1):n]
    coefs <- lm(y ~ X)$coefficients
    return(coefs[lag + 1])  # +1 for intercept
  }
}

for (h in 1:5) {
  cat(sprintf("Lag %d: PACF = %.4f (built-in) vs %.4f (manual)\n",
              h, pacf_result$acf[h], manual_pacf(x, h)))
}

# 3. Visual comparison
par(mfrow = c(2, 2))

# Time series plot
plot(x, type = "l", main = "AR(2) Time Series",
     xlab = "Time", ylab = "Value")

# ACF plot
acf(x, lag.max = 20, main = "Autocorrelation Function")

# PACF plot
pacf(x, lag.max = 20, main = "Partial Autocorrelation Function")

# Compare ACF and PACF
plot(1:20, acf(x, lag.max = 20, plot = FALSE)$acf[-1],
     type = "h", col = "blue", lwd = 2,
     xlab = "Lag", ylab = "Correlation",
     main = "ACF vs PACF", ylim = c(-0.5, 0.6))
points(1:20 + 0.2, pacf_result$acf,
       type = "h", col = "red", lwd = 2)
legend("topright", legend = c("ACF", "PACF"),
       col = c("blue", "red"), lwd = 2)

par(mfrow = c(1, 1))

# 4. Model identification
cat("\n3. Model Identification:\n")
cat("Based on PACF cutoff after lag 2, suggest AR(2) model.\n")
cat("PACF significant at lags 1 and 2, insignificant thereafter.\n")

Advanced Topics

        1. 1) Regularized Partial Correlation
# Regularized estimation for high-dimensional data
library(glasso)

regularized_partial_correlation <- function(data, lambda = 0.1) {
  # Graphical Lasso for sparse precision matrix estimation
  S <- cov(data)
  glasso_result <- glasso(S, rho = lambda)
  
  # Convert to partial correlations
  omega <- glasso_result$wi
  pcor_matrix <- -cov2cor(omega)
  diag(pcor_matrix) <- 1
  
  return(list(
    precision_matrix = omega,
    partial_correlation = pcor_matrix,
    lambda = lambda
  ))
}

# Example with high-dimensional data
set.seed(101)
p <- 50  # Number of variables
n <- 30  # Number of observations (n < p)

high_dim_data <- matrix(rnorm(n * p), n, p)

cat("=== Regularized Partial Correlation (n < p) ===\n")
cat("Dimensions:", n, "samples ×", p, "variables\n")

# Standard correlation matrix is singular
try_cor <- try(cor(high_dim_data), silent = TRUE)
if (inherits(try_cor, "try-error")) {
  cat("Standard correlation matrix is singular (n < p).\n")
}

# Regularized estimation
reg_result <- regularized_partial_correlation(high_dim_data, lambda = 0.5)

cat("\nRegularized partial correlation matrix computed successfully.\n")
cat("Sparsity:", 
    mean(reg_result$partial_correlation[upper.tri(reg_result$partial_correlation)] == 0),
    "proportion of zero partial correlations.\n")
        1. 2) Bayesian Partial Correlation
# Bayesian approach to partial correlation
library(rstan)

bayesian_partial_correlation <- function(X, Y, Z, n_iter = 2000) {
  # Prepare data for Stan
  stan_data <- list(
    N = length(X),
    X = X,
    Y = Y,
    Z = if(is.null(dim(Z))) matrix(Z, ncol = 1) else as.matrix(Z),
    K = if(is.null(dim(Z))) 1 else ncol(Z)
  )
  
  # Stan model for partial correlation
  stan_model_code <- "
  data {
    int<lower=1> N;
    vector[N] X;
    vector[N] Y;
    matrix[N, K] Z;
    int<lower=1> K;
  }
  parameters {
    real beta_X0;
    real beta_Y0;
    vector[K] beta_XZ;
    vector[K] beta_YZ;
    real<lower=-1, upper=1> rho_resid;
    real<lower=0> sigma_X;
    real<lower=0> sigma_Y;
  }
  model {
    // Priors
    beta_X0 ~ normal(0, 10);
    beta_Y0 ~ normal(0, 10);
    beta_XZ ~ normal(0, 5);
    beta_YZ ~ normal(0, 5);
    rho_resid ~ uniform(-1, 1);
    sigma_X ~ cauchy(0, 2.5);
    sigma_Y ~ cauchy(0, 2.5);
    
    // Residual covariance
    for (n in 1:N) {
      real mu_X = beta_X0 + Z[n] * beta_XZ;
      real mu_Y = beta_Y0 + Z[n] * beta_YZ;
      [X[n], Y[n]]' ~ multi_normal([mu_X, mu_Y]', 
                                   [[sigma_X^2, rho_resid*sigma_X*sigma_Y],
                                    [rho_resid*sigma_X*sigma_Y, sigma_Y^2]]);
    }
  }
  generated quantities {
    // Partial correlation
    real partial_corr = rho_resid;
  }
  "
  
  # Fit model
  fit <- stan(model_code = stan_model_code, 
              data = stan_data, 
              iter = n_iter, 
              chains = 4)
  
  # Extract results
  samples <- extract(fit)
  
  return(list(
    posterior_mean = mean(samples$partial_corr),
    posterior_median = median(samples$partial_corr),
    credible_interval = quantile(samples$partial_corr, c(0.025, 0.975)),
    posterior_samples = samples$partial_corr
  ))
}

# Example
cat("\n=== Bayesian Partial Correlation ===\n")
bayes_result <- bayesian_partial_correlation(
  medical_data$exercise,
  medical_data$cholesterol,
  medical_data[, c("age", "bmi")],
  n_iter = 1000
)

print(bayes_result)
        1. 3) Causal Inference and Partial Correlation
# Partial correlation in causal inference
library(pcalg)

causal_partial_correlation <- function(data) {
  # Estimate causal structure using PC algorithm
  suffStat <- list(C = cor(data), n = nrow(data))
  pc_fit <- pc(suffStat, indepTest = gaussCItest,
               p = ncol(data), alpha = 0.05)
  
  # Extract partial correlations for edges in the graph
  adj_matrix <- as(pc_fit, "matrix")
  
  # Compute partial correlations for connected pairs
  p <- ncol(data)
  pcor_matrix <- matrix(0, p, p)
  
  for (i in 1:(p-1)) {
    for (j in (i+1):p) {
      if (adj_matrix[i, j] != 0) {
        # Find separating set
        sep_set <- pc_fit@sepset[[i]][[j]]
        if (length(sep_set) > 0) {
          pcor <- pcor.test(data[, i], data[, j], 
                           data[, sep_set])$estimate
          pcor_matrix[i, j] <- pcor_matrix[j, i] <- pcor
        }
      }
    }
  }
  
  return(list(
    graph = pc_fit,
    adjacency = adj_matrix,
    partial_correlations = pcor_matrix
  ))
}

# Example with simulated causal structure
set.seed(2023)
n_causal <- 100
causal_data <- data.frame(
  A = rnorm(n_causal),
  B = 0.5 * rnorm(n_causal) + 0.3 * A + rnorm(n_causal, sd = 0.1),
  C = 0.4 * rnorm(n_causal) + 0.2 * A + 0.3 * B + rnorm(n_causal, sd = 0.1),
  D = 0.3 * rnorm(n_causal) + 0.1 * B + 0.2 * C + rnorm(n_causal, sd = 0.1)
)

cat("=== Causal Inference with Partial Correlation ===\n")
causal_result <- causal_partial_correlation(causal_data)

cat("Estimated causal graph (adjacency matrix):\n")
print(causal_result$adjacency)

cat("\nPartial correlations for direct connections:\n")
print(causal_result$partial_correlations)

Software Implementation

# Comprehensive partial correlation analysis function
run_partial_correlation_analysis <- function(data, var1, var2, controls,
                                            method = "standard",
                                            inference = TRUE,
                                            visualization = TRUE) {
  
  cat("=== PARTIAL CORRELATION ANALYSIS ===\n\n")
  cat("Variable 1:", var1, "\n")
  cat("Variable 2:", var2, "\n")
  cat("Control variables:", paste(controls, collapse = ", "), "\n")
  cat("Method:", method, "\n\n")
  
  # Extract variables
  X <- data[[var1]]
  Y <- data[[var2]]
  Z <- if(length(controls) > 0) data[, controls, drop = FALSE] else NULL
  
  # 1. Simple correlation
  simple_cor <- cor(X, Y)
  cat("1. Simple correlation:", round(simple_cor, 4), "\n")
  
  # 2. Partial correlation
  if (method == "standard") {
    if (!is.null(Z)) {
      pcor_result <- pcor.test(X, Y, Z)
      cat("\n2. Partial correlation:", round(pcor_result$estimate, 4), "\n")
      if (inference) {
        cat("   p-value:", format.pval(pcor_result$p.value, digits = 4), "\n")
        cat("   95% CI: [", 
            round(pcor_result$conf.int[1], 4), ", ",
            round(pcor_result$conf.int[2], 4), "]\n", sep = "")
      }
    } else {
      cat("\nNo control variables specified.\n")
    }
    
  } else if (method == "regularized") {
    # Regularized partial correlation
    all_data <- cbind(X, Y, Z)
    reg_result <- regularized_partial_correlation(all_data, lambda = 0.1)
    pcor_value <- reg_result$partial_correlation[1, 2]
    cat("\n2. Regularized partial correlation:", round(pcor_value, 4), "\n")
    cat("   Regularization parameter lambda:", reg_result$lambda, "\n")
  }
  
  # 3. Semi-partial correlation
  if (!is.null(Z)) {
    # Regress Y on Z
    formula_y <- as.formula(paste(var2, "~", paste(controls, collapse = "+")))
    lm_y <- lm(formula_y, data = data)
    residuals_y <- resid(lm_y)
    
    semi_partial <- cor(X, residuals_y)
    cat("\n3. Semi-partial correlation:", round(semi_partial, 4), "\n")
    cat("   (Correlation of", var1, "with residuals of", var2, "after controlling for Z)\n")
  }
  
  # 4. Visualization
  if (visualization) {
    par(mfrow = c(2, 2))
    
    # Scatterplot
    plot(X, Y, main = paste("Scatterplot:", var1, "vs", var2),
         xlab = var1, ylab = var2, pch = 19, col = "blue")
    
    # Residual plots if controls specified
    if (!is.null(Z)) {
      # Residuals after controlling
      res_X <- resid(lm(as.formula(paste(var1, "~", paste(controls, collapse = "+"))), 
                       data = data))
      res_Y <- resid(lm(formula_y, data = data))
      
      plot(res_X, res_Y, 
           main = paste("Residual Scatterplot\n(Controlling for", 
                       paste(controls, collapse = ", "), ")"),
           xlab = paste("Residuals of", var1),
           ylab = paste("Residuals of", var2),
           pch = 19, col = "red")
      
      # Comparison plot
      cor_types <- data.frame(
        Type = factor(c("Simple", "Partial", "Semi-partial"),
                     levels = c("Simple", "Partial", "Semi-partial")),
        Value = c(simple_cor, 
                 if(exists("pcor_result")) pcor_result$estimate else NA,
                 semi_partial)
      )
      
      barplot(cor_types$Value, names.arg = cor_types$Type,
              ylim = c(-1, 1), col = c("blue", "red", "green"),
              main = "Comparison of Correlation Measures",
              ylab = "Correlation Coefficient")
      abline(h = 0, lty = 2)
    }
    
    # Network visualization if multiple variables
    if (ncol(data) >= 3) {
      pcor_matrix <- pcor(data)$estimate
      qgraph(pcor_matrix, layout = "spring", 
             labels = colnames(data),
             title = "Partial Correlation Network",
             maximum = 0.8)
    }
    
    par(mfrow = c(1, 1))
  }
  
  # Return results
  results <- list(
    simple_correlation = simple_cor,
    partial_correlation = if(exists("pcor_result")) pcor_result$estimate else NA,
    semi_partial_correlation = if(exists("semi_partial")) semi_partial else NA,
    p_value = if(exists("pcor_result")) pcor_result$p.value else NA
  )
  
  return(results)
}

# Example usage with mtcars dataset
cat("\n=== EXAMPLE: Partial Correlation with mtcars ===\n")
data(mtcars)

# Analyze relationship between mpg and hp, controlling for wt and cyl
results <- run_partial_correlation_analysis(
  data = mtcars,
  var1 = "mpg",
  var2 = "hp",
  controls = c("wt", "cyl"),
  method = "standard",
  inference = TRUE,
  visualization = TRUE
)

Common Issues and Solutions

        1. 1) Multicollinearity Among Controls
# Check for multicollinearity in control variables
check_multicollinearity <- function(Z) {
  if (is.null(dim(Z))) Z <- as.matrix(Z)
  
  # Variance Inflation Factor (VIF)
  if (require(car)) {
    # Create a dummy regression to compute VIF
    dummy_y <- rnorm(nrow(Z))
    dummy_lm <- lm(dummy_y ~ Z)
    vif_values <- vif(dummy_lm)
    
    cat("Variance Inflation Factors (VIF):\n")
    print(vif_values)
    
    # Rule of thumb: VIF > 10 indicates problematic multicollinearity
    problematic <- vif_values > 10
    if (any(problematic)) {
      cat("\nWarning: High multicollinearity detected in control variables.\n")
      cat("Consider removing or combining variables with VIF > 10.\n")
    }
  }
  
  # Condition number
  if (ncol(Z) > 1) {
    kappa_value <- kappa(cor(Z), exact = TRUE)
    cat("\nCondition number of correlation matrix:", round(kappa_value, 2), "\n")
    if (kappa_value > 30) {
      cat("Warning: High condition number indicates multicollinearity.\n")
    }
  }
}

# Example
check_multicollinearity(mtcars[, c("wt", "cyl", "disp")])
        1. 2) Missing Data Handling
# Multiple approaches for missing data in partial correlation
partial_correlation_missing <- function(X, Y, Z, method = "complete") {
  
  if (method == "complete") {
    # Listwise deletion
    complete_cases <- complete.cases(cbind(X, Y, Z))
    X_comp <- X[complete_cases]
    Y_comp <- Y[complete_cases]
    Z_comp <- if(!is.null(dim(Z))) Z[complete_cases, ] else Z[complete_cases]
    
    return(pcor.test(X_comp, Y_comp, Z_comp))
    
  } else if (method == "pairwise") {
    # Pairwise deletion (not recommended for partial correlation)
    cat("Warning: Pairwise deletion can lead to inconsistent results in partial correlation.\n")
    
  } else if (method == "multiple_imputation") {
    # Multiple imputation
    if (require(mice)) {
      data_complete <- cbind(X, Y, Z)
      imp <- mice(data_complete, m = 5, printFlag = FALSE)
      
      # Fit model on each imputed dataset
      fits <- with(imp, pcor.test(X, Y, Z))
      
      # Pool results
      pooled <- pool(fits)
      return(summary(pooled))
    }
  }
}

# Example with missing data
set.seed(303)
n_missing <- 50
X_miss <- rnorm(n_missing)
Y_miss <- 0.5 * X_miss + rnorm(n_missing, sd = 0.5)
Z_miss <- 0.3 * X_miss + 0.4 * Y_miss + rnorm(n_missing, sd = 0.3)

# Introduce missing values
X_miss[sample(1:n_missing, 10)] <- NA
Y_miss[sample(1:n_missing, 10)] <- NA
Z_miss[sample(1:n_missing, 10)] <- NA

cat("=== Partial Correlation with Missing Data ===\n")
cat("Complete cases method:\n")
print(partial_correlation_missing(X_miss, Y_miss, Z_miss, method = "complete"))

Problems and Exercises

1. Conceptual Problems:

  a) Prove that for jointly Gaussian variables, zero partial correlation implies conditional independence
  b) Derive the relationship between partial correlation and regression coefficients
  c) Show that \(|r_{X(Y\cdot Z)}| \leq |r_{XY\cdot Z}| \leq |r_{XY}|\)

2. Applied Problems:

  a) Analyze the relationship between `mpg` and `qsec` in the `mtcars` dataset, controlling for `wt` and `hp`
  b) Compute the partial correlation matrix for the `iris` dataset (sepal and petal measurements)
  c) Conduct a power analysis for detecting partial correlation ρ = 0.3 with 3 control variables, α = 0.05, power = 0.80

3. Simulation Study:

# Simulation to understand partial correlation properties
simulate_partial_correlation <- function(n_sim = 1000, n = 100, 
                                        true_rho = 0.5, n_controls = 2) {
  
  estimates <- numeric(n_sim)
  p_values <- numeric(n_sim)
  coverage <- numeric(n_sim)
  
  for (i in 1:n_sim) {
    # Generate data with specified partial correlation
    # Generate control variables
    Z <- matrix(rnorm(n * n_controls), n, n_controls)
    
    # Generate residuals with specified correlation
    sigma <- matrix(c(1, true_rho, true_rho, 1), 2, 2)
    residuals <- MASS::mvrnorm(n, mu = c(0, 0), Sigma = sigma)
    
    # Generate X and Y
    beta_X <- rnorm(n_controls)
    beta_Y <- rnorm(n_controls)
    
    X <- Z %*% beta_X + residuals[, 1]
    Y <- Z %*% beta_Y + residuals[, 2]
    
    # Estimate partial correlation
    result <- pcor.test(X, Y, Z)
    estimates[i] <- result$estimate
    p_values[i] <- result$p.value
    
    # Check coverage of 95% CI
    ci <- result$conf.int
    coverage[i] <- (true_rho >= ci[1] & true_rho <= ci[2])
  }
  
  cat("Simulation Results (n =", n_sim, "):\n")
  cat("True partial correlation:", true_rho, "\n")
  cat("Mean estimate:", mean(estimates), "\n")
  cat("Bias:", mean(estimates) - true_rho, "\n")
  cat("RMSE:", sqrt(mean((estimates - true_rho)^2)), "\n")
  cat("Type I error rate (for true_rho = 0):", mean(p_values < 0.05), "\n")
  cat("Coverage of 95% CI:", mean(coverage), "\n")
  
  return(data.frame(
    estimate = estimates,
    p_value = p_values,
    coverage = coverage
  ))
}

# Run simulation
sim_results <- simulate_partial_correlation(n_sim = 500, n = 50, 
                                          true_rho = 0.3, n_controls = 2)

References

1. Whittaker, J. (1990). *Graphical Models in Applied Multivariate Statistics*. Wiley.

2. Edwards, D. (2000). *Introduction to Graphical Modelling* (2nd ed.). Springer.

3. Pearl, J. (2009). *Causality: Models, Reasoning, and Inference* (2nd ed.). Cambridge University Press.

4. Koller, D., & Friedman, N. (2009). *Probabilistic Graphical Models: Principles and Techniques*. MIT Press.

5. Bühlmann, P., & van de Geer, S. (2011). *Statistics for High-Dimensional Data: Methods, Theory and Applications*. Springer.

Online Resources




Translate this page:

(default)
Uk flag.gif

Deutsch
De flag.gif

Español
Es flag.gif

Français
Fr flag.gif

Italiano
It flag.gif

Português
Pt flag.gif

日本語
Jp flag.gif

България
Bg flag.gif

الامارات العربية المتحدة
Ae flag.gif

Suomi
Fi flag.gif

इस भाषा में
In flag.gif

Norge
No flag.png

한국어
Kr flag.gif

中文
Cn flag.gif

繁体中文
Cn flag.gif

Русский
Ru flag.gif

Nederlands
Nl flag.gif

Ελληνικά
Gr flag.gif

Hrvatska
Hr flag.gif

Česká republika
Cz flag.gif

Danmark
Dk flag.gif

Polska
Pl flag.png

România
Ro flag.png

Sverige
Se flag.gif