SMHS ANCOVA

From SOCR
Revision as of 10:05, 9 December 2025 by Dinov (talk | contribs) (Common Issues and Solutions)
Jump to: navigation, search

Scientific Methods for Health Sciences - Analysis of Covariance (ANCOVA)

Overview

Analysis of Covariance (ANCOVA) is a statistical technique that blends Analysis of Variance (ANOVA) and regression analysis to assess whether population means of a dependent variable (DV) differ across levels of a categorical independent variable (IV) while controlling for the effects of continuous covariates (CVs). ANCOVA extends ANOVA by incorporating continuous predictors, which increases statistical power and reduces bias from preexisting group differences. This section provides a comprehensive treatment of ANCOVA, its multivariate extensions (MANCOVA), and practical implementation with R.

Motivation

Consider a clinical trial comparing two treatments for blood pressure reduction. Patients differ in baseline blood pressure, age, and BMI. Simple ANOVA comparing treatment groups would ignore these covariates, potentially biasing results. ANCOVA addresses this by: 1. Increasing statistical power by reducing within-group error variance 2. Adjusting for preexisting differences in nonequivalent groups 3. Reducing bias from confounding variables 4. Enabling more precise estimation of treatment effects

For multivariate outcomes (e.g., blood pressure, cholesterol, weight), Multivariate ANCOVA (MANCOVA) extends this framework to multiple DVs simultaneously.

Theory

1) ANOVA Review

One-way ANOVA

For \(k\) groups with \(n_i\) observations in group \(i\), the model is: \[ y_{ij} = \mu + \tau_i + \varepsilon_{ij}, \quad \varepsilon_{ij} \sim N(0, \sigma^2) \] where: - \(y_{ij}\) is the \(j\)-th observation in group \(i\) - \(\mu\) is the overall mean - \(\tau_i\) is the treatment effect for group \(i\) (\(\sum_{i=1}^k \tau_i = 0\)) - \(\varepsilon_{ij}\) is the random error

The total sum of squares partitions as: \[ SST = SSB + SSW = \sum_{i=1}^k \sum_{j=1}^{n_i} (y_{ij} - \bar{y}_{\cdot\cdot})^2 = \sum_{i=1}^k n_i (\bar{y}_{i\cdot} - \bar{y}_{\cdot\cdot})^2 + \sum_{i=1}^k \sum_{j=1}^{n_i} (y_{ij} - \bar{y}_{i\cdot})^2 \]

The F-statistic tests \(H_0: \tau_1 = \tau_2 = \cdots = \tau_k\): \[ F = \frac{MSB}{MSW} = \frac{SSB/(k-1)}{SSW/(n-k)} \sim F_{k-1, n-k} \quad \text{under } H_0 \]

Two-way ANOVA

For factors A (a levels) and B (b levels) with r replicates: \[ y_{ijk} = \mu + \alpha_i + \beta_j + (\alpha\beta)_{ij} + \varepsilon_{ijk} \] where: - \(\alpha_i\) is the main effect of factor A - \(\beta_j\) is the main effect of factor B - \((\alpha\beta)_{ij}\) is the interaction effect - \(\varepsilon_{ijk} \sim N(0, \sigma^2)\)

Sum of squares decomposition: \[ SST = SSA + SSB + SSAB + SSE \]

2) ANCOVA Model

Basic Model

The ANCOVA model with one covariate and one factor: \[ y_{ij} = \mu + \tau_i + \beta(x_{ij} - \bar{x}_{\cdot\cdot}) + \varepsilon_{ij}, \quad \varepsilon_{ij} \sim N(0, \sigma^2) \] where: - \(y_{ij}\) is the response for observation \(j\) in group \(i\) - \(\mu\) is the overall mean - \(\tau_i\) is the treatment effect (\(\sum \tau_i = 0\)) - \(\beta\) is the regression coefficient for the covariate \(x_{ij}\) - \(\bar{x}_{\cdot\cdot}\) is the overall mean of the covariate (centering reduces multicollinearity)

The adjusted group mean for group \(i\) is: \[ \mu_i^{adj} = \mu + \tau_i = \bar{y}_{i\cdot} - \beta(\bar{x}_{i\cdot} - \bar{x}_{\cdot\cdot}) \]

Matrix Formulation

For \(k\) groups and \(p\) covariates: \[ \mathbf{y} = \mathbf{X}\boldsymbol{\beta} + \boldsymbol{\varepsilon}, \quad \boldsymbol{\varepsilon} \sim N(\mathbf{0}, \sigma^2\mathbf{I}) \] where: \[ \mathbf{X} = \begin{bmatrix} \mathbf{1} & \mathbf{Z} & \mathbf{C} \end{bmatrix}, \quad \boldsymbol{\beta} = \begin{bmatrix} \mu \\ \boldsymbol{\tau} \\ \boldsymbol{\gamma} \end{bmatrix} \] - \(\mathbf{Z}\) is the design matrix for group indicators - \(\mathbf{C}\) is the matrix of covariates - \(\boldsymbol{\tau}\) are treatment effects - \(\boldsymbol{\gamma}\) are covariate coefficients

The least squares estimates: \[ \hat{\boldsymbol{\beta}} = (\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top\mathbf{y} \]

Hypothesis Testing

1. Test for covariate effect: \(H_0: \beta = 0\) vs \(H_1: \beta \neq 0\) \[ F = \frac{MS_{reg}}{MSE} \sim F_{1, n-k-1} \]

2. Test for treatment effect adjusted for covariate: \(H_0: \tau_1 = \cdots = \tau_k = 0\) \[ F = \frac{MS_{trt|cov}}{MSE} \sim F_{k-1, n-k-1} \]

The adjusted treatment sum of squares: \[ SS_{trt|cov} = SS_{total} - SS_{cov} - SSE \]

3) MANOVA and MANCOVA

MANOVA Model

For \(p\) dependent variables: \[ \mathbf{Y}_{n\times p} = \mathbf{X}_{n\times q}\mathbf{B}_{q\times p} + \mathbf{E}_{n\times p}, \quad \text{vec}(\mathbf{E}) \sim N(\mathbf{0}, \mathbf{I}_n \otimes \boldsymbol{\Sigma}) \] where \(\boldsymbol{\Sigma}\) is the \(p\times p\) covariance matrix of errors.

The hypothesis matrix \(\mathbf{H}\) and error matrix \(\mathbf{E}\): \[ \mathbf{H} = \mathbf{Y}^\top\mathbf{X}(\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top\mathbf{Y} - \mathbf{Y}^\top\mathbf{1}(\mathbf{1}^\top\mathbf{1})^{-1}\mathbf{1}^\top\mathbf{Y} \] \[ \mathbf{E} = \mathbf{Y}^\top\mathbf{Y} - \mathbf{Y}^\top\mathbf{X}(\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top\mathbf{Y} \]

Test statistics based on eigenvalues \(\lambda_i\) of \(\mathbf{E}^{-1}\mathbf{H}\): 1. Wilks' Lambda: \(\Lambda = \frac{|\mathbf{E}|}{|\mathbf{H}+\mathbf{E}|} = \prod_{i=1}^s \frac{1}{1+\lambda_i}\) 2. Pillai's Trace: \(V = \text{tr}[\mathbf{H}(\mathbf{H}+\mathbf{E})^{-1}] = \sum_{i=1}^s \frac{\lambda_i}{1+\lambda_i}\) 3. Hotelling-Lawley Trace: \(U = \text{tr}(\mathbf{E}^{-1}\mathbf{H}) = \sum_{i=1}^s \lambda_i\) 4. Roy's Largest Root: \(\theta = \frac{\lambda_1}{1+\lambda_1}\)

MANCOVA Model

Extends MANOVA with covariates: \[ \mathbf{Y} = \mathbf{X}\mathbf{B} + \mathbf{Z}\mathbf{\Gamma} + \mathbf{E} \] where \(\mathbf{Z}\) contains covariates and \(\mathbf{\Gamma}\) their coefficients.

4) Assumptions and Diagnostics

Key Assumptions: 1. Linearity: Relationship between DV and covariates is linear 2. Homogeneity of regression slopes: \(\beta\) is constant across groups 3. Normality: Residuals \(\varepsilon_{ij} \sim N(0, \sigma^2)\) 4. Homoscedasticity: Constant variance across groups 5. Independence: Observations are independent 6. No multicollinearity: Covariates not highly correlated

Diagnostic Tests:

# R function for ANCOVA diagnostics
check_ancova_assumptions <- function(model, data, group_var, covariate) {
  par(mfrow = c(2, 3))
  
  # 1. Normality of residuals
  residuals <- resid(model)
  qqnorm(residuals, main = "Q-Q Plot of Residuals")
  qqline(residuals)
  shapiro_test <- shapiro.test(residuals)
  cat("Shapiro-Wilk normality test: W =", shapiro_test$statistic, 
      "p =", shapiro_test$p.value, "\n")
  
  # 2. Homogeneity of variance
  plot(fitted(model), residuals, 
       xlab = "Fitted Values", ylab = "Residuals",
       main = "Residuals vs Fitted")
  abline(h = 0, col = "red")
  
  # Levene's test (using car package)
  if (require(car)) {
    levene_test <- leveneTest(residuals ~ data[[group_var]])
    cat("Levene's test for homogeneity of variance: F =", 
        levene_test$F[1], "p =", levene_test$Pr[1], "\n")
  }
  
  # 3. Linearity check
  plot(data[[covariate]], residuals,
       xlab = covariate, ylab = "Residuals",
       main = "Residuals vs Covariate")
  abline(h = 0, col = "red")
  
  # 4. Homogeneity of regression slopes
  interaction_model <- lm(formula(paste("y ~", group_var, "*", covariate)), data = data)
  anova_interaction <- anova(model, interaction_model)
  cat("\nTest for homogeneity of slopes (interaction test):\n")
  print(anova_interaction)
  
  # 5. Multicollinearity (if multiple covariates)
  if (require(car)) {
    vif_values <- vif(model)
    cat("\nVariance Inflation Factors (VIF):\n")
    print(vif_values)
  }
  
  par(mfrow = c(1, 1))
}

Below is a more sophisticated function, check_ancova_assumptions_simple(), which is more robust to model configurations.

check_ancova_assumptions_simple <- function(model, data, group_var, covariate) {
  # Get response variable name from model formula
  response_var <- all.vars(formula(model))[1]
  
  cat("=== ANCOVA ASSUMPTION CHECKS ===\n")
  cat("Response variable:", response_var, "\n")
  cat("Group variable:", group_var, "\n")
  cat("Covariate:", covariate, "\n\n")
  
  # Set up plot layout
  par(mfrow = c(2, 3))
  
  # 1. Normality of residuals
  residuals <- resid(model)
  qqnorm(residuals, main = "Q-Q Plot of Residuals")
  qqline(residuals)
  shapiro_test <- shapiro.test(residuals)
  cat("1. Normality (Shapiro-Wilk): W =", round(shapiro_test$statistic, 4), 
      ", p =", format.pval(shapiro_test$p.value, digits = 4), "\n")
  
  # 2. Homogeneity of variance
  fitted_vals <- fitted(model)
  plot(fitted_vals, residuals, 
       xlab = "Fitted Values", ylab = "Residuals",
       main = "Residuals vs Fitted")
  abline(h = 0, col = "red", lty = 2)
  
  # 3. Linearity check (residuals vs covariate)
  plot(data[[covariate]], residuals,
       xlab = covariate, ylab = "Residuals",
       main = paste("Residuals vs", covariate))
  abline(h = 0, col = "red", lty = 2)
  
  # 4. Homogeneity of regression slopes - FIXED: Properly nested comparison
  # Get all terms from the original model
  model_terms <- attr(terms(model), "term.labels")
  
  # Remove any existing interaction involving the group_var and covariate
  # Keep all other terms
  other_terms <- model_terms[!grepl(paste0(group_var, ":", covariate, "|", 
                                          covariate, ":", group_var), model_terms)]
  
  # Build the model formula without the interaction
  if (length(other_terms) > 0) {
    base_formula <- paste(response_var, "~", paste(other_terms, collapse = " + "))
  } else {
    base_formula <- paste(response_var, "~ 1")
  }
  
  # Build interaction model formula by adding the interaction term
  interaction_formula <- paste(base_formula, "+", group_var, "*", covariate)
  
  # Fit both models
  base_model <- lm(as.formula(base_formula), data = data)
  interaction_model <- lm(as.formula(interaction_formula), data = data)
  
  # Test for significant interaction
  interaction_test <- anova(base_model, interaction_model)
  cat("\n2. Homogeneity of slopes test:\n")
  print(interaction_test)
  
  # Check if the interaction is significant - FIXED: Handle NA p-values
  p_value <- interaction_test$`Pr(>F)`[2]
  
  if (!is.na(p_value)) {
    if (p_value < 0.05) {
      cat("\nWARNING: Significant interaction detected (p =", 
          format.pval(p_value, digits = 4), 
          "). Slopes are not homogeneous.\n")
      
      # Plot different slopes
      plot(data[[covariate]], data[[response_var]],
           col = as.numeric(data[[group_var]]),
           xlab = covariate, ylab = response_var,
           main = "Different Slopes by Group (Interaction Present)")
      
      # Add regression lines for each group
      groups <- unique(data[[group_var]])
      for (i in seq_along(groups)) {
        group_data <- data[data[[group_var]] == groups[i], ]
        if (nrow(group_data) > 1) {
          abline(lm(as.formula(paste(response_var, "~", covariate)), 
                   data = group_data), 
                 col = i, lwd = 2)
        }
      }
      
      legend("topright", legend = levels(data[[group_var]]), 
             col = 1:length(levels(data[[group_var]])), 
             lwd = 2, pch = 1)
    } else {
      cat("\nNo significant interaction (p =", 
          format.pval(p_value, digits = 4), 
          "). Homogeneity of slopes assumption is satisfied.\n")
    }
  } else {
    cat("\nNote: Could not calculate p-value for interaction test.\n")
    cat("Model comparison summary:\n")
    print(interaction_test)
  }
  
  # 5. Residual histogram
  hist(residuals, main = "Histogram of Residuals", 
       xlab = "Residuals", col = "lightblue")
  
  par(mfrow = c(1, 1))
  
  # Additional diagnostics
  cat("\n3. Additional Statistics:\n")
  cat("- Mean of residuals:", round(mean(residuals), 4), "\n")
  cat("- SD of residuals:", round(sd(residuals), 4), "\n")
  cat("- Max absolute residual:", round(max(abs(residuals)), 4), "\n")
  
  # Check for outliers (residuals > 3 SD)
  if (sd(residuals) > 0) {
    outlier_count <- sum(abs(scale(residuals)) > 3, na.rm = TRUE)
    cat("- Potential outliers (>3 SD):", outlier_count, "\n")
  }
  
  # Return diagnostic results
  return(list(
    shapiro_test = shapiro_test,
    interaction_test = interaction_test,
    p_value_interaction = p_value
  ))
}

Applications

Example 1: Clinical Trial with Baseline Adjustment

# Simulated clinical trial data
set.seed(123)
n <- 100
trial_data <- data.frame(
  patient_id = 1:n,
  treatment = factor(rep(c("Drug", "Placebo"), each = n/2)),
  baseline_bp = rnorm(n, mean = 150, sd = 15),
  age = sample(40:75, n, replace = TRUE),
  bmi = rnorm(n, mean = 28, sd = 4)
)

# Generate post-treatment BP with treatment effect and baseline correlation
trial_data$post_bp <- with(trial_data, 
  baseline_bp * 0.7 + 
  ifelse(treatment == "Drug", -15, -5) +
  (age - 60) * 0.2 +
  (bmi - 28) * 0.5 +
  rnorm(n, 0, 8)
)

cat("=== Basic ANOVA (ignoring covariates) ===\n")
anova_simple <- aov(post_bp ~ treatment, data = trial_data)
print(summary(anova_simple))

cat("\n=== ANCOVA (adjusting for baseline BP) ===\n")
ancova_model <- aov(post_bp ~ treatment + baseline_bp, data = trial_data)
print(summary(ancova_model))

cat("\n=== ANCOVA (multiple covariates) ===\n")
ancova_full <- lm(post_bp ~ treatment + baseline_bp + age + bmi, data = trial_data)
print(summary(ancova_full))

cat("\n=== Adjusted Means ===\n")
library(emmeans)
adj_means <- emmeans(ancova_full, specs = ~ treatment)
print(adj_means)

cat("\n=== Pairwise Comparisons with Adjustment ===\n")
pairwise_comparisons <- pairs(adj_means, adjust = "holm")
print(pairwise_comparisons)

# Diagnostic checks
cat("\n=== Model Diagnostics ===\n")
check_ancova_assumptions_simple(ancova_full, trial_data, "treatment", "baseline_bp")

Example 2: Educational Intervention Study

# Using built-in iris dataset to demonstrate MANCOVA
data(iris)
# Create a categorical variable and covariate
set.seed(456)
iris$treatment <- factor(sample(c("Method_A", "Method_B", "Control"), 
                                nrow(iris), replace = TRUE))
iris$pretest_score <- iris$Sepal.Length * 10 + rnorm(nrow(iris), 50, 5)

# MANCOVA with multiple DVs
cat("=== MANCOVA Example ===\n")
Y <- cbind(iris$Sepal.Width, iris$Petal.Length, iris$Petal.Width)
manova_model <- manova(Y ~ treatment + pretest_score, data = iris)

cat("\n--- Wilks' Lambda Test ---\n")
print(summary(manova_model, test = "Wilks"))

cat("\n--- Pillai's Trace Test ---\n")
print(summary(manova_model, test = "Pillai"))

cat("\n--- Individual ANCOVAs ---\n")
for (i in 1:3) {
  cat("\nDV", i, ":\n")
  ancova_indiv <- lm(Y[, i] ~ treatment + pretest_score, data = iris)
  print(summary(ancova_indiv))
}

# Visualizing adjusted means
library(ggplot2)
library(dplyr)

# Calculate adjusted means using emmeans
if (require(emmeans)) {
  adj_means_plot <- list()
  for (i in 1:3) {
    model <- lm(Y[, i] ~ treatment + pretest_score, data = iris)
    emm <- emmeans(model, specs = ~ treatment)
    adj_means_plot[[i]] <- as.data.frame(emm) %>%
      mutate(DV = paste("DV", i))
  }
  
  plot_data <- bind_rows(adj_means_plot)
  ggplot(plot_data, aes(x = treatment, y = emmean, fill = treatment)) +
    geom_bar(stat = "identity", alpha = 0.7) +
    geom_errorbar(aes(ymin = lower.CL, ymax = upper.CL), width = 0.2) +
    facet_wrap(~ DV, scales = "free_y") +
    labs(title = "Adjusted Means with 95% Confidence Intervals",
         y = "Adjusted Mean", x = "Treatment Group") +
    theme_minimal()
}

Example 3: Real Dataset Analysis - Plant Growth

# Using PlantGrowth dataset with simulated covariate
data(PlantGrowth)
set.seed(789)
PlantGrowth$soil_quality <- rnorm(nrow(PlantGrowth), mean = 5, sd = 1) + 
  ifelse(PlantGrowth$group == "trt1", 0.5, 
         ifelse(PlantGrowth$group == "trt2", -0.5, 0))

cat("=== Plant Growth ANCOVA ===\n")
cat("Research question: Do treatments affect plant weight after controlling for soil quality?\n\n")

# EDA
cat("--- Exploratory Data Analysis ---\n")
cat("Group means (unadjusted):\n")
print(aggregate(weight ~ group, data = PlantGrowth, mean))

cat("\nCorrelation between weight and soil quality:", 
    cor(PlantGrowth$weight, PlantGrowth$soil_quality), "\n")

# Visualization
ggplot(PlantGrowth, aes(x = soil_quality, y = weight, color = group)) +
  geom_point(size = 3) +
  geom_smooth(method = "lm", se = FALSE) +
  labs(title = "Relationship between Plant Weight and Soil Quality by Treatment",
       x = "Soil Quality", y = "Plant Weight") +
  theme_minimal()

# ANCOVA analysis
ancova_plant <- lm(weight ~ group + soil_quality, data = PlantGrowth)
cat("\n--- ANCOVA Results ---\n")
print(summary(ancova_plant))

# Check assumptions
cat("\n--- Assumption Checks ---\n")
check_ancova_assumptions_simple(ancova_plant, PlantGrowth, "group", "soil_quality")

# Contrasts and post-hoc tests
cat("\n--- Post-hoc Comparisons ---\n")
library(multcomp)
contrasts <- glht(ancova_plant, linfct = mcp(group = "Tukey"))
print(summary(contrasts))

# Effect sizes
cat("\n--- Effect Sizes ---\n")
library(effectsize)
eta_squared <- eta_squared(ancova_plant, partial = TRUE)
print(eta_squared)

# Power analysis
cat("\n--- Power Analysis ---\n")
library(pwr)
# Calculate achieved power
f2 <- eta_squared$Eta2_partial[1] / (1 - eta_squared$Eta2_partial[1])
power_achieved <- pwr.f2.test(u = 2, v = 27, f2 = f2, sig.level = 0.05)$power
cat("Achieved power for treatment effect:", round(power_achieved, 3), "\n")

Advanced Topics

1) Nonparametric ANCOVA

# Quandrant test or rank-based ANCOVA
if (require(Rfit)) {
  cat("=== Rank-Based ANCOVA ===\n")
  rank_model <- rfit(weight ~ group + soil_quality, data = PlantGrowth)
  print(summary(rank_model))
}

2) Mixed Effects ANCOVA

# For repeated measures or clustered data
if (require(lme4)) {
  # Simulated longitudinal data
  set.seed(321)
  long_data <- data.frame(
    subject = rep(1:20, each = 3),
    time = rep(1:3, 20),
    treatment = factor(rep(sample(c("A", "B"), 20, replace = TRUE), each = 3)),
    baseline = rnorm(60, 100, 15),
    response = NA
  )
  
  # Generate responses
  long_data$response <- with(long_data,
    100 + 
    ifelse(treatment == "A", 5, -5) +
    0.7 * baseline +
    rnorm(60, 0, 10) +  # Residual error
    rep(rnorm(20, 0, 5), each = 3)  # Random intercept
  )
  
  # Linear mixed model (ANCOVA with random effects)
  mixed_model <- lmer(response ~ treatment + baseline + time + (1|subject), 
                      data = long_data)
  cat("\n=== Mixed Effects ANCOVA ===\n")
  print(summary(mixed_model))
  
  # Compare with standard ANCOVA
  standard_ancova <- lm(response ~ treatment + baseline + time, data = long_data)
  cat("\n--- Comparison: Standard vs Mixed ANCOVA ---\n")
  cat("Standard ANCOVA AIC:", AIC(standard_ancova), "\n")
  cat("Mixed model AIC:", AIC(mixed_model), "\n")
}

3) Power Analysis and Sample Size Planning

# Power analysis for ANCOVA
calculate_power_ancova <- function(k, n_per_group, f2, rho, alpha = 0.05) {
  # k: number of groups
  # n_per_group: sample size per group
  # f2: effect size (Cohen's f^2)
  # rho: correlation between covariate and DV
  
  N <- k * n_per_group
  df1 <- k - 1
  df2 <- N - k - 1  # -1 for covariate
  
  # Adjust f2 for covariate inclusion
  f2_adj <- f2 / (1 - rho^2)
  
  power <- pf(qf(1 - alpha, df1, df2), df1, df2, ncp = N * f2_adj, lower.tail = FALSE)
  return(power)
}

cat("=== ANCOVA Power Calculator ===\n")
# Example: 3 groups, 20 per group, medium effect (f2 = 0.15), rho = 0.5
power_example <- calculate_power_ancova(k = 3, n_per_group = 20, f2 = 0.15, rho = 0.5)
cat("Power for specified design:", round(power_example, 3), "\n")

# Sample size calculation for desired power
library(pwr)
ss <- pwr.f2.test(u = 2, f2 = 0.15, sig.level = 0.05, power = 0.80)
cat("Required sample size (per group for 3 groups):", ceiling(ss$v/3 + 3), "\n")

Software Implementation

# Comprehensive ANCOVA analysis function
run_ancova_analysis <- function(formula, data, group_var, covariates, 
                                pairwise = TRUE, diagnostics = TRUE) {
  
  # Fit model
  model <- lm(formula, data = data)
  
  cat("=== ANCOVA ANALYSIS REPORT ===\n\n")
  cat("Model formula:", deparse(formula), "\n")
  cat("Number of groups:", length(unique(data[[group_var]])), "\n")
  cat("Number of covariates:", length(covariates), "\n")
  cat("Total observations:", nrow(data), "\n\n")
  
  # Model summary
  cat("--- MODEL SUMMARY ---\n")
  print(summary(model))
  cat("\n")
  
  # ANOVA table
  cat("--- ANOVA TABLE ---\n")
  print(anova(model))
  cat("\n")
  
  # Assumption checks
  if (diagnostics) {
    cat("--- ASSUMPTION DIAGNOSTICS ---\n")
    check_ancova_assumptions_simple(model, data, group_var, covariates[1])
  }
  
  # Adjusted means and comparisons
  if (pairwise && require(emmeans)) {
    cat("\n--- ADJUSTED GROUP MEANS ---\n")
    adj_means <- emmeans(model, specs = as.formula(paste("~", group_var)))
    print(adj_means)
    
    cat("\n--- PAIRWISE COMPARISONS (Holm-adjusted) ---\n")
    pairwise_results <- pairs(adj_means, adjust = "holm")
    print(pairwise_results)
    
    # Plot adjusted means
    plot_data <- as.data.frame(adj_means)
    p <- ggplot(plot_data, aes_string(x = group_var, y = "emmean", fill = group_var)) +
      geom_bar(stat = "identity", alpha = 0.7) +
      geom_errorbar(aes(ymin = lower.CL, ymax = upper.CL), width = 0.2) +
      labs(title = "Adjusted Group Means with 95% Confidence Intervals",
           y = "Adjusted Mean", x = group_var) +
      theme_minimal()
    print(p)
  }
  
  # Effect sizes
  if (require(effectsize)) {
    cat("\n--- EFFECT SIZES ---\n")
    es <- eta_squared(model, partial = TRUE)
    print(es)
  }
  
  return(model)
}

# Example usage with mtcars
cat("\n\n=== EXAMPLE: ANCOVA with mtcars dataset ===\n")
data(mtcars)
mtcars$cyl <- factor(mtcars$cyl)
mtcars$am <- factor(mtcars$am, labels = c("Automatic", "Manual"))

# Run comprehensive analysis
model_result <- run_ancova_analysis(
  formula = mpg ~ cyl + hp + wt,
  data = mtcars,
  group_var = "cyl",
  covariates = c("hp", "wt")
)

Common Issues and Solutions

Violation of Homogeneity of Slopes

# When slopes differ by group
check_slopes <- function(data, dv, group, covariate) {
  # Fit interaction model
  interaction_model <- lm(as.formula(paste(dv, "~", group, "*", covariate)), data = data)
  
  # Test interaction
  simple_model <- lm(as.formula(paste(dv, "~", group, "+", covariate)), data = data)
  anova_result <- anova(simple_model, interaction_model)
  
  cat("Test for homogeneity of regression slopes:\n")
  print(anova_result)
  
  if (anova_result$`Pr(>F)`[2] < 0.05) {
    cat("\nWARNING: Significant interaction detected. Slopes are not homogeneous.\n")
    cat("Consider:\n")
    cat("1. Reporting separate slopes for each group\n")
    cat("2. Using Johnson-Neyman technique to identify regions of significance\n")
    cat("3. Considering moderated regression analysis\n")
    
    # Visualize different slopes
    library(ggplot2)
    p <- ggplot(data, aes_string(x = covariate, y = dv, color = group)) +
      geom_point() +
      geom_smooth(method = "lm", se = FALSE) +
      labs(title = "Different Slopes by Group (Interaction Present)",
           subtitle = "ANCOVA assumption violated") +
      theme_minimal()
    print(p)
  } else {
    cat("\nNo significant interaction. Homogeneity of slopes assumption satisfied.\n")
  }
}

# Example
check_slopes(mtcars, "mpg", "cyl", "wt")

Dealing with Missing Data

# Multiple imputation for ANCOVA with missing covariate values
if (require(mice)) {
  cat("=== Multiple Imputation for ANCOVA ===\n")
  
  # Create dataset with missing values
  set.seed(123)
  missing_data <- PlantGrowth
  missing_data$soil_quality[sample(1:nrow(missing_data), 5)] <- NA
  
  # Perform multiple imputation
  imputed_data <- mice(missing_data, m = 5, method = "pmm", printFlag = FALSE)
  
  # Fit ANCOVA on each imputed dataset
  models <- with(imputed_data, lm(weight ~ group + soil_quality))
  
  # Pool results
  pooled_results <- pool(models)
  cat("Pooled coefficients:\n")
  print(summary(pooled_results))
}

Problems and Exercises

1. Conceptual Problems:

  a) Prove that the adjusted group mean in ANCOVA is unbiased when the covariate is uncorrelated with treatment assignment
  b) Derive the variance of the adjusted treatment effect estimator
  c) Show how ANCOVA increases power compared to ANOVA

2. Applied Problems:

  a) Analyze the SOCR [Consumer Price Index dataset](http://wiki.socr.umich.edu/index.php/SOCR_Data_Dinov_021808_ConsumerPriceIndex3Way) using ANCOVA
  b) Conduct a MANCOVA on the [Iris dataset](https://archive.ics.uci.edu/ml/datasets/iris) with species as the factor and sepal length as covariate
  c) Perform power analysis for a planned study with 4 groups, expecting a medium effect (f² = 0.20), and correlation ρ = 0.6 between covariate and DV

3. Simulation Study:

# Simulation to demonstrate ANCOVA advantages
simulate_ancova_power <- function(n_sim = 1000, n_per_group = 30, 
                                  effect_size = 0.5, rho = 0.6) {
  
  power_anova <- numeric(n_sim)
  power_ancova <- numeric(n_sim)
  
  for (i in 1:n_sim) {
    # Simulate data
    group <- factor(rep(1:3, each = n_per_group))
    covariate <- rnorm(n_per_group * 3)
    
    # Generate response with treatment effect and covariate relationship
    response <- effect_size * as.numeric(group) + rho * covariate + 
                rnorm(n_per_group * 3, 0, sqrt(1 - rho^2))
    
    data <- data.frame(response, group, covariate)
    
    # Fit ANOVA
    anova_model <- aov(response ~ group, data = data)
    power_anova[i] <- summary(anova_model)[[1]]$`Pr(>F)`[1] < 0.05
    
    # Fit ANCOVA
    ancova_model <- aov(response ~ group + covariate, data = data)
    power_ancova[i] <- summary(ancova_model)[[1]]$`Pr(>F)`[1] < 0.05
  }
  
  cat("Simulation Results (n =", n_sim, "):\n")
  cat("ANOVA power:", mean(power_anova), "\n")
  cat("ANCOVA power:", mean(power_ancova), "\n")
  cat("Power increase:", mean(power_ancova) - mean(power_anova), "\n")
  
  return(data.frame(ANOVA = power_anova, ANCOVA = power_ancova))
}

# Run simulation
results <- simulate_ancova_power(n_sim = 500, n_per_group = 25, 
                                 effect_size = 0.4, rho = 0.5)

References

1. Maxwell, S. E., Delaney, H. D., & Kelley, K. (2017). *Designing Experiments and Analyzing Data: A Model Comparison Perspective* (3rd ed.). Routledge. 2. Rutherford, A. (2011). *ANOVA and ANCOVA: A GLM Approach* (2nd ed.). Wiley. 3. Tabachnick, B. G., & Fidell, L. S. (2018). *Using Multivariate Statistics* (7th ed.). Pearson. 4. Miller, G. A., & Chapman, J. P. (2001). Misunderstanding analysis of covariance. *Journal of Abnormal Psychology, 110*(1), 40-48. 5. Stevens, J. P. (2012). *Applied Multivariate Statistics for the Social Sciences* (5th ed.). Routledge.

        1. Online Resources####



Translate this page:

(default)
Uk flag.gif

Deutsch
De flag.gif

Español
Es flag.gif

Français
Fr flag.gif

Italiano
It flag.gif

Português
Pt flag.gif

日本語
Jp flag.gif

България
Bg flag.gif

الامارات العربية المتحدة
Ae flag.gif

Suomi
Fi flag.gif

इस भाषा में
In flag.gif

Norge
No flag.png

한국어
Kr flag.gif

中文
Cn flag.gif

繁体中文
Cn flag.gif

Русский
Ru flag.gif

Nederlands
Nl flag.gif

Ελληνικά
Gr flag.gif

Hrvatska
Hr flag.gif

Česká republika
Cz flag.gif

Danmark
Dk flag.gif

Polska
Pl flag.png

România
Ro flag.png

Sverige
Se flag.gif