Difference between revisions of "SMHS ANCOVA"

Latest revision as of 10:09, 9 December 2025

Scientific Methods for Health Sciences - Analysis of Covariance (ANCOVA)

Overview

Analysis of Covariance (ANCOVA) is a statistical technique that blends Analysis of Variance (ANOVA) and regression analysis to assess whether population means of a dependent variable (DV) differ across levels of a categorical independent variable (IV) while controlling for the effects of continuous covariates (CVs). ANCOVA extends ANOVA by incorporating continuous predictors, which increases statistical power and reduces bias from preexisting group differences. This section provides a comprehensive treatment of ANCOVA, its multivariate extensions (MANCOVA), and practical implementation with R.

Motivation

Consider a clinical trial comparing two treatments for blood pressure reduction. Patients differ in baseline blood pressure, age, and BMI. Simple ANOVA comparing treatment groups would ignore these covariates, potentially biasing results. ANCOVA addresses this by: 1. Increasing statistical power by reducing within-group error variance 2. Adjusting for preexisting differences in nonequivalent groups 3. Reducing bias from confounding variables 4. Enabling more precise estimation of treatment effects

For multivariate outcomes (e.g., blood pressure, cholesterol, weight), Multivariate ANCOVA (MANCOVA) extends this framework to multiple DVs simultaneously.

Theory

1) ANOVA Review

One-way ANOVA

For \(k\) groups with \(n_i\) observations in group \(i\), the model is: \[ y_{ij} = \mu + \tau_i + \varepsilon_{ij}, \quad \varepsilon_{ij} \sim N(0, \sigma^2) \] where: - \(y_{ij}\) is the \(j\)-th observation in group \(i\) - \(\mu\) is the overall mean - \(\tau_i\) is the treatment effect for group \(i\) (\(\sum_{i=1}^k \tau_i = 0\)) - \(\varepsilon_{ij}\) is the random error

The total sum of squares partitions as: \[ SST = SSB + SSW = \sum_{i=1}^k \sum_{j=1}^{n_i} (y_{ij} - \bar{y}_{\cdot\cdot})^2 = \sum_{i=1}^k n_i (\bar{y}_{i\cdot} - \bar{y}_{\cdot\cdot})^2 + \sum_{i=1}^k \sum_{j=1}^{n_i} (y_{ij} - \bar{y}_{i\cdot})^2 \]

The F-statistic tests \(H_0: \tau_1 = \tau_2 = \cdots = \tau_k\): \[ F = \frac{MSB}{MSW} = \frac{SSB/(k-1)}{SSW/(n-k)} \sim F_{k-1, n-k} \quad \text{under } H_0 \]

Two-way ANOVA

For factors A (a levels) and B (b levels) with r replicates: \[ y_{ijk} = \mu + \alpha_i + \beta_j + (\alpha\beta)_{ij} + \varepsilon_{ijk} \] where: - \(\alpha_i\) is the main effect of factor A - \(\beta_j\) is the main effect of factor B - \((\alpha\beta)_{ij}\) is the interaction effect - \(\varepsilon_{ijk} \sim N(0, \sigma^2)\)

Sum of squares decomposition: \[ SST = SSA + SSB + SSAB + SSE \]

2) ANCOVA Model

Basic Model

The ANCOVA model with one covariate and one factor: \[ y_{ij} = \mu + \tau_i + \beta(x_{ij} - \bar{x}_{\cdot\cdot}) + \varepsilon_{ij}, \quad \varepsilon_{ij} \sim N(0, \sigma^2) \] where: - \(y_{ij}\) is the response for observation \(j\) in group \(i\) - \(\mu\) is the overall mean - \(\tau_i\) is the treatment effect (\(\sum \tau_i = 0\)) - \(\beta\) is the regression coefficient for the covariate \(x_{ij}\) - \(\bar{x}_{\cdot\cdot}\) is the overall mean of the covariate (centering reduces multicollinearity)

The adjusted group mean for group \(i\) is: \[ \mu_i^{adj} = \mu + \tau_i = \bar{y}_{i\cdot} - \beta(\bar{x}_{i\cdot} - \bar{x}_{\cdot\cdot}) \]

Matrix Formulation

For \(k\) groups and \(p\) covariates: \[ \mathbf{y} = \mathbf{X}\boldsymbol{\beta} + \boldsymbol{\varepsilon}, \quad \boldsymbol{\varepsilon} \sim N(\mathbf{0}, \sigma^2\mathbf{I}) \] where: \[ \mathbf{X} = \begin{bmatrix} \mathbf{1} & \mathbf{Z} & \mathbf{C} \end{bmatrix}, \quad \boldsymbol{\beta} = \begin{bmatrix} \mu \\ \boldsymbol{\tau} \\ \boldsymbol{\gamma} \end{bmatrix} \] - \(\mathbf{Z}\) is the design matrix for group indicators - \(\mathbf{C}\) is the matrix of covariates - \(\boldsymbol{\tau}\) are treatment effects - \(\boldsymbol{\gamma}\) are covariate coefficients

The least squares estimates: \[ \hat{\boldsymbol{\beta}} = (\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top\mathbf{y} \]

Hypothesis Testing

1. Test for covariate effect: \(H_0: \beta = 0\) vs \(H_1: \beta \neq 0\) \[ F = \frac{MS_{reg}}{MSE} \sim F_{1, n-k-1} \]

2. Test for treatment effect adjusted for covariate: \(H_0: \tau_1 = \cdots = \tau_k = 0\) \[ F = \frac{MS_{trt|cov}}{MSE} \sim F_{k-1, n-k-1} \]

The adjusted treatment sum of squares: \[ SS_{trt|cov} = SS_{total} - SS_{cov} - SSE \]

3) MANOVA and MANCOVA

MANOVA Model

For \(p\) dependent variables: \[ \mathbf{Y}_{n\times p} = \mathbf{X}_{n\times q}\mathbf{B}_{q\times p} + \mathbf{E}_{n\times p}, \quad \text{vec}(\mathbf{E}) \sim N(\mathbf{0}, \mathbf{I}_n \otimes \boldsymbol{\Sigma}) \] where \(\boldsymbol{\Sigma}\) is the \(p\times p\) covariance matrix of errors.

The hypothesis matrix \(\mathbf{H}\) and error matrix \(\mathbf{E}\): \[ \mathbf{H} = \mathbf{Y}^\top\mathbf{X}(\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top\mathbf{Y} - \mathbf{Y}^\top\mathbf{1}(\mathbf{1}^\top\mathbf{1})^{-1}\mathbf{1}^\top\mathbf{Y} \] \[ \mathbf{E} = \mathbf{Y}^\top\mathbf{Y} - \mathbf{Y}^\top\mathbf{X}(\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top\mathbf{Y} \]

Test statistics based on eigenvalues \(\lambda_i\) of \(\mathbf{E}^{-1}\mathbf{H}\): 1. Wilks' Lambda: \(\Lambda = \frac{|\mathbf{E}|}{|\mathbf{H}+\mathbf{E}|} = \prod_{i=1}^s \frac{1}{1+\lambda_i}\) 2. Pillai's Trace: \(V = \text{tr}[\mathbf{H}(\mathbf{H}+\mathbf{E})^{-1}] = \sum_{i=1}^s \frac{\lambda_i}{1+\lambda_i}\) 3. Hotelling-Lawley Trace: \(U = \text{tr}(\mathbf{E}^{-1}\mathbf{H}) = \sum_{i=1}^s \lambda_i\) 4. Roy's Largest Root: \(\theta = \frac{\lambda_1}{1+\lambda_1}\)

MANCOVA Model

Extends MANOVA with covariates: \[ \mathbf{Y} = \mathbf{X}\mathbf{B} + \mathbf{Z}\mathbf{\Gamma} + \mathbf{E} \] where \(\mathbf{Z}\) contains covariates and \(\mathbf{\Gamma}\) their coefficients.

4) Assumptions and Diagnostics

Key Assumptions: 1. Linearity: Relationship between DV and covariates is linear 2. Homogeneity of regression slopes: \(\beta\) is constant across groups 3. Normality: Residuals \(\varepsilon_{ij} \sim N(0, \sigma^2)\) 4. Homoscedasticity: Constant variance across groups 5. Independence: Observations are independent 6. No multicollinearity: Covariates not highly correlated

Diagnostic Tests:

# R function for ANCOVA diagnostics
check_ancova_assumptions <- function(model, data, group_var, covariate) {
  par(mfrow = c(2, 3))
  
  # 1. Normality of residuals
  residuals <- resid(model)
  qqnorm(residuals, main = "Q-Q Plot of Residuals")
  qqline(residuals)
  shapiro_test <- shapiro.test(residuals)
  cat("Shapiro-Wilk normality test: W =", shapiro_test$statistic, 
      "p =", shapiro_test$p.value, "\n")
  
  # 2. Homogeneity of variance
  plot(fitted(model), residuals, 
       xlab = "Fitted Values", ylab = "Residuals",
       main = "Residuals vs Fitted")
  abline(h = 0, col = "red")
  
  # Levene's test (using car package)
  if (require(car)) {
    levene_test <- leveneTest(residuals ~ data[[group_var]])
    cat("Levene's test for homogeneity of variance: F =", 
        levene_test$F[1], "p =", levene_test$Pr[1], "\n")
  }
  
  # 3. Linearity check
  plot(data[[covariate]], residuals,
       xlab = covariate, ylab = "Residuals",
       main = "Residuals vs Covariate")
  abline(h = 0, col = "red")
  
  # 4. Homogeneity of regression slopes
  interaction_model <- lm(formula(paste("y ~", group_var, "*", covariate)), data = data)
  anova_interaction <- anova(model, interaction_model)
  cat("\nTest for homogeneity of slopes (interaction test):\n")
  print(anova_interaction)
  
  # 5. Multicollinearity (if multiple covariates)
  if (require(car)) {
    vif_values <- vif(model)
    cat("\nVariance Inflation Factors (VIF):\n")
    print(vif_values)
  }
  
  par(mfrow = c(1, 1))
}

Below is a more sophisticated function, check_ancova_assumptions_simple(), which is more robust to model configurations.

check_ancova_assumptions_simple <- function(model, data, group_var, covariate) {
  # Get response variable name from model formula
  response_var <- all.vars(formula(model))[1]
  
  cat("=== ANCOVA ASSUMPTION CHECKS ===\n")
  cat("Response variable:", response_var, "\n")
  cat("Group variable:", group_var, "\n")
  cat("Covariate:", covariate, "\n\n")
  
  # Set up plot layout
  par(mfrow = c(2, 3))
  
  # 1. Normality of residuals
  residuals <- resid(model)
  qqnorm(residuals, main = "Q-Q Plot of Residuals")
  qqline(residuals)
  shapiro_test <- shapiro.test(residuals)
  cat("1. Normality (Shapiro-Wilk): W =", round(shapiro_test$statistic, 4), 
      ", p =", format.pval(shapiro_test$p.value, digits = 4), "\n")
  
  # 2. Homogeneity of variance
  fitted_vals <- fitted(model)
  plot(fitted_vals, residuals, 
       xlab = "Fitted Values", ylab = "Residuals",
       main = "Residuals vs Fitted")
  abline(h = 0, col = "red", lty = 2)
  
  # 3. Linearity check (residuals vs covariate)
  plot(data[[covariate]], residuals,
       xlab = covariate, ylab = "Residuals",
       main = paste("Residuals vs", covariate))
  abline(h = 0, col = "red", lty = 2)
  
  # 4. Homogeneity of regression slopes - FIXED: Properly nested comparison
  # Get all terms from the original model
  model_terms <- attr(terms(model), "term.labels")
  
  # Remove any existing interaction involving the group_var and covariate
  # Keep all other terms
  other_terms <- model_terms[!grepl(paste0(group_var, ":", covariate, "|", 
                                          covariate, ":", group_var), model_terms)]
  
  # Build the model formula without the interaction
  if (length(other_terms) > 0) {
    base_formula <- paste(response_var, "~", paste(other_terms, collapse = " + "))
  } else {
    base_formula <- paste(response_var, "~ 1")
  }
  
  # Build interaction model formula by adding the interaction term
  interaction_formula <- paste(base_formula, "+", group_var, "*", covariate)
  
  # Fit both models
  base_model <- lm(as.formula(base_formula), data = data)
  interaction_model <- lm(as.formula(interaction_formula), data = data)
  
  # Test for significant interaction
  interaction_test <- anova(base_model, interaction_model)
  cat("\n2. Homogeneity of slopes test:\n")
  print(interaction_test)
  
  # Check if the interaction is significant - FIXED: Handle NA p-values
  p_value <- interaction_test$`Pr(>F)`[2]
  
  if (!is.na(p_value)) {
    if (p_value < 0.05) {
      cat("\nWARNING: Significant interaction detected (p =", 
          format.pval(p_value, digits = 4), 
          "). Slopes are not homogeneous.\n")
      
      # Plot different slopes
      plot(data[[covariate]], data[[response_var]],
           col = as.numeric(data[[group_var]]),
           xlab = covariate, ylab = response_var,
           main = "Different Slopes by Group (Interaction Present)")
      
      # Add regression lines for each group
      groups <- unique(data[[group_var]])
      for (i in seq_along(groups)) {
        group_data <- data[data[[group_var]] == groups[i], ]
        if (nrow(group_data) > 1) {
          abline(lm(as.formula(paste(response_var, "~", covariate)), 
                   data = group_data), 
                 col = i, lwd = 2)
        }
      }
      
      legend("topright", legend = levels(data[[group_var]]), 
             col = 1:length(levels(data[[group_var]])), 
             lwd = 2, pch = 1)
    } else {
      cat("\nNo significant interaction (p =", 
          format.pval(p_value, digits = 4), 
          "). Homogeneity of slopes assumption is satisfied.\n")
    }
  } else {
    cat("\nNote: Could not calculate p-value for interaction test.\n")
    cat("Model comparison summary:\n")
    print(interaction_test)
  }
  
  # 5. Residual histogram
  hist(residuals, main = "Histogram of Residuals", 
       xlab = "Residuals", col = "lightblue")
  
  par(mfrow = c(1, 1))
  
  # Additional diagnostics
  cat("\n3. Additional Statistics:\n")
  cat("- Mean of residuals:", round(mean(residuals), 4), "\n")
  cat("- SD of residuals:", round(sd(residuals), 4), "\n")
  cat("- Max absolute residual:", round(max(abs(residuals)), 4), "\n")
  
  # Check for outliers (residuals > 3 SD)
  if (sd(residuals) > 0) {
    outlier_count <- sum(abs(scale(residuals)) > 3, na.rm = TRUE)
    cat("- Potential outliers (>3 SD):", outlier_count, "\n")
  }
  
  # Return diagnostic results
  return(list(
    shapiro_test = shapiro_test,
    interaction_test = interaction_test,
    p_value_interaction = p_value
  ))
}

Applications

Example 1: Clinical Trial with Baseline Adjustment

# Simulated clinical trial data
set.seed(123)
n <- 100
trial_data <- data.frame(
  patient_id = 1:n,
  treatment = factor(rep(c("Drug", "Placebo"), each = n/2)),
  baseline_bp = rnorm(n, mean = 150, sd = 15),
  age = sample(40:75, n, replace = TRUE),
  bmi = rnorm(n, mean = 28, sd = 4)
)

# Generate post-treatment BP with treatment effect and baseline correlation
trial_data$post_bp <- with(trial_data, 
  baseline_bp * 0.7 + 
  ifelse(treatment == "Drug", -15, -5) +
  (age - 60) * 0.2 +
  (bmi - 28) * 0.5 +
  rnorm(n, 0, 8)
)

cat("=== Basic ANOVA (ignoring covariates) ===\n")
anova_simple <- aov(post_bp ~ treatment, data = trial_data)
print(summary(anova_simple))

cat("\n=== ANCOVA (adjusting for baseline BP) ===\n")
ancova_model <- aov(post_bp ~ treatment + baseline_bp, data = trial_data)
print(summary(ancova_model))

cat("\n=== ANCOVA (multiple covariates) ===\n")
ancova_full <- lm(post_bp ~ treatment + baseline_bp + age + bmi, data = trial_data)
print(summary(ancova_full))

cat("\n=== Adjusted Means ===\n")
library(emmeans)
adj_means <- emmeans(ancova_full, specs = ~ treatment)
print(adj_means)

cat("\n=== Pairwise Comparisons with Adjustment ===\n")
pairwise_comparisons <- pairs(adj_means, adjust = "holm")
print(pairwise_comparisons)

# Diagnostic checks
cat("\n=== Model Diagnostics ===\n")
check_ancova_assumptions_simple(ancova_full, trial_data, "treatment", "baseline_bp")

Example 2: Educational Intervention Study

# Using built-in iris dataset to demonstrate MANCOVA
data(iris)
# Create a categorical variable and covariate
set.seed(456)
iris$treatment <- factor(sample(c("Method_A", "Method_B", "Control"), 
                                nrow(iris), replace = TRUE))
iris$pretest_score <- iris$Sepal.Length * 10 + rnorm(nrow(iris), 50, 5)

# MANCOVA with multiple DVs
cat("=== MANCOVA Example ===\n")
Y <- cbind(iris$Sepal.Width, iris$Petal.Length, iris$Petal.Width)
manova_model <- manova(Y ~ treatment + pretest_score, data = iris)

cat("\n--- Wilks' Lambda Test ---\n")
print(summary(manova_model, test = "Wilks"))

cat("\n--- Pillai's Trace Test ---\n")
print(summary(manova_model, test = "Pillai"))

cat("\n--- Individual ANCOVAs ---\n")
for (i in 1:3) {
  cat("\nDV", i, ":\n")
  ancova_indiv <- lm(Y[, i] ~ treatment + pretest_score, data = iris)
  print(summary(ancova_indiv))
}

# Visualizing adjusted means
library(ggplot2)
library(dplyr)

# Calculate adjusted means using emmeans
if (require(emmeans)) {
  adj_means_plot <- list()
  for (i in 1:3) {
    model <- lm(Y[, i] ~ treatment + pretest_score, data = iris)
    emm <- emmeans(model, specs = ~ treatment)
    adj_means_plot[[i]] <- as.data.frame(emm) %>%
      mutate(DV = paste("DV", i))
  }
  
  plot_data <- bind_rows(adj_means_plot)
  ggplot(plot_data, aes(x = treatment, y = emmean, fill = treatment)) +
    geom_bar(stat = "identity", alpha = 0.7) +
    geom_errorbar(aes(ymin = lower.CL, ymax = upper.CL), width = 0.2) +
    facet_wrap(~ DV, scales = "free_y") +
    labs(title = "Adjusted Means with 95% Confidence Intervals",
         y = "Adjusted Mean", x = "Treatment Group") +
    theme_minimal()
}

Example 3: Real Dataset Analysis - Plant Growth

# Using PlantGrowth dataset with simulated covariate
data(PlantGrowth)
set.seed(789)
PlantGrowth$soil_quality <- rnorm(nrow(PlantGrowth), mean = 5, sd = 1) + 
  ifelse(PlantGrowth$group == "trt1", 0.5, 
         ifelse(PlantGrowth$group == "trt2", -0.5, 0))

cat("=== Plant Growth ANCOVA ===\n")
cat("Research question: Do treatments affect plant weight after controlling for soil quality?\n\n")

# EDA
cat("--- Exploratory Data Analysis ---\n")
cat("Group means (unadjusted):\n")
print(aggregate(weight ~ group, data = PlantGrowth, mean))

cat("\nCorrelation between weight and soil quality:", 
    cor(PlantGrowth$weight, PlantGrowth$soil_quality), "\n")

# Visualization
ggplot(PlantGrowth, aes(x = soil_quality, y = weight, color = group)) +
  geom_point(size = 3) +
  geom_smooth(method = "lm", se = FALSE) +
  labs(title = "Relationship between Plant Weight and Soil Quality by Treatment",
       x = "Soil Quality", y = "Plant Weight") +
  theme_minimal()

# ANCOVA analysis
ancova_plant <- lm(weight ~ group + soil_quality, data = PlantGrowth)
cat("\n--- ANCOVA Results ---\n")
print(summary(ancova_plant))

# Check assumptions
cat("\n--- Assumption Checks ---\n")
check_ancova_assumptions_simple(ancova_plant, PlantGrowth, "group", "soil_quality")

# Contrasts and post-hoc tests
cat("\n--- Post-hoc Comparisons ---\n")
library(multcomp)
contrasts <- glht(ancova_plant, linfct = mcp(group = "Tukey"))
print(summary(contrasts))

# Effect sizes
cat("\n--- Effect Sizes ---\n")
library(effectsize)
eta_squared <- eta_squared(ancova_plant, partial = TRUE)
print(eta_squared)

# Power analysis
cat("\n--- Power Analysis ---\n")
library(pwr)
# Calculate achieved power
f2 <- eta_squared$Eta2_partial[1] / (1 - eta_squared$Eta2_partial[1])
power_achieved <- pwr.f2.test(u = 2, v = 27, f2 = f2, sig.level = 0.05)$power
cat("Achieved power for treatment effect:", round(power_achieved, 3), "\n")

Advanced Topics

1) Nonparametric ANCOVA

# Quandrant test or rank-based ANCOVA
if (require(Rfit)) {
  cat("=== Rank-Based ANCOVA ===\n")
  rank_model <- rfit(weight ~ group + soil_quality, data = PlantGrowth)
  print(summary(rank_model))
}

2) Mixed Effects ANCOVA

# For repeated measures or clustered data
if (require(lme4)) {
  # Simulated longitudinal data
  set.seed(321)
  long_data <- data.frame(
    subject = rep(1:20, each = 3),
    time = rep(1:3, 20),
    treatment = factor(rep(sample(c("A", "B"), 20, replace = TRUE), each = 3)),
    baseline = rnorm(60, 100, 15),
    response = NA
  )
  
  # Generate responses
  long_data$response <- with(long_data,
    100 + 
    ifelse(treatment == "A", 5, -5) +
    0.7 * baseline +
    rnorm(60, 0, 10) +  # Residual error
    rep(rnorm(20, 0, 5), each = 3)  # Random intercept
  )
  
  # Linear mixed model (ANCOVA with random effects)
  mixed_model <- lmer(response ~ treatment + baseline + time + (1|subject), 
                      data = long_data)
  cat("\n=== Mixed Effects ANCOVA ===\n")
  print(summary(mixed_model))
  
  # Compare with standard ANCOVA
  standard_ancova <- lm(response ~ treatment + baseline + time, data = long_data)
  cat("\n--- Comparison: Standard vs Mixed ANCOVA ---\n")
  cat("Standard ANCOVA AIC:", AIC(standard_ancova), "\n")
  cat("Mixed model AIC:", AIC(mixed_model), "\n")
}

3) Power Analysis and Sample Size Planning

# Power analysis for ANCOVA
calculate_power_ancova <- function(k, n_per_group, f2, rho, alpha = 0.05) {
  # k: number of groups
  # n_per_group: sample size per group
  # f2: effect size (Cohen's f^2)
  # rho: correlation between covariate and DV
  
  N <- k * n_per_group
  df1 <- k - 1
  df2 <- N - k - 1  # -1 for covariate
  
  # Adjust f2 for covariate inclusion
  f2_adj <- f2 / (1 - rho^2)
  
  power <- pf(qf(1 - alpha, df1, df2), df1, df2, ncp = N * f2_adj, lower.tail = FALSE)
  return(power)
}

cat("=== ANCOVA Power Calculator ===\n")
# Example: 3 groups, 20 per group, medium effect (f2 = 0.15), rho = 0.5
power_example <- calculate_power_ancova(k = 3, n_per_group = 20, f2 = 0.15, rho = 0.5)
cat("Power for specified design:", round(power_example, 3), "\n")

# Sample size calculation for desired power
library(pwr)
ss <- pwr.f2.test(u = 2, f2 = 0.15, sig.level = 0.05, power = 0.80)
cat("Required sample size (per group for 3 groups):", ceiling(ss$v/3 + 3), "\n")

Software Implementation

# Comprehensive ANCOVA analysis function
run_ancova_analysis <- function(formula, data, group_var, covariates, 
                                pairwise = TRUE, diagnostics = TRUE) {
  
  # Fit model
  model <- lm(formula, data = data)
  
  cat("=== ANCOVA ANALYSIS REPORT ===\n\n")
  cat("Model formula:", deparse(formula), "\n")
  cat("Number of groups:", length(unique(data[[group_var]])), "\n")
  cat("Number of covariates:", length(covariates), "\n")
  cat("Total observations:", nrow(data), "\n\n")
  
  # Model summary
  cat("--- MODEL SUMMARY ---\n")
  print(summary(model))
  cat("\n")
  
  # ANOVA table
  cat("--- ANOVA TABLE ---\n")
  print(anova(model))
  cat("\n")
  
  # Assumption checks
  if (diagnostics) {
    cat("--- ASSUMPTION DIAGNOSTICS ---\n")
    check_ancova_assumptions_simple(model, data, group_var, covariates[1])
  }
  
  # Adjusted means and comparisons
  if (pairwise && require(emmeans)) {
    cat("\n--- ADJUSTED GROUP MEANS ---\n")
    adj_means <- emmeans(model, specs = as.formula(paste("~", group_var)))
    print(adj_means)
    
    cat("\n--- PAIRWISE COMPARISONS (Holm-adjusted) ---\n")
    pairwise_results <- pairs(adj_means, adjust = "holm")
    print(pairwise_results)
    
    # Plot adjusted means
    plot_data <- as.data.frame(adj_means)
    p <- ggplot(plot_data, aes_string(x = group_var, y = "emmean", fill = group_var)) +
      geom_bar(stat = "identity", alpha = 0.7) +
      geom_errorbar(aes(ymin = lower.CL, ymax = upper.CL), width = 0.2) +
      labs(title = "Adjusted Group Means with 95% Confidence Intervals",
           y = "Adjusted Mean", x = group_var) +
      theme_minimal()
    print(p)
  }
  
  # Effect sizes
  if (require(effectsize)) {
    cat("\n--- EFFECT SIZES ---\n")
    es <- eta_squared(model, partial = TRUE)
    print(es)
  }
  
  return(model)
}

# Example usage with mtcars
cat("\n\n=== EXAMPLE: ANCOVA with mtcars dataset ===\n")
data(mtcars)
mtcars$cyl <- factor(mtcars$cyl)
mtcars$am <- factor(mtcars$am, labels = c("Automatic", "Manual"))

# Run comprehensive analysis
model_result <- run_ancova_analysis(
  formula = mpg ~ cyl + hp + wt,
  data = mtcars,
  group_var = "cyl",
  covariates = c("hp", "wt")
)

Common Issues and Solutions

Violation of Homogeneity of Slopes

# When slopes differ by group
check_slopes <- function(data, dv, group, covariate) {
  # Fit interaction model
  interaction_model <- lm(as.formula(paste(dv, "~", group, "*", covariate)), data = data)
  
  # Test interaction
  simple_model <- lm(as.formula(paste(dv, "~", group, "+", covariate)), data = data)
  anova_result <- anova(simple_model, interaction_model)
  
  cat("Test for homogeneity of regression slopes:\n")
  print(anova_result)
  
  if (anova_result$`Pr(>F)`[2] < 0.05) {
    cat("\nWARNING: Significant interaction detected. Slopes are not homogeneous.\n")
    cat("Consider:\n")
    cat("1. Reporting separate slopes for each group\n")
    cat("2. Using Johnson-Neyman technique to identify regions of significance\n")
    cat("3. Considering moderated regression analysis\n")
    
    # Visualize different slopes
    library(ggplot2)
    p <- ggplot(data, aes_string(x = covariate, y = dv, color = group)) +
      geom_point() +
      geom_smooth(method = "lm", se = FALSE) +
      labs(title = "Different Slopes by Group (Interaction Present)",
           subtitle = "ANCOVA assumption violated") +
      theme_minimal()
    print(p)
  } else {
    cat("\nNo significant interaction. Homogeneity of slopes assumption satisfied.\n")
  }
}

# Example
check_slopes(mtcars, "mpg", "cyl", "wt")

Dealing with Missing Data

# Multiple imputation for ANCOVA with missing covariate values
if (require(mice)) {
  cat("=== Multiple Imputation for ANCOVA ===\n")
  
  # Create dataset with missing values
  set.seed(123)
  missing_data <- PlantGrowth
  missing_data$soil_quality[sample(1:nrow(missing_data), 5)] <- NA
  
  # Perform multiple imputation
  imputed_data <- mice(missing_data, m = 5, method = "pmm", printFlag = FALSE)
  
  # Fit ANCOVA on each imputed dataset
  models <- with(imputed_data, lm(weight ~ group + soil_quality))
  
  # Pool results
  pooled_results <- pool(models)
  cat("Pooled coefficients:\n")
  print(summary(pooled_results))
}

Problems and Exercises

1. Conceptual Problems:

  a) Prove that the adjusted group mean in ANCOVA is unbiased when the covariate is uncorrelated with treatment assignment
  b) Derive the variance of the adjusted treatment effect estimator
  c) Show how ANCOVA increases power compared to ANOVA

2. Applied Problems:

  a) Analyze the SOCR Consumer Price Index dataset using ANCOVA
  b) Conduct a MANCOVA on the Iris dataset with species as the factor and sepal length as covariate
  c) Perform power analysis for a planned study with 4 groups, expecting a medium effect (f² = 0.20), and correlation ρ = 0.6 between covariate and DV

3. Simulation Study:

# Simulation to demonstrate ANCOVA advantages
simulate_ancova_power <- function(n_sim = 1000, n_per_group = 30, 
                                  effect_size = 0.5, rho = 0.6) {
  
  power_anova <- numeric(n_sim)
  power_ancova <- numeric(n_sim)
  
  for (i in 1:n_sim) {
    # Simulate data
    group <- factor(rep(1:3, each = n_per_group))
    covariate <- rnorm(n_per_group * 3)
    
    # Generate response with treatment effect and covariate relationship
    response <- effect_size * as.numeric(group) + rho * covariate + 
                rnorm(n_per_group * 3, 0, sqrt(1 - rho^2))
    
    data <- data.frame(response, group, covariate)
    
    # Fit ANOVA
    anova_model <- aov(response ~ group, data = data)
    power_anova[i] <- summary(anova_model)[[1]]$`Pr(>F)`[1] < 0.05
    
    # Fit ANCOVA
    ancova_model <- aov(response ~ group + covariate, data = data)
    power_ancova[i] <- summary(ancova_model)[[1]]$`Pr(>F)`[1] < 0.05
  }
  
  cat("Simulation Results (n =", n_sim, "):\n")
  cat("ANOVA power:", mean(power_anova), "\n")
  cat("ANCOVA power:", mean(power_ancova), "\n")
  cat("Power increase:", mean(power_ancova) - mean(power_anova), "\n")
  
  return(data.frame(ANOVA = power_anova, ANCOVA = power_ancova))
}

# Run simulation
results <- simulate_ancova_power(n_sim = 500, n_per_group = 25, 
                                 effect_size = 0.4, rho = 0.5)

References

1. Maxwell, S. E., Delaney, H. D., & Kelley, K. (2017). *Designing Experiments and Analyzing Data: A Model Comparison Perspective* (3rd ed.). Routledge.

2. Rutherford, A. (2011). *ANOVA and ANCOVA: A GLM Approach* (2nd ed.). Wiley.

3. Tabachnick, B. G., & Fidell, L. S. (2018). *Using Multivariate Statistics* (7th ed.). Pearson.

4. Miller, G. A., & Chapman, J. P. (2001). Misunderstanding analysis of covariance. *Journal of Abnormal Psychology, 110*(1), 40-48.

5. Stevens, J. P. (2012). *Applied Multivariate Statistics for the Social Sciences* (5th ed.). Routledge.

Online Resources

SOCR Home page: https://www.socr.umich.edu

Translate this page:

(default)	Deutsch	Español	Français	Italiano	Português	日本語	България	الامارات العربية المتحدة	Suomi	इस भाषा में	Norge
한국어	中文	繁体中文	Русский	Nederlands	Ελληνικά	Hrvatska	Česká republika	Danmark	Polska	România	Sverige

@@ Line 2: / Line 2: @@
 ===Overview===
-Analysis of Variance (ANOVA) is the common method applied to analyze the differences between group means. Analysis of Covariance (ANCOVA) is the common method applied to blend ANOVA and regression and evaluate whether population means of a dependent variance (DV) are equal across levels of a categorical independent variable (IV) while statistically controlling for the effects of other continuous variables (CV). Multivariate analysis of variance (MANOVA) is a statistical test procedure for comparing multivariate means of several groups, which is a generalized form of ANOVA. Similar to MANOVA, MANOCVA (multivariate analysis of covariance) is an extension of ANCOVA and is designed for cases where there is more than one dependent variable and where the control of concomitant continuous independent variables is required. In this section, we review ANOVA, ANCOVA, MANOVA and MANCOVA and illustrate their application with examples.
+'''Analysis of Covariance (ANCOVA)''' is a statistical technique that blends [[SMHS_ANOVA|Analysis of Variance (ANOVA)]] and [[SMHS_Regression|regression analysis]] to assess whether population means of a dependent variable (DV) differ across levels of a categorical independent variable (IV) while controlling for the effects of continuous covariates (CVs). ANCOVA extends ANOVA by incorporating continuous predictors, which increases statistical power and reduces bias from preexisting group differences. This section provides a comprehensive treatment of ANCOVA, its multivariate extensions (MANCOVA), and practical implementation with R.
 ===Motivation===
-We have talked about analysis of variance (ANOVA). ANCOVA is similar to ANOVA and deals with covariance instead. What if we have more than one dependent variable and studied on multivariate observations? What if we want to see if the interactions among dependent variables or changes in the independent variables influence the dependent variable? Then we will need to use the extension of ANOVA and ANCOVA, MANOVA and MACOVA respectively. So the question would be how do those methods work and what kind of conclusions we can drawn from those methods?
+Consider a clinical trial comparing two treatments for blood pressure reduction. Patients differ in baseline blood pressure, age, and BMI. Simple ANOVA comparing treatment groups would ignore these covariates, potentially biasing results. ANCOVA addresses this by:
+. '''Increasing statistical power''' by reducing within-group error variance
+. '''Adjusting for preexisting differences''' in nonequivalent groups
+. '''Reducing bias''' from confounding variables
+. '''Enabling more precise estimation''' of treatment effects
+For multivariate outcomes (e.g., blood pressure, cholesterol, weight), '''Multivariate ANCOVA (MANCOVA)''' extends this framework to multiple DVs simultaneously.
 ===Theory===
-*==ANOVA==
+====1) ANOVA Review====
-Analysis of Variance (ANOVA) is the common method applied to analyze the differences between group means. In ANOVA, we divide the observed variance into components attributed to different sources of variation.
-**One-way ANOVA: we expand our inference methods to study and compare k independent samples. In this case, we will be decomposing the entire variation in the data into independent components.
+=====One-way ANOVA=====
-**Notations first: $y_{ij}$ is the measurement from group $i$, observation index $j$; $k$ is the number of groups; $n_{i}$ is the number of observations in group $i$; $n$ is the total number of observations and $n=n_{1}+n_{2}+⋯+n_{k}$. The group mean for group $i$ is $\bar y_{i}. =\frac{\sum_{j=1}^n_{i}y_{ij}}{n_{i}},$ the grand mean is $\bar y = \bar y_{..}=\frac{\sum_{i=1}^{k}\sum_{j=1}^{n_{i}}y_{ij}} {n}$.
+For \(k\) groups with \(n_i\) observations in group \(i\), the model is:
-**Difference between the means (compare each group mean to the grand mean): total variance $SST(total)=\sum_{i=1}^{k}\sum_{j=1}^{n_{i}}\left(y_{ij}-\bar y_{..}\right)^{2}$, degrees of freedom $df(total)=n-1$; difference between each group mean and grand mean: $SST(between)=\sum_{i=}^{k}n_{i} \left(\bar y_{i.} \bar y_{..}\right )^{2}$, degrees of freedom $df(between)=k-1$; Sum square due to error (combination of variation within group): $SSE=\sum_{i=1}^{k} \sum_{j=1}^{n_{i}}\left(y_{ij}-\bar y_{i.}\right)^{2}$, degrees of freedom $df(within)=n-k$.
+\[
+y_{ij} = \mu + \tau_i + \varepsilon_{ij}, \quad \varepsilon_{ij} \sim N(0, \sigma^2)
+\]
+where:
+- \(y_{ij}\) is the \(j\)-th observation in group \(i\)
+- \(\mu\) is the overall mean
+- \(\tau_i\) is the treatment effect for group \(i\) (\(\sum_{i=1}^k \tau_i = 0\))
+- \(\varepsilon_{ij}\) is the random error
+The total sum of squares partitions as:
+\[
+SST = SSB + SSW = \sum_{i=1}^k \sum_{j=1}^{n_i} (y_{ij} - \bar{y}_{\cdot\cdot})^2 = \sum_{i=1}^k n_i (\bar{y}_{i\cdot} - \bar{y}_{\cdot\cdot})^2 + \sum_{i=1}^k \sum_{j=1}^{n_i} (y_{ij} - \bar{y}_{i\cdot})^2
+\]
+The F-statistic tests \(H_0: \tau_1 = \tau_2 = \cdots = \tau_k\):
+\[
+F = \frac{MSB}{MSW} = \frac{SSB/(k-1)}{SSW/(n-k)} \sim F_{k-1, n-k} \quad \text{under } H_0
+\]
+=====Two-way ANOVA=====
+For factors A (a levels) and B (b levels) with r replicates:
+\[
+y_{ijk} = \mu + \alpha_i + \beta_j + (\alpha\beta)_{ij} + \varepsilon_{ijk}
+\]
+where:
+- \(\alpha_i\) is the main effect of factor A
+- \(\beta_j\) is the main effect of factor B
+- \((\alpha\beta)_{ij}\) is the interaction effect
+- \(\varepsilon_{ijk} \sim N(0, \sigma^2)\)
+Sum of squares decomposition:
+\[
+SST = SSA + SSB + SSAB + SSE
+\]
+====2) ANCOVA Model====
+=====Basic Model=====
+The ANCOVA model with one covariate and one factor:
+\[
+y_{ij} = \mu + \tau_i + \beta(x_{ij} - \bar{x}_{\cdot\cdot}) + \varepsilon_{ij}, \quad \varepsilon_{ij} \sim N(0, \sigma^2)
+\]
+where:
+- \(y_{ij}\) is the response for observation \(j\) in group \(i\)
+- \(\mu\) is the overall mean
+- \(\tau_i\) is the treatment effect (\(\sum \tau_i = 0\))
+- \(\beta\) is the regression coefficient for the covariate \(x_{ij}\)
+- \(\bar{x}_{\cdot\cdot}\) is the overall mean of the covariate (centering reduces multicollinearity)
+The adjusted group mean for group \(i\) is:
+\[
+\mu_i^{adj} = \mu + \tau_i = \bar{y}_{i\cdot} - \beta(\bar{x}_{i\cdot} - \bar{x}_{\cdot\cdot})
+\]
+=====Matrix Formulation=====
+For \(k\) groups and \(p\) covariates:
+\[
+\mathbf{y} = \mathbf{X}\boldsymbol{\beta} + \boldsymbol{\varepsilon}, \quad \boldsymbol{\varepsilon} \sim N(\mathbf{0}, \sigma^2\mathbf{I})
+\]
+where:
+\[
+\mathbf{X} = \begin{bmatrix}
+\mathbf{1} & \mathbf{Z} & \mathbf{C}
+\end{bmatrix}, \quad
+\boldsymbol{\beta} = \begin{bmatrix}
+\mu \\ \boldsymbol{\tau} \\ \boldsymbol{\gamma}
+\end{bmatrix}
+\]
+- \(\mathbf{Z}\) is the design matrix for group indicators
+- \(\mathbf{C}\) is the matrix of covariates
+- \(\boldsymbol{\tau}\) are treatment effects
+- \(\boldsymbol{\gamma}\) are covariate coefficients
+The least squares estimates:
+\[
+\hat{\boldsymbol{\beta}} = (\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top\mathbf{y}
+\]
+=====Hypothesis Testing=====
+. '''Test for covariate effect''': \(H_0: \beta = 0\) vs \(H_1: \beta \neq 0\)
+\[
+F = \frac{MS_{reg}}{MSE} \sim F_{1, n-k-1}
+\]
+. '''Test for treatment effect adjusted for covariate''': \(H_0: \tau_1 = \cdots = \tau_k = 0\)
+\[
+F = \frac{MS_{trt|cov}}{MSE} \sim F_{k-1, n-k-1}
+\]
+The adjusted treatment sum of squares:
+\[
+SS_{trt|cov} = SS_{total} - SS_{cov} - SSE
+\]
+====3) MANOVA and MANCOVA====
-With ANOVA decomposition, we have $\sum_{i=1}^{k}\sum_{j=1}^{n_{i}}\left(y_{ij}-\bar y_{..}\right)^{2}=\sum_{i=}^{k}n_{i}\left(\bar y_{i.} - \bar y_{..}\right )^{2}+\sum_{i=1}^{k}\sum_{j=1}^{n_{i}}\left(y_{ij}-\bar y_{i.}\right)^{2}$,that is $ST(total)=SST(between)+SSE(within)\,and\, df(total)=df(between)+df(within).$
+=====MANOVA Model=====
+For \(p\) dependent variables:
-*Calculations:
+\[
-ANOVA table:
+\mathbf{Y}_{n\times p} = \mathbf{X}_{n\times q}\mathbf{B}_{q\times p} + \mathbf{E}_{n\times p}, \quad \text{vec}(\mathbf{E}) \sim N(\mathbf{0}, \mathbf{I}_n \otimes \boldsymbol{\Sigma})
-<center>
+\]
-{| class="wikitable" style="text-align:center; width:45%" border="1"
+where \(\boldsymbol{\Sigma}\) is the \(p\times p\) covariance matrix of errors.
-|-
-|Variance source||Degree of freedom (df)|| Sum of squares (SS) ||Mean sum of squares (MS)|| F-statistics||P-value
-|-
-|Treatment effect (between group)||$k-1$||$\sum_{i=1}^{k}n_{i} \left(\bar y_{i.} -\bar y_{..}\right)^{2}$ || $\frac{SST(between)}{df(between)}$ || $F_{0}=\frac{MST(between)}{MSE(within)}$ || $p(F_{(df(between),df(within)}>F_{0}$
-|-
-|Error (within group)|| $n-k$ ||$\sum_{i=1}^{k} \sum_{j=1}^{n_{i}} \left(y_{ij}-\bar y_{i.}\right)^{2}$  ||	$\frac{SSE(within)}{df(within)}$ || ||
-|-
-|Total || $n-1$ || $\sum_{i=1}^{k} \sum_{j=1}^{n_{i}} \left(y_{ij}-\bar y_{..}\right)^{2}$ ||  ||  ||
-|-
-|}
-</center>
-*==ANOVA hypotheses(general form)==
+The hypothesis matrix \(\mathbf{H}\) and error matrix \(\mathbf{E}\):
-$H_{o}:\mu_{1}=\mu_{2}=⋯=\mu_{k}; H_{a}:\mu_{i}≠μ_{j}$ for some $i≠j$.  The test statistics: $F_{0}=\frac{MST(between)}{MSE(within)}$, if $F_{0}$ is large, then there is a lot between group variation, relative to the within group variation. Therefore, the discrepancies between the group means are large compared to the variability within the groups (error). That is large $F_{0}$ provides strong evidence against $H_{0}$.
+\[
-**ANOVA conditions: valid if (1) design conditions: all groups of observations represent random samples from their population respectively. Plus, all the observations within each group are independent of each other; (2) population conditions: the k population distributions must be approximately normal. If sample size is large, the normality condition is less crucial. Plus, the standard deviations of all populations are equal, which can be slightly relaxed when $0.5≤\frac{\sigma_{i}} {\sigma_{j}}≤2,$ for all $i$ and $j$, none of the population variance is twice larger than any of the other ones.
+\mathbf{H} = \mathbf{Y}^\top\mathbf{X}(\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top\mathbf{Y} - \mathbf{Y}^\top\mathbf{1}(\mathbf{1}^\top\mathbf{1})^{-1}\mathbf{1}^\top\mathbf{Y}
+\]
+\[
+\mathbf{E} = \mathbf{Y}^\top\mathbf{Y} - \mathbf{Y}^\top\mathbf{X}(\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top\mathbf{Y}
+\]
-*==Two-way ANOVA==
+Test statistics based on eigenvalues \(\lambda_i\) of \(\mathbf{E}^{-1}\mathbf{H}\):
-We focus on decomposing the variance of a dataset into independent (orthogonal) components when we have two grouping factors.
+. '''Wilks' Lambda''': \(\Lambda = \frac{|\mathbf{E}|}{|\mathbf{H}+\mathbf{E}|} = \prod_{i=1}^s \frac{1}{1+\lambda_i}\)
-**Notations first: two-way model: $y_{ijk}=\mu+\tau_{i}+\beta_{j}+\gamma_{ij}+\varepsilon_{ijk}$,for all $1≤i≤a,1≤j≤b$ and $1≤k≤r. y_{ijk}$ is the A-factor level $i$, and B-factor level $j,$ observation-index $k$ measurement; $k$ is the number of replications; $a_{i}$ is the number of A-factor observations at level $i,a=a_{1}+⋯+a_{i}; b_{j}$ is the number of B-factor observations at level $j,b=b_{1}+⋯+b_{J}; N$ is the total number of observations and $N=a*a*b.$ Here $\mu$ is the overall mean response, $\tau_{i}$ is the effect due to the $i^{th}$ level of factor A, $\beta_{j}$ is the effect due to the $j^{th}$ level of factor B, and $\gamma_{ij}$ is the effect due to any interaction between the $i^{th}$ level of factor A and $j^{th}$ level of factor B. The mean for A-factor group mean at level $i$ and B-factor at level $j$ is $\bar y_{ij}.=\frac{\sum_{k=1}^{r}y_{ijk}}{r}$, the grand mean is $\bar y=\bar y_{…}=\frac{\sum_{k=1}^{r}\sum_{i=1}^{a}\sum_{j=1}^{b} y_{ijk}} {n}$, we have $SST(total)=SS(A)+SS(B)+SS(AB)+SSE.$
+. '''Pillai's Trace''': \(V = \text{tr}[\mathbf{H}(\mathbf{H}+\mathbf{E})^{-1}] = \sum_{i=1}^s \frac{\lambda_i}{1+\lambda_i}\)
+. '''Hotelling-Lawley Trace''': \(U = \text{tr}(\mathbf{E}^{-1}\mathbf{H}) = \sum_{i=1}^s \lambda_i\)
+. '''Roy's Largest Root''': \(\theta = \frac{\lambda_1}{1+\lambda_1}\)
-*==Hypotheses:==
+=====MANCOVA Model=====
-**Null hypotheses: (1) the population means of the first factor are equal, which is like the one-way ANOVA for the row factor; (2) the population means of the second factor are equal, which is like the one-way ANOVA for the column factor; (3) there is no interaction between the two factors, which is similar to performing a test for independence with contingency tables.
+Extends MANOVA with covariates:
-**Factors: factor A and factor B are independent variables in two-way ANOVA.
+\[
-**Treatment groups: formed by making all possible combinations of two factors. For example, if the factor A has 3 levels and factor B has 5 levels, then there will be $3*5=15$ different treatment groups.
+\mathbf{Y} = \mathbf{X}\mathbf{B} + \mathbf{Z}\mathbf{\Gamma} + \mathbf{E}
-**Main effect: involves the dependent variable one at a time. The interaction is ignored for this part.
+\]
-**Interaction effect: the effect that one factor has on the other factor. The degree of freedom is the product of the two degrees of freedom of each factor.
+where \(\mathbf{Z}\) contains covariates and \(\mathbf{\Gamma}\) their coefficients.
-**Calculations:
-ANOVA table:
-<center>
-{| class="wikitable" style="text-align:center; width:45%" border="1"
-|-
-|Variance source||Degree of freedom (df)||Sum of squares (SS)||	Mean sum of squares (MS)||F-statistics||P-value
-|-
-|Main effect A|| $a-1$ ||$SS(A)=rb\Sigma_{i=1}^{a} n_{i} (\bar y_{i..} -\bar y_{...} )^{2}$ ||$SS(A)/df(A)$||$F_{0}=\frac{MS(A)}{MSE}$	||$p(F_{(df(B),df(E)})>F_{0}$
-|-
-|Main effect B	|| $b-1$||$SS(B)=ra\Sigma_{j=1}^{b} n_{i} (\bar y_{.j.}-\bar y_{...})^{2}$ || $SS(B)/df(B)$||	$F_{0}=\frac{MS(B)}{MSE}$||$p(F_{(df(AB),df(E)} )>F_{0}$
-|-
-|A vs. B interaction||$(a-1)(b-1)$||$SS(AB)=r\Sigma_{i=1}^{a} \Sigma_{j=1}^{b} ((\bar y_{ij.}-\bar y_{i..}) +\bar y_{.j.} -\bar y_{...} )^{2}$ ||$SS(AB)/df(AB)$||$F_{0}=\frac{MS(AB)} {MSE}$||
-|-
-|Error ||$N-ab$||$SSE=\Sigma_{k=1}^{r} \Sigma_{i=1}^{a} \Sigma_{j=1}^{b} (y_{ijk}-\bar y_{ij.})^{2}$||$SSE/df(error)$||  ||
-|-
-|Total||$N-1$||$SST=\Sigma_{k=1}^{r}\Sigma_{i=1}^{a} \Sigma_{j=1}^{b}(y_{ijk}-\bar y_{...})^{2}$|| || ||
-|-
-|}
-</center>
-**ANOVA conditions: valid if (1) the population from which the samples were obtained must be normally or approximately normally distributed; (2) the samples must be independent; (3) the variances of the populations must be equal; (4) the groups must have the same sample size.
+====4) Assumptions and Diagnostics====
-*==ANCOVA==
+'''Key Assumptions:'''
-Analysis of Covariance is the common method applied to blend ANOVA and regression and evaluate whether population means of a dependent variance (DV) are equal across levels of a categorical independent variable (IV) while statistically controlling for the effects of other continuous variables (CV).
+. '''Linearity''': Relationship between DV and covariates is linear
-**Assumptions of ANCOVA: (1) normality of residuals; (2) homogeneity of variance for error; (3) Homogeneity of regression slopes, regression lines should be parallel among groups; (4) Linearity of regression; (5) independence of error terms.
+. '''Homogeneity of regression slopes''': \(\beta\) is constant across groups
-**Increase statistical power: ANCOVA reducing the with-in group error variance and increase statistical power. Use the F-test to evaluate difference between groups by dividing the explained variance between groups by the unexplained variance within the groups. $F=\frac{MS between}{MSwithin}$. If this value is greater than the critical value, then there is significant difference between groups. The influence of CVs is grouped into the denominator. When we control for the effect of CVs on the DV, remove it from the denominator making F bigger, thereby increased our power to find a significant effect if one exists.
+. '''Normality''': Residuals \(\varepsilon_{ij} \sim N(0, \sigma^2)\)
-**Adjusting preexisting difference in nonequivalent groups: correct for initial group difference that exists on DV among several intact groups. In this case, CVs are used to adjust scores and make participants more similar than without the CV since the participants cannot be made equal through random assignment. CV may be so intimately related to the IV that removing the variance on the DV associated with CV would remove considerable variance on DV, which will make the result meaningless.
+. '''Homoscedasticity''': Constant variance across groups
-**Conduct ANCOVA: (1) Test multicollinearity: if a CV is highly related to another CV, then it won’t adjust the DV over and above the other CV. One or the other should be removed since they are statistically redundant. (2) Test the homogeneity of variance assumption: Levene’s test of equality of error variances. (3) Test of the homogeneity of regression slopes assumption: tested by testing if the CV significantly interacts with the IV by running an ANCOVA model including both the IV and the CV*IV interaction term in the model. If the interaction term is significant, then we should not perform ANCOVA. Instead assess group difference on DV at particular level of CV. (4) Run ANCOVA analysis: if the interaction is not significant, then rerun the ANCOVA without the interaction term. In this analysis, use the adjust means and adjusted MSerror. (5) Follow-up analyses: if there was a significant main effect, then there is a significant difference between the levels of one IV, ignoring all other factors. To find out exactly which levels are significant from others, use the same follow-up tests for ANOVA.
+. '''Independence''': Observations are independent
+. '''No multicollinearity''': Covariates not highly correlated
-*==MANOVA==
+'''Diagnostic Tests:'''
-Multivariate analysis of variance or multiple analysis of variance is a statistical test procedure for comparing multivariate means of several groups. MANOVA is a generalized form of ANOVA.
+<pre>
-Relationship with ANOVA:
+# R function for ANCOVA diagnostics
-**MANOVA is an extension of ANOVA, though, unlike ANOVA, it uses the variance-covariance between variables in testing the statistical significance of the mean difference. It is similar to ANOVA, but allows adding of interval independents as covariates. Several specific use-cases for MANOVA: (1) to compare groups formed by categorical independent variables on group differences in a set of dependent variables;  (2) to use lack of difference for a set of dependent variables as a criterion for reducing a set of independent variables to a smaller, more easily modeled number of variables; (3) to identify the independent variables which differentiate a set of dependent variables the most.
+check_ancova_assumptions <- function(model, data, group_var, covariate) {
-**Analogous to ANOVA, MANOVA is based on the product of model variance matrix, $\Sigma_{model}$ and inverse of the error variance matrix, $\Sigma_{res}^{-1}, or A=\Sigma_{model} * \Sigma_{res}^{-1}$. The hypothesis that $\Sigma_{model} = \Sigma_{residual}$ implies that the product $A \sim I$. Invariance considerations imply the MANOVA statistic should be a measure of magnitude of the singular value decomposition of this matrix product, there is no unique choice owing to the multi-dimensional nature of the alternative hypothesis.
+  par(mfrow = c(2, 3))
-**MANOVA calculations closely resemble the ANOVA calculations, except that they are in vector and matrix forms. Assume that instead of a single dependent variance in the one-way ANOVA, there are three dependent variables as in our neuroimaging example above. Under the null hypothesis, it is assumed that scores on the three variables for each of the four groups are sampled from a tri-variate normal distribution mean vector $\mu =(\mu_{1}, \mu_{2}, \mu_{3})^{T}$ and variance-covariance matrix $\Sigma=\bigl(\begin{smallmatrix} \sigma_{1}^{2} & \rho_{1,2}\sigma_{1}\sigma{2}  &\rho_{1,3}\sigma_{1}\sigma{3} \\ \rho_{2,1}\sigma_{2}\sigma_{1} & \sigma_{2}^{2}  & \rho_{2,3}\sigma_{2}\sigma{3} \\\rho_{3,1}\sigma_{3}\sigma_{1}&\rho_{3,2}\sigma_{3}\sigma_{2} &\sigma_{3}^{2}\end{smallmatrix}\bigr)$. Where the covariance between variables 1 and 2 is expressed in terms of their correlation $(\rho_{1,2})$ and individual variances $(\sigma_{1}$ and $\sigma_{2})$. Under the null hypothesis, the scores for all subjects in groups 1, 2 and 3 are sampled from the same distribution.
-**Example: a $2*2$ factorial design with medication as one factor and type of therapy as the second factor. The matrix of the data looks includes the patient ID, the drug-treatment (vitamin-E or Placebo), Therapy (Cognitive/physical), MMSE, CDR, Imaging. It's better when the study the design is balanced with equal numbers of patients in all four conditions, as this avoid potential problems of sample-size-driven effects (e.g., variance estimates). Recall that a univariate ANOVA (on any single outcome measure) would contain three types of effects -- a main effect for therapy, a mean effect for medication, and an interaction between therapy and medication. Similarly, MANOVA will contain the same three effects: main effects: (1) Therapy: The univariate ANOVA main effect for therapy tells whether the physical vs. cognitive therapy groups have different means, irrespective of their medications. The MANOVA main effect for therapy tells whether the physical vs. cognitive therapy group have different mean vectors irrespective of their medication. The vectors in this case are the $(3*1)$ column vectors of means (MMSE, CDR and Imaging); (2) Medication: The univariate ANOVA for medication tells whether the placebo group has a different mean from the Vitamin-E group irrespective of the therapy type. The MANOVA main effect for medication tells whether the placebo group has a different mean vector from the VItamin-E group irrespective of therapy; interaction effects: (3) The univariate ANOVA interaction tells whether the four means for a single variable differ from the value predicted from knowledge of the main effects of therapy and medication. The MANOVA interaction term tells whether the four mean vectors differ from the vector predicted from knowledge of the main effects of therapy and medication.
+  # 1. Normality of residuals
-**Variance partitioning: MANOVA has the same properties as an ANOVA. The only difference is that an ANOVA deals with a $(1*1)$ mean vector for any group (as the response is univatiate). While a MANOVA deals with a $(k*1)$ vector for any group, $k$ being the number of dependent variables, 3 in our example. The variance-covariance matrix for 1 variable is a $(1*1)$ matrix that has only one element, the variance of the variable. What is the variance-covariance matrix for $k$ variables is a $(k*k)$ matrix with the variances on the diagonal and the covariances representing the off diagonal elements. The ANOVA partitions the $(1*1)$ covariance matrix into a part due to error and a part due to the researcher-specified hypotheses (the two main effects and the interaction term). That is: $V_{total} = V_{therapy} + V_{medication} + V_{therapy*medication} + V_{error}.$ Likewise, MANOVA partitions its $(k*k)$ covariance matrix into a part due to research-hypotheses and a part due to error. Thus, in out example, MANOVA will have a $(3*3)$ covariance matrix for total variability, a $(3*3)$ covariance matrix due to therapy, a $(3*3)$ covariance matrix due to medication, a $(3*3)$ covariance matrix due to the interaction of therapy with medication, and finally a $(3*3)$ covariance matrix for the error. That is: $V_{total} = V_{therapy} + V_{medication} + V_{therapy*medication} + V_{error}$. Now,$V$ stands for the appropriate $(3*3)$ matrix, as opposed to $(1*1)$ value, as in ANOVA. The second equation is the matrix-form of the first one. Here is how we interpret these matrices. The error matrix looks like:
+  residuals <- resid(model)
-<center>
+  qqnorm(residuals, main = "Q-Q Plot of Residuals")
-{| class="wikitable" style="text-align:center; width:45%" border="1"
+  qqline(residuals)
-|-
+  shapiro_test <- shapiro.test(residuals)
-|	||MMSE	||CDR	||Imaging
+  cat("Shapiro-Wilk normality test: W =", shapiro_test$statistic,
-|-
+      "p =", shapiro_test$p.value, "\n")
-|MMSE||	$V_{error1}$||	COV(error1, error2)||	COV(error1, error3)
-|-
+  # 2. Homogeneity of variance
-|CDR||	COV(error2, error1)||$V_{error2}$||	COV(error2, error1)
+  plot(fitted(model), residuals,
-|-
+       xlab = "Fitted Values", ylab = "Residuals",
-|Imaging||	COV(error3, error1)||	COV(error3, error2)	||$V_{error3}$
+       main = "Residuals vs Fitted")
-|-
+  abline(h = 0, col = "red")
-|}
-</center>
+  # Levene's test (using car package)
+  if (require(car)) {
+    levene_test <- leveneTest(residuals ~ data[[group_var]])
+    cat("Levene's test for homogeneity of variance: F =",
+        levene_test$F[1], "p =", levene_test$Pr[1], "\n")
+  }
+  # 3. Linearity check
+  plot(data[[covariate]], residuals,
+       xlab = covariate, ylab = "Residuals",
+       main = "Residuals vs Covariate")
+  abline(h = 0, col = "red")
+  # 4. Homogeneity of regression slopes
+  interaction_model <- lm(formula(paste("y ~", group_var, "*", covariate)), data = data)
+  anova_interaction <- anova(model, interaction_model)
+  cat("\nTest for homogeneity of slopes (interaction test):\n")
+  print(anova_interaction)
+  # 5. Multicollinearity (if multiple covariates)
+  if (require(car)) {
+    vif_values <- vif(model)
+    cat("\nVariance Inflation Factors (VIF):\n")
+    print(vif_values)
+  }
+  par(mfrow = c(1, 1))
+}
+</pre>
-*Common statistics are summarized based on the root (eigenvalues) \lambda_{p} of the A matrix: (1) Samuel Stanley Wilks’, $\Lambda_{Wilks}=\prod_{1,\cdots,p} (1/(1+\lambda_{p}))=det(I+A)^{-1}=det(\Sigma_{res})/det(\Sigma_{res}+\Sigma_{model})$ distributed as lambda $(\Lambda)$; (2) The Pillai-M.S. Bartlett trace, $\Lambda_{Pillai}=\sum_{1,\cdots,p}(1/(1+\lambda_{p}))=tr((I+A)^{-1})$; (3) The Lawley-Hotelling trace, $\Lambda_{LH}=\sum_{1,\cdots,p}(\lambda_{p})=tr(A)$; (4) Roy’s greatest root, $\Lambda_{Roy}=max_{p}(\lambda_{p})=\left \| A \right \|_{\infty}$. The 4 major types of MANOVA test, the statistical power of these tests follow: $Pillai’s > Wilk’s > Hotelling’s > Roy’s Robustness$.
+Below is a more sophisticated function, ''check_ancova_assumptions_simple()'', which is more robust to model configurations.
-**Let the $A$ statistic be the ratio of the sum of squares for an hypothesis and the sum of squares for error. Let $H$ denote the hypothesis sums of squares and cross products matrix, and let $E$ denote the error sums of squares and cross products matrix. The multivariate $A$ statistic is the matrix $A = HE_{-1}.$ Notice how mean squares (that is, covariance matrices) disappear from MANOVA just as they did for ANOVA. All hypothesis tests may be performed on the matrix $A$. Note also that because both $H$ and $E$ are symmetric, $HE^{-1}=E^{-1} H$. This is one special case where the order of matrix multiplication does not matter.
-**All MANOVA tests are made on $A=E^{-1}H$. There are four different multivariate tests that are made on this matrix. Each of the four test statistics has its own associated $F$ ratio. In some cases the four tests give an exact $F$ ratio for testing the null hypothesis and in other cases the $F$ ratio is only approximate. The reason for four different statistics and for approximations is that the MANOVA calculations may be complicated in some cases (i.e., the sampling distribution of the $F$ statistic in some multivariate cases would be difficult to compute exactly.) Suppose there are $k$ dependent variables in the MANOVA, and let $\lambda I$ denote the ith eigenvalue of $A=E^{-1}H$.
-**Wilk’s $\Lambda:\, 1- \Lambda$ is an index of variance explained by the model, $\eta^{2}$ is a measure of effect size analogous to $R^{2}$ in regression. $Wilk’s\: \Lambda$ is the pooled ratio of error variance to effect variance plus error variance: $\Lambda=\frac{|E|}{|H+E|}=\prod_{i=1}^{k}\frac{1}{1+\lambda_{i}}$.
-**Pillai’s Trace: Pillai’s criterion is the pooled effect variances. $Pillai’s\: trace=trace[H(H+E)^{-1}]=\prod_{i=1}^{k}\frac{1}{1+\lambda_{i}}$.
-**Hotelling’s Trace: the pooled ratio of effect variance to error variance:$trace(A)=trace[HE^{-1}]=\sum_{i=1}^{k}\lambda_{i}$.
-**Roy’s largest root: gives an upper bound for the F statistic, $Roy’s\: largest \,root=max(\hat{\lambda_{i}})$.
-*==MANCOVA==
+<pre>
-A multivariate analysis of covariance MANCOVA is a statistical an extension of ANCOVA and is designed for cases where there is more than one dependent variable and where the control of concomitant continuous independent variables is required. The process of characterizing a covariate in a data source allows the reduction of the magnitude of the error term, represented in the MANCOVA design as $MS_{error}$. The MANCOVA allows the characterization of the difference in group means in regards to a linear combination of multiple dependent variables, while simultaneously controlling for covariates.
+check_ancova_assumptions_simple <- function(model, data, group_var, covariate) {
-**Assumptions: (1) normality, for each group, each dependent variable follows a normal distribution and any linear combination of dependent variables are normally distributed; (2) independence of observations from all other observations; (3) homogeneity of variance: each dependent variable demonstrate similar levels of variance across each independent variable; (4) homogeneity of covariance: the intercorrelation matrix between dependent variables equals to each other across all levels of independent variable.
+  # Get response variable name from model formula
-**Covariate represents the source of variance that has not been controlled in the experiment and is believed to affect the dependent variable. And methods like ANCOVA and MANCOVA aim to remove the effects of such uncontrolled variation in order to increase statistical power and to ensure an accurate measurement of the true relationship between independent and dependent variables.
+  response_var <- all.vars(formula(model))[1]
-**In some studies with covariates it happens that the F value actually becomes smaller (less significant) after including covariates in the design. This indicates that the covariates are not only correlated with the dependent variable, but also with the between-groups factors.
+  cat("=== ANCOVA ASSUMPTION CHECKS ===\n")
+  cat("Response variable:", response_var, "\n")
+  cat("Group variable:", group_var, "\n")
+  cat("Covariate:", covariate, "\n\n")
+  # Set up plot layout
+  par(mfrow = c(2, 3))
+  # 1. Normality of residuals
+  residuals <- resid(model)
+  qqnorm(residuals, main = "Q-Q Plot of Residuals")
+  qqline(residuals)
+  shapiro_test <- shapiro.test(residuals)
+  cat("1. Normality (Shapiro-Wilk): W =", round(shapiro_test$statistic, 4),
+      ", p =", format.pval(shapiro_test$p.value, digits = 4), "\n")
+  # 2. Homogeneity of variance
+  fitted_vals <- fitted(model)
+  plot(fitted_vals, residuals,
+       xlab = "Fitted Values", ylab = "Residuals",
+       main = "Residuals vs Fitted")
+  abline(h = 0, col = "red", lty = 2)
+  # 3. Linearity check (residuals vs covariate)
+  plot(data[[covariate]], residuals,
+       xlab = covariate, ylab = "Residuals",
+       main = paste("Residuals vs", covariate))
+  abline(h = 0, col = "red", lty = 2)
+  # 4. Homogeneity of regression slopes - FIXED: Properly nested comparison
+  # Get all terms from the original model
+  model_terms <- attr(terms(model), "term.labels")
+  # Remove any existing interaction involving the group_var and covariate
+  # Keep all other terms
+  other_terms <- model_terms[!grepl(paste0(group_var, ":", covariate, "|",
+                                          covariate, ":", group_var), model_terms)]
+  # Build the model formula without the interaction
+  if (length(other_terms) > 0) {
+    base_formula <- paste(response_var, "~", paste(other_terms, collapse = " + "))
+  } else {
+    base_formula <- paste(response_var, "~ 1")
+  }
+  # Build interaction model formula by adding the interaction term
+  interaction_formula <- paste(base_formula, "+", group_var, "*", covariate)
+  # Fit both models
+  base_model <- lm(as.formula(base_formula), data = data)
+  interaction_model <- lm(as.formula(interaction_formula), data = data)
+  # Test for significant interaction
+  interaction_test <- anova(base_model, interaction_model)
+  cat("\n2. Homogeneity of slopes test:\n")
+  print(interaction_test)
+  # Check if the interaction is significant - FIXED: Handle NA p-values
+  p_value <- interaction_test$`Pr(>F)`[2]
+  if (!is.na(p_value)) {
+    if (p_value < 0.05) {
+      cat("\nWARNING: Significant interaction detected (p =",
+          format.pval(p_value, digits = 4),
+          "). Slopes are not homogeneous.\n")
+      # Plot different slopes
+      plot(data[[covariate]], data[[response_var]],
+           col = as.numeric(data[[group_var]]),
+           xlab = covariate, ylab = response_var,
+           main = "Different Slopes by Group (Interaction Present)")
+      # Add regression lines for each group
+      groups <- unique(data[[group_var]])
+      for (i in seq_along(groups)) {
+        group_data <- data[data[[group_var]] == groups[i], ]
+        if (nrow(group_data) > 1) {
+          abline(lm(as.formula(paste(response_var, "~", covariate)),
+                   data = group_data),
+                 col = i, lwd = 2)
+        }
+      }
+      legend("topright", legend = levels(data[[group_var]]),
+             col = 1:length(levels(data[[group_var]])),
+             lwd = 2, pch = 1)
+    } else {
+      cat("\nNo significant interaction (p =",
+          format.pval(p_value, digits = 4),
+          "). Homogeneity of slopes assumption is satisfied.\n")
+    }
+  } else {
+    cat("\nNote: Could not calculate p-value for interaction test.\n")
+    cat("Model comparison summary:\n")
+    print(interaction_test)
+  }
+  # 5. Residual histogram
+  hist(residuals, main = "Histogram of Residuals",
+       xlab = "Residuals", col = "lightblue")
+  par(mfrow = c(1, 1))
+  # Additional diagnostics
+  cat("\n3. Additional Statistics:\n")
+  cat("- Mean of residuals:", round(mean(residuals), 4), "\n")
+  cat("- SD of residuals:", round(sd(residuals), 4), "\n")
+  cat("- Max absolute residual:", round(max(abs(residuals)), 4), "\n")
+  # Check for outliers (residuals > 3 SD)
+  if (sd(residuals) > 0) {
+    outlier_count <- sum(abs(scale(residuals)) > 3, na.rm = TRUE)
+    cat("- Potential outliers (>3 SD):", outlier_count, "\n")
+  }
+  # Return diagnostic results
+  return(list(
+    shapiro_test = shapiro_test,
+    interaction_test = interaction_test,
+    p_value_interaction = p_value
+  ))
+}
+</pre>
 ===Applications===
-[http://rer.sagepub.com/content/68/3/350.short   This article] examined articles published in several prominent educational journals to investigate the use of data analysis tools by researchers in four research paradigms: between-subjects univariate designs, between-subjects multivariate designs, repeated measures designs, and covariance designs. In addition to examining specific details pertaining to the research design (e.g., sample size, group size equality/inequality) and methods employed for data analysis, the authors also catalogued whether (a) validity assumptions were examined, (b) effect size indices were reported, (c) sample sizes were selected on the basis of power considerations, and (d) appropriate textbooks and/or articles were cited to communicate the nature of the analyses that were performed. The present analyses imply that researchers rarely verify that validity assumptions are satisfied and that, accordingly, they typically use analyses that are nonrobust to assumption violations. In addition, researchers rarely report effect size statistics, nor do they routinely perform power analyses to determine sample size requirements. Recommendations are offered to rectify these shortcomings.
+====Example 1: Clinical Trial with Baseline Adjustment====
+<pre>
+# Simulated clinical trial data
+set.seed(123)
+n <- 100
+trial_data <- data.frame(
+   patient_id = 1:n,
+  treatment = factor(rep(c("Drug", "Placebo"), each = n/2)),
+  baseline_bp = rnorm(n, mean = 150, sd = 15),
+  age = sample(40:75, n, replace = TRUE),
+  bmi = rnorm(n, mean = 28, sd = 4)
+)
-[http://www.tandfonline.com/doi/abs/10.1080/03610919808813485#.U-aSFhZTWdA  This article] investigated the performance of ANOVA, MANOVA, WLS, and GEE for repeated ordinal data with small sample sizes. Repeated ordinal outcomes are common in behavioral and medical sciences. Due to the familiarity, simplicity and robustness of ANOVA methodology, this approach has been frequently used for repeated ordinal data. Weighted least squares (WLS) and generalized estimating equations (GEE) are usually the procedures of choice for repeated ordinal data since, unlike ANOVA, they generally make no or few untenable assumptions. However, these methods are based on asymptotic results and their properties are not well understood for small samples. Moreover, few software packages have procedures for implementing these methods. For a design with two groups and four time points, the simulation results indicated that ANOVA with the Huynh-Feldt adjustment performed matrix, known as sphericity, or the H‐F condition, is a sufficient condition for the usual F tests to be valid.
+# Generate post-treatment BP with treatment effect and baseline correlation
+trial_data$post_bp <- with(trial_data,
+  baseline_bp * 0.7 +
+  ifelse(treatment == "Drug", -15, -5) +
+  (age - 60) * 0.2 +
+  (bmi - 28) * 0.5 +
+  rnorm(n, 0, 8)
+)
-===Software===
+cat("=== Basic ANOVA (ignoring covariates) ===\n")
+anova_simple <- aov(post_bp ~ treatment, data = trial_data)
+print(summary(anova_simple))
-RCODE:
+cat("\n=== ANCOVA (adjusting for baseline BP) ===\n")
-fit <- aov(y ~ A, data = mydata)  #one way ANOVA (completely randomized design)
+ancova_model <- aov(post_bp ~ treatment + baseline_bp, data = trial_data)
-fit <- aov(y ~ A+B, data=mydata)  # randomized block design where B is the blocking factor
+print(summary(ancova_model))
-fit <- aov(y ~ A+B+A*B, data=mydata)  ## two way factorial design
-fit <- aov(y ~ A+x, data=mydata)  ## analysis of covariance
-## for within subject designs, the data frame has to be rearranged for each measurement on a subject to be a separate observation
-fit <- aov(y ~ A+Error(subject/A), data=mydata)  ## one within factor
-fit <- aov(y ~(w1*w2*B1*B2)+Error(Subject/(W1*W2))+(B1*B2),data=mydata)  # two within factors W1 and W2, two between factors B1 and B2.
-## 2*2 factorial MANOVA with 3 dependent variables
-Y <- cbind(y1,y2,y3)
-fit <- manova(Y ~ A*B)
-===Problems===
+cat("\n=== ANCOVA (multiple covariates) ===\n")
+ancova_full <- lm(post_bp ~ treatment + baseline_bp + age + bmi, data = trial_data)
+print(summary(ancova_full))
-Use data of the CPI (consumer price index) for food, housing, transportation and medical care from 1981 to 2007 to do a two-way analysis of the covariance in R. We take ‘Month’ as one factor and the item the CPI measured on as another factor and did a 2* 2 factorial design. The data are linked at [http://wiki.socr.umich.edu/index.php/SOCR_Data_Dinov_021808_ConsumerPriceIndex3Way  Consumer Price Index].
+cat("\n=== Adjusted Means ===\n")
+library(emmeans)
+adj_means <- emmeans(ancova_full, specs = ~ treatment)
+print(adj_means)
- In R:
+cat("\n=== Pairwise Comparisons with Adjustment ===\n")
- CPI <- read.csv('/Users/yufangli/Desktop/CPI_Food.csv',header=T)
+pairwise_comparisons <- pairs(adj_means, adjust = "holm")
- attach(CPI)
+print(pairwise_comparisons)
- summary(CPI)
- Month <- factor(Month)
- CPI_Item <- factor(CPI_Item)
- fit <- aov(CPI_Value ~ Month + CPI_Item + CPI_Item*Month, data=CPI)
- fit
- Call:
- aov(formula = CPI_Value ~ Month + CPI_Item + CPI_Item * Month,
- data = CPI)
- Terms:
+# Diagnostic checks
-                    Month  CPI_Item Month:CPI_Item Residuals
+cat("\n=== Model Diagnostics ===\n")
- Sum of Squares     3282.6 1078702.9          706.8 2987673.8
+check_ancova_assumptions_simple(ancova_full, trial_data, "treatment", "baseline_bp")
- Deg. of Freedom        11         3             33      1248
+</pre>
- Residual standard error: 48.92821
+====Example 2: Educational Intervention Study====
- Estimated effects may be unbalanced
+<pre>
+# Using built-in iris dataset to demonstrate MANCOVA
+data(iris)
+# Create a categorical variable and covariate
+set.seed(456)
+iris$treatment <- factor(sample(c("Method_A", "Method_B", "Control"),
+                                nrow(iris), replace = TRUE))
+iris$pretest_score <- iris$Sepal.Length * 10 + rnorm(nrow(iris), 50, 5)
- summary(fit)
+# MANCOVA with multiple DVs
-                 Df  Sum Sq  Mean Sq  F value Pr(>F)
+cat("=== MANCOVA Example ===\n")
- Month            11    3283     298     0.125       1
+Y <- cbind(iris$Sepal.Width, iris$Petal.Length, iris$Petal.Width)
- CPI_Item          3 1078703  359568 150.197 <2e-16 ***
+manova_model <- manova(Y ~ treatment + pretest_score, data = iris)
- Month:CPI_Item   33     707      21   0.009      1
- Residuals      1248   2987674    2394
----
- Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
- fit2 <- aov(CPI_Value ~ CPI_Item, data=CPI)
+cat("\n--- Wilks' Lambda Test ---\n")
- fit2
+print(summary(manova_model, test = "Wilks"))
- Call:
-   aov(formula = CPI_Value ~ CPI_Item, data = CPI)
- Terms:
+cat("\n--- Pillai's Trace Test ---\n")
-                CPI_Item Residuals
+print(summary(manova_model, test = "Pillai"))
- Sum of Squares   1078703   2991663
- Deg. of Freedom        3      1292
- Residual standard error: 48.11994
+cat("\n--- Individual ANCOVAs ---\n")
- Estimated effects may be unbalanced
+for (i in 1:3) {
- summary(fit2)
+  cat("\nDV", i, ":\n")
-              Df  Sum Sq Mean Sq F value Pr(>F)
+   ancova_indiv <- lm(Y[, i] ~ treatment + pretest_score, data = iris)
- CPI_Item       3 1078703  359568   155.3 <2e-16 ***
+   print(summary(ancova_indiv))
- Residuals   1292 2991663    2316
+}
----
- Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
- Seems like the CPI_Item is an important factor while month is not a significant factor of CPI based on the dataset we got.
+# Visualizing adjusted means
+library(ggplot2)
+library(dplyr)
+# Calculate adjusted means using emmeans
+if (require(emmeans)) {
+  adj_means_plot <- list()
+  for (i in 1:3) {
+    model <- lm(Y[, i] ~ treatment + pretest_score, data = iris)
+    emm <- emmeans(model, specs = ~ treatment)
+    adj_means_plot[[i]] <- as.data.frame(emm) %>%
+      mutate(DV = paste("DV", i))
+  }
+  plot_data <- bind_rows(adj_means_plot)
+  ggplot(plot_data, aes(x = treatment, y = emmean, fill = treatment)) +
+    geom_bar(stat = "identity", alpha = 0.7) +
+    geom_errorbar(aes(ymin = lower.CL, ymax = upper.CL), width = 0.2) +
+    facet_wrap(~ DV, scales = "free_y") +
+    labs(title = "Adjusted Means with 95% Confidence Intervals",
+         y = "Adjusted Mean", x = "Treatment Group") +
+    theme_minimal()
+}
+</pre>
-===References===
+====Example 3: Real Dataset Analysis - Plant Growth====
-[http://www.statsoft.com/Textbook/ANOVA-MANOVA  ANOVA-MANOVA]
+<pre>
+# Using PlantGrowth dataset with simulated covariate
+data(PlantGrowth)
+set.seed(789)
+PlantGrowth$soil_quality <- rnorm(nrow(PlantGrowth), mean = 5, sd = 1) +
+  ifelse(PlantGrowth$group == "trt1", 0.5,
+         ifelse(PlantGrowth$group == "trt2", -0.5, 0))
+cat("=== Plant Growth ANCOVA ===\n")
+cat("Research question: Do treatments affect plant weight after controlling for soil quality?\n\n")
+# EDA
+cat("--- Exploratory Data Analysis ---\n")
+cat("Group means (unadjusted):\n")
+print(aggregate(weight ~ group, data = PlantGrowth, mean))
+cat("\nCorrelation between weight and soil quality:",
+    cor(PlantGrowth$weight, PlantGrowth$soil_quality), "\n")
+# Visualization
+ggplot(PlantGrowth, aes(x = soil_quality, y = weight, color = group)) +
+  geom_point(size = 3) +
+  geom_smooth(method = "lm", se = FALSE) +
+  labs(title = "Relationship between Plant Weight and Soil Quality by Treatment",
+       x = "Soil Quality", y = "Plant Weight") +
+  theme_minimal()
+# ANCOVA analysis
+ancova_plant <- lm(weight ~ group + soil_quality, data = PlantGrowth)
+cat("\n--- ANCOVA Results ---\n")
+print(summary(ancova_plant))
+# Check assumptions
+cat("\n--- Assumption Checks ---\n")
+check_ancova_assumptions_simple(ancova_plant, PlantGrowth, "group", "soil_quality")
+# Contrasts and post-hoc tests
+cat("\n--- Post-hoc Comparisons ---\n")
+library(multcomp)
+contrasts <- glht(ancova_plant, linfct = mcp(group = "Tukey"))
+print(summary(contrasts))
+# Effect sizes
+cat("\n--- Effect Sizes ---\n")
+library(effectsize)
+eta_squared <- eta_squared(ancova_plant, partial = TRUE)
+print(eta_squared)
+# Power analysis
+cat("\n--- Power Analysis ---\n")
+library(pwr)
+# Calculate achieved power
+f2 <- eta_squared$Eta2_partial[1] / (1 - eta_squared$Eta2_partial[1])
+power_achieved <- pwr.f2.test(u = 2, v = 27, f2 = f2, sig.level = 0.05)$power
+cat("Achieved power for treatment effect:", round(power_achieved, 3), "\n")
+</pre>
+===Advanced Topics===
+====1) Nonparametric ANCOVA====
+<pre>
+# Quandrant test or rank-based ANCOVA
+if (require(Rfit)) {
+  cat("=== Rank-Based ANCOVA ===\n")
+  rank_model <- rfit(weight ~ group + soil_quality, data = PlantGrowth)
+  print(summary(rank_model))
+}
+</pre>
+====2) Mixed Effects ANCOVA====
+<pre>
+# For repeated measures or clustered data
+if (require(lme4)) {
+  # Simulated longitudinal data
+  set.seed(321)
+  long_data <- data.frame(
+    subject = rep(1:20, each = 3),
+    time = rep(1:3, 20),
+    treatment = factor(rep(sample(c("A", "B"), 20, replace = TRUE), each = 3)),
+    baseline = rnorm(60, 100, 15),
+    response = NA
+  )
+  # Generate responses
+  long_data$response <- with(long_data,
++
+    ifelse(treatment == "A", 5, -5) +
+.7 * baseline +
+    rnorm(60, 0, 10) +  # Residual error
+    rep(rnorm(20, 0, 5), each = 3)  # Random intercept
+  )
+  # Linear mixed model (ANCOVA with random effects)
+  mixed_model <- lmer(response ~ treatment + baseline + time + (1|subject),
+                      data = long_data)
+  cat("\n=== Mixed Effects ANCOVA ===\n")
+  print(summary(mixed_model))
+  # Compare with standard ANCOVA
+  standard_ancova <- lm(response ~ treatment + baseline + time, data = long_data)
+  cat("\n--- Comparison: Standard vs Mixed ANCOVA ---\n")
+  cat("Standard ANCOVA AIC:", AIC(standard_ancova), "\n")
+  cat("Mixed model AIC:", AIC(mixed_model), "\n")
+}
+</pre>
+====3) Power Analysis and Sample Size Planning====
+<pre>
+# Power analysis for ANCOVA
+calculate_power_ancova <- function(k, n_per_group, f2, rho, alpha = 0.05) {
+  # k: number of groups
+  # n_per_group: sample size per group
+  # f2: effect size (Cohen's f^2)
+  # rho: correlation between covariate and DV
+  N <- k * n_per_group
+  df1 <- k - 1
+  df2 <- N - k - 1  # -1 for covariate
+  # Adjust f2 for covariate inclusion
+  f2_adj <- f2 / (1 - rho^2)
+  power <- pf(qf(1 - alpha, df1, df2), df1, df2, ncp = N * f2_adj, lower.tail = FALSE)
+  return(power)
+}
+cat("=== ANCOVA Power Calculator ===\n")
+# Example: 3 groups, 20 per group, medium effect (f2 = 0.15), rho = 0.5
+power_example <- calculate_power_ancova(k = 3, n_per_group = 20, f2 = 0.15, rho = 0.5)
+cat("Power for specified design:", round(power_example, 3), "\n")
+# Sample size calculation for desired power
+library(pwr)
+ss <- pwr.f2.test(u = 2, f2 = 0.15, sig.level = 0.05, power = 0.80)
+cat("Required sample size (per group for 3 groups):", ceiling(ss$v/3 + 3), "\n")
+</pre>
+===Software Implementation===
+<pre>
+# Comprehensive ANCOVA analysis function
+run_ancova_analysis <- function(formula, data, group_var, covariates,
+                                pairwise = TRUE, diagnostics = TRUE) {
+  # Fit model
+  model <- lm(formula, data = data)
+  cat("=== ANCOVA ANALYSIS REPORT ===\n\n")
+  cat("Model formula:", deparse(formula), "\n")
+  cat("Number of groups:", length(unique(data[[group_var]])), "\n")
+  cat("Number of covariates:", length(covariates), "\n")
+  cat("Total observations:", nrow(data), "\n\n")
+  # Model summary
+  cat("--- MODEL SUMMARY ---\n")
+  print(summary(model))
+  cat("\n")
+  # ANOVA table
+  cat("--- ANOVA TABLE ---\n")
+  print(anova(model))
+  cat("\n")
+  # Assumption checks
+  if (diagnostics) {
+    cat("--- ASSUMPTION DIAGNOSTICS ---\n")
+    check_ancova_assumptions_simple(model, data, group_var, covariates[1])
+  }
+  # Adjusted means and comparisons
+  if (pairwise && require(emmeans)) {
+    cat("\n--- ADJUSTED GROUP MEANS ---\n")
+    adj_means <- emmeans(model, specs = as.formula(paste("~", group_var)))
+    print(adj_means)
+    cat("\n--- PAIRWISE COMPARISONS (Holm-adjusted) ---\n")
+    pairwise_results <- pairs(adj_means, adjust = "holm")
+    print(pairwise_results)
+    # Plot adjusted means
+    plot_data <- as.data.frame(adj_means)
+    p <- ggplot(plot_data, aes_string(x = group_var, y = "emmean", fill = group_var)) +
+      geom_bar(stat = "identity", alpha = 0.7) +
+      geom_errorbar(aes(ymin = lower.CL, ymax = upper.CL), width = 0.2) +
+      labs(title = "Adjusted Group Means with 95% Confidence Intervals",
+           y = "Adjusted Mean", x = group_var) +
+      theme_minimal()
+    print(p)
+  }
+  # Effect sizes
+  if (require(effectsize)) {
+    cat("\n--- EFFECT SIZES ---\n")
+    es <- eta_squared(model, partial = TRUE)
+    print(es)
+  }
+  return(model)
+}
+# Example usage with mtcars
+cat("\n\n=== EXAMPLE: ANCOVA with mtcars dataset ===\n")
+data(mtcars)
+mtcars$cyl <- factor(mtcars$cyl)
+mtcars$am <- factor(mtcars$am, labels = c("Automatic", "Manual"))
+# Run comprehensive analysis
+model_result <- run_ancova_analysis(
+  formula = mpg ~ cyl + hp + wt,
+  data = mtcars,
+  group_var = "cyl",
+  covariates = c("hp", "wt")
+)
+</pre>
+===Common Issues and Solutions===
+====Violation of Homogeneity of Slopes====
+<pre>
+# When slopes differ by group
+check_slopes <- function(data, dv, group, covariate) {
+  # Fit interaction model
+  interaction_model <- lm(as.formula(paste(dv, "~", group, "*", covariate)), data = data)
+  # Test interaction
+  simple_model <- lm(as.formula(paste(dv, "~", group, "+", covariate)), data = data)
+  anova_result <- anova(simple_model, interaction_model)
+  cat("Test for homogeneity of regression slopes:\n")
+  print(anova_result)
+  if (anova_result$`Pr(>F)`[2] < 0.05) {
+    cat("\nWARNING: Significant interaction detected. Slopes are not homogeneous.\n")
+    cat("Consider:\n")
+    cat("1. Reporting separate slopes for each group\n")
+    cat("2. Using Johnson-Neyman technique to identify regions of significance\n")
+    cat("3. Considering moderated regression analysis\n")
+    # Visualize different slopes
+    library(ggplot2)
+    p <- ggplot(data, aes_string(x = covariate, y = dv, color = group)) +
+      geom_point() +
+      geom_smooth(method = "lm", se = FALSE) +
+      labs(title = "Different Slopes by Group (Interaction Present)",
+           subtitle = "ANCOVA assumption violated") +
+      theme_minimal()
+    print(p)
+  } else {
+    cat("\nNo significant interaction. Homogeneity of slopes assumption satisfied.\n")
+  }
+}
+# Example
+check_slopes(mtcars, "mpg", "cyl", "wt")
+</pre>
+====Dealing with Missing Data====
+<pre>
+# Multiple imputation for ANCOVA with missing covariate values
+if (require(mice)) {
+  cat("=== Multiple Imputation for ANCOVA ===\n")
+  # Create dataset with missing values
+  set.seed(123)
+  missing_data <- PlantGrowth
+  missing_data$soil_quality[sample(1:nrow(missing_data), 5)] <- NA
+  # Perform multiple imputation
+  imputed_data <- mice(missing_data, m = 5, method = "pmm", printFlag = FALSE)
+  # Fit ANCOVA on each imputed dataset
+  models <- with(imputed_data, lm(weight ~ group + soil_quality))
+  # Pool results
+  pooled_results <- pool(models)
+  cat("Pooled coefficients:\n")
+  print(summary(pooled_results))
+}
+</pre>
+===Problems and Exercises===
+. '''Conceptual Problems''':
+   a) Prove that the adjusted group mean in ANCOVA is unbiased when the covariate is uncorrelated with treatment assignment
+   b) Derive the variance of the adjusted treatment effect estimator
+   c) Show how ANCOVA increases power compared to ANOVA
+. '''Applied Problems''':
+   a) Analyze the SOCR [http://wiki.socr.umich.edu/index.php/SOCR_Data_Dinov_021808_ConsumerPriceIndex3Way Consumer Price Index dataset] using ANCOVA
+   b) Conduct a MANCOVA on the [https://archive.ics.uci.edu/ml/datasets/iris Iris dataset] with species as the factor and sepal length as covariate
+   c) Perform power analysis for a planned study with 4 groups, expecting a medium effect (f² = 0.20), and correlation ρ = 0.6 between covariate and DV
-[http://en.wikipedia.org/wiki/Analysis_of_variance  ANOVA Wikipedia]
+. '''Simulation Study''':
+<pre>
+# Simulation to demonstrate ANCOVA advantages
+simulate_ancova_power <- function(n_sim = 1000, n_per_group = 30,
+                                  effect_size = 0.5, rho = 0.6) {
+  power_anova <- numeric(n_sim)
+  power_ancova <- numeric(n_sim)
+  for (i in 1:n_sim) {
+    # Simulate data
+    group <- factor(rep(1:3, each = n_per_group))
+    covariate <- rnorm(n_per_group * 3)
+    # Generate response with treatment effect and covariate relationship
+    response <- effect_size * as.numeric(group) + rho * covariate +
+                rnorm(n_per_group * 3, 0, sqrt(1 - rho^2))
+    data <- data.frame(response, group, covariate)
+    # Fit ANOVA
+    anova_model <- aov(response ~ group, data = data)
+    power_anova[i] <- summary(anova_model)[[1]]$`Pr(>F)`[1] < 0.05
+    # Fit ANCOVA
+    ancova_model <- aov(response ~ group + covariate, data = data)
+    power_ancova[i] <- summary(ancova_model)[[1]]$`Pr(>F)`[1] < 0.05
+  }
+  cat("Simulation Results (n =", n_sim, "):\n")
+  cat("ANOVA power:", mean(power_anova), "\n")
+  cat("ANCOVA power:", mean(power_ancova), "\n")
+  cat("Power increase:", mean(power_ancova) - mean(power_anova), "\n")
+  return(data.frame(ANOVA = power_anova, ANCOVA = power_ancova))
+}
-[http://wiki.stat.ucla.edu/socr/index.php/Probability_and_statistics_EBook#Chapter_XI:_Analysis_of_Variance_.28ANOVA.29  SOCR]
+# Run simulation
+results <- simulate_ancova_power(n_sim = 500, n_per_group = 25,
+                                 effect_size = 0.4, rho = 0.5)
+</pre>
-[http://en.wikipedia.org/wiki/Analysis_of_covariance  ANCOVA Wikipedia]
+===References===
-[http://en.wikipedia.org/wiki/Multivariate_analysis_of_variance  MANOVA Wikipedia]
+. Maxwell, S. E., Delaney, H. D., & Kelley, K. (2017). *Designing Experiments and Analyzing Data: A Model Comparison Perspective* (3rd ed.). Routledge.
-[http://en.wikipedia.org/wiki/MANCOVA  MANCOVA Wikipedia]
+. Rutherford, A. (2011). *ANOVA and ANCOVA: A GLM Approach* (2nd ed.). Wiley.
+. Tabachnick, B. G., & Fidell, L. S. (2018). *Using Multivariate Statistics* (7th ed.). Pearson.
+. Miller, G. A., & Chapman, J. P. (2001). Misunderstanding analysis of covariance. *Journal of Abnormal Psychology, 110*(1), 40-48.
+. Stevens, J. P. (2012). *Applied Multivariate Statistics for the Social Sciences* (5th ed.). Routledge.
+===Online Resources===
+* [https://www.jstatsoft.org/article/view/v033i01 ANCOVA in R Tutorial]
+* [https://cran.r-project.org/web/packages/emmeans/vignettes/emmeans.html Estimated Marginal Means]
+* [https://cran.r-project.org/web/views/Multivariate.html Multivariate Statistics in R]
+* [https://www.socr.umich.edu SOCR Resources]
 <hr>
-* SOCR Home page: http://www.socr.umich.edu
+* SOCR Home page: https://www.socr.umich.edu
 {{translate|pageName=http://wiki.socr.umich.edu/index.php?title=SMHS_ANCOVA}}