Difference between revisions of "SMHS GLM"

Revision as of 18:15, 8 December 2025

Scientific Methods for Health Sciences - Generalized Linear Modeling (GLM)

Overview

Generalized Linear Modeling (GLM) is a flexible generalization of ordinary linear regression that allows response variables to have error distribution models other than a normal distribution. GLM extends linear regression by allowing the linear model to be related to the response variable via a link function and enabling the variance of each measurement to be a function of its predicted value. This framework unifies statistical models including linear regression, logistic regression, and Poisson regression. Estimation methods include iteratively reweighted least squares for maximum likelihood estimation, Bayesian approaches, and least squares fitted to variance-stabilized responses.

Motivation

While linear regression models linear relationships between response and predictors, many real-world scenarios involve response variables that don't follow normal distributions. For example:

Binary outcomes (yes/no decisions) with probabilities bounded between 0 and 1
Count data (number of events) that follow Poisson distributions
Survival times that follow exponential or Weibull distributions

GLM provides a unified framework for these situations by allowing response variables from the exponential family of distributions.

Theory

1) GLM Components

A GLM consists of three components:

1. Random Component: The response variable $Y$ follows a distribution from the exponential family\[ f_Y(y|\theta,\phi) = \exp\left\{\frac{y\theta - b(\theta)}{a(\phi)} + c(y,\phi)\right\} \]

  where \(\theta\) is the natural parameter and \(\phi\) is the dispersion parameter.

2. Systematic Component: The linear predictor $\eta$\[ \eta = \mathbf{X}\beta = \beta_0 + \beta_1X_1 + \beta_2X_2 + \cdots + \beta_pX_p \]

3. Link Function: $g(\cdot)$ that relates the mean $\mu = E[Y]$ to the linear predictor\[ g(\mu) = \eta \quad \text{or equivalently} \quad \mu = g^{-1}(\eta) \]

The variance function relates the variance to the mean\[ \text{Var}(Y) = a(\phi)V(\mu) \] where $V(\mu)$ is the variance function specific to the distribution.

2) Exponential Family Distributions

The exponential family includes many common distributions:

Distribution	Support	Natural Parameter $\theta$	$b(\theta)$	Canonical Link $g(\mu)$	Variance Function $V(\mu)$
Normal	$\mathbb{R}$	$\mu$	$\frac{\theta^2}{2}$	Identity: $\mu$	1
Binomial	$\{0,1,\ldots,n\}$	$\log\left(\frac{\mu}{n-\mu}\right)$	$n\log(1+e^\theta)$	Logit: $\log\left(\frac{\mu}{n-\mu}\right)$	$\mu\left(1-\frac{\mu}{n}\right)$
Poisson	$\mathbb{N}_0$	$\log(\mu)$	$e^\theta$	Log: $\log(\mu)$	$\mu$
Gamma	$\mathbb{R}^+$	$-\frac{1}{\mu}$	$-\log(-\theta)$	Inverse: $\frac{1}{\mu}$	$\mu^2$
Inverse Gaussian	$\mathbb{R}^+$	$-\frac{1}{2\mu^2}$	$-\sqrt{-2\theta}$	Inverse squared: $\frac{1}{\mu^2}$	$\mu^3$

3) Maximum Likelihood Estimation

For a GLM with $n$ independent observations, the log-likelihood is\[ \ell(\beta) = \sum_{i=1}^n \frac{y_i\theta_i - b(\theta_i)}{a(\phi)} + c(y_i,\phi) \] where $\theta_i = \theta(\mu_i)$ and $\mu_i = g^{-1}(\mathbf{x}_i^\top\beta)$.

The score equations are\[ \mathbf{U}(\beta) = \frac{\partial\ell}{\partial\beta} = \mathbf{X}^\top\mathbf{W}(\mathbf{y} - \boldsymbol{\mu}) = 0 \] where $\mathbf{W} = \text{diag}\left\{\frac{1}{a(\phi)V(\mu_i)[g'(\mu_i)]^2}\right\}$.

The Fisher information matrix is\[ \mathcal{I}(\beta) = E\left[-\frac{\partial^2\ell}{\partial\beta\partial\beta^\top}\right] = \mathbf{X}^\top\mathbf{W}\mathbf{X} \]

Parameters are estimated via Iteratively Reweighted Least Squares (IRLS)\[ \beta^{(t+1)} = \beta^{(t)} + (\mathbf{X}^\top\mathbf{W}^{(t)}\mathbf{X})^{-1}\mathbf{X}^\top\mathbf{W}^{(t)}\mathbf{z}^{(t)} \] where $z_i^{(t)} = \eta_i^{(t)} + (y_i - \mu_i^{(t)})g'(\mu_i^{(t)})$.

4) Deviance and Goodness-of-Fit

The deviance measures goodness-of-fit\[ D = 2[\ell(\text{saturated model}) - \ell(\text{fitted model})] \] For nested models $M_0 \subset M_1$\[ D_{M_0} - D_{M_1} \sim \chi^2_{df_{M_0} - df_{M_1}} \quad \text{under } H_0 \]

The scaled deviance is $D^* = D/\phi$, and Pearson's chi-square statistic is\[ \chi^2_P = \sum_{i=1}^n \frac{(y_i - \hat{\mu}_i)^2}{V(\hat{\mu}_i)} \]

5) Model Diagnostics

Key diagnostic tools:

Pearson residuals: $r_i^P = \frac{y_i - \hat{\mu}_i}{\sqrt{V(\hat{\mu}_i)}}$
Deviance residuals: $r_i^D = \text{sign}(y_i - \hat{\mu}_i)\sqrt{d_i}$
Leverage: $h_{ii}$ from the hat matrix $\mathbf{H} = \mathbf{W}^{1/2}\mathbf{X}(\mathbf{X}^\top\mathbf{W}\mathbf{X})^{-1}\mathbf{X}^\top\mathbf{W}^{1/2}$
Cook's distance: $D_i = \frac{r_i^2 h_{ii}}{p(1-h_{ii})}$

Applications

Example 1: Logistic Regression for Contraceptive Use

# Load and prepare data
# https://grodri.github.io/datasets/cuse.dat
cuse <- read.table("https://grodri.github.io/datasets/cuse.dat", header=TRUE)
cat("First few rows of dataset:\n")
print(head(cuse))

# Fit binomial GLM with logit link
model1 <- glm(cbind(using, notUsing) ~ age + education + wantsMore,
              family = binomial, data = cuse)

cat("\n=== Model Summary ===\n")
summary(model1)

cat("\n=== Confidence Intervals ===\n")
confint(model1)

cat("\n=== Odds Ratios with 95% CI ===\n")
exp_coef <- exp(coef(model1))
exp_ci <- exp(confint(model1))
cbind(OR = exp_coef, exp_ci)

cat("\n=== Model Comparison (Likelihood Ratio Test) ===\n")
# Reduced model without education
model_reduced <- glm(cbind(using, notUsing) ~ age + wantsMore,
                     family = binomial, data = cuse)
anova(model_reduced, model1, test = "Chisq")

# Diagnostic plots
par(mfrow = c(2, 2))
plot(model1, which = 1:4)

Example 2: Poisson Regression for Count Data

# Example with built-in R dataset: AIDS cases in Belgium
# Load necessary libraries
if (!require("MASS")) install.packages("MASS")
library(MASS)

# Load AIDS dataset
data(Aids2)
cat("AIDS dataset structure:\n")
str(Aids2)

# Prepare data: count of AIDS cases by year and state
aids_counts <- aggregate(cbind(count = 1:nrow(Aids2)) ~ age + state,
                         data = Aids2, FUN = length)

# Fit Poisson regression
model2 <- glm(count ~ age + state, family = poisson, data = aids_counts)

cat("\n=== Poisson Model Summary ===\n")
summary(model2)

cat("\n=== Check for Overdispersion ===\n")
# Pearson chi-square statistic
pearson_chi2 <- sum(residuals(model2, type = "pearson")^2)
df_resid <- df.residual(model2)
dispersion <- pearson_chi2 / df_resid
cat("Pearson χ²:", pearson_chi2, "\n")
cat("Dispersion parameter:", dispersion, "\n")
cat("p-value for H0: φ=1:", pchisq(pearson_chi2, df_resid, lower.tail = FALSE), "\n")

# If overdispersed, fit quasipoisson model
if (dispersion > 1.5) {
  cat("\n=== Fitting Quasi-Poisson Model (accounting for overdispersion) ===\n")
  model2_qp <- glm(count ~ age + state, family = quasipoisson, data = aids_counts)
  summary(model2_qp)
}

# Predict expected counts
cat("\n=== Predictions for First 10 Observations ===\n")
predictions <- predict(model2, type = "response", se.fit = TRUE)
pred_df <- data.frame(
  Observed = aids_counts$count[1:10],
  Predicted = predictions$fit[1:10],
  SE = predictions$se.fit[1:10],
  Lower = predictions$fit[1:10] - 1.96 * predictions$se.fit[1:10],
  Upper = predictions$fit[1:10] + 1.96 * predictions$se.fit[1:10]
)
print(pred_df)

Example 3: Gamma Regression for Positive Continuous Data

# Example with insurance claims data
if (!require("insuranceData")) install.packages("insuranceData")
library(insuranceData)
data(dataCar)

cat("Car insurance claims dataset:\n")
str(dataCar)

# Filter for positive claim amounts
claims_positive <- subset(dataCar, claimcst0 > 0)

# Fit Gamma GLM with log link (common for monetary amounts)
model3 <- glm(claimcst0 ~ agecat + area + veh_age,
              family = Gamma(link = "log"),
              data = claims_positive)

cat("\n=== Gamma Model Summary ===\n")
summary(model3)

cat("\n=== Checking Gamma Model Assumptions ===\n")
# Check residuals
res_gamma <- residuals(model3, type = "deviance")
par(mfrow = c(1, 2))
hist(res_gamma, main = "Deviance Residuals", xlab = "Residuals")
qqnorm(res_gamma, main = "Q-Q Plot of Residuals")
qqline(res_gamma)

# Scale-location plot
fitted_values <- fitted(model3)
plot(fitted_values, sqrt(abs(res_gamma)),
     xlab = "Fitted Values", ylab = "√|Deviance Residuals|",
     main = "Scale-Location Plot")
lines(lowess(fitted_values, sqrt(abs(res_gamma))), col = "red")

Software Implementation in R

Note that `family` is a function (a "closure"), not a list or an object with a $\$family$ component. In thecode, we're passing the family function itself (e.g., gaussian) to the $run\_glm\_analysis()$ function, and later accessing $family\$family$, which doesn’t exist—because gaussian is a function, not a fitted model object. In R, gaussian, binomial, poisson, etc., are functions that return family objects when called. We're passing the function, not the result of calling it. However, `glm()` internally calls `family()`, i.e., `gaussian()`, to get the actual family list object, which does have a $\$family$ element. Instead of checking $family\$family$, we extract the family name from the fitted model, not from the input argument. Specifically, after fitting the model, we use $model\$family\$family$, which is a character string like "gaussian", "binomial", etc.

# Comprehensive GLM analysis function
run_glm_analysis <- function(formula, data, family, link = NULL) {
  # Fit the model
  if (!is.null(link)) {
    model <- glm(formula, data = data, family = family(link = link))
  } else {
    model <- glm(formula, data = data, family = family)
  }
  
  # Model summary
  cat("=== MODEL SUMMARY ===\n")
  print(summary(model))
  
  # Confidence intervals
  cat("\n=== 95% CONFIDENCE INTERVALS ===\n")
  print(confint(model))
  
  # Goodness-of-fit tests
  cat("\n=== GOODNESS-OF-FIT ===\n")
  cat("Null Deviance:", model$null.deviance, "on", model$df.null, "df\n")
  cat("Residual Deviance:", model$deviance, "on", model$df.residual, "df\n")
  cat("AIC:", AIC(model), "\n")
  
  # Check for overdispersion (for Poisson and binomial families)
  family_name <- model$family$family  # ✅ Extract from fitted model
  if (family_name %in% c("poisson", "binomial", "quasipoisson", "quasibinomial")) {
    cat("\n=== OVERDISPERSION CHECK ===\n")
    dispersion <- model$deviance / model$df.residual
    cat("Dispersion parameter:", round(dispersion, 4), "\n")
    if (abs(dispersion - 1) > 0.1) {
      direction <- ifelse(dispersion > 1, "over", "under")
      cat("Note: Significant", direction, "dispersion detected\n")
    }
  }
  
  return(model)
}

# Example usage with mtcars dataset
cat("\n\n=== EXAMPLE: GAUSSIAN GLM (equivalent to linear regression) ===\n")
data(mtcars)
model_gaussian <- run_glm_analysis(
  formula = mpg ~ wt + hp + cyl,
  data = mtcars,
  family = gaussian,
  link = "identity"
)

# Compare with lm() for verification
cat("\n=== COMPARISON WITH lm() ===\n")
model_lm <- lm(mpg ~ wt + hp + cyl, data = mtcars)
print(summary(model_lm))

Common GLM Families and Links in R

Family	Default Link	Alternative Links	Typical Use
`gaussian()`	`identity`	`log`, `inverse`	Continuous, symmetric data
`binomial()`	`logit`	`probit`, `cauchit`, `log`, `cloglog`	Binary/count proportions
`poisson()`	`log`	`identity`, `sqrt`	Count data
`Gamma()`	`inverse`	`log`, `identity`	Positive continuous, right-skewed
`inverse.gaussian()`	`1/μ²`	`inverse`, `log`, `identity`	Positive continuous, highly skewed
`quasi()`	`identity`	User-defined	Overdispersed data

Practical Considerations

Model Selection:
- AIC (Akaike Information Criterion): $ \text{AIC} = -2\ell + 2p $
- BIC (Bayesian Information Criterion): $ \text{BIC} = -2\ell + p\log(n) $
- Cross-validation: Particularly useful for predictive performance.

Handling Overdispersion: For Poisson models: $ \text{Var}(Y) = \phi\mu $ where $\phi > 1$ indicates overdispersion

Solutions:

- Use quasi-Poisson model
- Use negative binomial distribution
- Include random effects.

Zero-Inflation: For count data with excess zeros, consider:
- Zero-inflated Poisson (ZIP) model
- Zero-inflated negative binomial (ZINB) model
- Hurdle models.

Advanced Topics

Mixed Effects GLM (GLMM)

Extension incorporating random effects\[ g(E[Y|b]) = \mathbf{X}\beta + \mathbf{Z}b \] where $ b \sim N(0, \mathbf{G}) $.

Implementation in R with lme4::glmer():

library(lme4)

# generate random data
n <- 500
id <- seq(n)
day <- 1:20
mydata <- expand.grid(id = id, day = day)
set.seed(1)
trt <- sample(c("control", "treat"), size = n, replace = TRUE)
sex <- sample(c("female", "male"), size = n, replace = TRUE)
mydata$trt <- trt[mydata$id]
mydata$sex <- sex[mydata$id]
mydata <- mydata[order(mydata$id, mydata$day),]
rownames(mydata) <- NULL
head(mydata, n = 10)

mydata$trtsex <- interaction(mydata$trt, mydata$sex)
probs <- c(0.40, 0.85, 0.30, 0.50)
names(probs) <- levels(mydata$trtsex)
mydata$p <- probs[mydata$trtsex]

set.seed(3)
r_probs <- rnorm(n = n, mean = 0, sd = 0.03)
mydata$random_p <- r_probs[mydata$id]
mydata$p <- mydata$p + mydata$random_p

# use the probabilities to generate zeroes and ones with the binom() function.
mydata$y <- rbinom(n = nrow(mydata), size = 1, prob = mydata$p)

# using the sim data, inspect the first few records.

head(mydata[c("id", "day", "trt", "sex", "p", "y")])



# Example with binary outcome and random intercept
m <- glmer(y ~ trt * sex + (1|id), data = mydata, family = binomial)
summary(m, corr = FALSE)

Bayesian GLM

Using Markov Chain Monte Carlo (MCMC) methods.

library(rstanarm)
# Bayesian logistic regression
bayes_model <- stan_glm(y ~ day +  trt + sex + p, # predictors,
                        family = binomial,
                        data = mydata,
                        prior = normal(0, 2.5),
                        prior_intercept = normal(0, 5))

print(bayes_model, digits = 2)

# Coefficient Plot (Forest Plot)

plot(bayes_model, "areas")  # density + intervals
# OR more customizable:
library(bayesplot)
mcmc_intervals(bayes_model, prob = 0.95) +
  ggplot2::labs(title = "Posterior Distributions of Coefficients")

# Posterior Predictive Checks (PPC)
# Check if the model can reproduce data like the observed
pp_check(bayes_model, plotfun = "hist", nreps = 100) +
  ggplot2::labs(title = "Posterior Predictive Check (Histogram of y_rep vs y)")

Another Bayesian Experiment.

library(rstanarm)
set.seed(123)

# Simulate data — explicitly make trt and sex into factors
n <- 500
mydata <- data.frame(
  day = sample(1:30, n, replace = TRUE),
  trt = factor(sample(c("A", "B"), n, replace = TRUE), levels = c("A", "B")),
  sex = factor(sample(c("F", "M"), n, replace = TRUE), levels = c("F", "M")),
  p   = rnorm(n, mean = 50, sd = 10)
)

# True log-odds
logit_p <- -1 - 0.02 * mydata$day + 
           0.6 * (as.numeric(mydata$trt) - 1) +   # A=0, B=1
           0.3 * (as.numeric(mydata$sex) - 1) +    # F=0, M=1
           0.04 * mydata$p

mydata$y <- rbinom(n, size = 1, prob = plogis(logit_p))

# Fit model
bayes_model <- stan_glm(
  y ~ day + trt + sex + p,
  family = binomial,
  data = mydata,
  prior = normal(0, 2.5),
  prior_intercept = normal(0, 5),
  seed = 42,
  refresh = 0
)

# Create New Data
# Confirm variables are factors
str(mydata[c("trt", "sex")])

# Create prediction grid
newdata <- expand.grid(
  day = seq(min(mydata$day), max(mydata$day), length.out = 30),
  trt = levels(mydata$trt)[1],      # e.g., "A"
  sex = levels(mydata$sex)[1],      # e.g., "F"
  p   = median(mydata$p)
)

# Verify it worked
stopifnot(nrow(newdata) > 0)
str(newdata)

# Generate Predictions from the Bayesian Posterior Probability (after fitting the Bayesian Model):
# Posterior predicted probabilities
post_pred <- posterior_linpred(bayes_model, newdata = newdata, transform = TRUE)

# Summarize
pred_summary <- data.frame(
  day = newdata$day,
  prob = apply(post_pred, 2, median),
  lower = apply(post_pred, 2, quantile, 0.025),
  upper = apply(post_pred, 2, quantile, 0.975)
)

# Plot
library(ggplot2)
ggplot(pred_summary, aes(x = day, y = prob)) +
  geom_line(color = "blue") +
  geom_ribbon(aes(ymin = lower, ymax = upper), alpha = 0.2, fill = "blue") +
  labs(
    title = "Predicted Probability of Outcome by Day",
    subtitle = "Holding treatment = A, sex = F, and p = median",
    y = "P(y = 1)",
    x = "Day"
  ) +
  theme_minimal()

Problems

1. Conceptual Exercises:

  a) Derive the score equations for a Poisson GLM with log link
  b) Show that the binomial distribution with logit link is the canonical link
  c) Prove that the deviance for a normal GLM equals the residual sum of squares

2. Applied Problems:

  a) Analyze the Consumer Price Index data using appropriate GLM
  b) Model the Monetary Base data considering temporal autocorrelation
  c) Using the iris dataset, build a multinomial logistic regression to classify species

3. Practice Simulation Study:

   # Simulate data from a logistic regression model
   set.seed(123)
   n <- 1000
   x1 <- rnorm(n)
   x2 <- rnorm(n)
   beta <- c(0.5, 1, -0.5)
   linear_predictor <- beta[1] + beta[2]*x1 + beta[3]*x2
   probabilities <- plogis(linear_predictor)
   y <- rbinom(n, size = 1, prob = probabilities)
   
   # Fit model and evaluate performance
   sim_data <- data.frame(y = y, x1 = x1, x2 = x2)
   model_sim <- glm(y ~ x1 + x2, family = binomial, data = sim_data)
   
   # Calculate bias and MSE
   beta_hat <- coef(model_sim)
   bias <- beta_hat - beta
   mse <- mean((beta_hat - beta)^2)
   
   cat("True parameters:", beta, "\n")
   cat("Estimated parameters:", beta_hat, "\n")
   cat("Bias:", bias, "\n")
   cat("MSE:", mse, "\n")

References

1. McCullagh, P., & Nelder, J. A. (1989). *Generalized Linear Models* (2nd ed.). Chapman and Hall.

2. Dobson, A. J., & Barnett, A. G. (2018). *An Introduction to Generalized Linear Models* (4th ed.). CRC Press.

3. Agresti, A. (2015). *Foundations of Linear and Generalized Linear Models*. Wiley.

4. Fahrmeir, L., Kneib, T., Lang, S., & Marx, B. (2013). *Regression: Models, Methods and Applications*. Springer.

5. Wood, S. N. (2017). *Generalized Additive Models: An Introduction with R* (2nd ed.). Chapman and Hall/CRC.

Online Resources

SOCR Home page: https://www.socr.umich.edu

Translate this page:

(default)	Deutsch	Español	Français	Italiano	Português	日本語	България	الامارات العربية المتحدة	Suomi	इस भाषा में	Norge
한국어	中文	繁体中文	Русский	Nederlands	Ελληνικά	Hrvatska	Česká republika	Danmark	Polska	România	Sverige

Distribution	Support	Natural Parameter \(\theta\)	\(b(\theta)\)	Canonical Link \(g(\mu)\)	Variance Function \(V(\mu)\)
Normal	\(\mathbb{R}\)	\(\mu\)	\(\frac{\theta^2}{2}\)	Identity: \(\mu\)	1
Binomial	\(\{0,1,\ldots,n\}\)	\(\log\left(\frac{\mu}{n-\mu}\right)\)	\(n\log(1+e^\theta)\)	Logit: \(\log\left(\frac{\mu}{n-\mu}\right)\)	\(\mu\left(1-\frac{\mu}{n}\right)\)
Poisson	\(\mathbb{N}_0\)	\(\log(\mu)\)	\(e^\theta\)	Log: \(\log(\mu)\)	\(\mu\)
Gamma	\(\mathbb{R}^+\)	\(-\frac{1}{\mu}\)	\(-\log(-\theta)\)	Inverse: \(\frac{1}{\mu}\)	\(\mu^2\)
Inverse Gaussian	\(\mathbb{R}^+\)	\(-\frac{1}{2\mu^2}\)	\(-\sqrt{-2\theta}\)	Inverse squared: \(\frac{1}{\mu^2}\)	\(\mu^3\)