Difference between revisions of "SMHS GLM"

Latest revision as of 17:09, 13 March 2026

Scientific Methods for Health Sciences - Generalized Linear Modeling (GLM)

Overview

Generalized Linear Modeling (GLM) is a flexible generalization of ordinary linear regression that allows response variables to have error distribution models other than a normal distribution. GLM extends linear regression by allowing the linear model to be related to the response variable via a link function and enabling the variance of each measurement to be a function of its predicted value. This framework unifies statistical models including linear regression, logistic regression, and Poisson regression. Estimation methods include iteratively reweighted least squares for maximum likelihood estimation, Bayesian approaches, and least squares fitted to variance-stabilized responses.

Motivation

While linear regression models linear relationships between response and predictors, many real-world scenarios involve response variables that don't follow normal distributions. For example:

Binary outcomes (yes/no decisions) with probabilities bounded between 0 and 1
Count data (number of events) that follow Poisson distributions
Survival times that follow exponential or Weibull distributions

GLM provides a unified framework for these situations by allowing response variables from the exponential family of distributions.

Theory

1) GLM Components

A GLM consists of three components:

1. Random Component: The response variable $Y$ follows a distribution from the exponential family\[f_Y(y|\theta,\phi) = \exp\left\{\frac{y\theta - b(\theta)}{a(\phi)} + c(y,\phi)\right\}\] where $\theta$ is the natural parameter and $\phi$ is the dispersion parameter.

2. Systematic Component: The linear predictor $\eta$\[\eta = \mathbf{X}\beta = \beta_0 + \beta_1X_1 + \beta_2X_2 + \cdots + \beta_pX_p\]

3. Link Function: $g(\cdot)$ that relates the mean $\mu = E[Y]$ to the linear predictor\[g(\mu) = \eta \quad \text{or equivalently} \quad \mu = g^{-1}(\eta)\]

The variance function relates the variance to the mean\[\text{Var}(Y) = a(\phi)V(\mu)\] where $V(\mu)$ is the variance function specific to the distribution.

2) Exponential Family Distributions

The exponential family includes many common distributions:

Distribution	Support	Natural Parameter $\theta$	$b(\theta)$	Canonical Link $g(\mu)$	Variance Function $V(\mu)$
Normal	$\mathbb{R}$	$\mu$	$\frac{\theta^2}{2}$	Identity: $\mu$	1
Binomial	$\{0,1,\ldots,n\}$	$\log\left(\frac{\mu}{n-\mu}\right)$	$n\log(1+e^\theta)$	Logit: $\log\left(\frac{\mu}{n-\mu}\right)$	$\mu\left(1-\frac{\mu}{n}\right)$
Poisson	$\mathbb{N}_0$	$\log(\mu)$	$e^\theta$	Log: $\log(\mu)$	$\mu$
Gamma	$\mathbb{R}^+$	$-\frac{1}{\mu}$	$-\log(-\theta)$	Inverse: $\frac{1}{\mu}$	$\mu^2$
Inverse Gaussian	$\mathbb{R}^+$	$-\frac{1}{2\mu^2}$	$-\sqrt{-2\theta}$	Inverse squared: $\frac{1}{\mu^2}$	$\mu^3$

3) Maximum Likelihood Estimation

For a GLM with $n$ independent observations, the log-likelihood is\[ \ell(\beta) = \sum_{i=1}^n \frac{y_i\theta_i - b(\theta_i)}{a(\phi)} + c(y_i,\phi) \] where $\theta_i = \theta(\mu_i)$ and $\mu_i = g^{-1}(\mathbf{x}_i^\top\beta)$.

The score equations are\[ \mathbf{U}(\beta) = \frac{\partial\ell}{\partial\beta} = \mathbf{X}^\top\mathbf{W}(\mathbf{y} - \boldsymbol{\mu}) = 0 \] where $\mathbf{W} = \text{diag}\left\{\frac{1}{a(\phi)V(\mu_i)[g'(\mu_i)]^2}\right\}$.

The Fisher information matrix is\[ \mathcal{I}(\beta) = E\left[-\frac{\partial^2\ell}{\partial\beta\partial\beta^\top}\right] = \mathbf{X}^\top\mathbf{W}\mathbf{X} \]

Parameters are estimated via Iteratively Reweighted Least Squares (IRLS)\[ \beta^{(t+1)} = \beta^{(t)} + (\mathbf{X}^\top\mathbf{W}^{(t)}\mathbf{X})^{-1}\mathbf{X}^\top\mathbf{W}^{(t)}\mathbf{z}^{(t)} \] where $z_i^{(t)} = \eta_i^{(t)} + (y_i - \mu_i^{(t)})g'(\mu_i^{(t)})$.

4) Deviance and Goodness-of-Fit

The deviance measures goodness-of-fit\[ D = 2[\ell(\text{saturated model}) - \ell(\text{fitted model})] \] For nested models $M_0 \subset M_1$\[ D_{M_0} - D_{M_1} \sim \chi^2_{df_{M_0} - df_{M_1}} \quad \text{under } H_0 \]

The scaled deviance is $D^* = D/\phi$, and Pearson's chi-square statistic is\[ \chi^2_P = \sum_{i=1}^n \frac{(y_i - \hat{\mu}_i)^2}{V(\hat{\mu}_i)} \]

5) Model Diagnostics

Key diagnostic tools:

Pearson residuals: $r_i^P = \frac{y_i - \hat{\mu}_i}{\sqrt{V(\hat{\mu}_i)}}$
Deviance residuals: $r_i^D = \text{sign}(y_i - \hat{\mu}_i)\sqrt{d_i}$
Leverage: $h_{ii}$ from the hat matrix $\mathbf{H} = \mathbf{W}^{1/2}\mathbf{X}(\mathbf{X}^\top\mathbf{W}\mathbf{X})^{-1}\mathbf{X}^\top\mathbf{W}^{1/2}$
Cook's distance: $D_i = \frac{r_i^2 h_{ii}}{p(1-h_{ii})}$

Applications

Example 1: Logistic Regression for Contraceptive Use

# Load and prepare data
# https://grodri.github.io/datasets/cuse.dat
cuse <- read.table("https://grodri.github.io/datasets/cuse.dat", header=TRUE)
cat("First few rows of dataset:\n")
print(head(cuse))

# Fit binomial GLM with logit link
model1 <- glm(cbind(using, notUsing) ~ age + education + wantsMore,
              family = binomial, data = cuse)

cat("\n=== Model Summary ===\n")
summary(model1)

cat("\n=== Confidence Intervals ===\n")
confint(model1)

cat("\n=== Odds Ratios with 95% CI ===\n")
exp_coef <- exp(coef(model1))
exp_ci <- exp(confint(model1))
cbind(OR = exp_coef, exp_ci)

cat("\n=== Model Comparison (Likelihood Ratio Test) ===\n")
# Reduced model without education
model_reduced <- glm(cbind(using, notUsing) ~ age + wantsMore,
                     family = binomial, data = cuse)
anova(model_reduced, model1, test = "Chisq")

# Diagnostic plots
par(mfrow = c(2, 2))
plot(model1, which = 1:4)

Result Interpretation:

Let's explicate how to interpret the parameter estimates in this GLM model. Specifically, as this is a bivariate outcome, the estimates are not correlations

glm(formula = cbind(using, notUsing) ~ age + education + wantsMore, 
    family = binomial, data = cuse)

Coefficients:
             Estimate Std. Error z value Pr(>|z|)    
(Intercept)   -0.8082     0.1590  -5.083 3.71e-07 ***
age25-29       0.3894     0.1759   2.214  0.02681 *  
age30-39       0.9086     0.1646   5.519 3.40e-08 ***
age40-49       1.1892     0.2144   5.546 2.92e-08 ***
educationlow  -0.3250     0.1240  -2.620  0.00879 ** 
wantsMoreyes  -0.8330     0.1175  -7.091 1.33e-12 ***

In a Generalized Linear Model (GLM) with a binomial family and a logit link (the default for R), the coefficients represent log-odds ratios. Since our outcome is structured as

   cbind(using, notUsing),

we are modeling the probability of "using" (success) versus "not using" (failure).

That is, the model interpretation is in terms of Log-Odds. In this model, the relationship between the covariate predictors and the outcome is defined by the logit function

$\text{logit}(p) = \ln\left(\frac{p}{1-p}\right) = \beta_0 + \beta_1X_1 + \dots + \beta_kX_k. $

The estimate Sign matters:

Positive Estimate: As the predictor increases (or if the category is present), the probability of "using" increases.

Negative Estimate: As the predictor increases, the probability of "using" decreases.

The estimate Magnitude is interpreted via the "Exponentiation Trick". Because humans don't think naturally in "log-odds" terms, we usually exponentiate the coefficients ($e^\beta$) to get the Odds Ratios (OR).

Variable	Estimate (β)	Odds Ratio (e^β)	Interpretation
age40-49	1.1892	≈ 3.28	Women aged 40-49 have 3.28 times the odds of using contraception compared to the reference age group (likely <25).
wantsMoreyes	-0.8330	≈ 0.43	Women who want more children have 57% lower odds (1 - 0.43) of using contraception than those who don't.

Specific Breakdown of the Results:

Age (Categorical): R has treated age as a factor. The baseline (reference) group is the youngest group (under 25).

As age increases ($25-29 \ \to 30-39\ \to 40-49$), the coefficients become increasingly positive ($0.38 \to 0.90 \to 1.18$).

Meaning: Older women in this dataset are significantly more likely to use contraception than the youngest group.

Education:

Estimate: -0.3250

Meaning: Being in the low education group is associated with a decrease in the log-odds of using contraception compared to the high group. Specifically, their odds are about 28\% lower ($e^{-0.325} \approx 0.72$).

Wants More Children:

Estimate: -0.8330

Meaning: This is a strong negative predictor. If a woman wants more children, the probability of her using contraception drops significantly.

We can have a quick "reality check" on the measuring units. Unlike a standard correlation (which is bounded between -1 and 1), these estimates can be any real number.

An estimate of 0 means the variable has no effect on the odds (Odds Ratio = 1).

The z-value tells us how many standard errors the estimate is from zero. Since all p-values are small ($< 0.05$), all these predictors are statistically significant.

Example 2: Poisson Regression for Count Data

# Example with built-in R dataset: AIDS cases in Belgium
# Load necessary libraries
if (!require("MASS")) install.packages("MASS")
library(MASS)

# Load AIDS dataset
data(Aids2)
cat("AIDS dataset structure:\n")
str(Aids2)

# Prepare data: count of AIDS cases by year and state
aids_counts <- aggregate(cbind(count = 1:nrow(Aids2)) ~ age + state,
                         data = Aids2, FUN = length)

# Fit Poisson regression
model2 <- glm(count ~ age + state, family = poisson, data = aids_counts)

cat("\n=== Poisson Model Summary ===\n")
summary(model2)

cat("\n=== Check for Overdispersion ===\n")
# Pearson chi-square statistic
pearson_chi2 <- sum(residuals(model2, type = "pearson")^2)
df_resid <- df.residual(model2)
dispersion <- pearson_chi2 / df_resid
cat("Pearson χ²:", pearson_chi2, "\n")
cat("Dispersion parameter:", dispersion, "\n")
cat("p-value for H0: φ=1:", pchisq(pearson_chi2, df_resid, lower.tail = FALSE), "\n")

# If overdispersed, fit quasipoisson model
if (dispersion > 1.5) {
  cat("\n=== Fitting Quasi-Poisson Model (accounting for overdispersion) ===\n")
  model2_qp <- glm(count ~ age + state, family = quasipoisson, data = aids_counts)
  summary(model2_qp)
}

# Predict expected counts
cat("\n=== Predictions for First 10 Observations ===\n")
predictions <- predict(model2, type = "response", se.fit = TRUE)
pred_df <- data.frame(
  Observed = aids_counts$count[1:10],
  Predicted = predictions$fit[1:10],
  SE = predictions$se.fit[1:10],
  Lower = predictions$fit[1:10] - 1.96 * predictions$se.fit[1:10],
  Upper = predictions$fit[1:10] + 1.96 * predictions$se.fit[1:10]
)
print(pred_df)

Example 3: Gamma Regression for Positive Continuous Data

# Example with insurance claims data
if (!require("insuranceData")) install.packages("insuranceData")
library(insuranceData)
data(dataCar)

cat("Car insurance claims dataset:\n")
str(dataCar)

# Filter for positive claim amounts
claims_positive <- subset(dataCar, claimcst0 > 0)

# Fit Gamma GLM with log link (common for monetary amounts)
model3 <- glm(claimcst0 ~ agecat + area + veh_age,
              family = Gamma(link = "log"),
              data = claims_positive)

cat("\n=== Gamma Model Summary ===\n")
summary(model3)

cat("\n=== Checking Gamma Model Assumptions ===\n")
# Check residuals
res_gamma <- residuals(model3, type = "deviance")
par(mfrow = c(1, 2))
hist(res_gamma, main = "Deviance Residuals", xlab = "Residuals")
qqnorm(res_gamma, main = "Q-Q Plot of Residuals")
qqline(res_gamma)

# Scale-location plot
fitted_values <- fitted(model3)
plot(fitted_values, sqrt(abs(res_gamma)),
     xlab = "Fitted Values", ylab = "√|Deviance Residuals|",
     main = "Scale-Location Plot")
lines(lowess(fitted_values, sqrt(abs(res_gamma))), col = "red")

Software Implementation in R

Note that `family` is a function (a "closure"), not a list or an object with a $\$family$ component. In thecode, we're passing the family function itself (e.g., gaussian) to the $run\_glm\_analysis()$ function, and later accessing $family\$family$, which doesn’t exist—because gaussian is a function, not a fitted model object. In R, gaussian, binomial, poisson, etc., are functions that return family objects when called. We're passing the function, not the result of calling it. However, `glm()` internally calls `family()`, i.e., `gaussian()`, to get the actual family list object, which does have a $\$family$ element. Instead of checking $family\$family$, we extract the family name from the fitted model, not from the input argument. Specifically, after fitting the model, we use $model\$family\$family$, which is a character string like "gaussian", "binomial", etc.

# Comprehensive GLM analysis function
run_glm_analysis <- function(formula, data, family, link = NULL) {
  # Fit the model
  if (!is.null(link)) {
    model <- glm(formula, data = data, family = family(link = link))
  } else {
    model <- glm(formula, data = data, family = family)
  }
  
  # Model summary
  cat("=== MODEL SUMMARY ===\n")
  print(summary(model))
  
  # Confidence intervals
  cat("\n=== 95% CONFIDENCE INTERVALS ===\n")
  print(confint(model))
  
  # Goodness-of-fit tests
  cat("\n=== GOODNESS-OF-FIT ===\n")
  cat("Null Deviance:", model$null.deviance, "on", model$df.null, "df\n")
  cat("Residual Deviance:", model$deviance, "on", model$df.residual, "df\n")
  cat("AIC:", AIC(model), "\n")
  
  # Check for overdispersion (for Poisson and binomial families)
  family_name <- model$family$family  # ✅ Extract from fitted model
  if (family_name %in% c("poisson", "binomial", "quasipoisson", "quasibinomial")) {
    cat("\n=== OVERDISPERSION CHECK ===\n")
    dispersion <- model$deviance / model$df.residual
    cat("Dispersion parameter:", round(dispersion, 4), "\n")
    if (abs(dispersion - 1) > 0.1) {
      direction <- ifelse(dispersion > 1, "over", "under")
      cat("Note: Significant", direction, "dispersion detected\n")
    }
  }
  
  return(model)
}

# Example usage with mtcars dataset
cat("\n\n=== EXAMPLE: GAUSSIAN GLM (equivalent to linear regression) ===\n")
data(mtcars)
model_gaussian <- run_glm_analysis(
  formula = mpg ~ wt + hp + cyl,
  data = mtcars,
  family = gaussian,
  link = "identity"
)

# Compare with lm() for verification
cat("\n=== COMPARISON WITH lm() ===\n")
model_lm <- lm(mpg ~ wt + hp + cyl, data = mtcars)
print(summary(model_lm))

Common GLM Families and Links in R

Family	Default Link	Alternative Links	Typical Use
`gaussian()`	`identity`	`log`, `inverse`	Continuous, symmetric data
`binomial()`	`logit`	`probit`, `cauchit`, `log`, `cloglog`	Binary/count proportions
`poisson()`	`log`	`identity`, `sqrt`	Count data
`Gamma()`	`inverse`	`log`, `identity`	Positive continuous, right-skewed
`inverse.gaussian()`	`1/μ²`	`inverse`, `log`, `identity`	Positive continuous, highly skewed
`quasi()`	`identity`	User-defined	Overdispersed data

Practical Considerations

Model Selection:
- AIC (Akaike Information Criterion): $ \text{AIC} = -2\ell + 2p $
- BIC (Bayesian Information Criterion): $ \text{BIC} = -2\ell + p\log(n) $
- Cross-validation: Particularly useful for predictive performance.

Handling Overdispersion: For Poisson models: $ \text{Var}(Y) = \phi\mu $ where $\phi > 1$ indicates overdispersion

Solutions:

- Use quasi-Poisson model
- Use negative binomial distribution
- Include random effects.

Zero-Inflation: For count data with excess zeros, consider:
- Zero-inflated Poisson (ZIP) model
- Zero-inflated negative binomial (ZINB) model
- Hurdle models.

Advanced Topics

Mixed Effects GLM (GLMM)

Extension incorporating random effects\[ g(E[Y|b]) = \mathbf{X}\beta + \mathbf{Z}b \] where $ b \sim N(0, \mathbf{G}) $.

Implementation in R with lme4::glmer():

library(lme4)

# generate random data
n <- 500
id <- seq(n)
day <- 1:20
mydata <- expand.grid(id = id, day = day)
set.seed(1)
trt <- sample(c("control", "treat"), size = n, replace = TRUE)
sex <- sample(c("female", "male"), size = n, replace = TRUE)
mydata$trt <- trt[mydata$id]
mydata$sex <- sex[mydata$id]
mydata <- mydata[order(mydata$id, mydata$day),]
rownames(mydata) <- NULL
head(mydata, n = 10)

mydata$trtsex <- interaction(mydata$trt, mydata$sex)
probs <- c(0.40, 0.85, 0.30, 0.50)
names(probs) <- levels(mydata$trtsex)
mydata$p <- probs[mydata$trtsex]

set.seed(3)
r_probs <- rnorm(n = n, mean = 0, sd = 0.03)
mydata$random_p <- r_probs[mydata$id]
mydata$p <- mydata$p + mydata$random_p

# use the probabilities to generate zeroes and ones with the binom() function.
mydata$y <- rbinom(n = nrow(mydata), size = 1, prob = mydata$p)

# using the sim data, inspect the first few records.

head(mydata[c("id", "day", "trt", "sex", "p", "y")])



# Example with binary outcome and random intercept
m <- glmer(y ~ trt * sex + (1|id), data = mydata, family = binomial)
summary(m, corr = FALSE)

Bayesian GLM

Using Markov Chain Monte Carlo (MCMC) methods.

library(rstanarm)
# Bayesian logistic regression
bayes_model <- stan_glm(y ~ day +  trt + sex + p, # predictors,
                        family = binomial,
                        data = mydata,
                        prior = normal(0, 2.5),
                        prior_intercept = normal(0, 5))

print(bayes_model, digits = 2)

# Coefficient Plot (Forest Plot)

plot(bayes_model, "areas")  # density + intervals
# OR more customizable:
library(bayesplot)
mcmc_intervals(bayes_model, prob = 0.95) +
  ggplot2::labs(title = "Posterior Distributions of Coefficients")

# Posterior Predictive Checks (PPC)
# Check if the model can reproduce data like the observed
pp_check(bayes_model, plotfun = "hist", nreps = 100) +
  ggplot2::labs(title = "Posterior Predictive Check (Histogram of y_rep vs y)")

Another Bayesian Experiment.

library(rstanarm)
set.seed(123)

# Simulate data — explicitly make trt and sex into factors
n <- 500
mydata <- data.frame(
  day = sample(1:30, n, replace = TRUE),
  trt = factor(sample(c("A", "B"), n, replace = TRUE), levels = c("A", "B")),
  sex = factor(sample(c("F", "M"), n, replace = TRUE), levels = c("F", "M")),
  p   = rnorm(n, mean = 50, sd = 10)
)

# True log-odds
logit_p <- -1 - 0.02 * mydata$day + 
           0.6 * (as.numeric(mydata$trt) - 1) +   # A=0, B=1
           0.3 * (as.numeric(mydata$sex) - 1) +    # F=0, M=1
           0.04 * mydata$p

mydata$y <- rbinom(n, size = 1, prob = plogis(logit_p))

# Fit model
bayes_model <- stan_glm(
  y ~ day + trt + sex + p,
  family = binomial,
  data = mydata,
  prior = normal(0, 2.5),
  prior_intercept = normal(0, 5),
  seed = 42,
  refresh = 0
)

# Create New Data
# Confirm variables are factors
str(mydata[c("trt", "sex")])

# Create prediction grid
newdata <- expand.grid(
  day = seq(min(mydata$day), max(mydata$day), length.out = 30),
  trt = levels(mydata$trt)[1],      # e.g., "A"
  sex = levels(mydata$sex)[1],      # e.g., "F"
  p   = median(mydata$p)
)

# Verify it worked
stopifnot(nrow(newdata) > 0)
str(newdata)

# Generate Predictions from the Bayesian Posterior Probability (after fitting the Bayesian Model):
# Posterior predicted probabilities
post_pred <- posterior_linpred(bayes_model, newdata = newdata, transform = TRUE)

# Summarize
pred_summary <- data.frame(
  day = newdata$day,
  prob = apply(post_pred, 2, median),
  lower = apply(post_pred, 2, quantile, 0.025),
  upper = apply(post_pred, 2, quantile, 0.975)
)

# Plot
library(ggplot2)
ggplot(pred_summary, aes(x = day, y = prob)) +
  geom_line(color = "blue") +
  geom_ribbon(aes(ymin = lower, ymax = upper), alpha = 0.2, fill = "blue") +
  labs(
    title = "Predicted Probability of Outcome by Day",
    subtitle = "Holding treatment = A, sex = F, and p = median",
    y = "P(y = 1)",
    x = "Day"
  ) +
  theme_minimal()

Gamma GLM vs. Beta Regression

The Gamma GLM and Beta Regression models are designed for very different types of "continuous" data. The fundamental difference lies in the domain (the range of possible values) and the variance structure of the data they expect.

The Gamma Model (glm)

The Gamma distribution is used for strictly positive, continuous data $(0, \infty)$. It is appropriate for data that is "skewed" to the right, where the variance increases as the mean increases. If a variable, like qol_score is a raw score (e.g., 0 to 100), the link, link = "log", ensures the predicted values remain positive.

In a Gamma GLM with a log link, we model the mean $\mu$ as

$\ln(\mu_i) = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \dots + \beta_k X_k.$

The probability density function (PDF) for the Gamma distribution is

$f(y; \alpha, \beta) = \frac{\beta^\alpha y^{\alpha-1} e^{-\beta y}}{\Gamma(\alpha)} \quad \text{for } y > 0.$

The Beta Model (betareg)

The Beta distribution, is used specifically for rates, proportions, or scores constrained between 0 and 1, $(0, 1)$. Unlike the Gamma model, which can go off to infinity, the Beta distribution is "boxed in" this interval. For instance, if qol_scaled is normalized, e.g., by dividing QoL score by its maximum value, it's values are forced into $(0, 1)$ interval. Beta distribution shape is veryflexible. It can be U-shaped, J-shaped, or symmetric, making it perfect for "ceiling" or "floor" effects in Quality of Life scores.

Beta regression typically uses a logit link by default to keep predictions between 0 and 1\[\ln\left(\frac{\mu_i}{1-\mu_i}\right) = \beta_0 + \beta_1 X_1 + \dots + \beta_k X_k.\]

The PDF for the Beta distribution (parameterized by mean $\mu$ and precision $\phi$) is\[f(y; \mu, \phi) = \frac{\Gamma(\phi)}{\Gamma(\mu\phi)\Gamma((1-\mu)\phi)} y^{\mu\phi-1} (1-y)^{(1-\mu)\phi-1} \quad \text{for } y \in (0, 1).\]

Key Differences between Gamma GLM and Beta Regression

Feature	Gamma Model (model4)	Beta Model (model_beta)
Outcome Range	$(0, \infty)$ (Positive numbers)	$(0, 1)$ (Proportions/Scaled)
Typical Use	Costs, rainfall, raw test scores	Percentages, scaled QoL indices
Interpretation	Log-link: Coefficients are multiplicative	Logit-link: Coefficients are odds ratios
Boundaries	Can handle very large values	Naturally handles "ceiling effects"

If the outcome dependent variable (DV) qol_score is a raw sum that could theoretically be much higher, stick with Gamma. But when qol_score is a bounded instrument (like a visual analog scale from 0–1) with many observations scoring near the very top or bottom, Beta may be superior because it understands the "walls" of the scale.

library(readr)
 
 # REMOTE URL
 url_wide <- "https://socr.umich.edu/docs/uploads/2026/SOCR_CRDS_CompAnalysis_CaseStudy_wide.csv"
 url_long <- "https://socr.umich.edu/docs/uploads/2026/SOCR_CRDS_CompAnalysis_CaseStudy_long.csv"
 
 # Import: Reading the Wide data
 df_wide <- read_csv(url_wide, col_types = cols(
     patient_id      = col_factor(),    # Prevents math on IDs
     age             = col_double(),
     gender          = col_character(),
     treatment       = col_character(),
     smoking_history = col_character(),
     baseline_fvc    = col_double(),
     fvc_0           = col_double(),
     fvc_6           = col_double(),
     fvc_12          = col_double(),
     qol_score       = col_double(),
     walk_dist       = col_double(),
     success         = col_integer(),   # 0 or 1 is best as integer
     hospital_visits = col_integer(),
     physician_notes = col_character()
 ))
                                                                                   
 # Import: Reading the Long data
 df_long <- read_csv(url_long, col_types = cols(
     patient_id = col_factor(),  
     age = col_double(),
     gender = col_character(),
     timepoint = col_factor(levels = c("Month 0", "Month 6", "Month 12"))
 ))

df_wide$qol_scaled <- df_wide$qol_score / (max(df_wide$qol_score) + 0.0001) 
 
 # Use the betareg package
 library(betareg)

model4 <- glm(qol_score ~ treatment + gender + age + baseline_fvc + success, family = Gamma(link = "log"), data = df_wide)
summary(model4)
model_beta <- betareg(qol_scaled ~ treatment + gender + baseline_fvc + success + age, data = df_wide)
summary(model_beta )

Problems

1. Conceptual Exercises:

  a) Derive the score equations for a Poisson GLM with log link
  b) Show that the binomial distribution with logit link is the canonical link
  c) Prove that the deviance for a normal GLM equals the residual sum of squares

2. Applied Problems:

  a) Analyze the Consumer Price Index data using appropriate GLM
  b) Model the Monetary Base data considering temporal autocorrelation
  c) Using the iris dataset, build a multinomial logistic regression to classify species

3. Practice Simulation Study:

   # Simulate data from a logistic regression model
   set.seed(123)
   n <- 1000
   x1 <- rnorm(n)
   x2 <- rnorm(n)
   beta <- c(0.5, 1, -0.5)
   linear_predictor <- beta[1] + beta[2]*x1 + beta[3]*x2
   probabilities <- plogis(linear_predictor)
   y <- rbinom(n, size = 1, prob = probabilities)
   
   # Fit model and evaluate performance
   sim_data <- data.frame(y = y, x1 = x1, x2 = x2)
   model_sim <- glm(y ~ x1 + x2, family = binomial, data = sim_data)
   
   # Calculate bias and MSE
   beta_hat <- coef(model_sim)
   bias <- beta_hat - beta
   mse <- mean((beta_hat - beta)^2)
   
   cat("True parameters:", beta, "\n")
   cat("Estimated parameters:", beta_hat, "\n")
   cat("Bias:", bias, "\n")
   cat("MSE:", mse, "\n")

References

1. McCullagh, P., & Nelder, J. A. (1989). *Generalized Linear Models* (2nd ed.). Chapman and Hall.

2. Dobson, A. J., & Barnett, A. G. (2018). *An Introduction to Generalized Linear Models* (4th ed.). CRC Press.

3. Agresti, A. (2015). *Foundations of Linear and Generalized Linear Models*. Wiley.

4. Fahrmeir, L., Kneib, T., Lang, S., & Marx, B. (2013). *Regression: Models, Methods and Applications*. Springer.

5. Wood, S. N. (2017). *Generalized Additive Models: An Introduction with R* (2nd ed.). Chapman and Hall/CRC.

Online Resources

SOCR Home page: https://www.socr.umich.edu

Translate this page:

(default)	Deutsch	Español	Français	Italiano	Português	日本語	България	الامارات العربية المتحدة	Suomi	इस भाषा में	Norge
한국어	中文	繁体中文	Русский	Nederlands	Ελληνικά	Hrvatska	Česká republika	Danmark	Polska	România	Sverige

Distribution	Support	Natural Parameter \(\theta\)	\(b(\theta)\)	Canonical Link \(g(\mu)\)	Variance Function \(V(\mu)\)
Normal	\(\mathbb{R}\)	\(\mu\)	\(\frac{\theta^2}{2}\)	Identity: \(\mu\)	1
Binomial	\(\{0,1,\ldots,n\}\)	\(\log\left(\frac{\mu}{n-\mu}\right)\)	\(n\log(1+e^\theta)\)	Logit: \(\log\left(\frac{\mu}{n-\mu}\right)\)	\(\mu\left(1-\frac{\mu}{n}\right)\)
Poisson	\(\mathbb{N}_0\)	\(\log(\mu)\)	\(e^\theta\)	Log: \(\log(\mu)\)	\(\mu\)
Gamma	\(\mathbb{R}^+\)	\(-\frac{1}{\mu}\)	\(-\log(-\theta)\)	Inverse: \(\frac{1}{\mu}\)	\(\mu^2\)
Inverse Gaussian	\(\mathbb{R}^+\)	\(-\frac{1}{2\mu^2}\)	\(-\sqrt{-2\theta}\)	Inverse squared: \(\frac{1}{\mu^2}\)	\(\mu^3\)