Difference between revisions of "SMHS GLM"

From SOCR
Jump to: navigation, search
(Software)
Line 1: Line 1:
 
==[[SMHS| Scientific Methods for Health Sciences]] - Generalized Linear Modeling (GLM) ==
 
==[[SMHS| Scientific Methods for Health Sciences]] - Generalized Linear Modeling (GLM) ==
 
  
 
===Overview===
 
===Overview===
Generalized Linear Modeling (GLM) is a flexible generalization of ordinary linear regression, which allows for response variables that have error distribution models other than a normal distribution. It generalizes linear regression by allowing the linear model to be related to the response variable via a link function and allowing the magnitude of the variance of each measurement to be function of its predicted value. GLM is a way of unifying statistical models like linear regression logistic regression and Poisson regression. Methods like iteratively reweighted least squares method for maximum likelihood estimation, Bayesian approaches and least squares fitted to variance stabilized responses are proposed for GLM. In this lecture, we are going to present a general introduction to GLM including the assumptions, model applied and illustrate the application of GLM with examples presented.
+
Generalized Linear Modeling (GLM) is a flexible generalization of ordinary linear regression that allows response variables to have error distribution models other than a normal distribution. GLM extends linear regression by allowing the linear model to be related to the response variable via a link function and enabling the variance of each measurement to be a function of its predicted value. This framework unifies statistical models including linear regression, logistic regression, and Poisson regression. Estimation methods include iteratively reweighted least squares for maximum likelihood estimation, Bayesian approaches, and least squares fitted to variance-stabilized responses.
  
 
===Motivation===
 
===Motivation===
We have discussed about linear regression, which deal with the linear relationship between response and the predictors. What if the response variable does not follow a normal distribution? For example, a model that predicts the probability of making a yes/no choice is not suitable as a linear-response model given that the probabilities are bounded on both ends. Or a study that aims to predict each decrease in 10 degrees Fahrenheit leads to 1000 fewer people going on vacation in the beach is unlikely to generalize well over small nor large beaches. In many cases, the linear regression model won’t apply and GLM has to be applied for it allows the response variables that have arbitrary distribution other than normal distribution to vary linearly with the predicted values.
+
While linear regression models linear relationships between response and predictors, many real-world scenarios involve response variables that don't follow normal distributions. For example:
 +
* Binary outcomes (yes/no decisions) with probabilities bounded between 0 and 1
 +
* Count data (number of events) that follow Poisson distributions
 +
* Survival times that follow exponential or Weibull distributions
 +
 
 +
GLM provides a unified framework for these situations by allowing response variables from the exponential family of distributions.
  
 
===Theory===
 
===Theory===
*1) GLM components: (1) a probability distribution from the exponential family; (2) a linear predictor $\eta=X\beta$ (3) a link function $g$ such that $E[Y]=\mu=g^{-1}(\eta)$ The dependent variable $Y$ in GLM is assumed to be generated from a particular distribution in the exponential family, a large range of probability distributions include normal, binomial, Poisson and gamma distribution. The mean, $\mu$, of the distribution depends on the independent variables $X:E[Y]=\mu=g^{-1} (X\beta),$ where $E[Y]$ is the expected value of $Y, X\beta$ is the linear predictor, a linear combination of unknown parameters, $\beta$, $g$ is the linked function. The variance is typically a function $V$ of the mean: $Var(Y)=V(\mu)=V(g^{-1} (X\beta)).$  It is convenient if $V$ follows from the exponential family distribution but it may simply be that the variance is a function of the predicted value. The unknown parameters, $\beta$, are typically estimated with maximum likelihood, maximum quasi-likelihood, or Bayesian techniques.
 
  
*2)Probability distribution: the over-dispersed exponential family of distribution is a generalization of exponential family and exponential dispersion model of distributions and includes those by $\theta$ and $\tau$, whose density function can be expressed as $f_{Y} (y│\theta,\tau)=h(y,\tau)exp\left(\frac{b(theta)^{T} T(y)-A(\theta)}{(d{\tau})}\right)$ where $\tau$ is the dispersion parameter and is usually related to the variance of the distribution. For scalar $Y$ and $\theta$, this reduces to $f_{Y} (y│\theta,\tau)=h(y,\tau)exp\left(\frac{b(theta)^{T} T(y)-A(\theta)}{(d{\tau})}\right)$, $\theta$ is related to the mean of the distribution. If $b(\theta)$ is the identity function then the distribution is said to be in canonical form (natural form).  
+
====1) GLM Components====
 +
A GLM consists of three components:
 +
 
 +
1. **Random Component**: The response variable \(Y\) follows a distribution from the exponential family:
 +
  \[
 +
  f_Y(y|\theta,\phi) = \exp\left\{\frac{y\theta - b(\theta)}{a(\phi)} + c(y,\phi)\right\}
 +
  \]
 +
  where \(\theta\) is the natural parameter and \(\phi\) is the dispersion parameter.
  
*3) Linear predictor and link function: the quantity which incorporate the information about the independent variables into the model. $\eta=X\beta$ relates the expected value of the data (predictor). $\eta$ is expressed as linear combination of unknown parameters $\beta$. The coefficients of the linear combination are represented as the matrix of independent variables $X$. The link function provides the relationship between the linear predictor and the mean of the distribution function. The following table lists some commonly used exponential family distribution and data that typically used for along with the canonical link functions and their inverses.
+
2. **Systematic Component**: The linear predictor \(\eta\):
 +
  \[
 +
  \eta = \mathbf{X}\beta = \beta_0 + \beta_1X_1 + \beta_2X_2 + \cdots + \beta_pX_p
 +
  \]
  
 +
3. **Link Function**: \(g(\cdot)\) that relates the mean \(\mu = E[Y]\) to the linear predictor:
 +
  \[
 +
  g(\mu) = \eta \quad \text{or equivalently} \quad \mu = g^{-1}(\eta)
 +
  \]
  
 +
The variance function relates the variance to the mean:
 +
\[
 +
\text{Var}(Y) = a(\phi)V(\mu)
 +
\]
 +
where \(V(\mu)\) is the variance function specific to the distribution.
 +
 +
====2) Exponential Family Distributions====
 +
The exponential family includes many common distributions:
  
 
<center>
 
<center>
{| class="wikitable" style="text-align:center; width:45%" border="1"
+
{| class="wikitable" style="text-align:center; width:80%" border="1"
 
|-
 
|-
|Distribution|| Support of distribution|| Typical uses|| Link name|| Link function|| Mean function
+
! Distribution !! Support !! Natural Parameter \(\theta\) !! \(b(\theta)\) !! Canonical Link \(g(\mu)\) !! Variance Function \(V(\mu)\)
 
|-
 
|-
|Normal|| $(-\infty,+\infty)$ real|| Linear response data|| Identity||$X\beta=\mu$|| $\mu =X\beta$
+
| Normal || \(\mathbb{R}\) || \(\mu\) || \(\frac{\theta^2}{2}\) || Identity: \(\mu\) || 1
|-
 
|Exponential|| $(0,+\infty)$ real||Exponential response data, scale parameters||Inverse||$X\beta=-\mu^{-1}$ ||$\mu =(X\beta)^{-1}$
 
|-
 
|Gamma||$(0,+\infty)$ real||Exponential response data, scale parameters||Inverse||$X\beta=-\mu^{-1}$ ||$\mu =(X\beta)^{-1}$
 
|-
 
|Inverse Gaussian || $(0,+\infty)$ real|| ||Inverse || $X\beta=-\mu^{-2}$ || $\mu =(X\beta)^{-\frac{1} {2}}$
 
 
|-
 
|-
|Poisson||integer $(0,\infty)$ ||Count of occurrences in fixed time/space||log||$X\beta=1n_{\mu}$|| $\mu = exp(X\beta)$
+
| Binomial || \(\{0,1,\ldots,n\}\) || \(\log\left(\frac{\mu}{n-\mu}\right)\) || \(n\log(1+e^\theta)\) || Logit: \(\log\left(\frac{\mu}{n-\mu}\right)\) || \(\mu\left(1-\frac{\mu}{n}\right)\)
 
|-
 
|-
|Bernoulli||integer [0,1]|| Outcome of single yes/no occurrence||logit||$X\beta=ln(\frac{\mu}{1-\mu})$||$\mu =\frac{exp(X\beta)}{1+exp(X\beta)}$
+
| Poisson || \(\mathbb{N}_0\) || \(\log(\mu)\) || \(e^\theta\) || Log: \(\log(\mu)\) || \(\mu\)
 
|-
 
|-
|Binomial ||integer [0,N]||Count of # of ‘yes’ in N yes/no occurrences||logit||$X\beta=ln(\frac{\mu}{1-\mu})$|| $\mu =\frac{exp(X\beta)}{1+exp(X\beta)}$
+
| Gamma || \(\mathbb{R}^+\) || \(-\frac{1}{\mu}\) || \(-\log(-\theta)\) || Inverse: \(\frac{1}{\mu}\) || \(\mu^2\)
|-
 
|Categorical || integer [0,K],K-vector of integers [0,1]||Outcome of single K-way occurrence||logit||$X\beta=ln(\frac{\mu}{1-\mu})$|| $\mu =\frac{exp(X\beta)}{1+exp(X\beta)}$
 
|-
 
|Multinomial||K-vector of integers: [0,1]|| Count of occurrences of different types (1,…,K) out of N total K-way occurrences||logit||$X\beta=ln(\frac{\mu}{1-\mu})$|| $\mu =\frac {exp(X\beta)}{1+exp(X\beta)}$
 
 
|-
 
|-
 +
| Inverse Gaussian || \(\mathbb{R}^+\) || \(-\frac{1}{2\mu^2}\) || \(-\sqrt{-2\theta}\) || Inverse squared: \(\frac{1}{\mu^2}\) || \(\mu^3\)
 
|}
 
|}
 
</center>
 
</center>
 
In the cases of the exponential and gamma distributions, the domain of the canonical link function is not the same as the permitted range of the mean. Particularly, the linear predictor may be negative, which would give an impossible negative mean. When maximizing the likelihood, precautions must be taken to avoid this. An alternative is to use a non-canonical link function.
 
  
 +
====3) Maximum Likelihood Estimation====
 +
For a GLM with \(n\) independent observations, the log-likelihood is:
 +
\[
 +
\ell(\beta) = \sum_{i=1}^n \frac{y_i\theta_i - b(\theta_i)}{a(\phi)} + c(y_i,\phi)
 +
\]
 +
where \(\theta_i = \theta(\mu_i)\) and \(\mu_i = g^{-1}(\mathbf{x}_i^\top\beta)\).
 +
 +
The score equations are:
 +
\[
 +
\mathbf{U}(\beta) = \frac{\partial\ell}{\partial\beta} = \mathbf{X}^\top\mathbf{W}(\mathbf{y} - \boldsymbol{\mu}) = 0
 +
\]
 +
where \(\mathbf{W} = \text{diag}\left\{\frac{1}{a(\phi)V(\mu_i)[g'(\mu_i)]^2}\right\}\).
 +
 +
The Fisher information matrix is:
 +
\[
 +
\mathcal{I}(\beta) = E\left[-\frac{\partial^2\ell}{\partial\beta\partial\beta^\top}\right] = \mathbf{X}^\top\mathbf{W}\mathbf{X}
 +
\]
 +
 +
Parameters are estimated via **Iteratively Reweighted Least Squares (IRLS)**:
 +
\[
 +
\beta^{(t+1)} = \beta^{(t)} + (\mathbf{X}^\top\mathbf{W}^{(t)}\mathbf{X})^{-1}\mathbf{X}^\top\mathbf{W}^{(t)}\mathbf{z}^{(t)}
 +
\]
 +
where \(z_i^{(t)} = \eta_i^{(t)} + (y_i - \mu_i^{(t)})g'(\mu_i^{(t)})\).
 +
 +
====4) Deviance and Goodness-of-Fit====
 +
The **deviance** measures goodness-of-fit:
 +
\[
 +
D = 2[\ell(\text{saturated model}) - \ell(\text{fitted model})]
 +
\]
 +
For nested models \(M_0 \subset M_1\):
 +
\[
 +
D_{M_0} - D_{M_1} \sim \chi^2_{df_{M_0} - df_{M_1}} \quad \text{under } H_0
 +
\]
 +
 +
The **scaled deviance** is \(D^* = D/\phi\), and **Pearson's chi-square statistic** is:
 +
\[
 +
\chi^2_P = \sum_{i=1}^n \frac{(y_i - \hat{\mu}_i)^2}{V(\hat{\mu}_i)}
 +
\]
  
Note also that in the case of the Bernoulli, binomial, categorical and multinomial distributions, the support of the distributions is not the same type of data as the parameter being predicted. In all of these cases, the predicted parameter is one or more probabilities, i.e. real numbers in the range $[0,1]$. The resulting model is known as logistic regression (or multinomial logistic regression in the case that K-way rather than binary values are being predicted).
+
====5) Model Diagnostics====
 +
Key diagnostic tools:
 +
* **Pearson residuals**: \(r_i^P = \frac{y_i - \hat{\mu}_i}{\sqrt{V(\hat{\mu}_i)}}\)
 +
* **Deviance residuals**: \(r_i^D = \text{sign}(y_i - \hat{\mu}_i)\sqrt{d_i}\)
 +
* **Leverage**: \(h_{ii}\) from the hat matrix \(\mathbf{H} = \mathbf{W}^{1/2}\mathbf{X}(\mathbf{X}^\top\mathbf{W}\mathbf{X})^{-1}\mathbf{X}^\top\mathbf{W}^{1/2}\)
 +
* **Cook's distance**: \(D_i = \frac{r_i^2 h_{ii}}{p(1-h_{ii})}\)
  
 +
===Applications===
  
 +
====Example 1: Logistic Regression for Contraceptive Use====
  
For the Bernoulli and binomial distributions, the parameter is a single probability, indicating the likelihood of occurrence of a single event. The Bernoulli still satisfies the basic condition of the generalized linear model in that, even though a single outcome will always be either $0$ or $1$, the expected value will nonetheless be a real-valued probability, i.e. the probability of occurrence of a "yes" (or $1$) outcome. Similarly, in a binomial distribution, the expected value is ''Np'', i.e. the expected proportion of "yes" outcomes will be the probability to be predicted.
+
<pre>
 +
# Load and prepare data
 +
cuse <- read.table("http://data.princeton.edu/wws509/datasets/cuse.dat", header=TRUE)
 +
cat("First few rows of dataset:\n")
 +
print(head(cuse))
  
 +
# Fit binomial GLM with logit link
 +
model1 <- glm(cbind(Using, NotUsing) ~ age + education + wantsMore,
 +
              family = binomial, data = cuse)
  
For categorical and multinomial distributions, the parameter to be predicted is a $K$-vector of probabilities, with the further restriction that all probabilities must add up to $1$. Each probability indicates the likelihood of occurrence of one of the $K$ possible values. For the multinomial distribution, and for the vector form of the categorical distribution, the expected values of the elements of the vector can be related to the predicted probabilities similarly to the binomial and Bernoulli distributions.
+
cat("\n=== Model Summary ===\n")
 +
summary(model1)
  
 +
cat("\n=== Confidence Intervals ===\n")
 +
confint(model1)
  
*4) Fitting the model:
+
cat("\n=== Odds Ratios with 95% CI ===\n")
**Maximum likelihood: can be found using iteratively reweighted least squares algorithm with $\beta^{t+1}=\beta^{t}+J^{-1} (\beta^{(t)}) u(\beta^{(t)})$, where $J(\beta^{(t)})$ is the observed information matrix (the negative of the Hessian matrix) and $u(\beta^{(t)})$ s the score function. With Fisher’s scoring method: $\beta^{(t+1)}=\beta^{(t)} +I^{-1} (\beta^{(t)})u(\beta^{(t)})$, where $where I^{-1} (\beta^{(t)})$is the Fisher information matrix.
+
exp_coef <- exp(coef(model1))
**Bayesian method: to approximate the posterior distribution, usually using Laplace approximation $(\int_{a}^{b}e^{Mf(x)}dx)$ or Markov Chain Monte Carlo method such as Gibbs sampling.
+
exp_ci <- exp(confint(model1))
 +
cbind(OR = exp_coef, exp_ci)
  
*5) Advantages of GLM: (1) can perform data analysis within and between subjects without the need to average the data itself; (2) allows you to counterbalance random stimuli orders; (3) allows you to exclude segments of runs with artifacts; (4) can perform more sophisticated analysis (e.g., two factor ANOVA with interactions); (5) easier to work with.
+
cat("\n=== Model Comparison (Likelihood Ratio Test) ===\n")
 +
# Reduced model without education
 +
model_reduced <- glm(cbind(Using, NotUsing) ~ age + wantsMore,
 +
                    family = binomial, data = cuse)
 +
anova(model_reduced, model1, test = "Chisq")
  
===Applications===
+
# Diagnostic plots
 +
par(mfrow = c(2, 2))
 +
plot(model1, which = 1:4)
 +
</pre>
  
[http://biomet.oxfordjournals.org/content/73/1/13.short  This article] proposed an extension of generalized linear models to the analysis of longitudinal data. We introduce a class of estimating equations that give consistent estimates of the regression parameters and of their variance under mild assumptions about the time dependence. The estimating equations are derived without specifying the joint distribution of a subject's observations yet they reduce to the score equations for multivariate Gaussian outcomes. Asymptotic theory is presented for the general class of estimators. Specific cases in which we assume independence, m-dependence and exchangeable correlation structures from each subject are discussed. Efficiency of the proposed estimators in two simple situations is considered. The approach is closely related to quasi-likelihood.
+
====Example 2: Poisson Regression for Count Data====
  
[http://biomet.oxfordjournals.org/content/78/4/719.short This article] proposed a conceptually very simple but general algorithm for the estimation of the fixed effects, random effects, and components of dispersion in generalized linear models with random effects. Conditions are described under which the algorithm yields approximate maximum likelihood or quasi-maximum likelihood estimates of the fixed effects and dispersion components, and approximate empirical Bayes estimates of the random effects. The algorithm is applied to two data sets to illustrate the estimation of components of dispersion and the modelling of over-dispersion.
+
<pre>
 +
# Example with built-in R dataset: AIDS cases in Belgium
 +
# Load necessary libraries
 +
if (!require("MASS")) install.packages("MASS")
 +
library(MASS)
  
 +
# Load AIDS dataset
 +
data(Aids2)
 +
cat("AIDS dataset structure:\n")
 +
str(Aids2)
  
===Software===
+
# Prepare data: count of AIDS cases by year and state
GLM in R:
+
aids_counts <- aggregate(cbind(count = 1:nrow(Aids2)) ~ year + state,
glm(formula, family=familytype(link=linkfunction),data= ):
+
                        data = Aids2, FUN = length)
# list of family and default link functions:
 
  
<center>
+
# Fit Poisson regression
{| class="wikitable" style="text-align:center; width:45%" border="1"
+
model2 <- glm(count ~ year + state, family = poisson, data = aids_counts)
|-
 
|Family|| Default Like Function
 
|-
 
|Binomial|| (link=’logit’)
 
|-
 
|Gaussian|| (link=identity)
 
|-
 
|Gamma|| (link=inverse)
 
|-
 
|Inverse gamma ||(link=${1}/mu^{2}$)
 
|-
 
|Poisson|| (link=’log’)
 
|-
 
|Quasi ||(link=’identity’,variance=’constant’)
 
|-
 
|Quasibinomial|| (link=’logit’)
 
|-
 
|Quasipoisson|| (link=’log’)
 
|-
 
|}
 
</center>
 
  
# Logistic regression is useful when predict a binary outcome from a set
+
cat("\n=== Poisson Model Summary ===\n")
of continuous predictor  variables.
+
summary(model2)
# F as a binary factor and x1-x3 are continuous predictors
 
fit <- glm(F~x1+x2+x3, data=mydata, family=binomial())
 
summary(fit)  # display the results
 
confint(fit)  # 95% CI for the coefficients
 
predict(fit,type=’response’)  # predicted values
 
residuals(fit,type=’deviance’) # residuals
 
  
: See these [http://ww2.coastal.edu/kingw/statistics/R-tutorials/logistic.html Logistic Regression R Examples].
+
cat("\n=== Check for Overdispersion ===\n")
 +
# Pearson chi-square statistic
 +
pearson_chi2 <- sum(residuals(model2, type = "pearson")^2)
 +
df_resid <- df.residual(model2)
 +
dispersion <- pearson_chi2 / df_resid
 +
cat("Pearson χ²:", pearson_chi2, "\n")
 +
cat("Dispersion parameter:", dispersion, "\n")
 +
cat("p-value for H0: φ=1:", pchisq(pearson_chi2, df_resid, lower.tail = FALSE), "\n")
  
## Use Poisson regression when predicting an outcome variable representing
+
# If overdispersed, fit quasipoisson model
counts from a set of continuous predictor variables
+
if (dispersion > 1.5) {
# Y is a count and x1-x3 are continuous predictors
+
  cat("\n=== Fitting Quasi-Poisson Model (accounting for overdispersion) ===\n")
fit <- glm(Y~x1+x2+x3, data=mydata, family=poisson())
+
  model2_qp <- glm(count ~ year + state, family = quasipoisson, data = aids_counts)
summary(fit)   # display results
+
  summary(model2_qp)
 +
}
  
 +
# Predict expected counts
 +
cat("\n=== Predictions for First 10 Observations ===\n")
 +
predictions <- predict(model2, type = "response", se.fit = TRUE)
 +
pred_df <- data.frame(
 +
  Observed = aids_counts$count[1:10],
 +
  Predicted = predictions$fit[1:10],
 +
  SE = predictions$se.fit[1:10],
 +
  Lower = predictions$fit[1:10] - 1.96 * predictions$se.fit[1:10],
 +
  Upper = predictions$fit[1:10] + 1.96 * predictions$se.fit[1:10]
 +
)
 +
print(pred_df)
 +
</pre>
  
### TO illustrate the GLM with the following example
+
====Example 3: Gamma Regression for Positive Continuous Data====
cuse <-read.table("http://data.princeton.edu/wws509/datasets/cuse.dat",header=TRUE)
 
cuse
 
  
<center>
+
<pre>
{| class="wikitable" style="text-align:center; width:45%" border="1"
+
# Example with insurance claims data
|-
+
if (!require("insuranceData")) install.packages("insuranceData")
| ||    Age|| Education|| Wants More || Not Using|| Using
+
library(insuranceData)
|-
+
data(dataCar)
|1 ||  <25 ||      low      ||yes  ||    53 ||    6
 
|-
 
|2  ||  <25 ||      low ||      no  ||    10  ||  4
 
|-
 
|3  ||  <25  ||    high    ||  yes  ||    212  || 52
 
|-
 
|4  ||  <25  ||  high    ||    no  ||    50  || 10
 
|-
 
|5 || 25-29  ||    low ||      yes ||      60  ||  14
 
|-
 
|6 || 25-29  ||    low  ||      no ||      19  || 10
 
|-
 
|7 || 25-29  ||    high  ||    yes  ||    155||    54
 
|-
 
|8 || 25-29  ||    high  ||      no ||      65  ||  27
 
|-
 
|9 || 30-39 ||      low ||      yes ||    112 ||  33
 
|-
 
|10|| 30-39 ||      low ||      no  ||    77 ||  80
 
|-
 
|11|| 30-39  ||    high  ||    yes  ||  118  ||  46
 
|-
 
|12 ||30-39 ||    high||        no ||      68||    78
 
|-
 
|13 ||40-49 ||      low  ||    yes  ||    35 ||    6
 
|-
 
|14|| 40-49||      low    ||    no  ||    46||    48
 
|-
 
|15|| 40-49    ||  high ||      yes ||      8||  8 
 
|-
 
|16 ||40-49  ||  high  ||    no    ||  12 ||  31
 
|}
 
</center>
 
  
attach(cuse)
+
cat("Car insurance claims dataset:\n")
lrfit <- glm( cbind(using, notUsing) ~ + age + education + wantsMore ,
+
str(dataCar)
family = binomial)     #  run the GLM on age, education, wantsMore
 
lrfit
 
  
Call: glm(formula = cbind(using, notUsing) ~ +age + education + wantsMore,  
+
# Filter for positive claim amounts
    family = binomial)
+
claims_positive <- subset(dataCar, claimcst0 > 0)
  
Coefficients:
+
# Fit Gamma GLM with log link (common for monetary amounts)
(Intercept)     age25-29      age30-39      age40-49  educationlow 
+
model3 <- glm(claimcst0 ~ agecat + area + veh_age,
    -0.8082        0.3894        0.9086        1.1892      -0.3250 
+
              family = Gamma(link = "log"),
 +
              data = claims_positive)
  
wantsMoreyes 
+
cat("\n=== Gamma Model Summary ===\n")
    -0.8330 
+
summary(model3)
  
Degrees of Freedom: 15 Total (i.e. Null);  10 Residual
+
cat("\n=== Checking Gamma Model Assumptions ===\n")
Null Deviance:     165.8
+
# Check residuals
Residual Deviance: 29.92 AIC: 113.4
+
res_gamma <- residuals(model3, type = "deviance")
 +
par(mfrow = c(1, 2))
 +
hist(res_gamma, main = "Deviance Residuals", xlab = "Residuals")
 +
qqnorm(res_gamma, main = "Q-Q Plot of Residuals")
 +
qqline(res_gamma)
  
### Now relevel to change the base category and define our own indicator variables
+
# Scale-location plot
with high education and women who want no more children:
+
fitted_values <- fitted(model3)
 +
plot(fitted_values, sqrt(abs(res_gamma)),
 +
    xlab = "Fitted Values", ylab = "√|Deviance Residuals|",
 +
    main = "Scale-Location Plot")
 +
lines(lowess(fitted_values, sqrt(abs(res_gamma))), col = "red")
 +
</pre>
  
noMore <- wantsMore == "no"
+
===Software Implementation in R====
hiEduc <- education == "high"
 
glm( cbind(using,notUsing) ~ age + hiEduc + noMore, family=binomial)
 
Call:  glm(formula = cbind(using, notUsing) ~ age + hiEduc + noMore,
 
family = binomial)
 
  
Coefficients:
+
<pre>
(Intercept)    age25-29     age30-39    age40-49   hiEducTRUE   noMoreTRUE 
+
# Comprehensive GLM analysis function
-1.9662      0.3894      0.9086      1.1892      0.3250       0.8330 
+
run_glm_analysis <- function(formula, data, family, link = NULL) {
 +
  # Fit the model
 +
  if (!is.null(link)) {
 +
     model <- glm(formula, data = data, family = family(link = link))
 +
  } else {
 +
     model <- glm(formula, data = data, family = family)
 +
  }
 +
 
 +
  # Model summary
 +
  cat("=== MODEL SUMMARY ===\n")
 +
  print(summary(model))
 +
 
 +
  # Confidence intervals
 +
  cat("\n=== 95% CONFIDENCE INTERVALS ===\n")
 +
  print(confint(model))
 +
    
 +
   # Goodness-of-fit tests
 +
  cat("\n=== GOODNESS-OF-FIT ===\n")
 +
  cat("Null Deviance:", model$null.deviance, "on", model$df.null, "df\n")
 +
  cat("Residual Deviance:", model$deviance, "on", model$df.residual, "df\n")
 +
  cat("AIC:", AIC(model), "\n")
 +
 
 +
  # Check for overdispersion (for Poisson and binomial)
 +
  if (family$family %in% c("poisson", "binomial", "quasipoisson", "quasibinomial")) {
 +
    cat("\n=== OVERDISPERSION CHECK ===\n")
 +
    n <- nrow(data)
 +
    p <- length(coef(model))
 +
    dispersion <- model$deviance / model$df.residual
 +
    cat("Dispersion parameter:", dispersion, "\n")
 +
    if (abs(dispersion - 1) > 0.1) {
 +
       cat("Note: Significant", ifelse(dispersion > 1, "over", "under"), "dispersion detected\n")
 +
    }
 +
  }
 +
 
 +
  # Return model for further analysis
 +
  return(model)
 +
}
  
Degrees of Freedom: 15 Total (i.e. Null);  10 Residual
+
# Example usage with mtcars dataset
Null Deviance:     165.8
+
cat("\n\n=== EXAMPLE: GAUSSIAN GLM (equivalent to linear regression) ===\n")
Residual Deviance: 29.92 AIC: 113.4
+
data(mtcars)
 +
model_gaussian <- run_glm_analysis(
 +
  formula = mpg ~ wt + hp + cyl,
 +
  data = mtcars,
 +
  family = gaussian,
 +
  link = "identity"
 +
)
  
1-pchisq(29.92,10)
+
# Compare with lm() for verification
0.0008828339
+
cat("\n=== COMPARISON WITH lm() ===\n")
## we see that the residual deviance of 29.92 on 10 degree of freedom is highly significant,
+
model_lm <- lm(mpg ~ wt + hp + cyl, data = mtcars)
a better model could be introducing an interaction term of age and desire for no more children:
+
print(summary(model_lm))
lrfit <- glm( cbind(using,notUsing) ~ age * noMore + hiEduc , family=binomial)
+
</pre>
lrfit
 
  
Call:  glm(formula = cbind(using, notUsing) ~ age * noMore + hiEduc,
+
===Common GLM Families and Links in R===
    family = binomial)
 
  
Coefficients:
+
<center>
        (Intercept)             age25-29            age30-39 
+
{| class="wikitable" style="text-align:center; width:80%" border="1"
          -1.80317              0.39460              0.54666 
+
|-
          age40-49          noMoreTRUE          hiEducTRUE 
+
! Family !! Default Link !! Alternative Links !! Typical Use
            0.57952              0.06622              0.34065 
+
|-
age25-29:noMoreTRUE  age30-39:noMoreTRUE  age40-49:noMoreTRUE 
+
| <code>gaussian()</code> || <code>identity</code> || <code>log</code>, <code>inverse</code> || Continuous, symmetric data
            0.25918              1.11266              1.36167 
+
|-
 +
| <code>binomial()</code> || <code>logit</code> || <code>probit</code>, <code>cauchit</code>, <code>log</code>, <code>cloglog</code> || Binary/count proportions
 +
|-
 +
| <code>poisson()</code> || <code>log</code> || <code>identity</code>, <code>sqrt</code> || Count data
 +
|-
 +
| <code>Gamma()</code> || <code>inverse</code> || <code>log</code>, <code>identity</code> || Positive continuous, right-skewed
 +
|-
 +
| <code>inverse.gaussian()</code> || <code>1/μ²</code> || <code>inverse</code>, <code>log</code>, <code>identity</code> || Positive continuous, highly skewed
 +
|-
 +
| <code>quasi()</code> || <code>identity</code> || User-defined || Overdispersed data
 +
|}
 +
</center>
  
Degrees of Freedom: 15 Total (i.e. Null);  7 Residual
+
===Practical Considerations====
Null Deviance:     165.8
 
Residual Deviance: 12.63 AIC: 102.1
 
  
## now the model’s deviance of 12.63 on 7 degree of freedom is not significant at
+
####1) Model Selection
the conventional five per cent level, so there is no evidence against this model.
+
* **AIC (Akaike Information Criterion)**: \( \text{AIC} = -2\ell + 2p \)
The summary function offers more detailed information about this model
+
* **BIC (Bayesian Information Criterion)**: \( \text{BIC} = -2\ell + p\log(n) \)
 +
* **Cross-validation**: Particularly useful for predictive performance
  
 +
####2) Handling Overdispersion
 +
For Poisson models: \( \text{Var}(Y) = \phi\mu \) where \(\phi > 1\) indicates overdispersion
 +
Solutions:
 +
* Use quasi-Poisson model
 +
* Use negative binomial distribution
 +
* Include random effects
  
summary(lrfit)
+
####3) Zero-Inflation
Call:
+
For count data with excess zeros, consider:
glm(formula = cbind(using, notUsing) ~ age * noMore + hiEduc,
+
* Zero-inflated Poisson (ZIP) model
    family = binomial)
+
* Zero-inflated negative binomial (ZINB) model
 +
* Hurdle models
  
Deviance Residuals:
+
===Advanced Topics====
      Min        1Q    Median        3Q      Max 
 
-1.30027  -0.66163  -0.03286  0.81945  1.73851 
 
  
Coefficients:
+
####1) Mixed Effects GLM (GLMM)
                    Estimate Std. Error z value Pr(>|z|)  
+
Extension incorporating random effects:
(Intercept)         -1.80317    0.18018 -10.008  < 2e-16 ***
+
\[
age25-29            0.39460    0.20145  1.959  0.05013 . 
+
g(E[Y|b]) = \mathbf{X}\beta + \mathbf{Z}b
age30-39            0.54666    0.19842  2.755  0.00587 **
+
\]
age40-49            0.57952    0.34742  1.668  0.09530 . 
+
where \( b \sim N(0, \mathbf{G}) \).
noMoreTRUE          0.06622    0.33071  0.200  0.84130   
 
hiEducTRUE          0.34065    0.12577  2.709  0.00676 **
 
age25-29:noMoreTRUE  0.25918    0.40975  0.633  0.52704   
 
age30-39:noMoreTRUE  1.11266    0.37404  2.975  0.00293 **
 
age40-49:noMoreTRUE  1.36167    0.48433  2.811  0.00493 **
 
---
 
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
 
  
(Dispersion parameter for binomial family taken to be 1)
+
Implementation in R with <code>lme4::glmer()</code>:
 +
<pre>
 +
library(lme4)
 +
# Example with binary outcome and random intercept
 +
glmm_model <- glmer(outcome ~ treatment + (1|subject),
 +
                    family = binomial, data = mydata)
 +
</pre>
  
Null deviance: 165.77  on 15  degrees of freedom
+
####2) Bayesian GLM
Residual deviance:  12.63  on  7  degrees of freedom
+
Using Markov Chain Monte Carlo (MCMC) methods:
AIC: 102.14
+
<pre>
 +
library(rstanarm)
 +
# Bayesian logistic regression
 +
bayes_model <- stan_glm(outcome ~ predictors,
 +
                        family = binomial,
 +
                        data = mydata,
 +
                        prior = normal(0, 2.5),
 +
                        prior_intercept = normal(0, 5))
 +
</pre>
  
Number of Fisher Scoring iterations: 4
+
===Problems====
  
===Problems===
+
1. **Conceptual Exercises**:
Repeat the GLM analyses above using the following SOCR datasets
+
  a) Derive the score equations for a Poisson GLM with log link
*[http://wiki.socr.umich.edu/index.php/SOCR_Data_Dinov_021808_ConsumerPriceIndex3Way Consumer Price Index 3 Way]
+
  b) Show that the binomial distribution with logit link is the canonical link
*[http://wiki.socr.umich.edu/index.php/SOCR_Data_MonetaryBase1959_2009 Monetary Base]
+
  c) Prove that the deviance for a normal GLM equals the residual sum of squares
  
===References===
+
2. **Applied Problems**:
 +
  a) Analyze the [http://wiki.socr.umich.edu/index.php/SOCR_Data_Dinov_021808_ConsumerPriceIndex3Way Consumer Price Index] data using appropriate GLM
 +
  b) Model the [http://wiki.socr.umich.edu/index.php/SOCR_Data_MonetaryBase1959_2009 Monetary Base] data considering temporal autocorrelation
 +
  c) Using the <code>iris</code> dataset, build a multinomial logistic regression to classify species
  
[http://en.wikipedia.org/wiki/Generalized_linear_model  GLM Wikipedia]
+
3. **Simulation Study**:
 +
  <pre>
 +
  # Simulate data from a logistic regression model
 +
  set.seed(123)
 +
  n <- 1000
 +
  x1 <- rnorm(n)
 +
  x2 <- rnorm(n)
 +
  beta <- c(0.5, 1, -0.5)
 +
  linear_predictor <- beta[1] + beta[2]*x1 + beta[3]*x2
 +
  probabilities <- plogis(linear_predictor)
 +
  y <- rbinom(n, size = 1, prob = probabilities)
 +
 
 +
  # Fit model and evaluate performance
 +
  sim_data <- data.frame(y = y, x1 = x1, x2 = x2)
 +
  model_sim <- glm(y ~ x1 + x2, family = binomial, data = sim_data)
 +
 
 +
  # Calculate bias and MSE
 +
  beta_hat <- coef(model_sim)
 +
  bias <- beta_hat - beta
 +
  mse <- mean((beta_hat - beta)^2)
 +
 
 +
  cat("True parameters:", beta, "\n")
 +
  cat("Estimated parameters:", beta_hat, "\n")
 +
  cat("Bias:", bias, "\n")
 +
  cat("MSE:", mse, "\n")
 +
  </pre>
  
[http://data.princeton.edu/wws509/notes/a2.pdf  GLM Theory]
+
===References====
  
 +
1. McCullagh, P., & Nelder, J. A. (1989). *Generalized Linear Models* (2nd ed.). Chapman and Hall.
 +
2. Dobson, A. J., & Barnett, A. G. (2018). *An Introduction to Generalized Linear Models* (4th ed.). CRC Press.
 +
3. Agresti, A. (2015). *Foundations of Linear and Generalized Linear Models*. Wiley.
 +
4. Fahrmeir, L., Kneib, T., Lang, S., & Marx, B. (2013). *Regression: Models, Methods and Applications*. Springer.
 +
5. Wood, S. N. (2017). *Generalized Additive Models: An Introduction with R* (2nd ed.). Chapman and Hall/CRC.
  
 +
====Online Resources====
 +
* [https://en.wikipedia.org/wiki/Generalized_linear_model GLM Wikipedia]
 +
* [https://data.princeton.edu/wws509/notes/a2.pdf GLM Theory Notes]
 +
* [https://www.jstatsoft.org/article/view/v015i12 GLM in R Tutorial]
 +
* [https://cran.r-project.org/web/views/SocialSciences.html R Resources for Social Sciences]
  
 
<hr>
 
<hr>
* SOCR Home page: http://www.socr.umich.edu
+
* SOCR Home page: https://www.socr.umich.edu
  
 
{{translate|pageName=http://wiki.socr.umich.edu/index.php?title=SMHS_GLM}}
 
{{translate|pageName=http://wiki.socr.umich.edu/index.php?title=SMHS_GLM}}

Revision as of 16:19, 8 December 2025

Scientific Methods for Health Sciences - Generalized Linear Modeling (GLM)

Overview

Generalized Linear Modeling (GLM) is a flexible generalization of ordinary linear regression that allows response variables to have error distribution models other than a normal distribution. GLM extends linear regression by allowing the linear model to be related to the response variable via a link function and enabling the variance of each measurement to be a function of its predicted value. This framework unifies statistical models including linear regression, logistic regression, and Poisson regression. Estimation methods include iteratively reweighted least squares for maximum likelihood estimation, Bayesian approaches, and least squares fitted to variance-stabilized responses.

Motivation

While linear regression models linear relationships between response and predictors, many real-world scenarios involve response variables that don't follow normal distributions. For example:

  • Binary outcomes (yes/no decisions) with probabilities bounded between 0 and 1
  • Count data (number of events) that follow Poisson distributions
  • Survival times that follow exponential or Weibull distributions

GLM provides a unified framework for these situations by allowing response variables from the exponential family of distributions.

Theory

1) GLM Components

A GLM consists of three components:

1. **Random Component**: The response variable \(Y\) follows a distribution from the exponential family:

  \[
   f_Y(y|\theta,\phi) = \exp\left\{\frac{y\theta - b(\theta)}{a(\phi)} + c(y,\phi)\right\}
   \]
  where \(\theta\) is the natural parameter and \(\phi\) is the dispersion parameter.

2. **Systematic Component**: The linear predictor \(\eta\):

  \[
   \eta = \mathbf{X}\beta = \beta_0 + \beta_1X_1 + \beta_2X_2 + \cdots + \beta_pX_p
   \]

3. **Link Function**: \(g(\cdot)\) that relates the mean \(\mu = E[Y]\) to the linear predictor:

  \[
   g(\mu) = \eta \quad \text{or equivalently} \quad \mu = g^{-1}(\eta)
   \]

The variance function relates the variance to the mean: \[ \text{Var}(Y) = a(\phi)V(\mu) \] where \(V(\mu)\) is the variance function specific to the distribution.

2) Exponential Family Distributions

The exponential family includes many common distributions:

Distribution Support Natural Parameter \(\theta\) \(b(\theta)\) Canonical Link \(g(\mu)\) Variance Function \(V(\mu)\)
Normal \(\mathbb{R}\) \(\mu\) \(\frac{\theta^2}{2}\) Identity: \(\mu\) 1
Binomial \(\{0,1,\ldots,n\}\) \(\log\left(\frac{\mu}{n-\mu}\right)\) \(n\log(1+e^\theta)\) Logit: \(\log\left(\frac{\mu}{n-\mu}\right)\) \(\mu\left(1-\frac{\mu}{n}\right)\)
Poisson \(\mathbb{N}_0\) \(\log(\mu)\) \(e^\theta\) Log: \(\log(\mu)\) \(\mu\)
Gamma \(\mathbb{R}^+\) \(-\frac{1}{\mu}\) \(-\log(-\theta)\) Inverse: \(\frac{1}{\mu}\) \(\mu^2\)
Inverse Gaussian \(\mathbb{R}^+\) \(-\frac{1}{2\mu^2}\) \(-\sqrt{-2\theta}\) Inverse squared: \(\frac{1}{\mu^2}\) \(\mu^3\)

3) Maximum Likelihood Estimation

For a GLM with \(n\) independent observations, the log-likelihood is: \[ \ell(\beta) = \sum_{i=1}^n \frac{y_i\theta_i - b(\theta_i)}{a(\phi)} + c(y_i,\phi) \] where \(\theta_i = \theta(\mu_i)\) and \(\mu_i = g^{-1}(\mathbf{x}_i^\top\beta)\).

The score equations are: \[ \mathbf{U}(\beta) = \frac{\partial\ell}{\partial\beta} = \mathbf{X}^\top\mathbf{W}(\mathbf{y} - \boldsymbol{\mu}) = 0 \] where \(\mathbf{W} = \text{diag}\left\{\frac{1}{a(\phi)V(\mu_i)[g'(\mu_i)]^2}\right\}\).

The Fisher information matrix is: \[ \mathcal{I}(\beta) = E\left[-\frac{\partial^2\ell}{\partial\beta\partial\beta^\top}\right] = \mathbf{X}^\top\mathbf{W}\mathbf{X} \]

Parameters are estimated via **Iteratively Reweighted Least Squares (IRLS)**: \[ \beta^{(t+1)} = \beta^{(t)} + (\mathbf{X}^\top\mathbf{W}^{(t)}\mathbf{X})^{-1}\mathbf{X}^\top\mathbf{W}^{(t)}\mathbf{z}^{(t)} \] where \(z_i^{(t)} = \eta_i^{(t)} + (y_i - \mu_i^{(t)})g'(\mu_i^{(t)})\).

4) Deviance and Goodness-of-Fit

The **deviance** measures goodness-of-fit: \[ D = 2[\ell(\text{saturated model}) - \ell(\text{fitted model})] \] For nested models \(M_0 \subset M_1\): \[ D_{M_0} - D_{M_1} \sim \chi^2_{df_{M_0} - df_{M_1}} \quad \text{under } H_0 \]

The **scaled deviance** is \(D^* = D/\phi\), and **Pearson's chi-square statistic** is: \[ \chi^2_P = \sum_{i=1}^n \frac{(y_i - \hat{\mu}_i)^2}{V(\hat{\mu}_i)} \]

5) Model Diagnostics

Key diagnostic tools:

  • **Pearson residuals**: \(r_i^P = \frac{y_i - \hat{\mu}_i}{\sqrt{V(\hat{\mu}_i)}}\)
  • **Deviance residuals**: \(r_i^D = \text{sign}(y_i - \hat{\mu}_i)\sqrt{d_i}\)
  • **Leverage**: \(h_{ii}\) from the hat matrix \(\mathbf{H} = \mathbf{W}^{1/2}\mathbf{X}(\mathbf{X}^\top\mathbf{W}\mathbf{X})^{-1}\mathbf{X}^\top\mathbf{W}^{1/2}\)
  • **Cook's distance**: \(D_i = \frac{r_i^2 h_{ii}}{p(1-h_{ii})}\)

Applications

Example 1: Logistic Regression for Contraceptive Use

# Load and prepare data
cuse <- read.table("http://data.princeton.edu/wws509/datasets/cuse.dat", header=TRUE)
cat("First few rows of dataset:\n")
print(head(cuse))

# Fit binomial GLM with logit link
model1 <- glm(cbind(Using, NotUsing) ~ age + education + wantsMore,
              family = binomial, data = cuse)

cat("\n=== Model Summary ===\n")
summary(model1)

cat("\n=== Confidence Intervals ===\n")
confint(model1)

cat("\n=== Odds Ratios with 95% CI ===\n")
exp_coef <- exp(coef(model1))
exp_ci <- exp(confint(model1))
cbind(OR = exp_coef, exp_ci)

cat("\n=== Model Comparison (Likelihood Ratio Test) ===\n")
# Reduced model without education
model_reduced <- glm(cbind(Using, NotUsing) ~ age + wantsMore,
                     family = binomial, data = cuse)
anova(model_reduced, model1, test = "Chisq")

# Diagnostic plots
par(mfrow = c(2, 2))
plot(model1, which = 1:4)

Example 2: Poisson Regression for Count Data

# Example with built-in R dataset: AIDS cases in Belgium
# Load necessary libraries
if (!require("MASS")) install.packages("MASS")
library(MASS)

# Load AIDS dataset
data(Aids2)
cat("AIDS dataset structure:\n")
str(Aids2)

# Prepare data: count of AIDS cases by year and state
aids_counts <- aggregate(cbind(count = 1:nrow(Aids2)) ~ year + state,
                         data = Aids2, FUN = length)

# Fit Poisson regression
model2 <- glm(count ~ year + state, family = poisson, data = aids_counts)

cat("\n=== Poisson Model Summary ===\n")
summary(model2)

cat("\n=== Check for Overdispersion ===\n")
# Pearson chi-square statistic
pearson_chi2 <- sum(residuals(model2, type = "pearson")^2)
df_resid <- df.residual(model2)
dispersion <- pearson_chi2 / df_resid
cat("Pearson χ²:", pearson_chi2, "\n")
cat("Dispersion parameter:", dispersion, "\n")
cat("p-value for H0: φ=1:", pchisq(pearson_chi2, df_resid, lower.tail = FALSE), "\n")

# If overdispersed, fit quasipoisson model
if (dispersion > 1.5) {
  cat("\n=== Fitting Quasi-Poisson Model (accounting for overdispersion) ===\n")
  model2_qp <- glm(count ~ year + state, family = quasipoisson, data = aids_counts)
  summary(model2_qp)
}

# Predict expected counts
cat("\n=== Predictions for First 10 Observations ===\n")
predictions <- predict(model2, type = "response", se.fit = TRUE)
pred_df <- data.frame(
  Observed = aids_counts$count[1:10],
  Predicted = predictions$fit[1:10],
  SE = predictions$se.fit[1:10],
  Lower = predictions$fit[1:10] - 1.96 * predictions$se.fit[1:10],
  Upper = predictions$fit[1:10] + 1.96 * predictions$se.fit[1:10]
)
print(pred_df)

Example 3: Gamma Regression for Positive Continuous Data

# Example with insurance claims data
if (!require("insuranceData")) install.packages("insuranceData")
library(insuranceData)
data(dataCar)

cat("Car insurance claims dataset:\n")
str(dataCar)

# Filter for positive claim amounts
claims_positive <- subset(dataCar, claimcst0 > 0)

# Fit Gamma GLM with log link (common for monetary amounts)
model3 <- glm(claimcst0 ~ agecat + area + veh_age,
              family = Gamma(link = "log"),
              data = claims_positive)

cat("\n=== Gamma Model Summary ===\n")
summary(model3)

cat("\n=== Checking Gamma Model Assumptions ===\n")
# Check residuals
res_gamma <- residuals(model3, type = "deviance")
par(mfrow = c(1, 2))
hist(res_gamma, main = "Deviance Residuals", xlab = "Residuals")
qqnorm(res_gamma, main = "Q-Q Plot of Residuals")
qqline(res_gamma)

# Scale-location plot
fitted_values <- fitted(model3)
plot(fitted_values, sqrt(abs(res_gamma)),
     xlab = "Fitted Values", ylab = "√|Deviance Residuals|",
     main = "Scale-Location Plot")
lines(lowess(fitted_values, sqrt(abs(res_gamma))), col = "red")

Software Implementation in R=

# Comprehensive GLM analysis function
run_glm_analysis <- function(formula, data, family, link = NULL) {
  # Fit the model
  if (!is.null(link)) {
    model <- glm(formula, data = data, family = family(link = link))
  } else {
    model <- glm(formula, data = data, family = family)
  }
  
  # Model summary
  cat("=== MODEL SUMMARY ===\n")
  print(summary(model))
  
  # Confidence intervals
  cat("\n=== 95% CONFIDENCE INTERVALS ===\n")
  print(confint(model))
  
  # Goodness-of-fit tests
  cat("\n=== GOODNESS-OF-FIT ===\n")
  cat("Null Deviance:", model$null.deviance, "on", model$df.null, "df\n")
  cat("Residual Deviance:", model$deviance, "on", model$df.residual, "df\n")
  cat("AIC:", AIC(model), "\n")
  
  # Check for overdispersion (for Poisson and binomial)
  if (family$family %in% c("poisson", "binomial", "quasipoisson", "quasibinomial")) {
    cat("\n=== OVERDISPERSION CHECK ===\n")
    n <- nrow(data)
    p <- length(coef(model))
    dispersion <- model$deviance / model$df.residual
    cat("Dispersion parameter:", dispersion, "\n")
    if (abs(dispersion - 1) > 0.1) {
      cat("Note: Significant", ifelse(dispersion > 1, "over", "under"), "dispersion detected\n")
    }
  }
  
  # Return model for further analysis
  return(model)
}

# Example usage with mtcars dataset
cat("\n\n=== EXAMPLE: GAUSSIAN GLM (equivalent to linear regression) ===\n")
data(mtcars)
model_gaussian <- run_glm_analysis(
  formula = mpg ~ wt + hp + cyl,
  data = mtcars,
  family = gaussian,
  link = "identity"
)

# Compare with lm() for verification
cat("\n=== COMPARISON WITH lm() ===\n")
model_lm <- lm(mpg ~ wt + hp + cyl, data = mtcars)
print(summary(model_lm))

Common GLM Families and Links in R

Family Default Link Alternative Links Typical Use
gaussian() identity log, inverse Continuous, symmetric data
binomial() logit probit, cauchit, log, cloglog Binary/count proportions
poisson() log identity, sqrt Count data
Gamma() inverse log, identity Positive continuous, right-skewed
inverse.gaussian() 1/μ² inverse, log, identity Positive continuous, highly skewed
quasi() identity User-defined Overdispersed data

Practical Considerations=

        1. 1) Model Selection
  • **AIC (Akaike Information Criterion)**: \( \text{AIC} = -2\ell + 2p \)
  • **BIC (Bayesian Information Criterion)**: \( \text{BIC} = -2\ell + p\log(n) \)
  • **Cross-validation**: Particularly useful for predictive performance
        1. 2) Handling Overdispersion

For Poisson models: \( \text{Var}(Y) = \phi\mu \) where \(\phi > 1\) indicates overdispersion Solutions:

  • Use quasi-Poisson model
  • Use negative binomial distribution
  • Include random effects
        1. 3) Zero-Inflation

For count data with excess zeros, consider:

  • Zero-inflated Poisson (ZIP) model
  • Zero-inflated negative binomial (ZINB) model
  • Hurdle models

Advanced Topics=

        1. 1) Mixed Effects GLM (GLMM)

Extension incorporating random effects: \[ g(E[Y|b]) = \mathbf{X}\beta + \mathbf{Z}b \] where \( b \sim N(0, \mathbf{G}) \).

Implementation in R with lme4::glmer():

library(lme4)
# Example with binary outcome and random intercept
glmm_model <- glmer(outcome ~ treatment + (1|subject),
                    family = binomial, data = mydata)
        1. 2) Bayesian GLM

Using Markov Chain Monte Carlo (MCMC) methods:

library(rstanarm)
# Bayesian logistic regression
bayes_model <- stan_glm(outcome ~ predictors,
                        family = binomial,
                        data = mydata,
                        prior = normal(0, 2.5),
                        prior_intercept = normal(0, 5))

Problems=

1. **Conceptual Exercises**:

  a) Derive the score equations for a Poisson GLM with log link
  b) Show that the binomial distribution with logit link is the canonical link
  c) Prove that the deviance for a normal GLM equals the residual sum of squares

2. **Applied Problems**:

  a) Analyze the Consumer Price Index data using appropriate GLM
  b) Model the Monetary Base data considering temporal autocorrelation
  c) Using the iris dataset, build a multinomial logistic regression to classify species

3. **Simulation Study**:

   # Simulate data from a logistic regression model
   set.seed(123)
   n <- 1000
   x1 <- rnorm(n)
   x2 <- rnorm(n)
   beta <- c(0.5, 1, -0.5)
   linear_predictor <- beta[1] + beta[2]*x1 + beta[3]*x2
   probabilities <- plogis(linear_predictor)
   y <- rbinom(n, size = 1, prob = probabilities)
   
   # Fit model and evaluate performance
   sim_data <- data.frame(y = y, x1 = x1, x2 = x2)
   model_sim <- glm(y ~ x1 + x2, family = binomial, data = sim_data)
   
   # Calculate bias and MSE
   beta_hat <- coef(model_sim)
   bias <- beta_hat - beta
   mse <- mean((beta_hat - beta)^2)
   
   cat("True parameters:", beta, "\n")
   cat("Estimated parameters:", beta_hat, "\n")
   cat("Bias:", bias, "\n")
   cat("MSE:", mse, "\n")
   

References=

1. McCullagh, P., & Nelder, J. A. (1989). *Generalized Linear Models* (2nd ed.). Chapman and Hall. 2. Dobson, A. J., & Barnett, A. G. (2018). *An Introduction to Generalized Linear Models* (4th ed.). CRC Press. 3. Agresti, A. (2015). *Foundations of Linear and Generalized Linear Models*. Wiley. 4. Fahrmeir, L., Kneib, T., Lang, S., & Marx, B. (2013). *Regression: Models, Methods and Applications*. Springer. 5. Wood, S. N. (2017). *Generalized Additive Models: An Introduction with R* (2nd ed.). Chapman and Hall/CRC.

Online Resources




Translate this page:

(default)
Uk flag.gif

Deutsch
De flag.gif

Español
Es flag.gif

Français
Fr flag.gif

Italiano
It flag.gif

Português
Pt flag.gif

日本語
Jp flag.gif

България
Bg flag.gif

الامارات العربية المتحدة
Ae flag.gif

Suomi
Fi flag.gif

इस भाषा में
In flag.gif

Norge
No flag.png

한국어
Kr flag.gif

中文
Cn flag.gif

繁体中文
Cn flag.gif

Русский
Ru flag.gif

Nederlands
Nl flag.gif

Ελληνικά
Gr flag.gif

Hrvatska
Hr flag.gif

Česká republika
Cz flag.gif

Danmark
Dk flag.gif

Polska
Pl flag.png

România
Ro flag.png

Sverige
Se flag.gif