SMHS GLM

Scientific Methods for Health Sciences - Generalized Linear Modeling (GLM)

Overview

Generalized Linear Modeling (GLM) is a flexible generalization of ordinary linear regression, which allows for response variables that have error distribution models other than a normal distribution. It generalizes linear regression by allowing the linear model to be related to the response variable via a link function and allowing the magnitude of the variance of each measurement to be function of its predicted value. GLM is a way of unifying statistical models like linear regression logistic regression and Poisson regression. Methods like iteratively reweighted least squares method for maximum likelihood estimation, Bayesian approaches and least squares fitted to variance stabilized responses are proposed for GLM. In this lecture, we are going to present a general introduction to GLM including the assumptions, model applied and illustrate the application of GLM with examples presented.

Motivation

We have discussed about linear regression, which deal with the linear relationship between response and the predictors. What if the response variable does not follow a normal distribution? For example, a model that predicts the probability of making a yes/no choice is not suitable as a linear-response model given that the probabilities are bounded on both ends. Or a study that aims to predict each decrease in 10 degrees Fahrenheit leads to 1000 fewer people going on vacation in the beach is unlikely to generalize well over small nor large beaches. In many cases, the linear regression model won’t apply and GLM has to be applied for it allows the response variables that have arbitrary distribution other than normal distribution to vary linearly with the predicted values.

Theory

1) GLM components: (1) a probability distribution from the exponential family; (2) a linear predictor $\eta=X\beta$ (3) a link function $g$ such that $E[Y]=\mu=g^{-1}(\eta)$ The dependent variable $Y$ in GLM is assumed to be generated from a particular distribution in the exponential family, a large range of probability distributions include normal, binomial, Poisson and gamma distribution. The mean, $\mu$, of the distribution depends on the independent variables $X:E[Y]=\mu=g^{-1} (X\beta),$ where $E[Y]$ is the expected value of $Y, X\beta$ is the linear predictor, a linear combination of unknown parameters, $\beta$, $g$ is the linked function. The variance is typically a function $V$ of the mean: $Var(Y)=V(\mu)=V(g^{-1} (X\beta)).$ It is convenient if $V$ follows from the exponential family distribution but it may simply be that the variance is a function of the predicted value. The unknown parameters, $\beta$, are typically estimated with maximum likelihood, maximum quasi-likelihood, or Bayesian techniques.

2)Probability distribution: the over-dispersed exponential family of distribution is a generalization of exponential family and exponential dispersion model of distributions and includes those by $\theta$ and $\tau$, whose density function can be expressed as $f_{Y} (y│\theta,\tau)=h(y,\tau)exp\left(\frac{b(theta)^{T} T(y)-A(\theta)}{(d{\tau})}\right)$ where $\tau$ is the dispersion parameter and is usually related to the variance of the distribution. For scalar $Y$ and $\theta$, this reduces to $f_{Y} (y│\theta,\tau)=h(y,\tau)exp\left(\frac{b(theta)^{T} T(y)-A(\theta)}{(d{\tau})}\right)$, $\theta$ is related to the mean of the distribution. If $b(\theta)$ is the identity function then the distribution is said to be in canonical form (natural form).

3) Linear predictor and link function: the quantity which incorporate the information about the independent variables into the model. $\eta=X\beta$ relates the expected value of the data (predictor). $\eta$ is expressed as linear combination of unknown parameters $\beta$. The coefficients of the linear combination are represented as the matrix of independent variables $X$. The link function provides the relationship between the linear predictor and the mean of the distribution function. The following table lists some commonly used exponential family distribution and data that typically used for along with the canonical link functions and their inverses.

Distribution	Support of distribution	Typical uses	Link name	Link function	Mean function
Normal	$(-\infty,+\infty)$ real	Linear response data	Identity	$X\beta=\mu$	$\mu =X\beta$
Exponential	$(0,+\infty)$ real	Exponential response data, scale parameters	Inverse	$X\beta=-\mu^{-1}$	$\mu =(X\beta)^{-1}$
Gamma	$(0,+\infty)$ real	Exponential response data, scale parameters	Inverse	$X\beta=-\mu^{-1}$	$\mu =(X\beta)^{-1}$
Inverse Gaussian	$(0,+\infty)$ real		Inverse	$X\beta=-\mu^{-2}$	$\mu =(X\beta)^{-\frac{1} {2}}$
Poisson	integer $(0,\infty)$	Count of occurrences in fixed time/space	log	$X\beta=1n_{\mu}$	$\mu = exp(X\beta)$
Bernoulli	integer [0,1]	Outcome of single yes/no occurrence	logit	$X\beta=ln(\frac{\mu}{1-\mu})$	$\mu =\frac{exp(X\beta)}{1+exp(X\beta)}$
Binomial	integer [0,N]	Count of # of ‘yes’ in N yes/no occurrences	logit	$X\beta=ln(\frac{\mu}{1-\mu})$	$\mu =\frac{exp(X\beta)}{1+exp(X\beta)}$
Categorical	integer [0,K],K-vector of integers [0,1]	Outcome of single K-way occurrence	logit	$X\beta=ln(\frac{\mu}{1-\mu})$	$\mu =\frac{exp(X\beta)}{1+exp(X\beta)}$
Multinomial	K-vector of integers: [0,1]	Count of occurrences of different types (1,…,K) out of N total K-way occurrences	logit	$X\beta=ln(\frac{\mu}{1-\mu})$	$\mu =\frac {exp(X\beta)}{1+exp(X\beta)}$

In the cases of the exponential and gamma distributions, the domain of the canonical link function is not the same as the permitted range of the mean. Particularly, the linear predictor may be negative, which would give an impossible negative mean. When maximizing the likelihood, precautions must be taken to avoid this. An alternative is to use a non-canonical link function.

Note also that in the case of the Bernoulli, binomial, categorical and multinomial distributions, the support of the distributions is not the same type of data as the parameter being predicted. In all of these cases, the predicted parameter is one or more probabilities, i.e. real numbers in the range $[0,1]$. The resulting model is known as logistic regression (or multinomial logistic regression in the case that K-way rather than binary values are being predicted).

For the Bernoulli and binomial distributions, the parameter is a single probability, indicating the likelihood of occurrence of a single event. The Bernoulli still satisfies the basic condition of the generalized linear model in that, even though a single outcome will always be either $0$ or $1$, the expected value will nonetheless be a real-valued probability, i.e. the probability of occurrence of a "yes" (or $1$) outcome. Similarly, in a binomial distribution, the expected value is Np, i.e. the expected proportion of "yes" outcomes will be the probability to be predicted.

For categorical and multinomial distributions, the parameter to be predicted is a $K$-vector of probabilities, with the further restriction that all probabilities must add up to $1$. Each probability indicates the likelihood of occurrence of one of the $K$ possible values. For the multinomial distribution, and for the vector form of the categorical distribution, the expected values of the elements of the vector can be related to the predicted probabilities similarly to the binomial and Bernoulli distributions.

4) Fitting the model:
- Maximum likelihood: can be found using iteratively reweighted least squares algorithm with $\beta^{t+1}=\beta^{t}+J^{-1} (\beta^{(t)}) u(\beta^{(t)})$, where $J(\beta^{(t)})$ is the observed information matrix (the negative of the Hessian matrix) and $u(\beta^{(t)})$ s the score function. With Fisher’s scoring method: $\beta^{(t+1)}=\beta^{(t)} +I^{-1} (\beta^{(t)})u(\beta^{(t)})$, where $where I^{-1} (\beta^{(t)})$is the Fisher information matrix.
- Bayesian method: to approximate the posterior distribution, usually using Laplace approximation $(\int_{a}^{b}e^{Mf(x)}dx)$ or Markov Chain Monte Carlo method such as Gibbs sampling.

5) Advantages of GLM: (1) can perform data analysis within and between subjects without the need to average the data itself; (2) allows you to counterbalance random stimuli orders; (3) allows you to exclude segments of runs with artifacts; (4) can perform more sophisticated analysis (e.g., two factor ANOVA with interactions); (5) easier to work with.

Applications

This article proposed an extension of generalized linear models to the analysis of longitudinal data. We introduce a class of estimating equations that give consistent estimates of the regression parameters and of their variance under mild assumptions about the time dependence. The estimating equations are derived without specifying the joint distribution of a subject's observations yet they reduce to the score equations for multivariate Gaussian outcomes. Asymptotic theory is presented for the general class of estimators. Specific cases in which we assume independence, m-dependence and exchangeable correlation structures from each subject are discussed. Efficiency of the proposed estimators in two simple situations is considered. The approach is closely related to quasi-likelihood.

This article proposed a conceptually very simple but general algorithm for the estimation of the fixed effects, random effects, and components of dispersion in generalized linear models with random effects. Conditions are described under which the algorithm yields approximate maximum likelihood or quasi-maximum likelihood estimates of the fixed effects and dispersion components, and approximate empirical Bayes estimates of the random effects. The algorithm is applied to two data sets to illustrate the estimation of components of dispersion and the modelling of over-dispersion.

Software

GLM in R:

glm(formula, family=familytype(link=linkfunction),data= ): 
# list of family and default link functions:

Family	Default Like Function
Binomial	(link=’logit’)
Gaussian	(link=identity)
Gamma	(link=inverse)
Inverse gamma	(link=${1}/mu^{2}$)
Poisson	(link=’log’)
Quasi	(link=’identity’,variance=’constant’)
Quasibinomial	(link=’logit’)
Quasipoisson	(link=’log’)

# Logistic regression is useful when predict a binary outcome from a set 
of continuous predictor   variables.
# F as a binary factor and x1-x3 are continuous predictors
fit <- glm(F~x1+x2+x3, data=mydata, family=binomial())
summary(fit)  # display the results
confint(fit)  # 95% CI for the coefficients
predict(fit,type=’response’)  # predicted values
residuals(fit,type=’deviance’)  # residuals

See these Logistic Regression R Examples.

## Use Poisson regression when predicting an outcome variable representing
counts from a set of continuous predictor variables
# Y is a count and x1-x3 are continuous predictors 
fit <- glm(Y~x1+x2+x3, data=mydata, family=poisson())
summary(fit)   # display results

### TO illustrate the GLM with the following example 
cuse <-read.table("http://data.princeton.edu/wws509/datasets/cuse.dat",header=TRUE)
cuse

	Age	Education	Wants More	Not Using	Using
1	<25	low	yes	53	6
2	<25	low	no	10	4
3	<25	high	yes	212	52
4	<25	high	no	50	10
5	25-29	low	yes	60	14
6	25-29	low	no	19	10
7	25-29	high	yes	155	54
8	25-29	high	no	65	27
9	30-39	low	yes	112	33
10	30-39	low	no	77	80
11	30-39	high	yes	118	46
12	30-39	high	no	68	78
13	40-49	low	yes	35	6
14	40-49	low	no	46	48
15	40-49	high	yes	8	8
16	40-49	high	no	12	31

attach(cuse)
lrfit <- glm( cbind(using, notUsing) ~ + age + education + wantsMore , 
family = binomial)     #  run the GLM on age, education, wantsMore
lrfit

Call: glm(formula = cbind(using, notUsing) ~ +age + education + wantsMore, 
   family = binomial)

Coefficients:
(Intercept)      age25-29      age30-39      age40-49  educationlow  
    -0.8082        0.3894        0.9086        1.1892       -0.3250

wantsMoreyes  
    -0.8330

Degrees of Freedom: 15 Total (i.e. Null);  10 Residual
Null Deviance:	    165.8 
Residual Deviance: 29.92 	AIC: 113.4

### Now relevel to change the base category and define our own indicator variables
with high education and women who want no more children:

noMore <- wantsMore == "no"
hiEduc <- education == "high"
glm( cbind(using,notUsing) ~ age + hiEduc + noMore, family=binomial)
Call:  glm(formula = cbind(using, notUsing) ~ age + hiEduc + noMore,
family = binomial)

Coefficients:
(Intercept)     age25-29     age30-39     age40-49   hiEducTRUE   noMoreTRUE  
-1.9662       0.3894       0.9086       1.1892       0.3250       0.8330

Degrees of Freedom: 15 Total (i.e. Null);  10 Residual
Null Deviance:	    165.8 
Residual Deviance: 29.92 	AIC: 113.4

1-pchisq(29.92,10)
0.0008828339
## we see that the residual deviance of 29.92 on 10 degree of freedom is highly significant,
a better model could be introducing an interaction term of age and desire for no more children:
lrfit <- glm( cbind(using,notUsing) ~ age * noMore + hiEduc , family=binomial)
lrfit

Call:  glm(formula = cbind(using, notUsing) ~ age * noMore + hiEduc, 
   family = binomial)

Coefficients:
       (Intercept)             age25-29             age30-39  
          -1.80317              0.39460              0.54666  
          age40-49           noMoreTRUE           hiEducTRUE  
           0.57952              0.06622              0.34065  
age25-29:noMoreTRUE  age30-39:noMoreTRUE  age40-49:noMoreTRUE  
           0.25918              1.11266              1.36167

Degrees of Freedom: 15 Total (i.e. Null);  7 Residual
Null Deviance:	    165.8 
Residual Deviance: 12.63 	AIC: 102.1

## now the model’s deviance of 12.63 on 7 degree of freedom is not significant at
the conventional five per cent level, so there is no evidence against this model.
The summary function offers more detailed information about this model

summary(lrfit)
Call:
glm(formula = cbind(using, notUsing) ~ age * noMore + hiEduc, 
    family = binomial)

Deviance Residuals: 
     Min        1Q    Median        3Q       Max  
-1.30027  -0.66163  -0.03286   0.81945   1.73851

Coefficients:
                   Estimate Std. Error z value Pr(>|z|)    
(Intercept)         -1.80317    0.18018 -10.008  < 2e-16 ***
age25-29             0.39460    0.20145   1.959  0.05013 .  
age30-39             0.54666    0.19842   2.755  0.00587 ** 
age40-49             0.57952    0.34742   1.668  0.09530 .  
noMoreTRUE           0.06622    0.33071   0.200  0.84130    
hiEducTRUE           0.34065    0.12577   2.709  0.00676 ** 
age25-29:noMoreTRUE  0.25918    0.40975   0.633  0.52704    
age30-39:noMoreTRUE  1.11266    0.37404   2.975  0.00293 ** 
age40-49:noMoreTRUE  1.36167    0.48433   2.811  0.00493 ** 
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 165.77  on 15  degrees of freedom
Residual deviance:  12.63  on  7  degrees of freedom
AIC: 102.14

Number of Fisher Scoring iterations: 4

Problems

Repeat the GLM analyses above using the following SOCR datasets

References

GLM Wikipedia

GLM Theory

SOCR Home page: http://www.socr.umich.edu

Translate this page:

(default)	Deutsch	Español	Français	Italiano	Português	日本語	България	الامارات العربية المتحدة	Suomi	इस भाषा में	Norge
한국어	中文	繁体中文	Русский	Nederlands	Ελληνικά	Hrvatska	Česká republika	Danmark	Polska	România	Sverige

SMHS GLM

Contents

Scientific Methods for Health Sciences - Generalized Linear Modeling (GLM)

Overview

Motivation

Theory

Applications

Software

Problems

References

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools