Difference between revisions of "SMHS GLM"
(→Software) |
(→Problems) |
||
Line 254: | Line 254: | ||
===Problems=== | ===Problems=== | ||
Repeat the GLM analyses above using the following SOCR datasets | Repeat the GLM analyses above using the following SOCR datasets | ||
− | * | + | *[http://wiki.socr.umich.edu/index.php/SOCR_Data_Dinov_021808_ConsumerPriceIndex3Way Consumer Price Index 3 Way] |
− | * | + | *[http://wiki.socr.umich.edu/index.php/SOCR_Data_MonetaryBase1959_2009 Monetary Base] |
− | + | ||
− | |||
===References=== | ===References=== | ||
Revision as of 14:20, 12 December 2014
Contents
Scientific Methods for Health Sciences - Generalized Linear Modeling (GLM)
Overview
Generalized Linear Modeling (GLM) is a flexible generalization of ordinary linear regression, which allows for response variables that have error distribution models other than a normal distribution. It generalizes linear regression by allowing the linear model to be related to the response variable via a link function and allowing the magnitude of the variance of each measurement to be function of its predicted value. GLM is a way of unifying statistical models like linear regression logistic regression and Poisson regression. Methods like iteratively reweighted least squares method for maximum likelihood estimation, Bayesian approaches and least squares fitted to variance stabilized responses are proposed for GLM. In this lecture, we are going to present a general introduction to GLM including the assumptions, model applied and illustrate the application of GLM with examples presented.
Motivation
We have discussed about linear regression, which deal with the linear relationship between response and the predictors. What if the response variable does not follow a normal distribution? For example, a model that predicts the probability of making a yes/no choice is not suitable as a linear-response model given that the probabilities are bounded on both ends. Or a study that aims to predict each decrease in 10 degrees Fahrenheit leads to 1000 fewer people going on vacation in the beach is unlikely to generalize well over small nor large beaches. In many cases, the linear regression model won’t apply and GLM has to be applied for it allows the response variables that have arbitrary distribution other than normal distribution to vary linearly with the predicted values.
Theory
- 1) GLM components: (1) a probability distribution from the exponential family; (2) a linear predictor $\eta=X\beta$ (3) a link function $g$ such that $E[Y]=\mu=g^{-1}(\eta)$ The dependent variable $Y$ in GLM is assumed to be generated from a particular distribution in the exponential family, a large range of probability distributions include normal, binomial, Poisson and gamma distribution. The mean, $\mu$, of the distribution depends on the independent variables $X:E[Y]=\mu=g^{-1} (X\beta),$ where $E[Y]$ is the expected value of $Y, X\beta$ is the linear predictor, a linear combination of unknown parameters, $\beta$, $g$ is the linked function. The variance is typically a function $V$ of the mean: $Var(Y)=V(\mu)=V(g^{-1} (X\beta)).$ It is convenient if $V$ follows from the exponential family distribution but it may simply be that the variance is a function of the predicted value. The unknown parameters, $\beta$, are typically estimated with maximum likelihood, maximum quasi-likelihood, or Bayesian techniques.
- 2)Probability distribution: the over-dispersed exponential family of distribution is a generalization of exponential family and exponential dispersion model of distributions and includes those by $\theta$ and $\tau$, whose density function can be expressed as $f_{Y} (y│\theta,\tau)=h(y,\tau)exp\left(\frac{b(theta)^{T} T(y)-A(\theta)}{(d{\tau})}\right)$ where $\tau$ is the dispersion parameter and is usually related to the variance of the distribution. For scalar $Y$ and $\theta$, this reduces to $f_{Y} (y│\theta,\tau)=h(y,\tau)exp\left(\frac{b(theta)^{T} T(y)-A(\theta)}{(d{\tau})}\right)$, $\theta$ is related to the mean of the distribution. If $b(\theta)$ is the identity function then the distribution is said to be in canonical form (natural form).
- 3) Linear predictor and link function: the quantity which incorporate the information about the independent variables into the model. $\eta=X\beta$ relates the expected value of the data (predictor). $\eta$ is expressed as linear combination of unknown parameters $\beta$. The coefficients of the linear combination are represented as the matrix of independent variables $X$. The link function provides the relationship between the linear predictor and the mean of the distribution function. The following table lists some commonly used exponential family distribution and data that typically used for along with the canonical link functions and their inverses.
Distribution | Support of distribution | Typical uses | Link name | Link function | Mean function |
Normal | $(-\infty,+\infty)$ real | Linear response data | Identity | $X\beta=\mu$ | $\mu =X\beta$ |
Exponential | $(0,+\infty)$ real | Exponential response data, scale parameters | Inverse | $X\beta=-\mu^{-1}$ | $\mu =(X\beta)^{-1}$ |
Gamma | $(0,+\infty)$ real | Exponential response data, scale parameters | Inverse | $X\beta=-\mu^{-1}$ | $\mu =(X\beta)^{-1}$ |
Inverse Gaussian | $(0,+\infty)$ real | Inverse | $X\beta=-\mu^{-2}$ | $\mu =(X\beta)^{-\frac{1} {2}}$ | |
Poisson | integer $(0,\infty)$ | Count of occurrences in fixed time/space | log | $X\beta=1n_{\mu}$ | $\mu = exp(X\beta)$ |
Bernoulli | integer [0,1] | Outcome of single yes/no occurrence | logit | $X\beta=ln(\frac{\mu}{1-\mu})$ | $\mu =\frac{exp(X\beta)}{1+exp(X\beta)}$ |
Binomial | integer [0,N] | Count of # of ‘yes’ in N yes/no occurrences | logit | $X\beta=ln(\frac{\mu}{1-\mu})$ | $\mu =\frac{exp(X\beta)}{1+exp(X\beta)}$ |
Categorical | integer [0,K],K-vector of integers [0,1] | Outcome of single K-way occurrence | logit | $X\beta=ln(\frac{\mu}{1-\mu})$ | $\mu =\frac{exp(X\beta)}{1+exp(X\beta)}$ |
Multinomial | K-vector of integers: [0,1] | Count of occurrences of different types (1,…,K) out of N total K-way occurrences | logit | $X\beta=ln(\frac{\mu}{1-\mu})$ | $\mu =\frac {exp(X\beta)}{1+exp(X\beta)}$ |
In the cases of the exponential and gamma distributions, the domain of the canonical link function is not the same as the permitted range of the mean. Particularly, the linear predictor may be negative, which would give an impossible negative mean. When maximizing the likelihood, precautions must be taken to avoid this. An alternative is to use a non-canonical link function.
Note also that in the case of the Bernoulli, binomial, categorical and multinomial distributions, the support of the distributions is not the same type of data as the parameter being predicted. In all of these cases, the predicted parameter is one or more probabilities, i.e. real numbers in the range $[0,1]$. The resulting model is known as logistic regression (or multinomial logistic regression in the case that K-way rather than binary values are being predicted).
For the Bernoulli and binomial distributions, the parameter is a single probability, indicating the likelihood of occurrence of a single event. The Bernoulli still satisfies the basic condition of the generalized linear model in that, even though a single outcome will always be either $0$ or $1$, the expected value will nonetheless be a real-valued probability, i.e. the probability of occurrence of a "yes" (or $1$) outcome. Similarly, in a binomial distribution, the expected value is Np, i.e. the expected proportion of "yes" outcomes will be the probability to be predicted.
For categorical and multinomial distributions, the parameter to be predicted is a $K$-vector of probabilities, with the further restriction that all probabilities must add up to $1$. Each probability indicates the likelihood of occurrence of one of the $K$ possible values. For the multinomial distribution, and for the vector form of the categorical distribution, the expected values of the elements of the vector can be related to the predicted probabilities similarly to the binomial and Bernoulli distributions.
- 4) Fitting the model:
- Maximum likelihood: can be found using iteratively reweighted least squares algorithm with $\beta^{t+1}=\beta^{t}+J^{-1} (\beta^{(t)}) u(\beta^{(t)})$, where $J(\beta^{(t)})$ is the observed information matrix (the negative of the Hessian matrix) and $u(\beta^{(t)})$ s the score function. With Fisher’s scoring method: $\beta^{(t+1)}=\beta^{(t)} +I^{-1} (\beta^{(t)})u(\beta^{(t)})$, where $where I^{-1} (\beta^{(t)})$is the Fisher information matrix.
- Bayesian method: to approximate the posterior distribution, usually using Laplace approximation $(\int_{a}^{b}e^{Mf(x)}dx)$ or Markov Chain Monte Carlo method such as Gibbs sampling.
- 5) Advantages of GLM: (1) can perform data analysis within and between subjects without the need to average the data itself; (2) allows you to counterbalance random stimuli orders; (3) allows you to exclude segments of runs with artifacts; (4) can perform more sophisticated analysis (e.g., two factor ANOVA with interactions); (5) easier to work with.
Applications
This article proposed an extension of generalized linear models to the analysis of longitudinal data. We introduce a class of estimating equations that give consistent estimates of the regression parameters and of their variance under mild assumptions about the time dependence. The estimating equations are derived without specifying the joint distribution of a subject's observations yet they reduce to the score equations for multivariate Gaussian outcomes. Asymptotic theory is presented for the general class of estimators. Specific cases in which we assume independence, m-dependence and exchangeable correlation structures from each subject are discussed. Efficiency of the proposed estimators in two simple situations is considered. The approach is closely related to quasi-likelihood.
This article proposed a conceptually very simple but general algorithm for the estimation of the fixed effects, random effects, and components of dispersion in generalized linear models with random effects. Conditions are described under which the algorithm yields approximate maximum likelihood or quasi-maximum likelihood estimates of the fixed effects and dispersion components, and approximate empirical Bayes estimates of the random effects. The algorithm is applied to two data sets to illustrate the estimation of components of dispersion and the modelling of over-dispersion.
Software
GLM in R:
glm(formula, family=familytype(link=linkfunction),data= ): # list of family and default link functions:
Family | Default Like Function |
Binomial | (link=’logit’) |
Gaussian | (link=identity) |
Gamma | (link=inverse) |
Inverse gamma | (link=${1}/mu^{2}$) |
Poisson | (link=’log’) |
Quasi | (link=’identity’,variance=’constant’) |
Quasibinomial | (link=’logit’) |
Quasipoisson | (link=’log’) |
# Logistic regression is useful when predict a binary outcome from a set of continuous predictor variables. # F as a binary factor and x1-x3 are continuous predictors fit <- glm(F~x1+x2+x3, data=mydata, family=binomial()) summary(fit) # display the results confint(fit) # 95% CI for the coefficients predict(fit,type=’response’) # predicted values residuals(fit,type=’deviance’) # residuals
## Use Poisson regression when predicting an outcome variable representing counts form a set of continuous predictor variables # Y is a count and x1-x3 are continuous predictors fit <- glm(Y~x1+x2+x3, data=mydata, family=poisson()) summary(fit) # display results
### TO illustrate the GLM with the following example cuse <-read.table("http://data.princeton.edu/wws509/datasets/cuse.dat",header=TRUE) cuse
Age | Education | Wants More | Not Using | Using | |
1 | <25 | low | yes | 53 | 6 |
2 | <25 | low | no | 10 | 4 |
3 | <25 | high | yes | 212 | 52 |
4 | <25 | high | no | 50 | 10 |
5 | 25-29 | low | yes | 60 | 14 |
6 | 25-29 | low | no | 19 | 10 |
7 | 25-29 | high | yes | 155 | 54 |
8 | 25-29 | high | no | 65 | 27 |
9 | 30-39 | low | yes | 112 | 33 |
10 | 30-39 | low | no | 77 | 80 |
11 | 30-39 | high | yes | 118 | 46 |
12 | 30-39 | high | no | 68 | 78 |
13 | 40-49 | low | yes | 35 | 6 |
14 | 40-49 | low | no | 46 | 48 |
15 | 40-49 | high | yes | 8 | 8 |
16 | 40-49 | high | no | 12 | 31 |
attach(cuse) lrfit <- glm( cbind(using, notUsing) ~ + age + education + wantsMore , family = binomial) # run the GLM on age, education, wantsMore lrfit
Call: glm(formula = cbind(using, notUsing) ~ +age + education + wantsMore, family = binomial)
Coefficients: (Intercept) age25-29 age30-39 age40-49 educationlow -0.8082 0.3894 0.9086 1.1892 -0.3250
wantsMoreyes -0.8330
Degrees of Freedom: 15 Total (i.e. Null); 10 Residual Null Deviance: 165.8 Residual Deviance: 29.92 AIC: 113.4
### Now relevel to change the base category and define our own indicator variables with high education and women who want no more children:
noMore <- wantsMore == "no" hiEduc <- education == "high" glm( cbind(using,notUsing) ~ age + hiEduc + noMore, family=binomial) Call: glm(formula = cbind(using, notUsing) ~ age + hiEduc + noMore, family = binomial)
Coefficients: (Intercept) age25-29 age30-39 age40-49 hiEducTRUE noMoreTRUE -1.9662 0.3894 0.9086 1.1892 0.3250 0.8330
Degrees of Freedom: 15 Total (i.e. Null); 10 Residual Null Deviance: 165.8 Residual Deviance: 29.92 AIC: 113.4
1-pchisq(29.92,10) 0.0008828339 ## we see that the residual deviance of 29.92 on 10 degree of freedom is highly significant, a better model could be introducing an interaction term of age and desire for no more children: lrfit <- glm( cbind(using,notUsing) ~ age * noMore + hiEduc , family=binomial) lrfit
Call: glm(formula = cbind(using, notUsing) ~ age * noMore + hiEduc, family = binomial)
Coefficients: (Intercept) age25-29 age30-39 -1.80317 0.39460 0.54666 age40-49 noMoreTRUE hiEducTRUE 0.57952 0.06622 0.34065 age25-29:noMoreTRUE age30-39:noMoreTRUE age40-49:noMoreTRUE 0.25918 1.11266 1.36167
Degrees of Freedom: 15 Total (i.e. Null); 7 Residual Null Deviance: 165.8 Residual Deviance: 12.63 AIC: 102.1
## now the model’s deviance of 12.63 on 7 degree of freedom is not significant at the conventional five per cent level, so there is no evidence against this model. The summary function offers more detailed information about this model
summary(lrfit) Call: glm(formula = cbind(using, notUsing) ~ age * noMore + hiEduc, family = binomial)
Deviance Residuals: Min 1Q Median 3Q Max -1.30027 -0.66163 -0.03286 0.81945 1.73851
Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -1.80317 0.18018 -10.008 < 2e-16 *** age25-29 0.39460 0.20145 1.959 0.05013 . age30-39 0.54666 0.19842 2.755 0.00587 ** age40-49 0.57952 0.34742 1.668 0.09530 . noMoreTRUE 0.06622 0.33071 0.200 0.84130 hiEducTRUE 0.34065 0.12577 2.709 0.00676 ** age25-29:noMoreTRUE 0.25918 0.40975 0.633 0.52704 age30-39:noMoreTRUE 1.11266 0.37404 2.975 0.00293 ** age40-49:noMoreTRUE 1.36167 0.48433 2.811 0.00493 ** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 165.77 on 15 degrees of freedom Residual deviance: 12.63 on 7 degrees of freedom AIC: 102.14
Number of Fisher Scoring iterations: 4
Problems
Repeat the GLM analyses above using the following SOCR datasets
References
- SOCR Home page: http://www.socr.umich.edu
Translate this page: