SMHS MLR
Contents
Scientific Methods for Health Sciences - Multiple Linear Regression (MLR)
Overview
Multiple Linear Regression is a class of statistical analysis models and procedures, which takes one independent variable and one dependent and one or more variable, both sets being quantitative, and models the relationship between them. The goal of MLR computing procedure is to estimate all of the coefficients based on the data using least square fitting. In this section, we will discuss multiple linear regression and illustrate its application with examples.
Motivation
We are already very familiar with the model of linear regression where a simple linear model is applied to the case of one dependent variable and one explanatory variable (predictor). However, it is often the case that the dependent variable is influenced by more than one predictor. So, what if we have more than one explanatory variables? How can we apply a linear model when there are more than one predictors in the study? Multiple linear regression would be the answer here.
Theory=
- MLR: In linear regression, data are modeled using linear predictor functions and models with unknown parameters are estimated from the sample data. It sometimes may refer to a model in which some quantile of the conditional distribution of y given X is expressed as a linear function of X. This model is widely applied in varieties of studies and is one of the most commonly used models in statistical analysis. MLR Regression: multiple linear regression is a special case of linear regression where more than one explanatory variable are included in the model. MLR should be distinguished from multivariate linear regression, where multiple correlated dependent variables are predicted rather than a single scalar variable. In fact, nearly all real-world regression models involve multiple predictors, and basic descriptions of linear regression are often phrased in terms of multiple regression model. Hence, MLR is of significant importance in statistical models.
The matrix form of MLR:
$y=X\beta+\varepsilon$
$y=(■(y_1@y_2@…@y_n )),X=(■(X_1^T@…@X_n^T ))= (■(x_11&…&x_1p@x_21&…&x_2p@…&…&…@x_n1&…&x_np )),β=(■(β_1@β_2@…@β_p ))$
is the p-dimensional parameter vector, the elements of which are interpreted as the partial derivatives of the dependent variable with respect to the various independent variables
$\varepsilon=\left(\varepsilon_{1}@\varepsilon_{2}@…@\varepsilon_{n}\right)$
is the error term, disturbance term or sometimes called noise,$X$ is also called the design matrix, usually a constant is included as one of the regressors, for example $x_{i1}=1,for\,i=1,2,…,n$
The corresponding element of $\beta$ is called the intercept. Many statistical inference procedure for linear models require an intercept to be present so it is often included even if the theoretical considerations suggest that its value should be zero. $x_{ij}$ may be viewed either as random variables, which we simply observe or they can be considered as predetermined fixed values which we can choose, however different methods to asymptotic analysis are used in these two situation.
- Assumptions: there are a number of assumptions about the predictor variables, the response variables and their relationship in standard linear regression model. The following are major assumptions made in standard situations:
- Weak exogeneity: the predictor variables x can be treated as fixed values instead of random variables.
- Linearity: the mean of the response variable is a linear combination of the parameters (regression coefficients) and the predictor variables.
- Constant variance (homoscedasticity): different response variables have the same variance in their errors regardless of their values. This assumption in invalid in reality if the response variables vary over a wide scale. Heteroscedasticity will result in averaging over of distinguishable variances around the points to get a single variance that is inaccurately representing all the variance of the line.
- Independence of the errors: the errors of the response variables should be uncorrelated with each other. Methods like generalized least square are capable of handling correlated errors though they typically require significantly more data unless some sort of regularization is used to bias the model towards assuming uncorrelated errors.
- Lack of multi-collinearity in predictors: the design matrix $X$ should be full column rank $\rho$, otherwise we will have multi-collinearity in the predictor variables.
- Regression interpretation: with multiple covariate in the model, we are fitting a model of y with respect to all the covariates. A fitted linear regression model can be used to identify the relationship between a single predictor variable $x_{j}$ and the response variable $y$ when all other predictors are held fixed. That is, specifically, the interpretation of $\beta_{j}$: the expected change in $y$ for one-unit change in $x_{j}$ holding other covariates fixed. This is sometimes called the unique effect of $x_{j}$ on $y$.
It is possible that the unique effect can be nearly zero even with very large marginal effect, which may imply that some other covariates captures all the information in $x_{j}$, so that once the variable is in the model, there is no contribution of $x_{j}$ to the variation in $y$. Conversely, the unique effect of $x_{j}$ can be large while its marginal effect is nearly zero. This is also the case when other covariates explained a great deal of variation of $y$, but they mainly explain variation in a way that is complementary to what is captured by $x_{j}$. Under these conditions, including other variables in the model reduces the part of variability in $y$ that is unrelated to $x_{j}$, therefore strengths the apparent relationship with $x_{j}$.
The term ‘held fixed’ may depend on how the values of the predictor variables arise. When we set the values of the predictors according to a study design, the comparisons of interest may literally correspond to comparisons among units whose predictor variables have been ‘held fixed’ by us.
- Estimations in MLR: methods to estimate parameters in MLR and to make inference based on this linear regression. Some commonly used estimation methods are introduced in the following:
- Least-square estimation:
- OLS (ordinary least squares): the simplest and most commonly used estimator. The aim of OLS is to minimize the sum of squared residuals:
- Least-square estimation:
- SOCR Home page: http://www.socr.umich.edu
Translate this page: