# SMHS MLR

## Scientific Methods for Health Sciences - Multiple Linear Regression (MLR)

### Overview

Multiple Linear Regression is a class of statistical analysis models and procedures, which takes one independent variable and one dependent and one or more variable, both sets being quantitative, and models the relationship between them. The goal of MLR computing procedure is to estimate all of the coefficients based on the data using least square fitting. In this section, we will discuss multiple linear regression and illustrate its application with examples.

### Motivation

We are already very familiar with the model of linear regression where a simple linear model is applied to the case of one dependent variable and one explanatory variable (predictor). However, it is often the case that the dependent variable is influenced by more than one predictor. So, what if we have more than one explanatory variables? How can we apply a linear model when there are more than one predictors in the study? Multiple linear regression would be the answer here.

### Theory

• MLR: In linear regression, data are modeled using linear predictor functions and models with unknown parameters are estimated from the sample data. It sometimes may refer to a model in which some quantile of the conditional distribution of y given X is expressed as a linear function of X. This model is widely applied in varieties of studies and is one of the most commonly used models in statistical analysis. MLR Regression: multiple linear regression is a special case of linear regression where more than one explanatory variable are included in the model. MLR should be distinguished from multivariate linear regression, where multiple correlated dependent variables are predicted rather than a single scalar variable. In fact, nearly all real-world regression models involve multiple predictors, and basic descriptions of linear regression are often phrased in terms of multiple regression model. Hence, MLR is of significant importance in statistical models.

The matrix form of MLR:

$y=X\beta+\varepsilon$, where $y=\left(\begin{array}\\y_{1}\\ y_{2}\\ \dots\\ y_{n}\end{array}\right),X=\left(\begin{array}\\X_{1}^{T}\\ \dots\\ X_{n}^{T}\end{array}\right)=\left(\begin{array}\\x_{11}\; \dots\; x_{1p}\\ x_{21}\; \dots\; x_{2p}\\ \dots\; \dots\; \dots \\ x_{n1} \dots\; x_{np}\end{array}\right),\beta =\left(\begin{array}\\\beta_{1}\\ \beta_{2}\\ \dots\\ \beta_{p}\end{array}\right)$

is the p-dimensional parameter vector, the elements of which are interpreted as the partial derivatives of the dependent variable with respect to the various independent variables

$\varepsilon=\left(\begin{array}\\ \varepsilon_{1}\\ \varepsilon_{2}\\ \dots\\ \varepsilon_{n}\end{array}\right)$

is the error term, disturbance term or sometimes called noise，$X$ is also called the design matrix, usually a constant is included as one of the regressors, for example $x_{i1}=1,for\,i=1,2,…,n$

The corresponding element of $\beta$ is called the intercept. Many statistical inference procedure for linear models require an intercept to be present so it is often included even if the theoretical considerations suggest that its value should be zero. $x_{ij}$ may be viewed either as random variables, which we simply observe or they can be considered as predetermined fixed values which we can choose, however different methods to asymptotic analysis are used in these two situation.

• Assumptions: there are a number of assumptions about the predictor variables, the response variables and their relationship in standard linear regression model. The following are major assumptions made in standard situations:
• Weak exogeneity: the predictor variables x can be treated as fixed values instead of random variables.
• Linearity: the mean of the response variable is a linear combination of the parameters (regression coefficients) and the predictor variables.
• Constant variance (homoscedasticity): different response variables have the same variance in their errors regardless of their values. This assumption in invalid in reality if the response variables vary over a wide scale. Heteroscedasticity will result in averaging over of distinguishable variances around the points to get a single variance that is inaccurately representing all the variance of the line.
• Independence of the errors: the errors of the response variables should be uncorrelated with each other. Methods like generalized least square are capable of handling correlated errors though they typically require significantly more data unless some sort of regularization is used to bias the model towards assuming uncorrelated errors.
• Lack of multi-collinearity in predictors: the design matrix $X$ should be full column rank $\rho$, otherwise we will have multi-collinearity in the predictor variables.

• Regression interpretation: with multiple covariate in the model, we are fitting a model of y with respect to all the covariates. A fitted linear regression model can be used to identify the relationship between a single predictor variable $x_{j}$ and the response variable $y$ when all other predictors are held fixed. That is, specifically, the interpretation of $\beta_{j}$: the expected change in $y$ for one-unit change in $x_{j}$ holding other covariates fixed. This is sometimes called the unique effect of $x_{j}$ on $y$.

It is possible that the unique effect can be nearly zero even with very large marginal effect, which may imply that some other covariates captures all the information in $x_{j}$, so that once the variable is in the model, there is no contribution of $x_{j}$ to the variation in $y$. Conversely, the unique effect of $x_{j}$ can be large while its marginal effect is nearly zero. This is also the case when other covariates explained a great deal of variation of $y$, but they mainly explain variation in a way that is complementary to what is captured by $x_{j}$. Under these conditions, including other variables in the model reduces the part of variability in $y$ that is unrelated to $x_{j}$, therefore strengths the apparent relationship with $x_{j}$.

The term ‘held fixed’ may depend on how the values of the predictor variables arise. When we set the values of the predictors according to a study design, the comparisons of interest may literally correspond to comparisons among units whose predictor variables have been ‘held fixed’ by us.

Estimations in MLR: methods to estimate parameters in MLR and to make inference based on this linear regression. Some commonly used estimation methods are introduced in the following:

• Least-square estimation:
• OLS (ordinary least squares): the simplest and most commonly used estimator. The aim of OLS is to minimize the sum of squared residuals: $\hat\beta=(X^{T}X)^{-1}X^{T}y=(\sum x_{i} x_{i}^{T})^{-1}(\sum x_{i} y_{i}),$ this estimator is unbiased and consistent if the errors have finite variance and are uncorrelated with the regressors: $E[x_{i} ε_{i}]=0$.
• Partial Least Squares (PLS): instead of identifying hyperplanes of minimum variance between the response and independent variables, PLS looks for an optimal linear regression model (bilinear factor model) by projecting the predicted variables and the observable variables to a new space, i.e., the $X$ and $Y$ data are projected to new spaces. Partial least squares Discriminant Analysis is a PLS variant used when the $Y$ is categorical.
• PLS is used to find the fundamental relations between two matrices ($X$, design, and $Y$, response), where a latent variable approach is used to model the covariance structures in these two spaces. A PLS model will try to find the multidimensional direction in the $X$ space that explains the maximum multidimensional variance direction in the $Y$ space. PLS regression is particularly suited when the matrix of predictors has more variables than observations, and when there is multi-collinearity among $X$ values, situations in which classical regression may be inappropriate.

The PLS model of multivariate PLS is $$X = T P^{\top} + E$$ $$Y = U Q^{\top} + F,$$ where $X$ is an $n \times m$ matrix of predictors, $Y$ is an $n \times p$ matrix of responses; $T$ and $U$ are $n \times l$ matrices that are, respectively, projections of $X$ (the X score, component or factor matrix) and projections of $Y$ (the Y scores); $P$ and $Q$ are, respectively, $m \times l$ and $p \times l$ orthogonal loading matrices; and matrices $E$ and $F$ are the error terms, assumed to be independent and identically distributed random normal variables. The decompositions of $X$ and $Y$ are made so as to maximize the covariance between $T$ and $U$.

There are various PLS algorithms for estimating the factor and loading matrices $T, U, P$ and $Q$. Most of them construct estimates of the linear regression between $X$ and $Y$ as $Y = X \tilde{B} + \tilde{B}_0$. Specialized PLS algorithms may be used in the case where $Y$ is a column vector. One example of a computationally tractable algorithm for the vector $Y$ case is PLS1. It estimates $T$ as an orthonormal matrix. Pseudocode of PSL1 is expressed below (capital and lower case letters indicate matrices and vectors):

 1  function PLS1($X, y, l$)
2  $X^{(0)} \gets X$
3  $w^{(0)} \gets X^T y/||X^Ty||$, an initial estimate of $w$.
4  $t^{(0)} \gets X w^{(0)}$
5  for $k$ = 0 to $l$
6      $t_k \gets {t^{(k)}}^T t^{(k)}$ (note this is a scalar)
7      $t^{(k)} \gets t^{(k)} / t_k$
8      $p^{(k)} \gets {X^{(k)}}^T t^{(k)}$
9      $q_k \gets {y}^T t^{(k)}$ (note this is a scalar)
10      if $q_k$ = 0
11          $l \gets k$, break the for loop
12      if $k < l$
13          $X^{(k+1)} \gets X^{(k)} - t_k t^{(k)} {p^{(k)}}^T$
14          $w^{(k+1)} \gets {X^{(k+1)}}^T y$
15          $t^{(k+1)} \gets X^{(k+1)}w^{(k+1)}$
16  end for
17  define $W$ to be the matrix with columns $w^{(0)},w^{(1)},...,w^{(l-1)}$.
Do the same to form the $P$ matrix and $q$ vector.
18  $B \gets W {(P^T W)}^{-1} q$
19  $B_0 \gets q_0 - {P^{(0)}}^T B$
20  return $B, B_0$


This form of the PLS1 algorithm does not require centering of the input $X$ and $Y$, as this is performed implicitly by the algorithm. This algorithm features 'deflation' of the matrix $X$ (subtraction of $t_k t^{(k)} {p^{(k)}}^T$), but deflation of the vector $y$ is not performed, as it is not necessary (it can be proved that deflating $y$ yields the same results as not deflating.) The user-supplied variable $l$ is the limit on the number of latent factors in the regression; if it equals the rank of the matrix $X$, the algorithm will yield the least squares regression estimates for $B$ and $B_0$.

• GLS (generalized least squares): an extension of the ordinary least squares, and allows efficient estimation of $\beta$ when either heteroscedasticity, or correlations, or both are present among errors but little is known about the covariance structure of the errors independently of the data. To handle these situations, GLS minimizes a weighted analogue to the sum of squared residuals from OLS regression, where the weight for the $i^{th}$ case inversely proportional to $var(\epsilon_{j})$, $\hat\beta=(X^{T} \Omega^{-1} X)^{-1} X^{T} \Omega^{-1} y$, where $\Omega$ is the covariance matrix of the errors.
• Iteratively reweighted least squares (IRLS): used when heteroscedasticity, or correlations, or both are present among error terms, but little is known about the covariance structure of the errors independently of the data. In the first iteration, OLS or GLS with a provisional covariance structure is carried out and residuals obtained; then an improved estimate of the covariance structure of the errors are obtained; a subsequent GLS iteration is then performed using this estimate of the error structure to define the weights.
• Instrumental variables regression (IV): performed when the regressors are correlated with the errors $\hat\beta=(X^{T} Z(Z^{T}Z)^{-1} X)^{-1} X^{T} Z(Z^{T}Z)^{-1} y$, where $z_{j}$ are some instrumental variables which satisfies $E[z_{j} \epsilon_{j} ]=0$ and $Z$ is the matrix of the instruments.
• Optimal instruments regression is extension of classical IV regression to the situation where $E[\epsilon_{j} z_{j} ]=0$
• Total least squares (TLS): an approach to least squares estimation of the linear regression model, which treats the covariates and response variable in a more geometrically symmetric manner than OLS.

Maximum-likelihood estimation:

• Maximum likelihood estimation: performed when the distribution of the error terms is known to belong to a certain estimate parametric family $f_{\theta}$ of probability distributions. When it is normal with zero mean and variance $\theta$, the resulting estimate is identical to OLS. And GLS is maximum likelihood estimates when ε follows a multivariate normal distribution with a known covariance matrix.
• Ridge regression: deliberately introduce bias into the estimation of parameters to reduce the variability of the estimate. It has lower mean squared error compared to OLS estimates, particularly when multicollinearity is present.
• Least absolute deviation (LAD): a robust estimation, which is less sensitive to the presence of outliers than OLS, but is less efficient than OLS when no outliers are present. It’s equivalent to maximum likelihood estimation under a Laplace distribution model for $\epsilon$.
• Adaptive estimation: applied when the error terms are assumed to be independent from the regressors $\epsilon_{i}⊥x_{i}$, the optimal estimator is the two-step MLE, where the first step is used to non-parametrically estimate the distribution of the error term.
• Other estimations like Bayesian Linear regression, quantile regression and mixed models.

Extensions of MLR:

• General linear models: considers the situation when the response variable $Y$ is not a scalar but a vector. Conditional linearity of $E(y│x)=Bx$ is still assumed, with matrix $B$ replacing the vector $\beta$ of the classical linear regression model. Multivariate analogous of OLS and GLS have been developed.
• Heteroscedastic models: allows for heteroscedasticity, i.e., errors for different responses may have different variance. For example, weighted least squares is a method for estimating linear regression models when responses have different error variances possibly with correlated errors.
• GLM: a framework for modeling a response variable y that is bounded or discrete, for example when we are modeling positive quantities that vary over a large scale, or when modeling categorical data or ordinal data.
• Hierarchical linear model (multi-level regression) organized data into hierarchy of regression. It’s often applied where data have a natural hierarchical structure. For example, when the students are nested in classes, which is nested in schools, and schools nested in countries.
• Errors-in-variables: extended traditional linear regression model to allow predictor variables X to be observed with error. This error causes standard estimators of $\beta$ to become biased.

### Applications

This article presented SOCR Multiple Linear Regression, which allows one or more independent variables. In the linear model, the error is assumed to follow a standard normal distribution. It demonstrated the utilization of the SOCR analyses package for statistical computing particularly the use of MLR and illustrate how to read the output results. The example included is based on a dataset from the statistical program R.

This article presents students how to predict the median income using multiple explanatory variables by using SOCR. It illustrated this prediction with specific steps to predict median income using age, proportion of homeowners and proportion of whites in population as an example.

This article described multiple linear regression, a statistical approach used to describe the simultaneous associations of several variables with one continuous outcome. Important steps in using this approach include estimation and inference, variable selection in model building, and assessing model fit. The special cases of regression with interactions among the variables, polynomial regression, regressions with categorical (grouping) variables, and separate slopes models are also covered. Examples in microbiology are used throughout.

This article proposed and discussed techniques of fitting that are said to be resistant when the result is not greatly altered in the case a small fraction of the data is altered. These techniques of fitting are said to be robust of efficiency when their statistical efficiency remains high for conditions more realistic than the utopian cases of Gaussian distributions with errors of equal variance. These properties are particularly important in the formative stages of model building when the form of the response is not known exactly.