SMHS SLR

From SOCR
Revision as of 17:23, 5 December 2025 by Dinov (talk | contribs)
Jump to: navigation, search

Scientific Methods for Health Sciences - Correlation and Simple Linear Regression (SLR)

Overview

In scientific research, we often analyze the relationship between two or more variables to understand underlying processes. While univariate analysis describes a single variable, bivariate analysis explores the association between two variables—typically an independent variable (\(X\)) and a dependent variable (\(Y\)).

This module focuses on two fundamental techniques:

  • Correlation: Quantifies the strength and direction of the linear association between two variables.
  • Simple Linear Regression (SLR): Models the relationship mathematically, allowing us to predict \(Y\) based on \(X\) by fitting a straight line to the data.

Common applications include studying the association between final exam scores and class participation, or physiological traits such as body weight and lung capacity.

Correlation

Theory and Definition

The correlation coefficient (denoted \(\rho\) for the population and \(r\) for the sample) measures the strength and direction of the linear relationship between two continuous variables. It is bounded by\[-1 \le \rho \le 1\].

The relationship is summarized by the means (\(\mu_X, \mu_Y\)), standard deviations (\(\sigma_X, \sigma_Y\)), and the correlation coefficient \(\rho(X,Y)\).

Interpretation of \(\rho\):

  • \(\rho = 1\): Perfect positive linear correlation (all points lie exactly on an upward-sloping line).
  • \(\rho = -1\): Perfect negative linear correlation (all points lie exactly on a downward-sloping line).
  • \(\rho = 0\): No linear correlation (points form a random cloud; note: nonlinear relationships may still exist).

Mathematical Definition (Population): The population correlation is the covariance normalized by the product of the standard deviations\[\rho(X,Y) = \frac{\operatorname{cov}(X,Y)}{\sigma_{X}\sigma_{Y}} = \frac{E[(X-\mu_{X})(Y-\mu_{Y})]}{\sigma_{X}\sigma_{Y}}\].

Equivalently\[\rho(X,Y) = \frac{E(XY)-E(X)E(Y)}{\sqrt{E(X^{2})-E^{2}(X)}\sqrt{E(Y^{2})-E^{2}(Y)}}\].

Sample Correlation (Pearson’s \(r\))

In practice, we estimate \(\rho\) using a sample of paired observations \(\{(x_1, y_1), \dots, (x_n, y_n)\}\). The sample correlation replaces population moments with sample statistics\[r = \frac{1}{n-1} \sum_{i=1}^n \left( \frac{x_{i}-\bar{x}}{s_{x}} \right) \left( \frac{y_{i}-\bar{y}}{s_{y}} \right)\].

Computationally, this is often expressed as\[r = \frac{n \sum x_{i} y_{i}-\sum x_{i}\sum y_{i}} {\sqrt{n\sum x_{i}^{2} -(\sum x_{i})^{2}} \sqrt{ n\sum y_{i}^{2}-(\sum y_{i})^{2}}}\].

Inference on Correlation

To assess whether an observed correlation reflects a true association in the population, we test\[H_0: \rho = 0 \quad \text{vs.} \quad H_a: \rho \ne 0\].

  • Test statistic\[t = \frac{r\sqrt{n-2}}{\sqrt{1-r^2}}\],

which follows a Student’s \(t\)-distribution with \(n - 2\) degrees of freedom.

Comparing Two Independent Correlations (Fisher’s Z-Transformation): Because the sampling distribution of \(r\) is skewed when \(\rho \ne 0\), we use Fisher’s transformation to compare correlations from two independent samples (\(r_1\) and \(r_2\))\[z' = \frac{1}{2} \ln \left( \frac{1+r}{1-r} \right)\].

The transformed value \(z'\) is approximately normally distributed with variance \(\frac{1}{n - 3\).

To test \(H_0: \rho_1 = \rho_2\), compute\[Z = \frac{z'_1 - z'_2}{\sqrt{\frac{1}{n_1-3} + \frac{1}{n_2-3}}}\].

Under \(H_0\), \(Z \sim N(0,1)\).

Simple Linear Regression (SLR)

Model Theory

Simple linear regression models the expected value of a dependent variable \(Y\) as a linear function of an independent variable \(X\)\[Y = \alpha + \beta X + \epsilon\],

where:

  • \(\alpha\) is the intercept (value of \(Y\) when \(X = 0\)),
  • \(\beta\) is the slope (change in \(Y\) per one-unit increase in \(X\)),
  • \(\epsilon\) is the random error term, assumed to have mean zero.

Least Squares Estimation

The "best-fit" line is obtained by the least squares method, which minimizes the sum of squared residuals\[SSE = \sum_{i=1}^n (y_i - \hat{y}_i)^2 = \sum_{i=1}^n \big(y_i - (a + b x_i)\big)^2\].

The sample estimates are\[b = \frac{\sum(x_i - \bar{x})(y_i - \bar{y})}{\sum(x_i - \bar{x})^2} = r \frac{s_y}{s_x}\], \(a = \bar{y} - b\bar{x}\).

Key Properties of the Least Squares Line:

  1. The line always passes through the centroid (\(\bar{x}, \bar{y}\)).
  2. The sum of the residuals is zero\[\sum (y_i - \hat{y}_i) = 0\].
  3. The estimators \(a\) and \(b\) are unbiased under standard assumptions.

Assumptions of SLR

For valid statistical inference (confidence intervals, hypothesis tests), the following assumptions should hold:

  • Linearity: The true relationship between \(X\) and \(Y\) is linear.
  • Independence: Observations are independent (e.g., no repeated measures).
  • Normality: The residuals are approximately normally distributed.
  • Homoscedasticity: The variance of residuals is constant across all values of \(X\).

Diagnostic plots (residuals vs. fitted, Q–Q plot) are used to assess these assumptions.

Inference on the Slope

We commonly test whether \(X\) is a significant predictor of \(Y\)\[H_0: \beta = 0 \quad \text{vs.} \quad H_a: \beta \ne 0\].

  • Standard error of the slope\[SE_b = \frac{s_{y|x}}{\sqrt{\sum (x_i - \bar{x})^2}}\],

where \(s_{y|x}\) is the residual standard error.

  • Test statistic\[t = \frac{b}{SE_b}\], with \(n - 2\) degrees of freedom.
  • Confidence interval for \(\beta\)\[b \pm t^* \cdot SE_b\],

where \(t^*\) is the critical value from the \(t\)-distribution.

Case Studies and R Implementation

Example 1: Body Fat and Age

Scenario: A study of 18 adults examining the relationship between age (\(X\)) and percent body fat (\(Y\)).

Data:

Age % Fat Age % Fat
23 9.5 53 34.7
23 27.9 53 42.0
27 7.8 54 29.1
27 17.8 56 32.5
39 31.4 57 30.3
41 25.9 58 33.0
45 27.4 58 33.8
49 25.2 60 41.1
50 31.1 61 34.5

R Analysis:

# Data entry
age <- c(23,23,27,27,39,41,45,49,50,53,53,54,56,57,58,58,60,61)
fat <- c(9.5,27.9,7.8,17.8,31.4,25.9,27.4,25.2,31.1,34.7,42,29.1,32.5,30.3,33,33.8,41.1,34.5)

# Correlation
cor(age, fat)  # r ≈ 0.792

# Fit regression model
fit <- lm(fat ~ age)
summary(fit)

# Diagnostic plots
par(mfrow = c(2,2))
plot(fit)

Interpretation:

  • The sample correlation is \(r \approx 0.79\), indicating a strong positive linear relationship.
  • The estimated regression equation is\[\widehat{\text{Body Fat}} = -6.38 + 0.55 \times \text{Age}\].
  • The slope is statistically significant (\(p < 0.001\)), so age is a useful predictor of body fat.
  • The 95% confidence interval for the slope is approximately (0.32, 0.77).

Example 2: Baseball Data (SLR and Prediction)

Scenario: Predicting the number of runs scored by a baseball team based on its batting average.

R Code:

batting <- c(0.294,0.278,0.278,0.270,0.274,0.271,0.263,0.257,
             0.267,0.265,0.256,0.254,0.246,0.266,0.262,0.251)
runs <- c(968,938,925,887,825,810,807,798,793,792,764,752,740,738,731,708)

cor(batting, runs)  # r ≈ 0.866
fit_bb <- lm(runs ~ batting)
summary(fit_bb)

Results:

  • Regression equation\[\widehat{\text{Runs}} = -706.2 + 5709.2 \times \text{Batting Avg}\].
  • \(R^2 = 0.749\), so about 75% of the variability in runs is explained by batting average.
  • Prediction for a team with batting average 0.280\[-706.2 + 5709.2 \times 0.280 \approx 892\] runs.

Example 3: Comparing Correlations Across Groups

Scenario: Does the correlation between age and brain volume differ between two clinical groups?

  • Group 1\[n_1 = 27,\ r_1 = -0.753\]
  • Group 2\[n_2 = 27,\ r_2 = -0.495\]

Fisher’s Z-transformation\[z'_1 = \frac{1}{2} \ln\left(\frac{1 - 0.753}{1 + 0.753}\right) \approx -0.981\] \(z'_2 = \frac{1}{2} \ln\left(\frac{1 - 0.495}{1 + 0.495}\right) \approx -0.543\)

Test statistic\[Z = \frac{-0.981 + 0.543}{\sqrt{\frac{1}{24} + \frac{1}{24}}} = \frac{-0.438}{\sqrt{0.0833}} \approx -1.52\]

Two-sided \(p\)-value ≈ 0.129.

Conclusion: There is no statistically significant difference between the two correlations (\(\alpha = 0.05\)).

Review Questions

1. Correlation and Causation A positive correlation between variables \(X\) and \(Y\) implies that increasing \(X\) causes \(Y\) to increase.

  • (a) Always true
  • (b) Sometimes true
  • (c) Never true

Answer: (c) Never true. Correlation measures association, not causation. Confounding, reverse causality, or coincidence may explain the observed relationship.

2. Visualizing Correlation If the correlation between working out and body fat is exactly \(-1.0\), which statement is FALSE?

  • (a) Points lie on a perfect straight line.
  • (b) 100% of the variance is explained.
  • (c) The slope of the best-fit line is \(-1.0\).
  • (d) The best-fit line has a negative slope.

Answer: (c). While \(r = -1\) implies a perfect negative linear relationship, the numerical value of the slope depends on the scales of \(X\) and \(Y\). The slope is not necessarily \(-1\).

3. Least Squares Principle Which statement best describes the principle of "least squares"?

  • (a) Minimizes the sum of residuals.
  • (b) Minimizes the sum of squared residuals.
  • (c) Minimizes the distance between actual and predicted values.

Answer: (b). Least squares minimizes \(\sum (y_i - \hat{y}_i)^2\).

4. Prediction and Residuals Given the model\[\text{Fat} = 6.83 + 0.97 \times \text{Protein}\]. A burger has 20g protein and actually contains 20g fat.

  • Predicted fat = \(6.83 + 0.97 \times 20 = 26.23\)g
  • Residual = \(20 - 26.23 = -6.23\)g

Interpretation: The model overestimates fat content by 6.23g for this burger.


References





Translate this page:

(default)
Uk flag.gif

Deutsch
De flag.gif

Español
Es flag.gif

Français
Fr flag.gif

Italiano
It flag.gif

Português
Pt flag.gif

日本語
Jp flag.gif

България
Bg flag.gif

الامارات العربية المتحدة
Ae flag.gif

Suomi
Fi flag.gif

इस भाषा में
In flag.gif

Norge
No flag.png

한국어
Kr flag.gif

中文
Cn flag.gif

繁体中文
Cn flag.gif

Русский
Ru flag.gif

Nederlands
Nl flag.gif

Ελληνικά
Gr flag.gif

Hrvatska
Hr flag.gif

Česká republika
Cz flag.gif

Danmark
Dk flag.gif

Polska
Pl flag.png

România
Ro flag.png

Sverige
Se flag.gif