Difference between revisions of "SMHS SLR"
(Created page with '== Scientific Methods for Health Sciences - Parametric Inference == <hr> * SOCR Home page: http://www.socr.umich.edu {{translate|pageName=http://wiki.socr.umich.ed…') |
m (→Inference on Correlation) |
||
| (74 intermediate revisions by 4 users not shown) | |||
| Line 1: | Line 1: | ||
| − | ==[[SMHS| Scientific Methods for Health Sciences]] - | + | == [[SMHS|Scientific Methods for Health Sciences]] - Correlation and Simple Linear Regression (SLR) == |
| + | === Overview === | ||
| + | In scientific research, we often analyze the relationship between two or more variables to understand underlying processes. While univariate analysis describes a single variable, bivariate analysis explores the association between two variables—typically an independent variable (<math>X</math>) and a dependent variable (<math>Y</math>). | ||
| + | This module focuses on two fundamental techniques: | ||
| + | * Correlation: Quantifies the strength and direction of the linear association between two variables. | ||
| + | * Simple Linear Regression (SLR): Models the relationship mathematically, allowing us to predict <math>Y</math> based on <math>X</math> by fitting a straight line to the data. | ||
| + | Common applications include studying the association between final exam scores and class participation, or physiological traits such as body weight and lung capacity. | ||
| + | |||
| + | === Correlation === | ||
| + | ==== Theory and Definition ==== | ||
| + | The correlation coefficient (denoted <math>\rho</math> for the population and <math>r</math> for the sample) measures the strength and direction of the linear relationship between two continuous variables. It is bounded by: | ||
| + | <math>-1 \le \rho \le 1</math>. | ||
| + | |||
| + | The relationship is summarized by the means (<math>\mu_X, \mu_Y</math>), standard deviations (<math>\sigma_X, \sigma_Y</math>), and the correlation coefficient <math>\rho(X,Y)</math>. | ||
| + | |||
| + | Interpretation of <math>\rho</math>: | ||
| + | * <math>\rho = 1</math>: Perfect positive linear correlation (all points lie exactly on an upward-sloping line). | ||
| + | * <math>\rho = -1</math>: Perfect negative linear correlation (all points lie exactly on a downward-sloping line). | ||
| + | * <math>\rho = 0</math>: No linear correlation (points form a random cloud; note: nonlinear relationships may still exist). | ||
| + | |||
| + | Mathematical Definition (Population): | ||
| + | The population correlation is the covariance normalized by the product of the standard deviations: | ||
| + | <math>\rho(X,Y) = \frac{\operatorname{cov}(X,Y)}{\sigma_{X}\sigma_{Y}} = \frac{E[(X-\mu_{X})(Y-\mu_{Y})]}{\sigma_{X}\sigma_{Y}}</math>. | ||
| + | |||
| + | Equivalently: | ||
| + | <math>\rho(X,Y) = \frac{E(XY)-E(X)E(Y)}{\sqrt{E(X^{2})-E^{2}(X)}\sqrt{E(Y^{2})-E^{2}(Y)}}</math>. | ||
| + | |||
| + | ==== Sample Correlation (Pearson’s <math>r</math>) ==== | ||
| + | In practice, we estimate <math>\rho</math> using a sample of paired observations <math>\{(x_1, y_1), \dots, (x_n, y_n)\}</math>. The sample correlation replaces population moments with sample statistics: | ||
| + | |||
| + | <math>r = \frac{1}{n-1} \sum_{i=1}^n \left( \frac{x_{i}-\bar{x}}{s_{x}} \right) \left( \frac{y_{i}-\bar{y}}{s_{y}} \right)</math>. | ||
| + | |||
| + | Computationally, this is often expressed as: | ||
| + | <math>r = \frac{n \sum x_{i} y_{i}-\sum x_{i}\sum y_{i}} {\sqrt{n\sum x_{i}^{2} -(\sum x_{i})^{2}} \sqrt{ n\sum y_{i}^{2}-(\sum y_{i})^{2}}}</math>. | ||
| + | |||
| + | ==== Inference on Correlation ==== | ||
| + | To assess whether an observed correlation reflects a true association in the population, we test: | ||
| + | <math>H_0: \rho = 0 \quad \text{vs.} \quad H_a: \rho \ne 0.</math> | ||
| + | |||
| + | * Test statistic: | ||
| + | <math>t = \frac{r\sqrt{n-2}}{\sqrt{1-r^2}},</math> | ||
| + | which follows a Student’s <math>t</math>-distribution with <math>n - 2</math> degrees of freedom. | ||
| + | |||
| + | Comparing Two Independent Correlations (Fisher’s Z-Transformation): | ||
| + | Because the sampling distribution of <math>r</math> is skewed when <math>\rho \ne 0,</math> we use Fisher’s transformation to compare correlations from two independent samples (<math>r_1</math> and <math>r_2</math>): | ||
| + | |||
| + | <math>z' = \frac{1}{2} \ln \left( \frac{1+r}{1-r} \right)\equiv atanh(r),</math> | ||
| + | |||
| + | The <math>atanh()</math> function ((arc) ''inverse hyperbolic tangent'') solves | ||
| + | the problem that the correlation coefficients (<math>r</math>) are not well-behaved | ||
| + | enough for standard testing. The correlation coefficient <math>-1\leq r\leq 1</math>, | ||
| + | and as <math>r</math> gets closer to these boundaries, its distribution becomes heavily skewed. | ||
| + | Hence, the standard error of <math>r</math> depends on the value of <math>r</math> itself, which violates the assumptions of many statistical tests, e.g., the Z-test or t-test. | ||
| + | |||
| + | The ''atanh()'' function maps the correlation range <math>[-1, 1]</math> out to | ||
| + | <math>(-\infty, \infty)</math>. The Fisher z-transformation above is defined in terms of | ||
| + | ''atanh()'' to linearize the raw correlation values away from the boundaries, | ||
| + | normalize the skewed distribution of <math>r</math> into a Normal (Gaussian) distribution, | ||
| + | and stabilize its variance, i.e., the variance of the transformed z-scores becomes | ||
| + | approximately constant, <math>Var(z) \approx \frac{1}{n-3}.</math> | ||
| + | |||
| + | The transformed value <math>z'</math> is approximately normally distributed with variance <math>\frac{1}{n - 3}</math>. | ||
| + | |||
| + | To test <math>H_0: \rho_1 = \rho_2</math>, compute: | ||
| + | <math>Z = \frac{z'_1 - z'_2}{\sqrt{\frac{1}{n_1-3} + \frac{1}{n_2-3}}}.</math> | ||
| + | |||
| + | Under <math>H_0</math>, <math>Z \sim N(0,1).</math> | ||
| + | |||
| + | === Simple Linear Regression (SLR) === | ||
| + | ==== Model Theory ==== | ||
| + | Simple linear regression models the expected value of a dependent variable <math>Y</math> as a linear function of an independent variable <math>X</math>: | ||
| + | |||
| + | <math>Y = \alpha + \beta X + \epsilon</math>, | ||
| + | |||
| + | where: | ||
| + | * <math>\alpha</math> is the intercept (value of <math>Y</math> when <math>X = 0</math>), | ||
| + | * <math>\beta</math> is the slope (change in <math>Y</math> per one-unit increase in <math>X</math>), | ||
| + | * <math>\epsilon</math> is the random error term, assumed to have mean zero. | ||
| + | |||
| + | ==== Least Squares Estimation ==== | ||
| + | The "best-fit" line is obtained by the least squares method, which minimizes the sum of squared residuals: | ||
| + | |||
| + | <math>SSE = \sum_{i=1}^n (y_i - \hat{y}_i)^2 = \sum_{i=1}^n \big(y_i - (a + b x_i)\big)^2</math>. | ||
| + | |||
| + | The sample estimates are: | ||
| + | <math>b = \frac{\sum(x_i - \bar{x})(y_i - \bar{y})}{\sum(x_i - \bar{x})^2} = r \frac{s_y}{s_x}</math>, | ||
| + | <math>a = \bar{y} - b\bar{x}</math>. | ||
| + | |||
| + | Key Properties of the Least Squares Line: | ||
| + | # The line always passes through the centroid (<math>\bar{x}, \bar{y}</math>). | ||
| + | # The sum of the residuals is zero: <math>\sum (y_i - \hat{y}_i) = 0</math>. | ||
| + | # The estimators <math>a</math> and <math>b</math> are unbiased under standard assumptions. | ||
| + | |||
| + | ==== Assumptions of SLR ==== | ||
| + | For valid statistical inference (confidence intervals, hypothesis tests), the following assumptions should hold: | ||
| + | * Linearity: The true relationship between <math>X</math> and <math>Y</math> is linear. | ||
| + | * Independence: Observations are independent (e.g., no repeated measures). | ||
| + | * Normality: The residuals are approximately normally distributed. | ||
| + | * Homoscedasticity: The variance of residuals is constant across all values of <math>X</math>. | ||
| + | |||
| + | Diagnostic plots (residuals vs. fitted, Q–Q plot) are used to assess these assumptions. | ||
| + | |||
| + | ==== Inference on the Slope ==== | ||
| + | We commonly test whether <math>X</math> is a significant predictor of <math>Y</math>: | ||
| + | <math>H_0: \beta = 0 \quad \text{vs.} \quad H_a: \beta \ne 0</math>. | ||
| + | |||
| + | * Standard error of the slope: | ||
| + | <math>SE_b = \frac{s_{y|x}}{\sqrt{\sum (x_i - \bar{x})^2}}</math>, | ||
| + | where <math>s_{y|x}</math> is the residual standard error. | ||
| + | |||
| + | * Test statistic: | ||
| + | <math>t = \frac{b}{SE_b}</math>, with <math>n - 2</math> degrees of freedom. | ||
| + | |||
| + | * Confidence interval for <math>\beta</math>: | ||
| + | <math>b \pm t^* \cdot SE_b</math>, | ||
| + | where <math>t^*</math> is the critical value from the <math>t</math>-distribution. | ||
| + | |||
| + | === Case Studies and R Implementation === | ||
| + | ==== Example 1: Body Fat and Age ==== | ||
| + | Scenario: A study of 18 adults examining the relationship between age (<math>X</math>) and percent body fat (<math>Y</math>). | ||
| + | |||
| + | Data: | ||
| + | {| class="wikitable" style="text-align:center;" | ||
| + | |- | ||
| + | ! Age !! % Fat !! Age !! % Fat | ||
| + | |- | ||
| + | | 23 || 9.5 || 53 || 34.7 | ||
| + | |- | ||
| + | | 23 || 27.9 || 53 || 42.0 | ||
| + | |- | ||
| + | | 27 || 7.8 || 54 || 29.1 | ||
| + | |- | ||
| + | | 27 || 17.8 || 56 || 32.5 | ||
| + | |- | ||
| + | | 39 || 31.4 || 57 || 30.3 | ||
| + | |- | ||
| + | | 41 || 25.9 || 58 || 33.0 | ||
| + | |- | ||
| + | | 45 || 27.4 || 58 || 33.8 | ||
| + | |- | ||
| + | | 49 || 25.2 || 60 || 41.1 | ||
| + | |- | ||
| + | | 50 || 31.1 || 61 || 34.5 | ||
| + | |} | ||
| + | |||
| + | R Analysis: | ||
| + | <pre> | ||
| + | # Data entry | ||
| + | age <- c(23,23,27,27,39,41,45,49,50,53,53,54,56,57,58,58,60,61) | ||
| + | fat <- c(9.5,27.9,7.8,17.8,31.4,25.9,27.4,25.2,31.1,34.7,42,29.1,32.5,30.3,33,33.8,41.1,34.5) | ||
| + | |||
| + | # Correlation | ||
| + | cor(age, fat) # r ≈ 0.792 | ||
| + | |||
| + | # Fit regression model | ||
| + | fit <- lm(fat ~ age) | ||
| + | summary(fit) | ||
| + | |||
| + | # Diagnostic plots | ||
| + | par(mfrow = c(2,2)) | ||
| + | plot(fit) | ||
| + | </pre> | ||
| + | |||
| + | Interpretation: | ||
| + | * The sample correlation is <math>r \approx 0.79</math>, indicating a strong positive linear relationship. | ||
| + | * The estimated regression equation is: | ||
| + | <math>\widehat{\text{Body Fat}} = -6.38 + 0.55 \times \text{Age}</math>. | ||
| + | * The slope is statistically significant (<math>p < 0.001</math>), so age is a useful predictor of body fat. | ||
| + | * The 95% confidence interval for the slope is approximately (0.32, 0.77). | ||
| + | |||
| + | ==== Example 2: Baseball Data (SLR and Prediction) ==== | ||
| + | Scenario: Predicting the number of runs scored by a baseball team based on its batting average. | ||
| + | |||
| + | R Code: | ||
| + | <pre> | ||
| + | batting <- c(0.294,0.278,0.278,0.270,0.274,0.271,0.263,0.257, | ||
| + | 0.267,0.265,0.256,0.254,0.246,0.266,0.262,0.251) | ||
| + | runs <- c(968,938,925,887,825,810,807,798,793,792,764,752,740,738,731,708) | ||
| + | |||
| + | cor(batting, runs) # r ≈ 0.866 | ||
| + | fit_bb <- lm(runs ~ batting) | ||
| + | summary(fit_bb) | ||
| + | </pre> | ||
| + | |||
| + | Results: | ||
| + | * Regression equation: | ||
| + | <math>\widehat{\text{Runs}} = -706.2 + 5709.2 \times \text{Batting Avg}</math>. | ||
| + | * <math>R^2 = 0.749</math>, so about 75% of the variability in runs is explained by batting average. | ||
| + | * Prediction for a team with batting average 0.280: | ||
| + | <math>-706.2 + 5709.2 \times 0.280 \approx 892</math> runs. | ||
| + | |||
| + | ==== Example 3: Comparing Correlations Across Groups ==== | ||
| + | Scenario: Does the correlation between age and brain volume differ between two clinical groups? | ||
| + | |||
| + | * Group 1: <math>n_1 = 27,\ r_1 = -0.753</math> | ||
| + | * Group 2: <math>n_2 = 27,\ r_2 = -0.495</math> | ||
| + | |||
| + | Fisher’s Z-transformation: | ||
| + | <math>z'_1 = \frac{1}{2} \ln\left(\frac{1 - 0.753}{1 + 0.753}\right) \approx -0.981</math> | ||
| + | <math>z'_2 = \frac{1}{2} \ln\left(\frac{1 - 0.495}{1 + 0.495}\right) \approx -0.543</math> | ||
| + | |||
| + | Test statistic: | ||
| + | <math>Z = \frac{-0.981 + 0.543}{\sqrt{\frac{1}{24} + \frac{1}{24}}} = \frac{-0.438}{\sqrt{0.0833}} \approx -1.52</math> | ||
| + | |||
| + | Two-sided <math>p</math>-value ≈ 0.129. | ||
| + | |||
| + | Conclusion: There is no statistically significant difference between the two correlations (<math>\alpha = 0.05</math>). | ||
| + | |||
| + | === Review Questions === | ||
| + | 1. Correlation and Causation | ||
| + | A positive correlation between variables <math>X</math> and <math>Y</math> implies that increasing <math>X</math> causes <math>Y</math> to increase. | ||
| + | * (a) Always true | ||
| + | * (b) Sometimes true | ||
| + | * (c) Never true | ||
| + | |||
| + | Answer: (c) Never true. Correlation measures association, not causation. Confounding, reverse causality, or coincidence may explain the observed relationship. | ||
| + | |||
| + | 2. Visualizing Correlation | ||
| + | If the correlation between working out and body fat is exactly <math>-1.0</math>, which statement is FALSE? | ||
| + | * (a) Points lie on a perfect straight line. | ||
| + | * (b) 100% of the variance is explained. | ||
| + | * (c) The slope of the best-fit line is <math>-1.0</math>. | ||
| + | * (d) The best-fit line has a negative slope. | ||
| + | |||
| + | Answer: (c). While <math>r = -1</math> implies a perfect negative linear relationship, the numerical value of the slope depends on the scales of <math>X</math> and <math>Y</math>. The slope is not necessarily <math>-1</math>. | ||
| + | |||
| + | 3. Least Squares Principle | ||
| + | Which statement best describes the principle of "least squares"? | ||
| + | * (a) Minimizes the sum of residuals. | ||
| + | * (b) Minimizes the sum of squared residuals. | ||
| + | * (c) Minimizes the distance between actual and predicted values. | ||
| + | |||
| + | Answer: (b). Least squares minimizes <math>\sum (y_i - \hat{y}_i)^2</math>. | ||
| + | |||
| + | 4. Prediction and Residuals | ||
| + | Given the model: <math>\text{Fat} = 6.83 + 0.97 \times \text{Protein}</math>. | ||
| + | A burger has 20g protein and actually contains 20g fat. | ||
| + | * Predicted fat = <math>6.83 + 0.97 \times 20 = 26.23</math>g | ||
| + | * Residual = <math>20 - 26.23 = -6.23</math>g | ||
| + | |||
| + | Interpretation: The model overestimates fat content by 6.23g for this burger. | ||
| + | |||
| + | |||
| + | ===References=== | ||
| + | |||
| + | * [https://sda.statisticalcomputing.org/learning SDA App, see the Learning Modules] | ||
| + | * [[Probability_and_statistics_EBook#Chapter_X:_Correlation_and_Regression | SOCR Probability and Statistics EBook, Correlation and Regression Chapter]] | ||
| + | * Altman DG. (1991). *Practical Statistics for Medical Research*. | ||
| + | * Dunn, G. (1989). *Design and Analysis of Reliability Studies*. | ||
<hr> | <hr> | ||
| − | * SOCR Home page: | + | * SOCR Home page: https://socr.umich.edu |
| − | {{translate|pageName= | + | {{translate|pageName=https://wiki.socr.umich.edu/index.php?title=SMHS_SLR}} |
Latest revision as of 21:29, 21 February 2026
Contents
Scientific Methods for Health Sciences - Correlation and Simple Linear Regression (SLR)
Overview
In scientific research, we often analyze the relationship between two or more variables to understand underlying processes. While univariate analysis describes a single variable, bivariate analysis explores the association between two variables—typically an independent variable (\(X\)) and a dependent variable (\(Y\)).
This module focuses on two fundamental techniques:
- Correlation: Quantifies the strength and direction of the linear association between two variables.
- Simple Linear Regression (SLR): Models the relationship mathematically, allowing us to predict \(Y\) based on \(X\) by fitting a straight line to the data.
Common applications include studying the association between final exam scores and class participation, or physiological traits such as body weight and lung capacity.
Correlation
Theory and Definition
The correlation coefficient (denoted \(\rho\) for the population and \(r\) for the sample) measures the strength and direction of the linear relationship between two continuous variables. It is bounded by\[-1 \le \rho \le 1\].
The relationship is summarized by the means (\(\mu_X, \mu_Y\)), standard deviations (\(\sigma_X, \sigma_Y\)), and the correlation coefficient \(\rho(X,Y)\).
Interpretation of \(\rho\):
- \(\rho = 1\): Perfect positive linear correlation (all points lie exactly on an upward-sloping line).
- \(\rho = -1\): Perfect negative linear correlation (all points lie exactly on a downward-sloping line).
- \(\rho = 0\): No linear correlation (points form a random cloud; note: nonlinear relationships may still exist).
Mathematical Definition (Population): The population correlation is the covariance normalized by the product of the standard deviations\[\rho(X,Y) = \frac{\operatorname{cov}(X,Y)}{\sigma_{X}\sigma_{Y}} = \frac{E[(X-\mu_{X})(Y-\mu_{Y})]}{\sigma_{X}\sigma_{Y}}\].
Equivalently\[\rho(X,Y) = \frac{E(XY)-E(X)E(Y)}{\sqrt{E(X^{2})-E^{2}(X)}\sqrt{E(Y^{2})-E^{2}(Y)}}\].
Sample Correlation (Pearson’s \(r\))
In practice, we estimate \(\rho\) using a sample of paired observations \(\{(x_1, y_1), \dots, (x_n, y_n)\}\). The sample correlation replaces population moments with sample statistics\[r = \frac{1}{n-1} \sum_{i=1}^n \left( \frac{x_{i}-\bar{x}}{s_{x}} \right) \left( \frac{y_{i}-\bar{y}}{s_{y}} \right)\].
Computationally, this is often expressed as\[r = \frac{n \sum x_{i} y_{i}-\sum x_{i}\sum y_{i}} {\sqrt{n\sum x_{i}^{2} -(\sum x_{i})^{2}} \sqrt{ n\sum y_{i}^{2}-(\sum y_{i})^{2}}}\].
Inference on Correlation
To assess whether an observed correlation reflects a true association in the population, we test\[H_0: \rho = 0 \quad \text{vs.} \quad H_a: \rho \ne 0.\]
- Test statistic\[t = \frac{r\sqrt{n-2}}{\sqrt{1-r^2}},\]
which follows a Student’s \(t\)-distribution with \(n - 2\) degrees of freedom.
Comparing Two Independent Correlations (Fisher’s Z-Transformation): Because the sampling distribution of \(r\) is skewed when \(\rho \ne 0,\) we use Fisher’s transformation to compare correlations from two independent samples (\(r_1\) and \(r_2\))\[z' = \frac{1}{2} \ln \left( \frac{1+r}{1-r} \right)\equiv atanh(r),\]
The \(atanh()\) function ((arc) inverse hyperbolic tangent) solves the problem that the correlation coefficients (\(r\)) are not well-behaved enough for standard testing. The correlation coefficient \(-1\leq r\leq 1\), and as \(r\) gets closer to these boundaries, its distribution becomes heavily skewed. Hence, the standard error of \(r\) depends on the value of \(r\) itself, which violates the assumptions of many statistical tests, e.g., the Z-test or t-test.
The atanh() function maps the correlation range \([-1, 1]\) out to \((-\infty, \infty)\). The Fisher z-transformation above is defined in terms of atanh() to linearize the raw correlation values away from the boundaries, normalize the skewed distribution of \(r\) into a Normal (Gaussian) distribution, and stabilize its variance, i.e., the variance of the transformed z-scores becomes approximately constant, \(Var(z) \approx \frac{1}{n-3}.\)
The transformed value \(z'\) is approximately normally distributed with variance \(\frac{1}{n - 3}\).
To test \(H_0: \rho_1 = \rho_2\), compute\[Z = \frac{z'_1 - z'_2}{\sqrt{\frac{1}{n_1-3} + \frac{1}{n_2-3}}}.\]
Under \(H_0\), \(Z \sim N(0,1).\)
Simple Linear Regression (SLR)
Model Theory
Simple linear regression models the expected value of a dependent variable \(Y\) as a linear function of an independent variable \(X\)\[Y = \alpha + \beta X + \epsilon\],
where:
- \(\alpha\) is the intercept (value of \(Y\) when \(X = 0\)),
- \(\beta\) is the slope (change in \(Y\) per one-unit increase in \(X\)),
- \(\epsilon\) is the random error term, assumed to have mean zero.
Least Squares Estimation
The "best-fit" line is obtained by the least squares method, which minimizes the sum of squared residuals\[SSE = \sum_{i=1}^n (y_i - \hat{y}_i)^2 = \sum_{i=1}^n \big(y_i - (a + b x_i)\big)^2\].
The sample estimates are\[b = \frac{\sum(x_i - \bar{x})(y_i - \bar{y})}{\sum(x_i - \bar{x})^2} = r \frac{s_y}{s_x}\], \(a = \bar{y} - b\bar{x}\).
Key Properties of the Least Squares Line:
- The line always passes through the centroid (\(\bar{x}, \bar{y}\)).
- The sum of the residuals is zero\[\sum (y_i - \hat{y}_i) = 0\].
- The estimators \(a\) and \(b\) are unbiased under standard assumptions.
Assumptions of SLR
For valid statistical inference (confidence intervals, hypothesis tests), the following assumptions should hold:
- Linearity: The true relationship between \(X\) and \(Y\) is linear.
- Independence: Observations are independent (e.g., no repeated measures).
- Normality: The residuals are approximately normally distributed.
- Homoscedasticity: The variance of residuals is constant across all values of \(X\).
Diagnostic plots (residuals vs. fitted, Q–Q plot) are used to assess these assumptions.
Inference on the Slope
We commonly test whether \(X\) is a significant predictor of \(Y\)\[H_0: \beta = 0 \quad \text{vs.} \quad H_a: \beta \ne 0\].
- Standard error of the slope\[SE_b = \frac{s_{y|x}}{\sqrt{\sum (x_i - \bar{x})^2}}\],
where \(s_{y|x}\) is the residual standard error.
- Test statistic\[t = \frac{b}{SE_b}\], with \(n - 2\) degrees of freedom.
- Confidence interval for \(\beta\)\[b \pm t^* \cdot SE_b\],
where \(t^*\) is the critical value from the \(t\)-distribution.
Case Studies and R Implementation
Example 1: Body Fat and Age
Scenario: A study of 18 adults examining the relationship between age (\(X\)) and percent body fat (\(Y\)).
Data:
| Age | % Fat | Age | % Fat |
|---|---|---|---|
| 23 | 9.5 | 53 | 34.7 |
| 23 | 27.9 | 53 | 42.0 |
| 27 | 7.8 | 54 | 29.1 |
| 27 | 17.8 | 56 | 32.5 |
| 39 | 31.4 | 57 | 30.3 |
| 41 | 25.9 | 58 | 33.0 |
| 45 | 27.4 | 58 | 33.8 |
| 49 | 25.2 | 60 | 41.1 |
| 50 | 31.1 | 61 | 34.5 |
R Analysis:
# Data entry age <- c(23,23,27,27,39,41,45,49,50,53,53,54,56,57,58,58,60,61) fat <- c(9.5,27.9,7.8,17.8,31.4,25.9,27.4,25.2,31.1,34.7,42,29.1,32.5,30.3,33,33.8,41.1,34.5) # Correlation cor(age, fat) # r ≈ 0.792 # Fit regression model fit <- lm(fat ~ age) summary(fit) # Diagnostic plots par(mfrow = c(2,2)) plot(fit)
Interpretation:
- The sample correlation is \(r \approx 0.79\), indicating a strong positive linear relationship.
- The estimated regression equation is\[\widehat{\text{Body Fat}} = -6.38 + 0.55 \times \text{Age}\].
- The slope is statistically significant (\(p < 0.001\)), so age is a useful predictor of body fat.
- The 95% confidence interval for the slope is approximately (0.32, 0.77).
Example 2: Baseball Data (SLR and Prediction)
Scenario: Predicting the number of runs scored by a baseball team based on its batting average.
R Code:
batting <- c(0.294,0.278,0.278,0.270,0.274,0.271,0.263,0.257,
0.267,0.265,0.256,0.254,0.246,0.266,0.262,0.251)
runs <- c(968,938,925,887,825,810,807,798,793,792,764,752,740,738,731,708)
cor(batting, runs) # r ≈ 0.866
fit_bb <- lm(runs ~ batting)
summary(fit_bb)
Results:
- Regression equation\[\widehat{\text{Runs}} = -706.2 + 5709.2 \times \text{Batting Avg}\].
- \(R^2 = 0.749\), so about 75% of the variability in runs is explained by batting average.
- Prediction for a team with batting average 0.280\[-706.2 + 5709.2 \times 0.280 \approx 892\] runs.
Example 3: Comparing Correlations Across Groups
Scenario: Does the correlation between age and brain volume differ between two clinical groups?
- Group 1\[n_1 = 27,\ r_1 = -0.753\]
- Group 2\[n_2 = 27,\ r_2 = -0.495\]
Fisher’s Z-transformation\[z'_1 = \frac{1}{2} \ln\left(\frac{1 - 0.753}{1 + 0.753}\right) \approx -0.981\] \(z'_2 = \frac{1}{2} \ln\left(\frac{1 - 0.495}{1 + 0.495}\right) \approx -0.543\)
Test statistic\[Z = \frac{-0.981 + 0.543}{\sqrt{\frac{1}{24} + \frac{1}{24}}} = \frac{-0.438}{\sqrt{0.0833}} \approx -1.52\]
Two-sided \(p\)-value ≈ 0.129.
Conclusion: There is no statistically significant difference between the two correlations (\(\alpha = 0.05\)).
Review Questions
1. Correlation and Causation A positive correlation between variables \(X\) and \(Y\) implies that increasing \(X\) causes \(Y\) to increase.
- (a) Always true
- (b) Sometimes true
- (c) Never true
Answer: (c) Never true. Correlation measures association, not causation. Confounding, reverse causality, or coincidence may explain the observed relationship.
2. Visualizing Correlation If the correlation between working out and body fat is exactly \(-1.0\), which statement is FALSE?
- (a) Points lie on a perfect straight line.
- (b) 100% of the variance is explained.
- (c) The slope of the best-fit line is \(-1.0\).
- (d) The best-fit line has a negative slope.
Answer: (c). While \(r = -1\) implies a perfect negative linear relationship, the numerical value of the slope depends on the scales of \(X\) and \(Y\). The slope is not necessarily \(-1\).
3. Least Squares Principle Which statement best describes the principle of "least squares"?
- (a) Minimizes the sum of residuals.
- (b) Minimizes the sum of squared residuals.
- (c) Minimizes the distance between actual and predicted values.
Answer: (b). Least squares minimizes \(\sum (y_i - \hat{y}_i)^2\).
4. Prediction and Residuals Given the model\[\text{Fat} = 6.83 + 0.97 \times \text{Protein}\]. A burger has 20g protein and actually contains 20g fat.
- Predicted fat = \(6.83 + 0.97 \times 20 = 26.23\)g
- Residual = \(20 - 26.23 = -6.23\)g
Interpretation: The model overestimates fat content by 6.23g for this burger.
References
- SDA App, see the Learning Modules
- SOCR Probability and Statistics EBook, Correlation and Regression Chapter
- Altman DG. (1991). *Practical Statistics for Medical Research*.
- Dunn, G. (1989). *Design and Analysis of Reliability Studies*.
- SOCR Home page: https://socr.umich.edu
Translate this page: