Simple Linear Regression Tutorial
SOCR_EduMaterials_AnalysesActivities - Simple Linear Regression Tutorial
Simple Linear Regression Tutorial Using LA Neighborhoods Data
Data: We will be using the LA Neighborhoods Data for this tutorial.
Goal: Our goal is to predict the median income using one explanatory variable by using SOCR. In this example, we will use the age variable.
Step 1: First, we will import the data into the SOCR Simple Regression Analysis Activity. Head to http://wiki.stat.ucla.edu/socr/index.php/SOCR_Data_LA_Neighborhoods_Data#Data_Source and find the table with the data. Select all of the data, and press Ctrl+C (Command+C on Macs) to copy it.
Step 2: Next, head to http://socr.ucla.edu/htmls/SOCR_Analyses.html, and find the Simple Regression Analysis Activity in the drop-down menu.
Step 3: Now Click the “PASTE” button under the drop down menu. You should now see the data in the window.
Step 4: Click on the “MAPPING” tab, and add Income to the dependent variable list and Age to the independent variable list.
Step 5: Click “CALCULATE”. You will now be taken to the “RESULTS” tab.
Here you can see the regression equation, \(R^2\), individual residuals, and also mean and standard deviation for both variables.
Step 6: Click “GRAPH”. Here is the scatterplot of Income vs Age. We see the upward trend: As median age increases, so does median household income
. There are also residual plots
and the Normal-QQ Plot
.
Step 7: We want to check that the assumptions of linear regression, and make sure that they are met.
Assumption 1: There is a linear relationship between the independent (age) and dependent variable (income)
- How to check: Make a scatter plot of income and age
- How to fix: Transformations (for example Log(y) vs x), or the relationship is not linear.
Assumption Met
Assumption 2: The variance is constant
- How to check: Look at plot of residuals vs. predicted values ( ). Make sure there is not a pattern, such as the residuals getting larger as the predicted values increase.
- How to fix: Logging of variables, fixing underlying independence or linearity causes.
Slight increase of residuals at the high end of age
Assumption 3: Errors are normally distributed
- How to check: Normal QQ Plot (Should lie close to straight line)
- How to fix: Take out outliers, if applicable. Non-linear transformation may be needed
Assumption Met
Conclusions
No major violation of linear regression assumptions, we proceed with our analysis:
We can see from the results tab that the regression equation is:
Income = -74549.596 + 4096.055 age
Income is the predicted value, -74549.596 is the intercept, 4096.055 is the slope, and age is the independent variable.
The linear model states that for every 1 year increase in median age, the median household income will increase by $4,096.06.