Difference between revisions of "Simple Linear Regression Tutorial"
(Created page with '==SOCR_EduMaterials_AnalysesActivities - Simple Linear Regression Tutorial== '''Simple Linear Regression Tutorial Using LA Neighborhoods Data''' '''Data:''' We will be usin…') |
|||
(7 intermediate revisions by the same user not shown) | |||
Line 7: | Line 7: | ||
'''Goal:''' Our goal is to predict the median income using one explanatory variable by using SOCR. In this example, we will use the age variable. | '''Goal:''' Our goal is to predict the median income using one explanatory variable by using SOCR. In this example, we will use the age variable. | ||
− | ''Step 1:'' First, we will import the data into the SOCR Simple Regression Analysis Activity. Head to http://wiki.stat.ucla.edu/socr/index.php/SOCR_Data_LA_Neighborhoods_Data#Data_Source and find the table with the data. Select all of the data, and press Ctrl+C ( | + | ''Step 1:'' First, we will import the data into the SOCR Simple Regression Analysis Activity. Head to [http://wiki.stat.ucla.edu/socr/index.php/SOCR_Data_LA_Neighborhoods_Data#Data_Source LA Neighborhoods Data]and find the table with the data. Select all of the data, and press Ctrl+C (Command+C on Macs) to copy it. |
''Step 2:'' Next, head to http://socr.ucla.edu/htmls/SOCR_Analyses.html, and find the Simple Regression Analysis Activity in the drop-down menu. [[File:SReg2.png|center|800px]] | ''Step 2:'' Next, head to http://socr.ucla.edu/htmls/SOCR_Analyses.html, and find the Simple Regression Analysis Activity in the drop-down menu. [[File:SReg2.png|center|800px]] | ||
+ | |||
+ | ''Step 3:'' Now Click the “PASTE” button under the drop down menu. You should now see the data in the window. [[File:SReg3.png|center|800px]] | ||
+ | |||
+ | ''Step 4:'' Click on the “MAPPING” tab. This is where we define our dependent and independent variables. The dependent variable is the one we want to make a prediction on, and the independent variable is the one which we will use to make the prediction. In this example, we add Income to the dependent variable list and Age to the independent variable list. [[File:SReg4.png|center|800px]] | ||
+ | |||
+ | ''Step 5:'' Click “CALCULATE”. You will now be taken to the “RESULTS” tab. [[File:SReg5.png|center|800px]] Here you can see the regression equation, <math>R^2</math>, individual residuals, and also mean and standard deviation for both variables. | ||
+ | |||
+ | |||
+ | ''Step 6:'' Click “GRAPH”. Here is the scatterplot of Income vs Age. We see the upward trend: As median age increases, so does median household income. [[File:SReg6.png|center|800px]] There are also residual plots [[File:SReg7.png|center|800px]]and the Normal-QQ Plot.[[File:SReg8.png|center|800px]] | ||
+ | |||
+ | ''Step 7:'' We want to check that the assumptions of linear regression, and make sure that they are met. | ||
+ | |||
+ | Assumption 1: There is a linear relationship between the independent (age) and dependent variable (income) | ||
+ | * How to check: Make a scatter plot of income and age | ||
+ | * How to fix: Transformations (for example Log(y) vs x), or the relationship is not linear. | ||
+ | '''''Assumption Met''''' | ||
+ | |||
+ | Assumption 2: The variance is constant | ||
+ | * How to check: Look at plot of residuals vs. predicted values. Make sure there is not a pattern, such as the residuals getting larger as the predicted values increase. | ||
+ | * How to fix: Logging of variables, fixing underlying independence or linearity causes. | ||
+ | '''''Slight increase of residuals at the high end of age''''' | ||
+ | |||
+ | Assumption 3: Errors are normally distributed | ||
+ | * How to check: Normal QQ Plot (Should lie close to straight line) | ||
+ | * How to fix: Take out outliers, if applicable. Non-linear transformation may be needed | ||
+ | '''''Assumption Met''''' | ||
+ | |||
+ | '''Conclusions''' | ||
+ | |||
+ | No major violation of linear regression assumptions, we proceed with our analysis: | ||
+ | |||
+ | We can see from the results tab that the regression equation is: [[File:SReg9.png|center|800px]] | ||
+ | |||
+ | Income = -74549.596 + 4096.055 age | ||
+ | |||
+ | Income is the predicted value, -74549.596 is the intercept, 4096.055 is the slope, and age is the independent variable. | ||
+ | |||
+ | '''The linear model states that for every 1 year increase in median age, the median household income will increase by $4,096.06.''' | ||
+ | |||
+ | <hr> | ||
+ | {{translate|pageName=http://wiki.stat.ucla.edu/socr/index.php/Simple_Linear_Regression_Tutorial}} |
Latest revision as of 21:37, 28 July 2011
SOCR_EduMaterials_AnalysesActivities - Simple Linear Regression Tutorial
Simple Linear Regression Tutorial Using LA Neighborhoods Data
Data: We will be using the LA Neighborhoods Data for this tutorial.
Goal: Our goal is to predict the median income using one explanatory variable by using SOCR. In this example, we will use the age variable.
Step 1: First, we will import the data into the SOCR Simple Regression Analysis Activity. Head to LA Neighborhoods Dataand find the table with the data. Select all of the data, and press Ctrl+C (Command+C on Macs) to copy it.
Step 2: Next, head to http://socr.ucla.edu/htmls/SOCR_Analyses.html, and find the Simple Regression Analysis Activity in the drop-down menu.
Step 3: Now Click the “PASTE” button under the drop down menu. You should now see the data in the window.
Step 4: Click on the “MAPPING” tab. This is where we define our dependent and independent variables. The dependent variable is the one we want to make a prediction on, and the independent variable is the one which we will use to make the prediction. In this example, we add Income to the dependent variable list and Age to the independent variable list.
Step 5: Click “CALCULATE”. You will now be taken to the “RESULTS” tab.
Here you can see the regression equation, \(R^2\), individual residuals, and also mean and standard deviation for both variables.
Step 6: Click “GRAPH”. Here is the scatterplot of Income vs Age. We see the upward trend: As median age increases, so does median household income.
There are also residual plots
and the Normal-QQ Plot.
Step 7: We want to check that the assumptions of linear regression, and make sure that they are met.
Assumption 1: There is a linear relationship between the independent (age) and dependent variable (income)
- How to check: Make a scatter plot of income and age
- How to fix: Transformations (for example Log(y) vs x), or the relationship is not linear.
Assumption Met
Assumption 2: The variance is constant
- How to check: Look at plot of residuals vs. predicted values. Make sure there is not a pattern, such as the residuals getting larger as the predicted values increase.
- How to fix: Logging of variables, fixing underlying independence or linearity causes.
Slight increase of residuals at the high end of age
Assumption 3: Errors are normally distributed
- How to check: Normal QQ Plot (Should lie close to straight line)
- How to fix: Take out outliers, if applicable. Non-linear transformation may be needed
Assumption Met
Conclusions
No major violation of linear regression assumptions, we proceed with our analysis:
We can see from the results tab that the regression equation is:
Income = -74549.596 + 4096.055 age
Income is the predicted value, -74549.596 is the intercept, 4096.055 is the slope, and age is the independent variable.
The linear model states that for every 1 year increase in median age, the median household income will increase by $4,096.06.
Translate this page: