Difference between revisions of "AP Statistics Curriculum 2007 GLM Predict"

From SOCR
Jump to: navigation, search
 
m (Text replacement - "{{translate|pageName=http://wiki.stat.ucla.edu/socr/" to ""{{translate|pageName=http://wiki.socr.umich.edu/")
 
(9 intermediate revisions by 3 users not shown)
Line 1: Line 1:
 
==[[AP_Statistics_Curriculum_2007 | General Advance-Placement (AP) Statistics Curriculum]] - Variation and Prediction Intervals ==
 
==[[AP_Statistics_Curriculum_2007 | General Advance-Placement (AP) Statistics Curriculum]] - Variation and Prediction Intervals ==
  
=== Linear Modeling - Variation and Prediction Intervals ===
+
=== Inference on Linear Models ===
Example on how to attach images to Wiki documents in included below (this needs to be replaced by an appropriate figure for this section)!
 
<center>[[Image:AP_Statistics_Curriculum_2007_IntroVar_Dinov_061407_Fig1.png|500px]]</center>
 
  
===Approach===
+
Suppose we have again ''n'' pairs ''(X,Y)'', {<math>X_1, X_2, X_3, \cdots, X_n</math>} and {<math>Y_1, Y_2, Y_3, \cdots, Y_n</math>}, of observations of the same process. In the [[AP_Statistics_Curriculum_2007_GLM_Regress |previous section, we discussed how to fit a line to the data]]. The main question is how to determine the best line?
Models & strategies for solving the problem, data understanding & inference.  
 
  
* TBD
+
====[[AP_Statistics_Curriculum_2007_GLM_Corr#Airfare_Example |Airfare Example]]====
 +
We can see from the [[SOCR_EduMaterials_Activities_ScatterChart |scatterplot]] that [[AP_Statistics_Curriculum_2007_GLM_Corr#Airfare_Example | greater distance is associated with higher airfare]]. In other words, airports that tend to be further from Baltimore tend to have more expensive airfare. To decide on the best fitting line, we use the '''least-squares method''' to fit the least squares (regression) line.
  
===Model Validation===
+
<center>[[Image:SOCR_EBook_Dinov_GLM_Regr_021708_Fig1.jpg|500px]]</center>
Checking/affirming underlying assumptions.
 
  
* TBD
+
====Confidence Interval Estimating of the Slope and Intercept of Linear Model====
 +
The parameters (''a'' and ''b'') of the linear regression line, <math>Y = a + bX</math>, are estimated using [http://en.wikipedia.org/wiki/Ordinary_Least_Squares  Least Squares]. The least squares technique finds the line that minimizes the sum of the squares of the regression '''residuals''', <math>\hat{\varepsilon_i}=\hat{y}_{i}-y_i</math>, <math> \sum_{i=1}^N {\hat{\varepsilon_i}^2} = \sum_{i=1}^N (\hat{y}_{i}-y_i)^2 </math>, where <math>y_i</math> and <math>\hat{y}_{i}=a+bx_i</math> are the observed and the predicted values of ''Y'' for <math>x_i</math>.
  
===Computational Resources: Internet-based SOCR Tools===
+
The minimization problem can be solved using calculus, by finding the first order partial derivatives and setting them equal to zero. The solution gives the slope and y-intercept of the regressions line:
* TBD
 
  
===Examples===
+
* Regression line Slope:
Computer simulations and real observed data.  
+
: <math> \hat{b} = \frac {\sum_{i=1}^{N}  (x_{i} - \bar{x})(y_{i} - \bar{y}) }  {\sum_{i=1}^{N} (x_{i} - \bar{x}) ^2} </math>
 +
: <math> \hat{b} = \frac {\sum_{i=1}^{N} {(x_{i}y_{i})} - N \bar{x} \bar{y}}  {\sum_{i=1}^{N} (x_{i})^2 - N \bar{x}^2}  = \rho_{X,Y} \frac {s_y}{s_x} </math>, where [[AP_Statistics_Curriculum_2007_GLM_Corr |<math>\rho_{X,Y}</math> is the correlation coefficient]].
  
* TBD
+
* Y-intercept:
+
: <math> \hat{a} = \bar{y} - \hat{b} \bar{x} </math>
===Hands-on activities===
 
Step-by-step practice problems.
 
  
* TBD
+
If the error terms are Normally distributed, the estimate of the slope coefficient has a normal distribution with mean equals to '''b''' and '''standard error''' given by:
 +
 
 +
: <math> s_ \hat{b} = \sqrt { {1\over (N-2)} \frac {\sum_{i=1}^N \hat{\varepsilon_i}^2} {\sum_{i=1}^N (x_i - \bar{x})^2} }</math>.
 +
 
 +
* A '''confidence interval''' for ''b'' can be created using a [[AP_Statistics_Curriculum_2007_StudentsT | T-distribution with N-2 degrees of freedom]]:
 +
 
 +
:<math> [ \hat{b} - s_ \hat{b} t_{(\alpha/2, N-2)},\hat{b} + s_ \hat{b} t_{(\alpha/2, N-2)}] </math>
 +
 
 +
In other words, if there is an 1 mile increase in distance the airfare will go up by between $0.054 and $0.180.
 +
 
 +
* '''Significance testing''':  If X is not useful for predicting Y, then the true slope is zero. In a hypothesis test ,our status quo null hypothesis would be that there is no relationship between X and Y
 +
 
 +
: Hypotheses: <math>H_o: b = 0</math> vs. <math>H_1: b \not= 0</math> (or <math>H_1: b > 0</math>  or <math>H_1: b < 0</math>).
 +
 
 +
: Test-statistics: <math>t_o={b-0\over SE(b)}</math>, where <math>t_o \sim t_{(df=n-2)}</math> is the [[AP_Statistics_Curriculum_2007_StudentsT |T-Distribution]].
 +
 
 +
====Example====
 +
For the [[AP_Statistics_Curriculum_2007_GLM_Corr#Airfare_Example | distance vs. airfare example]], we can compute the standard error of the slope coefficient (''b''), SE(b)
 +
: <math>SE(b)={37.83 \over \sqrt{1786499}}=0.0283</math>.
 +
 
 +
* Then a 95% confidence interval for '''b''' is given by:
 +
 
 +
: CI(b): <math>b \pm t_{(\alpha/2, df=10)}SE(b)=0.11738 \pm 2.228\times 0.02832=[0.054 , 0.180].</math>
 +
 
 +
* Significance testing:
 +
: <math>t_o={b-0\over SE(b)}={0.11738-0 \over 0.02832}=4.145</math> and <math>p-value =0.002</math>.
 +
 
 +
===Earthquake Example===
 +
 
 +
Use the [[SOCR_Data_Dinov_021708_Earthquakes | SOCR Earthquake Dataset]] to formulate and test a research hypothesis about the slope of the best-leaner fit between the [http://nationalatlas.gov/articles/mapping/a_latlong.html Longitude] and the [http://nationalatlas.gov/articles/mapping/a_latlong.html Latitude] of the California Earthquakes since 1900. You can see the [http://socr.ucla.edu/docs/resources/SOCR_Data/SOCR_Earthquake5Data_GoogleMap.html SOCR Geomap of these Earthquakes]. The image below shows how to use the [[SOCR_EduMaterials_AnalysisActivities_SLR |Simple Linear regression]] in [http://socr.ucla.edu/htmls/SOCR_Analyses.html SOCR Analyses] to calculate the regression line and make inference on the slope.
 +
 
 +
<center>[[Image:SOCR_EBook_Dinov_GLM_Regr_021708_Fig2.jpg|500px]]</center>
  
 
<hr>
 
<hr>
===References===
+
 
* TBD
+
===[[EBook_Problems_GLM_Predict|Problems]]===
  
 
<hr>
 
<hr>
 
* SOCR Home page: http://www.socr.ucla.edu
 
* SOCR Home page: http://www.socr.ucla.edu
  
{{translate|pageName=http://wiki.stat.ucla.edu/socr/index.php?title=AP_Statistics_Curriculum_2007_GML_Predict}}
+
"{{translate|pageName=http://wiki.socr.umich.edu/index.php?title=AP_Statistics_Curriculum_2007_GML_Predict}}

Latest revision as of 15:34, 3 March 2020

General Advance-Placement (AP) Statistics Curriculum - Variation and Prediction Intervals

Inference on Linear Models

Suppose we have again n pairs (X,Y), {\(X_1, X_2, X_3, \cdots, X_n\)} and {\(Y_1, Y_2, Y_3, \cdots, Y_n\)}, of observations of the same process. In the previous section, we discussed how to fit a line to the data. The main question is how to determine the best line?

Airfare Example

We can see from the scatterplot that greater distance is associated with higher airfare. In other words, airports that tend to be further from Baltimore tend to have more expensive airfare. To decide on the best fitting line, we use the least-squares method to fit the least squares (regression) line.

SOCR EBook Dinov GLM Regr 021708 Fig1.jpg

Confidence Interval Estimating of the Slope and Intercept of Linear Model

The parameters (a and b) of the linear regression line, \(Y = a + bX\), are estimated using Least Squares. The least squares technique finds the line that minimizes the sum of the squares of the regression residuals, \(\hat{\varepsilon_i}=\hat{y}_{i}-y_i\), \( \sum_{i=1}^N {\hat{\varepsilon_i}^2} = \sum_{i=1}^N (\hat{y}_{i}-y_i)^2 \), where \(y_i\) and \(\hat{y}_{i}=a+bx_i\) are the observed and the predicted values of Y for \(x_i\).

The minimization problem can be solved using calculus, by finding the first order partial derivatives and setting them equal to zero. The solution gives the slope and y-intercept of the regressions line:

  • Regression line Slope:

\[ \hat{b} = \frac {\sum_{i=1}^{N} (x_{i} - \bar{x})(y_{i} - \bar{y}) } {\sum_{i=1}^{N} (x_{i} - \bar{x}) ^2} \] \[ \hat{b} = \frac {\sum_{i=1}^{N} {(x_{i}y_{i})} - N \bar{x} \bar{y}} {\sum_{i=1}^{N} (x_{i})^2 - N \bar{x}^2} = \rho_{X,Y} \frac {s_y}{s_x} \], where \(\rho_{X,Y}\) is the correlation coefficient.

  • Y-intercept:

\[ \hat{a} = \bar{y} - \hat{b} \bar{x} \]

If the error terms are Normally distributed, the estimate of the slope coefficient has a normal distribution with mean equals to b and standard error given by:

\[ s_ \hat{b} = \sqrt { {1\over (N-2)} \frac {\sum_{i=1}^N \hat{\varepsilon_i}^2} {\sum_{i=1}^N (x_i - \bar{x})^2} }\].

\[ [ \hat{b} - s_ \hat{b} t_{(\alpha/2, N-2)},\hat{b} + s_ \hat{b} t_{(\alpha/2, N-2)}] \]

In other words, if there is an 1 mile increase in distance the airfare will go up by between $0.054 and $0.180.

  • Significance testing: If X is not useful for predicting Y, then the true slope is zero. In a hypothesis test ,our status quo null hypothesis would be that there is no relationship between X and Y
Hypotheses\[H_o: b = 0\] vs. \(H_1: b \not= 0\) (or \(H_1: b > 0\) or \(H_1: b < 0\)).
Test-statistics\[t_o={b-0\over SE(b)}\], where \(t_o \sim t_{(df=n-2)}\) is the T-Distribution.

Example

For the distance vs. airfare example, we can compute the standard error of the slope coefficient (b), SE(b) \[SE(b)={37.83 \over \sqrt{1786499}}=0.0283\].

  • Then a 95% confidence interval for b is given by:
CI(b)\[b \pm t_{(\alpha/2, df=10)}SE(b)=0.11738 \pm 2.228\times 0.02832=[0.054 , 0.180].\]
  • Significance testing:

\[t_o={b-0\over SE(b)}={0.11738-0 \over 0.02832}=4.145\] and \(p-value =0.002\).

Earthquake Example

Use the SOCR Earthquake Dataset to formulate and test a research hypothesis about the slope of the best-leaner fit between the Longitude and the Latitude of the California Earthquakes since 1900. You can see the SOCR Geomap of these Earthquakes. The image below shows how to use the Simple Linear regression in SOCR Analyses to calculate the regression line and make inference on the slope.

SOCR EBook Dinov GLM Regr 021708 Fig2.jpg

Problems


"-----


Translate this page:

(default)
Uk flag.gif

Deutsch
De flag.gif

Español
Es flag.gif

Français
Fr flag.gif

Italiano
It flag.gif

Português
Pt flag.gif

日本語
Jp flag.gif

България
Bg flag.gif

الامارات العربية المتحدة
Ae flag.gif

Suomi
Fi flag.gif

इस भाषा में
In flag.gif

Norge
No flag.png

한국어
Kr flag.gif

中文
Cn flag.gif

繁体中文
Cn flag.gif

Русский
Ru flag.gif

Nederlands
Nl flag.gif

Ελληνικά
Gr flag.gif

Hrvatska
Hr flag.gif

Česká republika
Cz flag.gif

Danmark
Dk flag.gif

Polska
Pl flag.png

România
Ro flag.png

Sverige
Se flag.gif