

(29 intermediate revisions by 2 users not shown) 
Line 1: 
Line 1: 
−  This is a General AdvancedPlacement (AP) Statistics Curriculum EBook
 +  #REDIRECT [[Probability and statistics EBook]] 
−   
−  ==[[AP_Statistics_Curriculum_2007_Preface Preface]]==
 
−  This is an Internetbased EBook for advancedplacement (AP) statistics educational curriculum. The EBook is initially developed by the UCLA [[SOCR  Statistics Online Computational Resource (SOCR)]], however, all statistics instructors, researchers and educators are encouraged to contribute to this effort and improve the content of these learning materials.
 
−   
−  ===[[AP_Statistics_Curriculum_2007_Format Format]]===
 
−  Follow the instructions in [[AP_Statistics_Curriculum_2007_Format this page]] to expand, revise or improve the materials in this EBook.
 
−   
−  ===[[AP_Statistics_Curriculum_2007_Usage Learning and Instructional Usage]]===
 
−  This section describes the means of traversing, searching, discovering and utilizing the SOCR Statistics EBook resources in formal curricula or informal learning setting.
 
−   
−  ==Chapter I: Introduction to Statistics==
 
−  ===[[AP_Statistics_Curriculum_2007_IntroVar  The Nature of Data and Variation]]===
 
−  Although natural phenomena in real life are unpredictable, the designs of experiments are bound to generate data that varies because of intrinsic (internal to the system) or extrinsic (due to the ambient environment) effects.
 
−  How many natural processes or phenomena in real life can we describe that have an exact mathematical closedform description and are completely deterministic? How do we model the rest of the processes that are unpredictable and have random characteristics?
 
−   
−  ===[[AP_Statistics_Curriculum_2007_IntroUses Uses and Abuses of Statistics]]===
 
−  Statistics is the science of variation, randomness and chance. As such, statistics is different from other sciences, where the processes being studied obey exact deterministic mathematical laws. Statistics provides quantitative inference represented as longtime probability values, confidence or prediction intervals, odds, chances, etc., which may ultimately be subjected to varying interpretations. The phrase ''Uses and Abuses of Statistics'' refers to the notion that in some cases statistical results may be used as evidence to seemingly opposite theses. However, most of the time, common [http://en.wikipedia.org/wiki/Logic principles of logic] allow us to disambiguate the obtained statistical inference.
 
−   
−  ===[[AP_Statistics_Curriculum_2007_IntroDesign  Design of Experiments]]===
 
−  Design of experiments is the blueprint for planning a study or experiment, performing the data collection protocol and controlling the study parameters for accuracy and consistency. Data, or information, is typically collected in regard to a specific process or phenomenon being studied to investigate the effects of some controlled variables (independent variables or predictors) on other observed measurements (responses or dependent variables). Both types of variables are associated with specific observational units (living beings, components, objects, materials, etc.)
 
−   
−  ===[[AP_Statistics_Curriculum_2007_IntroTools Statistics with Tools (Calculators and Computers)]]===
 
−  All methods for data analysis, understanding or visualizing are based on models that often have compact analytical representations (e.g., formulas, symbolic equations, etc.) Models are used to study processes theoretically. Empirical validations of the utility of models are achieved by inputting data and executing tests of the models. This validation step may be done manually, by computing the model prediction or model inference from recorded measurements. This process may be possible by hand, but only for small numbers of observations (<10). In practice, we write (or use existent) algorithms and computer programs that automate these calculations for better efficiency, accuracy and consistency in applying models to larger datasets.
 
−   
−  ==Chapter II: Describing, Exploring, and Comparing Data==
 
−  ===[[AP_Statistics_Curriculum_2007_EDA_DataTypes Types of Data ]]===
 
−  There are two important concepts in any data analysis  '''Population''' and '''Sample'''.
 
−  Each of these may generate data of two major types  '''Quantitative''' or '''Qualitative''' measurements.
 
−   
−  ===[[AP_Statistics_Curriculum_2007_EDA_Freq Summarizing Data with Frequency Tables ]]===
 
−  There are two important ways to describe a data set (sample from a population)  '''Graphs''' or '''Tables'''.
 
−   
−  ===[[AP_Statistics_Curriculum_2007_EDA_Pics Pictures of Data]]===
 
−  There are many different ways to display and graphically visualize data. These graphical techniques facilitate the understanding of the dataset and enable the selection of an appropriate statistical methodology for the analysis of the data.
 
−   
−  ===[[AP_Statistics_Curriculum_2007_EDA_Center Measures of Central Tendency]]===
 
−  There are three main features of populations (or sample data) that are always critical in understanding and interpreting their distributions  '''Center''', '''Spread''' and '''Shape'''. The main measures of centrality are '''Mean''', '''Median''' and '''Mode(s)'''.
 
−   
−  ===[[AP_Statistics_Curriculum_2007_EDA_Var Measures of Variation]]===
 
−  There are many measures of (population or sample) spread, e.g., the range, the variance, the standard deviation, mean absolute deviation, etc. These are used to assess the dispersion or variation in the population.
 
−   
−  ===[[AP_Statistics_Curriculum_2007_EDA_Shape Measures of Shape]]===
 
−  The '''shape''' of a distribution can usually be determined by looking at a histogram of a (representative) sample from that population; [[AP_Statistics_Curriculum_2007_EDA_Pics Frequency Plots, Dot Plots or Stem and Leaf Displays]] may be helpful.
 
−   
−  ===[[AP_Statistics_Curriculum_2007_EDA_Statistics  Statistics]]===
 
−  Variables can be summarized using statistics  functions of data samples.
 
−   
−  ===[[AP_Statistics_Curriculum_2007_EDA_Plots  Graphs and Exploratory Data Analysis]] ===
 
−  Graphical visualization and interrogation of data are critical components of any reliable method for statistical modeling, analysis and interpretation of data.
 
−   
−  ==Chapter III: Probability==
 
−  Probability is important in many studies and disciplines because measurements, observations and findings are often influenced by variation. In addition, probability theory provides the theoretical groundwork for statistical inference.
 
−   
−  ===[[AP_Statistics_Curriculum_2007_Prob_Basics Fundamentals]]===
 
−  Some fundamental concepts of probability theory include random events, sampling, types of probabilities, event manipulations and axioms of probability.
 
−   
−  ===[[AP_Statistics_Curriculum_2007_Prob_Rules  Rules for Computing Probabilities]]===
 
−  There are many important rules for computing probabilities of composite events. These include conditional probability, statistical independence, multiplication and addition rules, the law of total probability and the Bayesian rule.
 
−   
−  ===[[AP_Statistics_Curriculum_2007_Prob_Simul Probabilities Through Simulations]] ===
 
−  Many experimental setting require probability computations of complex events. Such calculations may be carried out exactly, using theoretical models, or approximately, using estimation or simulations.
 
−   
−  ===[[AP_Statistics_Curriculum_2007_Prob_Count Counting]]===
 
−  There are many useful counting principles (including permutations and combinations) to compute the number of ways that certain arrangements of objects can be formed. This allows countingbased estimation of probabilities of complex events.
 
−   
−  ==Chapter IV: Probability Distributions==
 
−  There are two basic types of processes that we observe in nature  '''Discrete''' and '''Continuous'''. We begin by discussing several important discrete random processes, emphasizing the different distributions, expectations, variances and applications. In the [[AP_Statistics_Curriculum_2007#Chapter_V:_Normal_Probability_Distribution  next chapter]], we will discuss their continuous counterparts. The complete list of all [[About_pages_for_SOCR_Distributions SOCR Distributions is available here]].
 
−   
−  ===[[AP_Statistics_Curriculum_2007_Distrib_RV  Random Variables]]===
 
−  To simplify the calculations of probabilities, we will define the concept of a '''random variable''' which will allow us to study uniformly various processes with the same mathematical and computational techniques.
 
−   
−  ===[[AP_Statistics_Curriculum_2007_Distrib_MeanVar  Expectation (Mean) and Variance]]===
 
−  The expectation and the variance for any discrete random variable or process are important measures of [[AP_Statistics_Curriculum_2007#Measures_of_Central_Tendency  Centrality]] and [[AP_Statistics_Curriculum_2007#Measures_of_Variation Dispersion]].
 
−   
−  ===[[AP_Statistics_Curriculum_2007_Distrib_Binomial Bernoulli and Binomial Experiments]]===
 
−  The Bernoulli and Binomial processes provide the simplest models for discrete random experiments.
 
−   
−  ===[[AP_Statistics_Curriculum_2007_Distrib_Dists Geometric, Hypergeometric and Negative Binomial]]===
 
−  The Geometric, Hypergeometric and Negative Binomial distributions provide computational models for calculating probabilities for a large number of experiment and random variables. This section presents the theoretical foundations and the applications of each of these discrete distributions.
 
−   
−  ===[[AP_Statistics_Curriculum_2007_Distrib_Poisson Poisson Distribution]]===
 
−  The Poisson distribution models many different discrete processes where the probability of the observed phenomenon is constant in time or space. Poisson distribution may be used as an approximation to the Binomial distribution.
 
−   
−  ==Chapter V: Normal Probability Distribution==
 
−  The Normal Distribution is perhaps the most important model for studying quantitative phenomena in the natural and behavioral sciences  this is due to the [[AP_Statistics_Curriculum_2007_Limits_CLT  Central Limit Theorem]]. Many numerical measurements (e.g., weight, time, etc.) can be well approximated by the normal distribution.
 
−   
−  ===[[AP_Statistics_Curriculum_2007_Normal_Std The Standard Normal Distribution]]===
 
−  The Standard Normal Distribution is the simplest version (zeromean, unitstandarddeviation) of the (General) Normal Distribution. Yet, it is perhaps the most frequently used version because many tables and computational resources are explicitly available for calculating probabilities.
 
−   
−  ===[[AP_Statistics_Curriculum_2007_Normal_Prob Nonstandard Normal Distribution: Finding Probabilities]]===
 
−  In practice, the mechanisms underlying natural phenomena may be unknown, yet the use of the normal model can be theoretically justified in many situations to compute critical and probability values for various processes.
 
−   
−  ===[[AP_Statistics_Curriculum_2007_Normal_Critical Nonstandard Normal Distribution: Finding Scores (critical values)]]===
 
−  In addition to being able to compute probability (p) values, we often need to estimate the critical values of the Normal Distribution for a given pvalue.
 
−   
−  ==Chapter VI: Relations Between Distributions==
 
−  In this chapter, we will explore the relations between different distributions. This knowledge will help us in two ways:
 
−  * Some interdistribution relations will enable us to compute difficult probabilities using reasonable approximations
 
−  * It would identify appropriate probability models, graphical and statistical analysis tools for data interpretation.
 
−  The complete list of all [[About_pages_for_SOCR_Distributions SOCR Distributions is available here]].
 
− 
 
−  ===[[AP_Statistics_Curriculum_2007_Limits_CLT The Central Limit Theorem]]===
 
−  The exploration of the relation between different distributions begins with the study of the '''sampling distribution of the sample average'''. This will demonstrate the universally important role of normal distribution.
 
−   
−  ===[[AP_Statistics_Curriculum_2007_Limits_LLN Law of Large Numbers]]===
 
−  Suppose the relative frequency of occurrence of one event whose probability to be observed at each experiment is ''p''. If we repeat the same experiment over and over, the ratio of the observed frequency of that event to the total number of repetitions converges towards ''p'' as the number of experiments increases. Why is that and why is this important?
 
−   
−  ===[[AP_Statistics_Curriculum_2007_Limits_Norm2Bin Normal Distribution as Approximation to Binomial Distribution]]===
 
−  Normal Distribution provides a valuable approximation to Binomial when the sample sizes are large and the probability of successes and failures are not close to zero.
 
−   
−  ===[[AP_Statistics_Curriculum_2007_Limits_Poisson2Bin Poisson Approximation to Binomial Distribution]]===
 
−  Poisson provides an approximation to Binomial Distribution when the sample sizes are large and the probability of successes or failures is close to zero.
 
−   
−  ===[[AP_Statistics_Curriculum_2007_Limits_Bin2HyperG Binomial Approximation to HyperGeometric]]===
 
−  Binomial Distribution is much simpler to compute, compared to Hypergeometric, and can be used as an approximation when the population sizes are large (relative to the sample size) and the probability of successes is not close to zero.
 
−   
−  ===[[AP_Statistics_Curriculum_2007_Limits_Norm2Poisson Normal Approximation to Poisson]]===
 
−  The Poisson can be approximated fairly well by Normal Distribution when λ is large.
 
−   
−  ==Chapter VII: Point and Interval Estimates==
 
−  Estimation of population parameters is critical in many applications. Estimation is most frequently carried in terms of pointestimates or interval (range) estimates for population parameters that are of interest.
 
− 
 
−  ===[[AP_Statistics_Curriculum_2007_Estim_L_Mean Estimating a Population Mean: Large Samples]]===
 
−  This section discusses how to find point and interval estimates when the samplesizes are large.
 
−   
−  ===[[AP_Statistics_Curriculum_2007_Estim_S_Mean Estimating a Population Mean: Small Samples]]===
 
−  Next, we discuss point and interval estimates when the samplesizes are small. Naturally, the point estimates are less precise and the interval estimates produce wider intervals, compared to the case of largesamples.
 
−   
−  ===[[AP_Statistics_Curriculum_2007_StudentsT Student's T distribution]]===
 
−  The Student's TDistribution arises in the problem of estimating the mean of a normally distributed population when the sample size is small and the population variance is unknown.
 
−   
−  ===[[AP_Statistics_Curriculum_2007_Estim_Proportion Estimating a Population Proportion]]===
 
−  Normal Distribution is appropriate model for proportions, when the sample size is large enough. In this section we demonstrate how to obtain point and interval estimates for population proportion.
 
−   
−  ===[[AP_Statistics_Curriculum_2007_Estim_Var Estimating a Population Variance]]===
 
−  In many processes and experiments, controlling the amount of variance is of critical importance. Thus the ability to assess variation, using point and interval estimates, facilitates our ability to make inference, revise manufacturing protocols, improve clinical trials, etc.
 
−   
−  ==Chapter VIII: Hypothesis Testing==
 
−  Hypothesis Testing is a statistical technique for decision making regarding populations or processes based on experimental data. It quantitatively answers the possibility that chance alone might be responsible for the observed discrepancy between a theoretical model and the empirical observations.
 
−   
−  ===[[AP_Statistics_Curriculum_2007_Hypothesis_Basics Fundamentals of Hypothesis Testing]]===
 
−  In this section, we define the core terminology necessary to discuss Hypothesis Testing (Null and Alternative Hypotheses, Type I and II errors, Sensitivity, Specificity, Statistical Power, etc.)
 
−   
−  ===[[AP_Statistics_Curriculum_2007_Hypothesis_L_Mean Testing a Claim about a Mean: Large Samples]]===
 
−  As we already saw how to construct point and interval estimates for the population mean in the large sample case, we now show how to do hypothesis testing in the same situation.
 
−   
−  ===[[AP_Statistics_Curriculum_2007_Hypothesis_S_Mean Testing a Claim about a Mean: Small Samples]]===
 
−  We continue with the discussion on inference for the population mean for small smaples.
 
−   
−  ===[[AP_Statistics_Curriculum_2007_Hypothesis_Proportion Testing a Claim about a Proportion]]===
 
−  When the sample size is large, the sampling distribution of the sample proportion <math>\hat{p}</math> is approximately Normal, by [[AP_Statistics_Curriculum_2007_Limits_CLT  CLT]]. This helps us formulate hypothesis testing protocols and compute the appropriate statistics and pvalues to assess significance.
 
−   
−  ===[[AP_Statistics_Curriculum_2007_Hypothesis_Var Testing a Claim about a Standard Deviation or Variance]]===
 
−  Significance Testing for the variation or the standard deviation of a process, natural phenomenon or an experiment is of paramount importance in many fields. This chapter provides the details for formulating testable hypotheses, computation and inference on assessing variation.
 
−   
−  ==Chapter IX: Inferences from Two Samples==
 
−  In this chapter, we continue our pursuit and study of significance testing in the case of having two populations. This expands the possible applications of onesample hypothesis testing we saw in the [[EBook#Chapter_VIII:_Hypothesis_Testing  previous chapter]].
 
−   
−  ===[[AP_Statistics_Curriculum_2007_Infer_2Means_Dep Inferences about Two Means: Dependent Samples]]===
 
−  We need to clearly identify whether samples we compare are dependent or independent in all study designs. In this section, we discuss one specific dependentsamples case  paired samples.
 
−   
−  ===[[AP_Statistics_Curriculum_2007_Infer_2Means_Indep Inferences about Two Means: Independent Samples]]===
 
−  Independent samples designs refer to experiments or observations where all measurements are individually independent from each other within their groups and the groups are independent. In this section we discuss inference based on independent samples.
 
−   
−  ===[[AP_Statistics_Curriculum_2007_Infer_BiVar Comparing Two Variances]]===
 
−  In this section we compare variances (or standard deviations) of two populations using randomly sampled data.
 
−   
−  ===[[AP_Statistics_Curriculum_2007_Infer_2Proportions Inferences about Two Proportions]]===
 
−  This section presents the significance testing and inference on equality of proportions from two independent populations.
 
−   
−  ==Chapter X: Correlation and Regression==
 
−  Many scientific applications involve the analysis of relationships between two or more variables involved in a process of interest. We begin with the simplest of all situations where bivariate data (X and Y) are measured for a process and we are interested on determining the association, relation or an appropriate model for these observations (e.g., fitting a straight line to the pairs of (X,Y) data).
 
−   
−  ===[[AP_Statistics_Curriculum_2007_GLM_Corr Correlation]]===
 
−  The correlation between X and Y represents the first bivariate model of association which may be used to make predictions.
 
−   
−  ===[[AP_Statistics_Curriculum_2007_GLM_Regress Regression]]===
 
−  We are now ready to discuss the modeling of linear relations between two variables using regression analysis. This section demonstrates this methodology for the SOCR California Earthquake dataset.
 
−   
−  ===[[AP_Statistics_Curriculum_2007_GLM_Predict Variation and Prediction Intervals]]===
 
−  In this section, we discuss point and interval estimates about the slope of linear models.
 
−   
−  ===[[AP_Statistics_Curriculum_2007_GLM_MultLin Multiple Regression]]===
 
−  Now we are interested in determining linear regressions, multilinear models, of the relationships between one dependent variable Y and many independent variables <math>X_i</math>.
 
−   
−  ==Chapter XI: Analysis of Variance (ANOVA)==
 
−  ===[[AP_Statistics_Curriculum_2007_ANOVA_1Way  OneWay ANOVA]]===
 
−  We now expand our inference methods to study and compare k independent samples. In this case, we will be decomposing the entire variation in the data into independent components.
 
−   
−  ===[[AP_Statistics_Curriculum_2007_ANOVA_2Way  TwoWay ANOVA]]===
 
−  Now we focus on decomposing the variance of a dataset into (independent/orthogonal) components when we have two (grouping) factors. This procedure called TwoWay Analysis of Variance.
 
−   
−  ==Chapter XII: NonParametric Inference==
 
−  To be valid, many statistical methods impose (parametric) requirements about the format, parameters and distributions of the data to be analysed. For instance, the [[AP_Statistics_Curriculum_2007_Infer_2Means_Indep  independent Ttest]] requires that the distributions of the two samples are Normal. Nonparametric (distributionfree) statistical methods do not make such and are often useful in practice, albeit [[AP_Statistics_Curriculum_2007_Hypothesis_Basics  lesspowerful]].
 
−   
−  ===[[AP_Statistics_Curriculum_2007_NonParam_2MedianPair  Differences of Medians (Centers) of Two Paired Samples]]===
 
−  The '''sign test''' and the '''Wilcoxon signed rank test''' are the simplest nonparametric tests which are also alternatives to the [[AP_Statistics_Curriculum_2007_Infer_2Means_Dep  onesample and paired Ttest]]. These tests are applicable for paired designs where the data need not be Normally distributed.
 
−   
−  ===[[AP_Statistics_Curriculum_2007_NonParam_2MedianIndep  Differences of Medians (Centers) of Two Independent Samples]]===
 
−  The WilcoxonMannWhitney (WMW) Test (also known as MannWhitney U test, MannWhitneyWilcoxon test, or Wilcoxon ranksum test) is a nonparametric test for assessing whether two samples come from the same distribution.
 
−   
−  ===[[AP_Statistics_Curriculum_2007_NonParam_2PropIndep  Differences of Proportions of Two Samples]]===
 
−  Depending upon whether the samples are dependent or independent we use different statistical tests.
 
−   
−  ===[[AP_Statistics_Curriculum_2007_NonParam_ANOVA  Differences of Means of Several Independent Samples]]===
 
−  Overview TBD
 
−   
−  ===[[AP_Statistics_Curriculum_2007_NonParam_VarIndep  Differences of Variances of Two Independent Samples]]===
 
−  Overview TBD
 
−   
−  ==Chapter XIII: Multinomial Experiments and Contingency Tables==
 
−  ===[[AP_Statistics_Curriculum_2007_Contingency_Fit Multinomial Experiments: GoodnessofFit]]===
 
−  Overview TBD
 
−   
−  ===[[AP_Statistics_Curriculum_2007_Contingency_Indep Contingency Tables: Independence and Homogeneity]]===
 
−  Overview TBD
 
−   
−  ==Chapter XIV: Statistical Process Control==
 
−  ===[[AP_Statistics_Curriculum_2007_Control_MeanVar Control Charts for Variation and Mean]]===
 
−  Overview TBD
 
−   
−  ===[[AP_Statistics_Curriculum_2007_Control_Attrib Control Charts for Attributes]]===
 
−  Overview TBD
 
−   
−  ==Chapter XV: Survival/Failure Analysis==
 
−  Overview TBD
 
−   
−  ==Chapter XVI: Multivariate Statistical Analyses==
 
−  ===[[AP_Statistics_Curriculum_2007_MultiVar_ANOVA  Multivariate Analysis of Variance]]===
 
−  Overview TBD
 
−   
−  ===[[AP_Statistics_Curriculum_2007_MultiVar_LinRegression  Multiple Linear Regression]]===
 
−  Overview TBD
 
−   
−  ===[[AP_Statistics_Curriculum_2007_MultiVar_Logistic  Logistic Regression]]===
 
−  Overview TBD
 
−   
−  ===[[AP_Statistics_Curriculum_2007_MultiVar_LogLinear  LogLinear Regression]]===
 
−  Overview TBD
 
−   
−  ===[[AP_Statistics_Curriculum_2007_MultiVar_ANCOVA  Multivariate Analysis of Covariance]]===
 
−  Overview TBD
 
−   
−  ==Chapter XVII: Time Series Analysis==
 
−  Overview TBD
 
−   
−   
−  <hr>
 
−  * SOCR Home page: http://www.socr.ucla.edu
 
−   
−  {{translatepageName=http://wiki.stat.ucla.edu/socr/index.php?title=AP_Statistics_Curriculum_2007}}
 