Difference between revisions of "Scientific Methods for Health Sciences"

Latest revision as of 14:12, 31 January 2026

This SMHS Ebook is coupled with the supporting Interactive SDA Curriculum and the Dynamic Health & Data Analytics Learning Modules

SOCR Wiki: Scientific Methods for Health Sciences

Scientific Methods for Health Sciences EBook

Preface

The Scientific Methods for Health Sciences (SMHS) EBook (ISBN: 978-0-9829949-1-7) is designed to support a 4-course training curriculum emphasizing the fundamentals, applications and practice of scientific methods specifically for graduate students in the health sciences.

Format

Follow the instructions in this page to expand, revise or improve the materials in this EBook.

Learning and Instructional Usage

This section describes the means of traversing, searching, discovering and utilizing the SMHS EBook resources in both formal and informal learning setting.

Copyrights

The SMHS EBook is a freely and openly accessible electronic book developed by SOCR and the general health sciences community.

Chapter I: Fundamentals

Exploratory Data Analysis, Plots and Charts

Review of data types, exploratory data analyses and graphical representation of information.

Ubiquitous Variation

There are many ways to quantify variability, which is present in all natural processes.

Parametric Inference

Foundations of parametric (model-based) statistical inference.

Probability Theory

Random variables, stochastic processes, and events are the core concepts necessary to define likelihoods of certain outcomes or results to be observed. We define event manipulations and present the fundamental principles of probability theory including conditional probability, total and Bayesian probability laws, and various combinatorial ideas.

Odds Ratio/Relative Risk

The relative risk, RR, (a measure of dependence comparing two probabilities in terms of their ratio) and the odds ratio, OR, (the fraction of one probability and its complement) are widely applicable in many healthcare studies.

Centrality, Variability and Shape

Three main features of sample data are commonly reported as critical in understanding and interpreting the population, or process, the data represents. These include Center, Spread and Shape. The main measures of centrality are Mean, Median and Mode(s). Common measures of variability include the range, the variance, the standard deviation, and mean absolute deviation. The shape of a (sample or population) distribution is an important characterization of the process and its intrinsic properties.

Probability Distributions

Probability distributions are mathematical models for processes that we observe in nature. Although there are different types of distributions, they have common features and properties that make them useful in various scientific applications. This section presents the Bernoulli, Binomial, Multinomial, Geometric, Hypergeometric, Negative binomial, Negative multinomial distribution, Poisson distribution, and Normal distributions, as well as the concept of moment generating function.

Resampling and Simulation

Resampling is a technique for estimation of sample statistics (e.g., medians, percentiles) by using subsets of available data or by randomly drawing replacement data. Simulation is a computational technique addressing specific imitations of what’s happening in the real world or system over time without awaiting it to happen by chance.

Design of Experiments

Design of experiments (DOE) is a technique for systematic and rigorous problem solving that applies data collection principles to ensure the generation of valid, supportable and reproducible conclusions.

Intro to Epidemiology

Epidemiology is the study of the distribution and determinants of disease frequency in human populations. This section presents the basic epidemiology concepts. More advanced epidemiological methodologies are discussed in the next chapter. This section also presents the Positive and Negative Predictive Values (PPV/NPV).

Experiments vs. Observational Studies

Experimental and observational studies have different characteristics and are useful in complementary investigations of association and causality.

Estimation

Estimation is a method of using sample data to approximate the values of specific population parameters of interest like population mean, variability or 97^th percentile. Estimated parameters are expected to be interpretable, accurate and optimal, in some form.

Hypothesis Testing

Hypothesis testing is a quantitative decision-making technique for examining the characteristics (e.g., centrality, span) of populations or processes based on observed experimental data. In this section we discuss inference about a mean, mean differences (both small and large samples), a proportion or differences of proportions and differences of variances.

Statistical Power, Sensitivity and Specificity

The fundamental concepts of type I (false-positive) and type II (false-negative) errors lead to the important study-specific notions of statistical power, sample size, effect size, sensitivity and specificity.

Data Management

All modern data-driven scientific inquiries demand deep understanding of tabular, ASCII, binary, streaming, and cloud data management, processing and interpretation.

Bias and Precision

Bias and precision are two important and complementary characteristics of estimated parameters that quantify the accuracy and variability of approximated quantities.

Association and Causality

An association is a relationship between two, or more, measured quantities that renders them statistically dependent so that the occurrence of one does affect the probability of the other. A causal relation is a specific type of association between an event (the cause) and a second event (the effect) that is considered to be a consequence of the first event.

Rate-of-change

Rate of change is a technical indicator describing the rate in which one quantity changes in relation to another quantity.

Clinical vs. Statistical Significance

Statistical significance addresses the question of whether or not the results of a statistical test meet an accepted quantitative criterion, whereas clinical significance is answers the question of whether the observed difference between two treatments (e.g., new and old therapy) found in the study large enough to alter the clinical practice.

Correction for Multiple Testing

Multiple testing refers to analytical protocols involving testing of several (typically more then two) hypotheses. Multiple testing studies require correction for type I (false-positive rate), which can be done using Bonferroni's method, Tukey’s procedure, family-wise error rate (FWER), or false discovery rate (FDR).

Chapter II: Applied Inference

Epidemiology

This section expands the Epidemiology Introduction from the previous chapter. Here we will discuss numbers needed to treat and various likelihoods related to genetic association studies, including linkage and association, LOD scores and Hardy-Weinberg equilibrium.

Correlation and Regression (ρ and slope inference, 1-2 samples)

Studies of correlations between two, or more, variables and regression modeling are important in many scientific inquiries. The simplest situation such situation is exploring the association and correlation of bivariate data ($X$ and $Y$).

ROC Curve

The receiver operating characteristic (ROC) curve is a graphical tool for investigating the performance of a binary classifier system as its discrimination threshold varies. We also discuss the concepts of positive and negative predictive values.

ANOVA

Analysis of Variance (ANOVA) is a statistical method fpr examining the differences between group means. ANOVA is a generalization of the t-test for more than 2 groups. It splits the observed variance into components attributed to different sources of variation.

Non-parametric inference

Nonparametric inference involves a class of methods for descriptive and inferential statistics that are not based on parametrized families of probability distributions, which is the basis of the parametric inference we discussed earlier. This section presents the Sign test, Wilcoxon Signed Rank test, Wilcoxon-Mann-Whitney test, the McNemar test, the Kruskal-Wallis test, and the Fligner-Killeen test.

Instrument Performance Evaluation: Cronbach's α

Cronbach’s alpha (α) is a measure of internal consistency used to estimate the reliability of a cumulative psychometric test.

Measurement Reliability and Validity

Measures of Validity include: Construct validity (extent to which the operation actually measures what the theory intends to), Content validity (the extent to which the content of the test matches the content associated with the construct), Criterion validity (the correlation between the test and a variable representative of the construct), experimental validity (validity of design of experimental research studies). Similarly, there many alternate strategies to assess instrument Reliability (or repeatability) -- test-retest reliability, administering different versions of an assessment tool to the same group of individuals, inter-rater reliability, internal consistency reliability.

Survival Analysis

Survival analysis is used for analyzing longitudinal data on the occurrence of events (e.g., death, injury, onset of illness, recovery from illness). In this section we discuss data structure, survival/hazard functions, parametric versus semi-parametric regression techniques and introduction to Kaplan-Meier methods (non-parametric).

Decision Theory

Decision theory helps determining the optimal course of action among a number of alternatives, when consequences cannot be forecasted with certainty. There are different types of loss-functions and decision principles (e.g., frequentist vs. Bayesian).

CLT/LLNs – limiting results and misconceptions

The Law of Large Numbers (LLT) and the Central Limit Theorem (CLT) are the first and second fundamental laws of probability. CLT yields that the arithmetic mean of a sufficiently large number of iterates of independent random variables given certain conditions will be approximately normally distributed. LLT states that in performing the same experiment a large number of times, the average of the results obtained should be close to the expected value and tends to get closer to the expected value with increasing number of trials.

Association Tests

There are alternative methods to measure association two quantities (e.g., relative risk, risk ratio, efficacy, prevalence ratio). This section also includes details on Chi-square tests for association and goodness-of-fit, Fisher’s exact test, randomized controlled trials (RCT), and external and internal validity.

Bayesian Inference

Bayes’ rule connects the theories of conditional and compound probability and provides a way to update probability estimates for a hypothesis as additional evidence is observed.

PCA/ICA/Factor Analysis

Principal component analysis is a mathematical procedure that transforms a number of possibly correlated variables into a fewer number of uncorrelated variables through a process known as orthogonal transformation. Independent component analysis is a computational tool to separate a multivariate signal into additive subcomponents by assuming that the subcomponents are non-Gaussian signals and are statistically independent from each other. Factor analysis is a statistical method, which describes variability among observed correlated variables in terms of potentially lower number of unobserved variables.

Point/Interval Estimation (CI) – MoM, MLE

Estimation of population parameters is critical in many applications. In statistics, estimation is commonly accomplished in terms of point-estimates or interval-estimates for specific (unknown) population parameters of interest. The method of moments (MOM) and maximum likelihood estimation (MLE) techniques are used frequently in practice. In this section, we also lay the foundations for expectation maximization and Gaussian mixture modeling.

Study/Research Critiques

The scientific rigor in published literature, grant proposals and general reports needs to be assessed and scrutinized to minimize errors in data extraction and meta-analysis. Reporting biases present significant obstacles to collecting of relevant information on the effectiveness of an intervention, strength of relations between variables, or causal associations.

Common mistakes and misconceptions in using probability and statistics, identifying potential assumption violations, and avoiding them

Chapter III: Linear Modeling

Multiple Linear Regression (MLR)

Multiple Linear Regression encapsulated a family of statistical analyses for modeling single or multiple independent variables and one dependent variable. MLR computationally estimates all of the effects of each independent variable (coefficients) based on the data using least square fitting.

Generalized Linear Modeling (GLM)

Generalized Linear Modeling (GLM) is a flexible generalization of ordinary linear multivariate regression, which allows for response variables that have error distribution models other than a normal distribution. GLM unifies statistical models like linear regression, logistic regression and Poisson regression.

Analysis of Covariance (ANCOVA)

Analysis of Variance (ANOVA) is a common method applied to analyze the differences between group means. Analysis of Covariance (ANCOVA) is another method applied to blend ANOVA and regression and evaluate whether population means of a dependent variance are equal across levels of a categorical independent variable while statistically controlling for the effects of other continuous variables.

Multivariate Analysis of Variance (MANOVA)

A generalized form of ANOVA is the multivariate analysis of variance (MANOVA), which is a statistical procedure for comparing multivariate means of several groups.

Multivariate Analysis of Covariance (MANCOVA)

Similar to MANOVA, the multivariate analysis of covariance (MANOCVA) is an extension of ANCOVA that is designed for cases where there is more than one dependent variable and when a control of concomitant continuous independent variables is present.

Repeated measures Analysis of Variance (rANOVA)

Repeated measures are used in situations when the same objects/units/entities take part in all conditions of an experiment. Given there is multiple measures on the same subject, we have to control for correlation between multiple measures on the same subject. Repeated measures ANOVA (rANOVA) is the equivalent of the one-way ANOVA, but for related, not independent, groups. It is also referred to as within-subject ANOVA or ANOVA for correlated samples.

Partial Correlation

Partial correlation measures the degree of association between two random variables by measuring variances controlling for certain other factors or variables.

Time Series Analysis

Time series data is a sequence of data points measured at successive points in time. Time series analysis is a technique used in varieties of studies involving temporal measurements and tracking metrics.

Fixed, Randomized and Mixed Effect Models

Fixed effect models are statistical models that represent the observed quantities in terms of explanatory variables (covariates) treated as non-random, while random effect models assume that the dataset being analyzed consist of a hierarchy of different population whose differences relate to that hierarchy. Mixed effect models consist of both fixed effects and random effects. For random effects model and mixed models, either all or part of the explanatory variables are treated as if they rise from random causes.

Hierarchical Linear Models (HLM)

Hierarchical linear model (also called multilevel models) refer to statistical models of parameters that vary at more than one level. These are generalizations of linear models and are widely applied in various studies especially for research designs where data for participants are organized at more than one level.

Multi-Model Inference

Multi-Model Inference involves model selection of a relationship between $Y$ (response) and predictors $X_1, X_2, ..., X_n$ that is simple, effective and retains good predictive power, as measured by the SSE, AIC or BIC.

Mixture Modeling

Mixture modeling is a probabilistic modeling technique for representing the presence of sub-populations within overall population, without requiring that an observed data set identifies the sub-population to which an individual observation belongs.

Surveys

Survey methodologies involve data collection using questionnaires designed to improve the number of responses and the reliability of the responses in the surveys. The ultimate goal is to make statistical inferences about the population, which would depend strongly on the survey questions provided. The commonly used survey methods include polls, public health surveys, market research surveys, censuses and so on.

Longitudinal Data

Longitudinal data represent data collected from a population over a given time period where the same subjects are measured at multiple points in time. Longitudinal data analyses are widely used statistical techniques in many health science fields.

Generalized Estimating Equations (GEE) Models

Generalized estimating equation (GEE) is a method for parameter estimation when fitting generalized linear models with a possible unknown correlation between outcomes. It provides a general approach for analyzing discrete and continuous responses with marginal models and works as a popular alternative to maximum likelihood estimation (MLE).

Model Fitting and Model Quality (KS-test)

The Kolmogorov-Smirnov Test (K-S test) is a nonparametric test commonly applied to test for the equality of continuous, one-dimensional probability distributions. This test can be used to compare one sample against a reference probability distribution (one-sample K-S test) or to compare two samples (two-sample K-S test).

Chapter IV: Special Topics

Data Simulation

This section demonstrates the core principles of simulating multivariate datasets.

Linear Modeling

This section is a review of linear modeling.

Scientific Visualization

This section discusses how and why we should "look" at data.

Methods for Studying Heterogeneity of Treatment Effects, Case-Studies of Comparative Effectiveness Research

This section discusses methods for studying heterogeneity of treatment effects and case-studies of comparative effectiveness research.

Big-Data/Big-Science

This section discusses structural equation modeling and generalized estimated equation modeling. Furthermore, it discusses statistical validation, cross validation, classification, and prediction.

Missing data

Many research studies encounter incomplete (missing) data that require special handling (e.g., teleprocessing, statistical analysis, visualization). There are a variety of methods (e.g., multiple imputation) to deal with missing data, detect missingness, impute the data, analyze the completed dataset and compare the characteristics of the raw and imputed data.

Genotype-Environment-Phenotype associations

Medical imaging

Data Networks

Adaptive Clinical Trials

Databases/registries

Meta-analyses

Causality/Causal Inference, SEM

Classification methods

Time-Series Analysis

In this section, we will discuss Time Series Analysis, which represents a class of statistical methods applicable for series data aiming to extract meaningful information, trend and characterization of the process using observed longitudinal data.

Scientific Validation

Geographic Information Systems (GIS)

Rasch measurement model/analysis

MCMC sampling for Bayesian inference

Network Analysis

References

Translate this page:

(default)	Deutsch	Español	Français	Italiano	Português	日本語	България	الامارات العربية المتحدة	Suomi	इस भाषा में	Norge
한국어	中文	繁体中文	Русский	Nederlands	Ελληνικά	Hrvatska	Česká republika	Danmark	Polska	România	Sverige

@@ Line 1: / Line 1: @@
-'''The Scientific Methods for Health Sciences EBook is still under active development. It is expected to be complete by Sept 01, 2014, when this banner will be removed.'''
+{| cellspacing="5" cellpadding="0" style="margin:0em 0em 1em 0em; border:1px solid #1DA0E7; background:#B3DDF4;width:100%"
+| [https://sda.statisticalcomputing.org/learning This SMHS Ebook is coupled with the supporting Interactive SDA Curriculum and the Dynamic Health & Data Analytics Learning Modules]
+|}
 == [[Main_Page | SOCR Wiki]]: Scientific Methods for Health Sciences ==
 [[Image:SMHS_EBook.png|250px|thumbnail|right| Scientific Methods for Health Sciences EBook]]
-Electronic book (EBook) on Scientific Methods for Health Sciences (coming up ...)
 ==[[SMHS_Preface| Preface]]==
-The ''Scientific Methods for Health Sciences (SMHS) EBook'' is designed to support a 4-course training of scientific methods for graduate students in the health sciences.
+The ''Scientific Methods for Health Sciences (SMHS) EBook'' (ISBN: 978-0-9829949-1-7) is designed to support a [http://www.socr.umich.edu/people/dinov/SMHS_Courses.html 4-course training curriculum] emphasizing the fundamentals, applications and practice of scientific methods specifically for graduate students in the health sciences.
 ===[[SMHS_Format| Format]]===
@@ Line 17: / Line 16: @@
 ===[[SMHS_copyright | Copyrights]]===
-The SMHS EBook is a freely and openly accessible electronic book developed by SOCR and the general community.
+The SMHS EBook is a freely and openly accessible electronic book developed by [[SOCR]] and the general health sciences community.
 ==Chapter I: Fundamentals==
 ===[[SMHS_EDA | Exploratory Data Analysis, Plots and Charts]]===
 Review of data types, exploratory data analyses and graphical representation of information.
-===Overview===
-* ''What is data?'' Data is a collection of facts, observations or information, such as values or measurements. Data can be numbers, measurements, or even just description of things (meta-data). Data types can be divided into two big categories of quantitative (numerical information) and qualitative data (descriptive information).
-*''Quantitative data'' is anything that can be expressed as a number, or quantified. For example, the scores on a math test or weight of girls in the fourth grade are both quantitative data. Quantitative data (discrete or continuous) is often referred to as the measurable data and this type of data allows scientists to perform various arithmetic operations, such as addition, multiplication, functional-evaluation, or to find parameters of a population. There are two major types of quantitative data: discrete and continuous.
-**Discrete data results from either a finite, or infinite but countable, possible options for the values present in a given discrete data set and the values of this data type can constitute a sequence of isolated or separated points on the real number line.
-**Continuous quantitative data results from infinite and dense possible values that the observations can take on.
-*'''Qualitative''' data cannot be expressed as a number. Examples may be gender, religious preference. Categorical data (qualitative or nominal) results from placing individuals into groups or categories. Ordinal and qualitative categorical data types both fall into this category.
-In statistics, exploratory data analysis (EDA) is an approach to analyze data sets to summarize their main characteristics. Modern statistics regards the graphical visualization and interrogation of data as a critical component of any reliable method for statistical modeling, analysis and interpretation of data. Formally, there are two types of data analysis that should be employed in concert on the same set of data to make a valid and robust inference: graphical techniques and quantitative techniques. We will discuss many of these later, but below is a snapshot of EDA approaches:
-* [[SOCR_EduMaterials_Activities_BoxPlot|Box plot]], [[SOCR_EduMaterials_Activities_Histogram_Graphs|Histogram]]; Multi-vari chart; Run chart; Pareto chart; [[SOCR_EduMaterials_Activities_ScatterChart|Scatter plot]]; Stem-and-leaf plot;
-* Parallel coordinates; Odds ratio; Multidimensional scaling; Targeted projection pursuit; Principal component analysis; Multi-linear PCA; Projection methods such as grand tour, guided tour and manual tour.
-* [http://en.wikipedia.org/wiki/Median_polish Median polish], [http://en.wikipedia.org/wiki/Trimean Trimean].
-===Motivation===
-The feel of data comes clearly from the application of various graphical techniques, which serves as a perfect window to human perspective and sense. The primary goal of EDA is to maximize the analyst’s insight into a data set and into the underlying structure of the data set. To get a feel for the data, it is not enough for the analyst to know what is in the data, he or she must also know what is not in the data, and the only way to do that is to draw on our own pattern recognition and comparative abilities in the context of a series of judicious graphical techniques applied to the data. The main objectives of EDA are to:
-* Suggest hypotheses about the causes of observed phenomena;
-* Assess (parametric) assumptions on which statistical inference will be based;
-* Support the selection of appropriate statistical tools and techniques;
-* Provide a basis for further data collection through surveys and experiments.
-===Theory===
-Many EDA techniques have been proposed, validated and adopted for various statistical methodologies. Here is an introduction to some of the frequently used EDA charts and the quantitative techniques.
-* [[SOCR_EduMaterials_Activities_BoxPlot|Box-and-Whisker plot]]: It is an efficient way for presenting data, especially for comparing multiple groups of data. In the box plot, we can mark-off the five-number summary of a data set (minimum, 25<sup>th</sup> percentile, median, 75<sup>th</sup> percentile, maximum). The box contains the 50% of the data. The upper edge of the box represents the 75<sup>th</sup> percentile, while the lower edge is the 25<sup>th</sup> percentile. The median is represented by a line drawn in the middle of the box. If the median is not in the middle of the box then the data are skewed. The ends of the lines (whiskers) represent the minimum and maximum values of the data set, unless there are outliers. Outliers are observations below \( Q_1-1.5(IQR) \) or above \( Q_3+1.5(IQR) \), where \( Q_1 \) is the 25<sup>th</sup> percentile, \( Q_3 \) is the 75<sup>th</sup> percentile, and \( IQR=Q_3-Q_1 \) (the interquartile range). The advantage of a box plot is that it provides graphically the location and the spread of the data set, it provides an idea about the skewness of the data set, and can provide a comparison between variables by constructing a side-by-side box plots.
-[[Image:SMHS EDA Gallaway 07012014 Fig1a.png|500px]]
-* [[SOCR_EduMaterials_Activities_Histogram_Graphs|Histogram]]: It is a graphical visualization of tabulated frequencies or counts of data within equal spaced partition of the range of the data. It shows what proportion of measurements that fall into each of the categories defined by the partition of the data range space.
-[[Image:UMHS Gallawaay 07012014 Fig2.PNG|500px]]
-Comment: Compare the two series from the histogram above, we can easily tell that the pattern of series 2 if more obvious compared to series 1. Our intuition may come from: series 1 has more extreme values across five days, for example, the values for Jan 1st and Jan 3rd are extremely high (almost 55 for Jan 1st) while that for Jan 4th is almost -12. However values for series 2 are all above 0 and fluctuated between 5 and 20.
-[[Image:UMHS Gallaway 07012014 Fig3.PNG|500px]]
-Comment: The Dot chart above gives a clear picture of the values of all the data points and makes the fundamental measurements easily readable. We can tell that most of the values of the data fluctuate between 1 and 7 with mean 3.9 and median 4. There are two obvious outliers valued -2 and 10.
-* [[SOCR_EduMaterials_Activities_ScatterChart|Scatter plot]]: It uses Cartesian coordinates to display values for two variables for a set of data, which is displayed as a collection of points. The value of variable is determined by the position on the horizontal and vertical axis.
-[[Image:UMHS Gallaway 07012014 Fig4.PNG|500px]]
-Comment: The x and y axes display values for two variables and all the data points drawn in the chart are coordinates indicating a pair of values for both variables.
-For the first series, all the data points lie on and above the diagonal line so with increasing x variable, the paired y variable increases faster or equal to x variable. We can infer a positive linear association between X and Y.
-For the second series, most data located along the line except for two outliers (4,8) and (1,5). So for most data points, with increasing x variable, the paired y variable decreases slower or equal to x. We may infer a negative linear association between X and Y.
-For the third series, we can’t draw a line association between X and Y, instead, a quadratic pattern would work better here.
-* [[SOCR_EduMaterials_Activities_QQChart|QQ Plot]]: (Quantile - Quantile Plot) The observed values are plotted against theoretical quantiles in QQ charts. A line of good fit is drawn to show the behavior of the data values against the theoretical distribution. If F() is a cumulative distribution function, then a quantile (q), also known as a percentile, is defined as a solution to the equation \(F(q)=p\),that is \(q=F^{-1}(p)\).
-[[Image:UMHS Gallaway 07012014 Fig5.PNG|500px]]
-Comment: From the chart above, we can see that the data follows a normal distribution in general given all the data points (noted in red) located along side the line. However, the data doesn’t follow a normal distribution tightly because there are data points located pretty far from the line. We can also infer that the sampled data may not be representative enough of the population because of the limited size of the sample.
-Median polish: An EDA procedure proposed by John Tukey. It finds an additively fit model for data in a two-way layout table of the form row effect + column effect + overall mean. It is an iterative algorithm for removing any trends by computing medians for various coordinates on the spatial domain D. The median polish algorithm assumes: m(s) = grand effect + row(s) + column(s).
-*Steps in Algorithm:
-**Take the median of each row and record the value to the side of the row. Subtract the row median from each point in that particular row.
-**Compute the median of the row medians, and record the value as the grand effect. Subtract this grand effect from each of the row medians.
-**Take the median of each column and record the value beneath the column. Subtract the column median from each point in that particular column.
-**Compute the median of the column medians, and add the value to the current grand effect. Subtract this addition to the grand effect from each of the column medians.
-**Repeat steps 1-4 until no changes occur with the row or column medians.
-* Trimean: It is a measure of a probability distribution’s location defined as a weighted average of the distribution’s median and its two quartiles. It combines the median’s emphasis on center values with the midhinge’s attention to the extremes. And it is a remarkably efficient estimator of population mean especially for large data set (say more than 100 points) from a symmetric population.
-<center> \( \frac{Q_1+2Q_2+Q_3}{4} \)  </center>
-) Applications
-.1) [http://www.itl.nist.gov/div898/handbook/eda/eda.htm This article] introduced a through introduction to EDA. It talked about the basic concept, objectives, techniques about EDA. It also includes case studies for application of EDA. In the case study part, it studies on eight kinds of charts for univariate variable, reliability as well as multi-factor study. The article gives specific examples with background, output and interpretations of results and would be a great source for studying on EDA and charts.
-.2 [http://www.stat.cmu.edu/~hseltman/309/Book/chapter4.pdf This article] starts with a general introduction to data analysis and works on EDA using examples applied in different graphical analysis. This article should serve as a more basic and general introduction to this concept and would be a good start of studying EDA and charts.
-.3) The [http://wiki.stat.ucla.edu/socr/index.php/SOCR_HTML5_Expansion_MotionCharts SOCR Motion Charts Project enables complex data visualization, see the [http://socr.umich.edu/HTML5/MotionChart/ SOCR MotionChart webapp]. The SOCR Motion Charts provide an interactive infrastructure for discovery-based exploratory analysis of multivariate data.
-Now, we want to explore the relation between two variables in the [[SOCR_Data_Dinov_010309_HousingPriceIndex| dataset: UR (Unemployment Rate) and HPI (Housing Price Index) in the state of Alabama over 2000 to 2006]]. First, how is the UR in Alabama from 2000 to 2006?
-[[Image:UMHS Gallaway 07012014 Fig6.PNG|800px]]
-From this chart, we can see that the UR of Alabama increases from 2000 to 2003 then decreases sharply from 2004 to 2006. So you may wonder what is UR for states from other part of the country over the same period?
-[[Image:UMHS Gallaway 07012014 Fig7.PNG|800px]]
-It seems like they all follow similar patterns over the years. So we want to study on any existing patterns between UR and HPI in a single state, say Alabama over the years.
-[[Image:UMHS Gallaway 07012014 Fig8.PNG|800px]]
-The chart above informs: HPI increases over the years for Alabama while UR increases first before a sharp drop between 2004 and 2006. Clearly, there exists a quadratic association, if there is any between UR and HPI instead of a linear association.
-Similarly, if we extend the graph to all three states from different region, we have the following chart:
-[[Image:UMHS Gallaway 07012014 Fig9.PNG|500px]]
-So, the question could be: is there any association between UR and HPI among all states based on the chart?
-The motion chart, however, makes the study much more interesting by exhibiting a moving chart with UR vs. HPI of 51 states from different areas over 2000 to 2006. So we can get an idea of the changing of scores over the years among all states. You’re welcomed to play with the data to see how the chart changes using the link listed above.
-) Software
-http://www.socr.ucla.edu/htmls/SOCR_Charts.html
-http://www.r-bloggers.com/exploratory-data-analysis-useful-r-functions-for-exploring-a-data-frame/
-) Problems
-.1) Work on problems in Uniform Random Numbers and Random Walk from the Case Studies part in http://www.itl.nist.gov/div898/handbook/eda/section4/eda42.htm .
-Two random samples were taken to determine backpack load difference between seniors and freshmen, in pounds. The following are the summaries
-<center>
-{| class="wikitable" style="text-align:center; width:75%" border="1"
-|-
-| Year|| Mean || SD || Median || Min ||Max || Range|| Count
-|-
-| Freshmen || 20.43 || 4.21 || 17.2 || 5.78 || 31.68 || 25.9 || 115
-|-
-| Senior || 18.67 || 4.21 || 18.67 || 5.31 || 27.66 || 22.35 ||157
-|-
-|}
-</center>
-.2) Which of the following plots would be the most useful in comparing the two sets of backpack weights? Choose One Answer:
-(a) Histograms
-(b) Dot Plots
-(c) Scatter Plots
-(d) Box Plots
-.3) School administrators are interested in examining the relationship between height and GPA. What type of plot should they use to display this relationship? Choose one answer.
-(a) box plot
-(b) scatter plot
-(c) line plot
-(d) dot plot
-.4) What would be the most appropriate plot for comparing the heights of the 8th graders from different ethnic backgrounds? Choose one answer.
-(a) bar charts
-(b) side by side boxplot
-(c) histograms
-(d) pie charts
-.5) There is a company in which a very small minority of males (3%) receive three times the median salary of males, and a very small minority of females (3%) receive one-third of the median salary of females. What do you expect the side-by-side boxplot of male and female salaries to look like? Choose one answer.
-(a) Both boxplots will be skewed and the median line will not be in the middle of any of the boxes.
-(b) Both boxplots will be skewed, in the case of the females the median line will be close to the top of the box and in the case of the males the median line will be closer to the bottom of the box.
-(c) Need to have the actual data to compare the shape of the boxplots.
-(d) Both boxplots will be skewed, in the case of males the median line will be close to the top of the box and for the females the median line will be closer to the bottom of the box.
-.6) A researcher has collected the following information on a random sample of 200 adults in the 40-50 age range: Weight in pounds Heart beats per minute Smoker or non-smoker Single or married
-He wants to examine the relationship between: 1) heart beat per minute and weight, and 2) smoking and marital status. Choose one answer.
-(a) He should draw a scatter plot of heart beat and weight, and a segmented bar chart of smoking and marital status.
-(b) He should draw a side by side boxplot of heart beat and weight and a scatterplot of smoking and marital status.
-(c) He should draw a side by side boxplot of smoking and marital status and a segmented bar chart of hear beat and weight.
-(d) He should draw a back to back stem and leaf plot of weight and heart beat and examine the cell frequencies in the contingency table for smoking by marital status.
-.7) As part of an experiment in perception, 160 UCLA psych students completed a task on identifying similar objects. On average, the students spent 8.25 minutes with standard deviation of 2.4 minutes. However, the minimum time was 2.3 minutes and one students worked for almost 60 minutes. What is the best description of the histogram of times that students spent on this task? Choose one answer.
-(a) The histogram of times could be symmetrical and not normal with major outliers.
-(b) The histogram of times could be left skewed, and in case there are any outliers, it is likely that they will be smaller than the mean.
-(c) The histogram of times could be right skewed, and in the case of any outliers, it is likely that they will be larger than the mean.
-(d) The histogram of times could be normal with no major outliers.
-*7) References
-* http://www.itl.nist.gov/div898/handbook/eda/eda.htm
-* http://mirlyn.lib.umich.edu/Record/000252958
-* http://mirlyn.lib.umich.edu/Record/012841334
-<hr>
-* SOCR Home page: http://www.socr.umich.edu
-{{translate|pageName=http://wiki.socr.umich.edu/index.php?title=SMHS_EDA}}
 ===[[SMHS_UbiquitousVariation | Ubiquitous Variation]]===
 There are many ways to quantify variability, which is present in all natural processes.
-'''IV. HS 850: Fundamentals'''
-'''Ubiquitous variation'''
-) Overview: In real world, variation exists in almost all the data set. The truth is no matter how controlled the environment is in the protocol or the design, virtually any repeated measurement, observation, experiment, trial, or study is bounded to generate data that varies because of intrinsic (internal to the system) or extrinsic (ambient environment) effects. And the extent to which they are unalike, or vary can be noted as variation. Variation is an important concept in statistics and measuring variability is of special importance in statistic inference. And measure of variation, which is namely measures that provided information on the variation, illustrates the extent to which data are dispersed or spread out. We will introduce several basic measures of variation commonly used in statistics: range, variation, standard deviation, sum of squares, Chebyshev’s theorem and empirical rules.
-) Motivation:
-Variation is of significant importance in statistics and it is ubiquitous in data. Consider the example in UCLA’s study of Alzheimer’s disease which analyzed the data of 31 Mild Cognitive Impairment (MCI) and 34 probable Alzheimer’s disease (AD) patients. The investigators made every attempt to control as many variables as possible. Yet, demographic information they collected from the outcomes of the subjects contained unavoidable variation. The same study found variation in the MMSE cognitive scores even in the same subject. The table below shows the demographic characteristics for the subjects and patients included in this study, where the following notation is used M (male), F (female), W (white), AA (African American), A (Asian).
-{| class="wikitable" style="text-align:center; width:75%" border="1"
-|-
-| Variable|| Alzhelmer's disease || MCI || Test Statistics || Test Score || P-value
-|-
-| Age(years)|| 76.2 (8.3) range 52-89|| 73.7 (7.3) range 57-84|| Student’s T ||t <sub>0</sub> =1.284 || p=0.21
-|-
-| Gender(M:F)|| 15:19|| 15:16|| Proportion|| z <sub>0</sub>= -0.345 || p=0.733
-|-
-| Education(years)|| 14.0 (2.1) range 12-19 || 16.23 (2.7) range 12-20 || Wilcoxon rank sum || w <sub>0</sub> =773.0 || p<0.001
-|-
-| Race(W:AA:A)|| 29:1:4 || 26:2:3 ||x <sup>2</sup>  <sub>(df=2)</sub> || x <sup>2</sup>   <sub>(df=2)</sub> =1.18 || 0.55
-|-
-| MMSE|| 20.9 (6.3) range 4-29 || 28.2 (1.6) range 23-30 || Wilcoxon rank sum || w <sub>0</sub> =977.5 || p<0.001
-|}
-Once we accept that all natural phenomena are inherently variant and there aren’t completely deterministic processes, we need to look for measures of variation that allow us to know the extent to which the data are dispersed. Suppose, for instance, we flip a coin 50 times and get 15 heads and 35 tails. But according to the fundamental probability theory where we assume it’s a fair coin, we should have got 25 heads and 25 tails. So, what happened here? Now, suppose there are 100 students and each one flipped the coin 50 times. So, how would you imagine the results to be?
-) Theory
-Measures of variation:
-* 3.1) Range: range is the simplest measure of variation and it is the difference between the largest value and the smallest. Range = Maximum – Minimum.
-Suppose the pulse rate of Jack varied from 70 to 76 while that of Tom varied from 58 to 79. Here we have Jack has a range of 76 – 70 = 6 and Tom has a range of 79 – 58 = 21. Hence we conclude that Tom has a big variation in pulse rate compared to Jack with the range measure.
-A similar measure of variation covers (more or less) the middle 50 percent. It is the interquartile range:Q <sub>3</sub> - Q <sub>1</sub> where Q <sub>1</sub> and Q <sub>3</sub> are the first and third quarters.
-*3.2) Variance: unlike range, which only involves the largest and smallest data, variance involves all the data values.
-** Population variance: σ^2=(∑(x-μ)^2 )/N  where μ is the population mean of the data and N is the size of the population.
-**Unbiased estimate of the population variance:s^2=(∑(x-x ̅ )^2 )/(n-1), where x ̅ is the sample mean and n is the sample size.
-*3.3) Standard deviation: It is the square root of variance. Given that the deviations in variance were squared, meaning the units were squared, so to take the square root of the variance gets the unit back the same as the original data values.
-**Population variance: σ=√((∑(x-μ)^2 )/N)   where μ is the population mean of the data and N is the size of the population.
-**Unbiased estimate of the population stand deviation (sample standard deviation):s=√((∑(x-x ̅ )^2 )/(n-1)) where x ̅ is the sample mean and n is the sample size.
-Consider an example: a biologist found 8, 11, 7, 13, 10, 11, 7 and 9 contaminated mice in 8 groups. Calculate s.
-x ̅=(8+11+7+13+10+11+7+9)/8=9.5
-{| class="wikitable" style="text-align:center; width:75%" border="1"
-|-
-| x || 8 || 11 || 7 || 13 || 10 || 11 || 7 || 9
-|-
-| x-x ̅ || -1.5  || 1.5 || -2.5 || 3.5 || 0.5 || 1.5 || -2.5 ||-0.5 || 0
-|-
-| (x-x ̅ )^2 || 2.25 || 2.25|| 6.25 || 12.25 || 0.25 || 2.25 || 6.25 || 0.25 || 32
-|-
-|}
-s=√((∑(x-x ̅ )^2 )/(n-1))=32/7≈2.14
-*3.4) Sum of squares (shortcuts)
-The sum of the squares of the deviations fro the means is given a shortcut notation and several alternative formulas.
-SS(x)=s=∑(x-x ̅ )^2
-A little algebraic simplification returns: SS(x)=∑x^2 -(∑x)^2/n
-*3.5) Chebyshev’s Theorem: The proportion of the values that fall within k standard deviations of the mean will be at least 1-1/k^2 where k is the number greater than 1. The interpretation of x ̅-ks to x ̅+ks would be within k standard deviations. Chebyshev’s theorem is true for any sample set with any distribution.
-*3.6) Empirical Rule: This rule only works for bell-shaped (normal) distributions. With this kind of distribution, we have: approximately 68% of the data values fall within one standard deviation of the mean; approximately 95% of the data values fall within two standard deviations of the mean; approximately 99.7% of the data values fall within three standard deviations of the mean.
-*4) Applications
-This article (http://www.nature.com/ng/journal/v39/n7s/full/ng2042.html) titled The Population Genetics of Structural Variation, talked about genomic variation in human genome. It summarized recent dramatic advances and illustrated on the diverse mutational origins of chromosomal rearrangements and argued about their complexity necessitates a re-evaluation of existing population genetic methods. It started with an introduction on genomic variants including their biological significance, their basic characteristics leading to the importance of study on structural variation. It then pointed out the improvements in knowledge of structural variation in human genome compared to the current state of studies in structural variation in human genome and ended with two important future challenges in the study of structural variation.
-*5) Software
-http://www.alcula.com/calculators/statistics/range/
-http://www.alcula.com/calculators/statistics/variance/
-http://easycalculation.com/statistics/standard-deviation.php
-*6) Problems
-.1) Let X be a random variable with mean 80 and standard deviation 12. Find the mean and the standard deviation of the following variable: X- 20. Choose one answer.
-(a) Mean = 60, standard deviation = 144
-(b) Mean = 60, standard deviation = 12
-(c) Mean = 80, standard deviation = 12
-(d) Mean = 60, standard deviation = -8
-.2) A physician collected data on 1000 patients to examine their heights. A statistician hired to look at the files noticed the typical height was about 60 inches, but found that one height was 720 inches. This is clearly an outlier. The physician is out of town and can't be contacted, but the statistician would like to have some preliminary descriptions of the data to present when the doctor returns. Which of the following best describes how the statistician should handle this outlier? Choose one answer.
-(a) The statistician should publish a paper on the emergence of a new race of giants.
-(b) The statistician should keep the data point in; each point is too valuable to drop one.
-(c) The statistician should drop the observation from the analysis because this is clearly a mistake; the person would be 60 feet tall.
-(d) The statistician should analyze the data twice, once with and once without this data point, and then compare how the point affects conclusions.
-(e) The statistician should drop the observation from the dataset because we can't analyze the data with it.
-.3) Researchers do a study on the number of cars that a person owns. They think that the distribution of their data might be normal, even though the median is much smaller than the mean. They make a p-plot. What does it look like? Choose one answer.
-(a) It's not a straight line.
-(b) It's a bell curve.
-(c) It's a group of points clustered around the middle of the plot.
-(d) It's a straight line.
-.4) Which of the following parameters is most sensitive to outliers? Choose one answer.
-(a) Standard deviation
-(b) Interquartile range
-(c) Mode
-(d) Median
-.5) Which value given below is the best representative for the following data?
-, 3, 4, 4, 4, 4, 4, 5, 6, 7, 8, 9, 9, 9, 9, 9, 10, 11. Choose one answer.
-(a) The weighted average of the two modes or (4*5 + 9*5)/10 = 6.5
-(b) No single number could represent this data set
-(c) The average of the two modes or (4 + 9) / 2 = 6.5
-(d) The mean or (2 + 3 + 4 + … + 10 + 11)/18 = 5.9
-(e) The median or (6 + 7)/2 = 6.5
-.6) The following data is collected from website for 121 schools and included these attributes about each institution: name, public or private institution, state, , cost of health insurance, resident tuition, resident fees, resident total expenses, nonresident tuition, nonresident fees, and nonresident total expenses in 2005. So was surprised that medical schools charge no tuition for residents. However, other students pay about $20,000 in fees.
-{| class="wikitable" style="text-align:center; width:75%" border="1"
-|-
-|  || Min || Q1 || Median || Q3 || Max
-|-
-| Private || -$6,550 || $30,729 || $33,850 || $36,685 ||  $41,360
-|-
-| Public || $0 || $10,219 || $16,168 || $18,800 || $27,886
-|}
-On the same scale, use the 5-Number summary to construct two boxplots for the tuition for residents at 73 public and 48 private medical colleges. Use the data and plots to determine which statement about centers is true.
-(a) For private medical schools, the mean tuition of residents is greater than the median tuition for residents.
-(b) With these data, we cannot determine the relationship between mean and median tuition for residents.
-(c) For private medical schools, the mean tuition of residents is equal to the median tuition for residents.
-(d) For private medical schools, the mean tuition of residents is less the median tuition for residents.
-.7) Suppose that we create a new data set by doubling the highest value in a large data set of positive values. What statement is FALSE about the new data set? Choose one answer.
-(a) The mean increases
-(b) The standard deviation increases
-(c) The range increases
-(d) The median and interquartile range both increase
-.8) Consider a large data set of positive values and multiply each value by 100. Which of the following statement is true? Choose one answer.
-(a) The mean, median, and standard deviation increase
-(b) The mean and median increase but the standard deviation is unchanged.
-(c) The standard deviation increases but the mean and median are unchanged.
-(d) The range and interquartile range are unchanged
-) References
-http://wiki.stat.ucla.edu/socr/index.php/AP_Statistics_Curriculum_2007_IntroVar
-http://mirlyn.lib.umich.edu/Record/004199238
-Answers: b, c, a, a, b, d, d, a.
-<hr>
-* SOCR Home page: http://www.socr.umich.edu
-{{translate|pageName=http://wiki.socr.umich.edu/index.php?title=SMHS_UbiquitousVariation}}
 ===[[SMHS_ParamInference | Parametric Inference]]===
@@ Line 417: / Line 33: @@
 ===[[SMHS_OR_RR | Odds Ratio/Relative Risk]]===
 The relative risk, RR, (a measure of dependence comparing two probabilities in terms of their ratio) and the odds ratio, OR, (the fraction of one probability and its complement) are widely applicable in many healthcare studies.
+===[[SMHS_CenterSpreadShape | Centrality, Variability and Shape]]===
+Three main features of sample data are commonly reported as critical in understanding and interpreting the population, or process, the data represents. These include Center, Spread and Shape. The main measures of centrality are Mean, Median and Mode(s). Common measures of variability include the range, the variance, the standard deviation, and mean absolute deviation. The shape of a (sample or population) distribution is an important characterization of the process and its intrinsic properties.
 ===[[SMHS_ProbabilityDistributions | Probability Distributions]]===
-Probability distributions are mathematical models for processes that we observe in nature. Although there are different types of distributions, they have common features and properties that make them useful in various scientific applications.
+Probability distributions are mathematical models for processes that we observe in nature. Although there are different types of distributions, they have common features and properties that make them useful in various scientific applications. This section presents the Bernoulli, Binomial, Multinomial, Geometric, Hypergeometric, Negative binomial, Negative multinomial distribution, Poisson distribution, and Normal distributions, as well as the concept of moment generating function.
 ===[[SMHS_ResamplingSimulation | Resampling and Simulation]]===
@@ Line 428: / Line 47: @@
 ===[[SMHS_IntroEpi | Intro to Epidemiology]]===
-Epidemiology is the study of the distribution and determinants of disease frequency in human populations. This section presents the basic epidemiology concepts. More advanced epidemiological methodologies are discussed in [[SMHS_Epidemiology|the next chapter]].
+Epidemiology is the study of the distribution and determinants of disease frequency in human populations. This section presents the basic epidemiology concepts. More advanced epidemiological methodologies are discussed in [[SMHS_Epidemiology|the next chapter]]. This section also presents the Positive and Negative Predictive Values (PPV/NPV).
 ===[[SMHS_ExpObsStudies | Experiments vs. Observational Studies]]===
@@ Line 437: / Line 56: @@
 ===[[SMHS_HypothesisTesting | Hypothesis Testing]]===
-Hypothesis testing is a quantitative decision-making technique for examining the characteristics (e.g., centrality, span) of populations or processes based on observed experimental data.
+Hypothesis testing is a quantitative decision-making technique for examining the characteristics (e.g., centrality, span) of populations or processes based on observed experimental data. In this section we discuss inference about a mean, mean differences (both small and large samples), a proportion or differences of proportions and differences of variances.
 ===[[SMHS_PowerSensitivitySpecificity | Statistical Power, Sensitivity and Specificity]]===
@@ Line 459: / Line 78: @@
 ===[[SMHS_CorrectionMultipleTesting | Correction for Multiple Testing]]===
 Multiple testing refers to analytical protocols involving testing of several (typically more then two) hypotheses. Multiple testing studies require correction for type I (false-positive rate), which can be done using Bonferroni's method, Tukey’s procedure, family-wise error rate (FWER), or false discovery rate (FDR).
 ==Chapter II: Applied Inference==
 ===[[SMHS_Epidemiology| Epidemiology]]===
+This section expands the [[SMHS_IntroEpi|Epidemiology Introduction]] from the previous chapter. Here we will discuss numbers needed to treat and various likelihoods related to genetic association studies, including linkage and association, LOD scores and Hardy-Weinberg equilibrium.
 ===[[SMHS_SLR| Correlation and Regression (ρ and slope inference, 1-2 samples)]]===
+Studies of correlations between two, or more, variables and regression modeling are important in many scientific inquiries. The simplest situation such situation is exploring the association and correlation of bivariate data ($X$ and $Y$).
 ===[[SMHS_ROC| ROC Curve]]===
+The receiver operating characteristic (ROC) curve is a graphical tool for investigating the performance of a binary classifier system as its discrimination threshold varies. We also discuss the concepts of positive and negative predictive values.
 ===[[SMHS_ANOVA| ANOVA]]===
+Analysis of Variance (ANOVA) is a statistical method fpr examining the differences between group means. ANOVA is a generalization of the [[SMHS_HypothesisTesting|t-test]] for more than 2 groups. It splits the observed variance into components attributed to different sources of variation.
 ===[[SMHS_NonParamInference| Non-parametric inference]]===
+Nonparametric inference involves a class of methods for descriptive and inferential statistics that are not based on parametrized families of probability distributions, which is the basis of the [[SMHS_ParamInference|parametric inference we discussed earlier]]. This section presents the Sign test, Wilcoxon Signed Rank test, Wilcoxon-Mann-Whitney test, the McNemar test, the Kruskal-Wallis test, and the Fligner-Killeen test.
 ===[[SMHS_Cronbachs| Instrument Performance Evaluation: Cronbach's α]]===
+Cronbach’s alpha (α) is a measure of internal consistency used to estimate the reliability of a cumulative psychometric test.
 ===[[SMHS_ReliabilityValidity| Measurement Reliability and Validity]]===
+Measures of Validity include: Construct validity (extent to which the operation actually measures what the theory intends to), Content validity (the extent to which the content of the test matches the content associated with the construct), Criterion validity (the correlation between the test and a variable representative of the construct), experimental validity (validity of design of experimental research studies). Similarly, there many alternate strategies to assess instrument Reliability (or repeatability) -- test-retest reliability, administering different versions of an assessment tool to the same group of individuals, inter-rater reliability, internal consistency reliability.
 ===[[SMHS_SurvivalAnalysis| Survival Analysis]]===
+Survival analysis is used for analyzing longitudinal data on the occurrence of events (e.g., death, injury, onset of illness, recovery from illness). In this section we discuss data structure, survival/hazard functions, parametric versus semi-parametric regression techniques and introduction to Kaplan-Meier methods (non-parametric).
 ===[[SMHS_DecisionTheory| Decision Theory]]===
+Decision theory helps determining the optimal course of action among a number of alternatives, when consequences cannot be forecasted with certainty. There are different types of loss-functions and decision principles (e.g., frequentist vs. Bayesian).
 ===[[SMHS_CLT_LLN| CLT/LLNs – limiting results and misconceptions]]===
+The Law of Large Numbers (LLT) and the Central Limit Theorem (CLT) are the first and second fundamental laws of probability. CLT yields that the arithmetic mean of a sufficiently large number of iterates of independent random variables given certain conditions will be approximately normally distributed. LLT states that in performing the same experiment a large number of times, the average of the results obtained should be close to the expected value and tends to get closer to the expected value with increasing number of trials.
 ===[[SMHS_AssociationTests| Association Tests]]===
+There are alternative methods to measure association two quantities (e.g., relative risk, risk ratio, efficacy, prevalence ratio). This section also includes details on Chi-square tests for association and goodness-of-fit, Fisher’s exact test, randomized controlled trials (RCT), and external and internal validity.
 ===[[SMHS_BayesianInference| Bayesian Inference]]===
+Bayes’ rule connects the theories of conditional and compound probability and provides a way to update probability estimates for a hypothesis as additional evidence is observed.
 ===[[SMHS_PCA_ICA_FA| PCA/ICA/Factor Analysis]]===
+Principal component analysis is a mathematical procedure that transforms a number of possibly correlated variables into a fewer number of uncorrelated variables through a process known as orthogonal transformation. Independent component analysis is a computational tool to separate a multivariate signal into additive subcomponents by assuming that the subcomponents are non-Gaussian signals and are statistically independent from each other. Factor analysis is a statistical method, which describes variability among observed correlated variables in terms of potentially lower number of unobserved variables.
 ===[[SMHS_CIs| Point/Interval Estimation (CI) – MoM, MLE]]===
+Estimation of population parameters is critical in many applications. In statistics, estimation is commonly accomplished in terms of point-estimates or interval-estimates for specific (unknown) population parameters of interest. The method of moments (MOM) and maximum likelihood estimation (MLE) techniques are used frequently in practice. In this section, we also lay the foundations for expectation maximization and Gaussian mixture modeling.
 ===[[SMHS_ResearchCritiques| Study/Research Critiques]]===
+The scientific rigor in published literature, grant proposals and general reports needs to be assessed and scrutinized to minimize errors in data extraction and meta-analysis. Reporting biases present significant obstacles to collecting of relevant information on the effectiveness of an intervention, strength of relations between variables, or causal associations.
 ===[[SMHS_CommonMistakesMisconceptions| Common mistakes and misconceptions in using probability and statistics, identifying potential assumption violations, and avoiding them]]===
@@ Line 496: / Line 129: @@
 ==Chapter III: Linear Modeling==
 ===[[SMHS_MLR | Multiple Linear Regression (MLR)]]===
+Multiple Linear Regression encapsulated a family of statistical analyses for modeling single or multiple independent variables and one dependent variable. MLR computationally estimates all of the effects of each independent variable (coefficients) based on the data using least square fitting.
 ===[[SMHS_GLM| Generalized Linear Modeling (GLM)]]===
+Generalized Linear Modeling (GLM) is a flexible generalization of ordinary linear multivariate regression, which allows for response variables that have error distribution models other than a normal distribution. GLM unifies statistical models like linear regression, logistic regression and Poisson regression.
 ===[[SMHS_ANCOVA| Analysis of Covariance (ANCOVA)]]===
-First, see the [[SMHS_ANOVA|ANOVA]] section above.
+Analysis of Variance ([[SMHS_ANOVA|ANOVA]]) is a common method applied to analyze the differences between group means. Analysis of Covariance (ANCOVA) is another method applied to blend ANOVA and regression and evaluate whether population means of a dependent variance are equal across levels of a categorical independent variable while statistically controlling for the effects of other continuous variables.
 ===[[SMHS_MANOVA| Multivariate Analysis of Variance (MANOVA)]]===
+A generalized form of [[SMHS_ANOVA|ANOVA]] is the multivariate analysis of variance (MANOVA), which is a statistical procedure for comparing multivariate means of several groups.
 ===[[SMHS_MANCOVA| Multivariate Analysis of Covariance (MANCOVA)]]===
+Similar to [[SMHS_MANOVA|MANOVA]], the multivariate analysis of covariance (MANOCVA) is an extension of [[SMHS_ANCOVA|ANCOVA]] that is designed for cases where there is more than one dependent variable and when a control of concomitant continuous independent variables is present.
 ===[[SMHS_rANOVA| Repeated measures Analysis of Variance (rANOVA)]]===
+Repeated measures are used in situations when the same objects/units/entities take part in all conditions of an experiment. Given there is multiple measures on the same subject, we have to control for correlation between multiple measures on the same subject. Repeated measures ANOVA (rANOVA) is the equivalent of the one-way [[SMHS_ANOVA|ANOVA]], but for related, not independent, groups. It is also referred to as within-subject ANOVA or ANOVA for correlated samples.
 ===[[SMHS_PartialCorrelation| Partial Correlation]]===
+Partial correlation measures the degree of association between two random variables by measuring variances controlling for certain other factors or variables.
-===[[SMHS_TimeSeriese| Time Series Analysis]]===
+===[[SMHS_TimeSeries| Time Series Analysis]]===
+Time series data is a sequence of data points measured at successive points in time. Time series analysis is a technique used in varieties of studies involving temporal measurements and tracking metrics.
 ===[[SMHS_FixedRandomMixedModels|Fixed, Randomized and Mixed Effect Models]]===
+Fixed effect models are statistical models that represent the observed quantities in terms of explanatory variables (covariates) treated as non-random, while random effect models assume that the dataset being analyzed consist of a hierarchy of different population whose differences relate to that hierarchy. Mixed effect models consist of both fixed effects and random effects. For random effects model and mixed models, either all or part of the explanatory variables are treated as if they rise from random causes.
 ===[[SMHS_HLM| Hierarchical Linear Models (HLM)]]===
+Hierarchical linear model (also called multilevel models) refer to statistical models of parameters that vary at more than one level. These are generalizations of linear models and are widely applied in various studies especially for research designs where data for participants are organized at more than one level.
 ===[[SMHS_MultimodelInference|Multi-Model Inference]]===
+Multi-Model Inference involves model selection of a relationship between $Y$ (response) and predictors $X_1, X_2, ..., X_n$ that is simple, effective and retains good predictive power, as measured by the SSE, AIC or BIC.
 ===[[SMHS_MixtureModeling|Mixture Modeling]]===
+Mixture modeling is a probabilistic modeling technique for representing the presence of sub-populations within overall population, without requiring that an observed data set identifies the sub-population to which an individual observation belongs.
 ===[[SMHS_Surveys|Surveys]]===
+Survey methodologies involve data collection using questionnaires designed to improve the number of responses and the reliability of the responses in the surveys. The ultimate goal is to make statistical inferences about the population, which would depend strongly on the survey questions provided. The commonly used survey methods include polls, public health surveys, market research surveys, censuses and so on.
 ===[[SMHS_LongitudinalData|Longitudinal Data]]===
+Longitudinal data represent data collected from a population over a given time period where the same subjects are measured at multiple points in time. Longitudinal data analyses are widely used statistical techniques in many health science fields.
 ===[[SMHS_GEE| Generalized Estimating Equations (GEE) Models]]===
+Generalized estimating equation (GEE) is a method for parameter estimation when fitting [[SMHS_GLM|generalized linear models]] with a possible unknown correlation between outcomes. It provides a general approach for analyzing discrete and continuous responses with marginal models and works as a popular alternative to maximum likelihood estimation (MLE).
 ===[[SMHS_ModelFitting| Model Fitting and Model Quality (KS-test)]]===
+The Kolmogorov-Smirnov Test (K-S test) is a nonparametric test commonly applied to test for the equality of continuous, one-dimensional probability distributions. This test can be used to compare one sample against a reference probability distribution (one-sample K-S test) or to compare two samples (two-sample K-S test).
 ==Chapter IV: Special Topics==
-===Scientific Visualization===
+===[[SMHS_DataSimulation| Data Simulation ]]===
-===PCOR/CER methods Heterogeneity of Treatment Effects===
+This section demonstrates the core principles of simulating multivariate datasets.
-===Big-Data/Big-Science===
+===[[SMHS_LinearModeling| Linear Modeling ]]===
-===Missing data===
+This section is a review of linear modeling.
+===[[SMHS_SciVisualization| Scientific Visualization ]]===
+This section discusses how and why we should "look" at data.
+===[[SMHS_MethodsHeterogeneity| Methods for Studying Heterogeneity of Treatment Effects, Case-Studies of Comparative Effectiveness Research ]]===
+This section discusses methods for studying heterogeneity of treatment effects and case-studies of comparative effectiveness research.
+===[[SMHS_BigDataBigSci| Big-Data/Big-Science ]]===
+This section discusses structural equation modeling and generalized estimated equation modeling. Furthermore, it discusses statistical validation, cross validation, classification, and prediction.
+===[[SMHS_MissingData|Missing data]]===
+Many research studies encounter incomplete (missing) data that require special handling (e.g., teleprocessing, statistical analysis, visualization). There are a variety of methods (e.g., multiple imputation) to deal with missing data, detect missingness, impute the data, analyze the completed dataset and compare the characteristics of the raw and imputed data.
 ===Genotype-Environment-Phenotype associations===
 ===Medical imaging===
@@ Line 541: / Line 204: @@
 ===Causality/Causal Inference, SEM===
 ===Classification methods===
-===Time-series analysis===
+===[[SMHS_TimeSeriesAnalysis|Time-Series Analysis]]===
+In this section, we will discuss Time Series Analysis, which represents a class of statistical methods applicable for series data aiming to extract meaningful information, trend and characterization of the process using observed longitudinal data.
 ===Scientific Validation===
 ===Geographic Information Systems (GIS)===
@@ Line 548: / Line 214: @@
 ===Network Analysis===
+<hr>
+==References==
-<hr>
 * SOCR Home page: http://www.socr.umich.edu
+* [http://www.socr.umich.edu/people/dinov/SMHS_Courses.html Scientific Methods for Health Sciences (SMHS) Course Series]
+* [http://predictive.space/ Data Science and Predictive Analytics (DSPA)]
+* Dinov, ID. (2018) [http://www.springer.com/us/book/9783319723464 Data Science and Predictive Analytics: Biomedical and Health Applications using R, Springer (ISBN 978-3-319-72346-4)]
 {{translate|pageName=http://wiki.socr.umich.edu/index.php?title=Scientific_Methods_for_Health_Sciences}}