# SMHS CommonMistakesMisconceptions

## Scientific Methods for Health Sciences - Common Probability Mistakes and Statistics Misconceptions

### Overview

Statistical methods, mathematical abstractions and detailed physics or biomedical experiments all require a focused, structured and coordinated exposure to appropriate motivation, theoretical foundation and explicit applications. Stochastic reasoning is critical in many studies involving statistical, computational or modeling techniques for advanced data analytics. As there are multi-disciplinary interactions, broader knowledge background requirements, and demands on transferring concepts from one area or topic into another, various errors or misconceptions about the statistical concepts can perk in the process which may have negative impact on the final study conclusions.

### Motivation

Challenges associated with deep understanding of statistical methods may be rooted in:

• The fact that often statistical concepts, such as probability, correlation, percent, etc. require proportional reasoning,
• The existence of false a priori intuitions
• Perceived distaste for statistics, due to prior study of probability and statistics in a highly abstract and formal way.

Furthermore, as probability and statistics are rather contemporary scientific fields they are well-formulated using established axiomatic system for probability. During the learning process, young scientists may experience difficulties in overcoming epistemological challenges from the historical development of knowledge. Statistical concepts presented in isolation from their original applications which have driven their global adoption may also impact learner’s perception of the field. For instance, the concept of mean has special different interpretations when applied to the center of gravity, life expectation, or index number.

### Exemplary Misconceptions

#### EDA: Frequency tables and graphical representation of data

Many information errors and difficulties in learning data-driven statistics are related to frequency tables and graphical representations. Ability to critically read data is a necessary component of numerical literacy. Examples of different levels in the comprehension of data include:

• Reading the data in which interpretation is not needed. Only facts explicitly expressed in the graph or table are required.
• Reading within the data, which requires comparisons and the use of mathematical concepts and skills.
• Reading beyond the data where an extension, prediction or inference is needed.

For example, when interpreting a scatter plot and ‘reading the data’ yields questions about the labeling of the plot, interpretation of scales, or finding a value of one variable corresponding to another one, interpreting the intensity of the co-variation, modeling the bivariate relationship by a linear function, or inferring whether the dependence may be direct, inverse or causal.

• The scales of either or both the vertical and horizontal axes are omitted.
• The origin of coordinates is not specified.
• Insufficient divisions in scales on the axes are provided.
• The axes are not labelled.
• Scale-breaks are inserted

#### Distribution Summaries

• The mean is an important summary statistics which may lead to errors in combining two weighted means as if they were simple means:
• There are ten people in an elevator, 3 women and 7 men. The average weight of the women is 120 pounds and the average weight of the men 180 pounds. What is the average of the weight of the ten people in the elevator?
There are a number of ways to go wrong when applying the computational rule: for example, using (120+180)/2 = 150, is incorrect as there are two different distributions involved.
• Suppose average the SAT score of the population of high school seniors in a large school system is 1,600. If we choose a random sample of 6 students with the first 5 student scoring 1300, 1400, 1600, 1400, 1700. What is the expected score of the 6th student?
The correct answer is 1,600, the expected value in the population not the average of the 5 sample scores.
Finally, explore the relation between the mean and other measures of centrality.
• Measures of spread: A frequent error is data analyses is to ignore the spread of data. The standard deviation measures how strongly data depart from the central tendency. There are many other measures of spread: variation, dispersion, diversity, spread, fluctuation, etc., which are related but have different interpretations. Suppose we have two different sets of blocks, A and B, where the lengths of blocks in set A are: 10, 20, 30, 40, 50 and 60 cm, and the lengths of blocks inset B are: 10, 10, 10, 60, 60 and 60 cm. Which set is more variable? Note that the concept of variability is different from 'unalikability', the latter indicating how much the values differ from each other (rather than differ from some fixed value such as the mean).
• Order statistics: yield robust metrics, not very sensitive to fluctuations in data or to outliers and many non-parametric methods are based on order statistics. These methods require fewer assumptions to be applied, and so can be more widely employed than classical parametric inference techniques. There are challenges with correctly interpreting order statistics - computational and conceptual.
• The computation of median, quantiles and percentiles is taught with a different algorithm for data grouped in intervals than for non-grouped data. The decision of whether or not to group the data and the selection of the width of intervals is commonly made by the researcher performing the analysis.

#### Association

Association and regression extends functional dependence and allows us to model diverse array of natural processes. Association is generally used to refer to the existence of statistical dependence between 2 quantitative or qualitative variables. Correlation usually refers to a specific type of linear association between two quantitative variables. Neither of these concepts necessarily imply a cause-effect relationship, but just the existence of a co-variation between the variables. Association studies involve contingency tables, linear regression, and correlation between quantitative variables, and experimental design.

Factors Factor1_Level1 Factor1_Level2 Row_Total
Factor2_Level1 a b a+b
Factor2_Level2 c d c+d
Column_Total a+c b+d a+b+c+d
For example, in studies of 2x2 contingency tables, the absolute frequency contained in a cell, for example cell a (see table above), yields 3 different types of relative frequencies that can be calculated:
• unconditional relative frequency: [a/(a+b+c+d)]
• conditional relative frequencies given the rows: [a/(a+b)]
• conditional relative frequencies given the columns: [a/(a+c)]

#### Inference/Sampling

A sample provides partial information, a window, about the population and in this way increases our knowledge about the population. In general, a sample does not contain complete information about a process.

The likelihood for samples are estimated by how well they represent some characteristics of the parent population. Thus, there is insensitivity to the sample size and over-confidence in small samples (this is known as 'belief in the law of small numbers').

Suppose a town has 2 universities. In the larger university about 3000 students
50% are females. If there are small variations in gender of graduates each year, and each
university recorded the years when ≥ 60% of the graduates were female which university is
expected to recorded more such (gender unbalanced) years?
(a) The larger university
(b) The smaller university

Some people believe that (c) must be the correct answer, due to the fact that they consider that in both universities the proportion of female and male graduates is the same, and believe this is the only fact of importance in order to determine the probability of the required events. However, we need to account for the significant differences in the sample sizes, as probability theory implies more fluctuation in the values of the proportion in small samples than in large samples.

#### Test of hypothesis

The term test of hypothesis could be applied to a great number of statistical procedures: test of differences between means, analysis of variance, non-parametric procedures, multivariate tests, etc. which share a common ground: a set of basic concepts (null hypothesis and alternative hypothesis, level of significance, power function, etc.), and some general procedures which are modified for particular cases. The correct application of these procedures involves many kinds of choices, including: the sample size, the level of significance $\alpha$ and the appropriate statistic.

• Determination of the null ($H_o$) and alternative/research ($H_1$) hypotheses
• Distinction between type I (false-positive) and type II (false-negative) errors;
• Power and operating characteristic curves
• Terminology used in stating the decision.

A key aspects in the correct application of a test of hypothesis is understanding the concept of level of significance, which is defined as the 'probability of rejecting the null hypothesis, when it is true. $$\alpha = P(\text{Rejecting } H_o | H_o \text{ is true})$$

A common error is the interchange of the conditional event and the condition in this definition and the mistaken interpretation of the level significance as 'the probability that the null hypothesis is true, once the decision to reject it has been taken', i.e., $$P(H_o\text{ is true when we have rejected } H_o)$$