# SMHS EDA

## Scientific Methods for Health Sciences - Exploratory Data Analysis (EDA), Charts and Plots

### Overview

• What is data? Data is a collection of facts, observations or information, such as values or measurements. Data can be numbers, measurements, or even just description of things (meta-data). Data types can be divided into two big categories of quantitative (numerical information) and qualitative data (descriptive information).
• Quantitative data is anything that can be expressed as a number, or quantified. For example, the scores on a math test or weight of girls in the fourth grade are both quantitative data. Quantitative data (discrete or continuous) is often referred to as the measurable data and this type of data allows scientists to perform various arithmetic operations, such as addition, multiplication, functional-evaluation, or to find parameters of a population. There are two major types of quantitative data: discrete and continuous.
• Discrete data results from either a finite, or infinite but countable, possible options for the values present in a given discrete data set and the values of this data type can constitute a sequence of isolated or separated points on the real number line.
• Continuous quantitative data results from infinite and dense possible values that the observations can take on.
• Qualitative data cannot be expressed as a number. Examples may be gender, religious preference. Categorical data (qualitative or nominal) results from placing individuals into groups or categories. Ordinal and qualitative categorical data types both fall into this category.

In statistics, exploratory data analysis (EDA) is an approach to analyze data sets to summarize their main characteristics. Modern statistics regards the graphical visualization and interrogation of data as a critical component of any reliable method for statistical modeling, analysis and interpretation of data. Formally, there are two types of data analysis that should be employed in concert on the same set of data to make a valid and robust inference: graphical techniques and quantitative techniques. We will discuss many of these later, but below is a snapshot of EDA approaches:

• Box plot, Histogram; Multi-vari chart; Run chart; Pareto chart; Scatter plot; Stem-and-leaf plot;
• Parallel coordinates; Odds ratio; Multidimensional scaling; Targeted projection pursuit; Principal component analysis; Multi-linear PCA; Projection methods such as grand tour, guided tour and manual tour.
• Median polish, Trimean.

### Motivation

The feel of data comes clearly from the application of various graphical techniques, which serves as a perfect window to human perspective and sense. The primary goal of EDA is to maximize the analyst’s insight into a data set and into the underlying structure of the data set. To get a feel for the data, it is not enough for the analyst to know what is in the data, he or she must also know what is not in the data, and the only way to do that is to draw on our own pattern recognition and comparative abilities in the context of a series of judicious graphical techniques applied to the data. The main objectives of EDA are to:

• Suggest hypotheses about the causes of observed phenomena;
• Assess (parametric) assumptions on which statistical inference will be based;
• Support the selection of appropriate statistical tools and techniques;
• Provide a basis for further data collection through surveys and experiments.

### Theory

Many EDA techniques have been proposed, validated and adopted for various statistical methodologies. Here is an introduction to some of the frequently used EDA charts and the quantitative techniques.

• Box-and-Whisker plot: It is an efficient way for presenting data, especially for comparing multiple groups of data. In the box plot, we can mark-off the five-number summary of a data set (minimum, 25th percentile, median, 75th percentile, maximum). The box contains the 50% of the data. The upper edge of the box represents the 75th percentile, while the lower edge is the 25th percentile. The median is represented by a line drawn in the middle of the box. If the median is not in the middle of the box then the data are skewed. The ends of the lines (whiskers) represent the minimum and maximum values of the data set, unless there are outliers. Outliers are observations below $$Q_1-1.5(IQR)$$ or above $$Q_3+1.5(IQR)$$, where $$Q_1$$ is the 25th percentile, $$Q_3$$ is the 75th percentile, and $$IQR=Q_3-Q_1$$ (the interquartile range). The advantage of a box plot is that it provides graphically the location and the spread of the data set, it provides an idea about the skewness of the data set, and can provide a comparison between variables by constructing a side-by-side box plots.
• Histogram: It is a graphical visualization of tabulated frequencies or counts of data within equal spaced partition of the range of the data. It shows what proportion of measurements that fall into each of the categories defined by the partition of the data range space.

Comment: Compare the two series from the histogram above, we can easily tell that the pattern of series 2 if more obvious compared to series 1. Our intuition may come from: series 1 has more extreme values across five days, for example, the values for Jan 1st and Jan 3rd are extremely high (almost 55 for Jan 1st) while that for Jan 4th is almost -12. However values for series 2 are all above 0 and fluctuated between 5 and 20.

Comment: The Dot chart above gives a clear picture of the values of all the data points and makes the fundamental measurements easily readable. We can tell that most of the values of the data fluctuate between 1 and 7 with mean 3.9 and median 4. There are two obvious outliers valued -2 and 10.

• Scatter plot: It uses Cartesian coordinates to display values for two variables for a set of data, which is displayed as a collection of points. The value of variable is determined by the position on the horizontal and vertical axis.

Comment: The x and y axes display values for two variables and all the data points drawn in the chart are coordinates indicating a pair of values for both variables.

For the first series, all the data points lie on and above the diagonal line so with increasing x variable, the paired y variable increases faster or equal to x variable. We can infer a positive linear association between X and Y.

For the second series, most data located along the line except for two outliers (4,8) and (1,5). So for most data points, with increasing x variable, the paired y variable decreases slower or equal to x. We may infer a negative linear association between X and Y.

For the third series, we can’t draw a line association between X and Y, instead, a quadratic pattern would work better here.

• QQ Plot: (Quantile - Quantile Plot) The observed values are plotted against theoretical quantiles in QQ charts. A line of good fit is drawn to show the behavior of the data values against the theoretical distribution. If F() is a cumulative distribution function, then a quantile (q), also known as a percentile, is defined as a solution to the equation $$F(q)=p$$,that is $$q=F^{-1}(p)$$.

Comment: From the chart above, we can see that the data follows a normal distribution in general given all the data points (noted in red) located along side the line. However, the data doesn’t follow a normal distribution tightly because there are data points located pretty far from the line. We can also infer that the sampled data may not be representative enough of the population because of the limited size of the sample.

Median polish: An EDA procedure proposed by John Tukey. It finds an additively fit model for data in a two-way layout table of the form row effect + column effect + overall mean. It is an iterative algorithm for removing any trends by computing medians for various coordinates on the spatial domain D. The median polish algorithm assumes: m(s) = grand effect + row(s) + column(s).

• Steps in Algorithm:
• Take the median of each row and record the value to the side of the row. Subtract the row median from each point in that particular row.
• Compute the median of the row medians, and record the value as the grand effect. Subtract this grand effect from each of the row medians.
• Take the median of each column and record the value beneath the column. Subtract the column median from each point in that particular column.
• Compute the median of the column medians, and add the value to the current grand effect. Subtract this addition to the grand effect from each of the column medians.
• Repeat steps 1-4 until no changes occur with the row or column medians.

• Trimean: It is a measure of a probability distribution’s location defined as a weighted average of the distribution’s median and its two quartiles. It combines the median’s emphasis on center values with the midhinge’s attention to the extremes. And it is a remarkably efficient estimator of population mean especially for large data set (say more than 100 points) from a symmetric population.
$$\frac{Q_1+2Q_2+Q_3}{4}$$

4) Applications

4.1) This article introduced a through introduction to EDA. It talked about the basic concept, objectives, techniques about EDA. It also includes case studies for application of EDA. In the case study part, it studies on eight kinds of charts for univariate variable, reliability as well as multi-factor study. The article gives specific examples with background, output and interpretations of results and would be a great source for studying on EDA and charts.

4.2 This article starts with a general introduction to data analysis and works on EDA using examples applied in different graphical analysis. This article should serve as a more basic and general introduction to this concept and would be a good start of studying EDA and charts.

4.3) The SOCR Motion Charts Project enables complex data visualization, see the [http://socr.umich.edu/HTML5/MotionChart/ SOCR MotionChart webapp. The SOCR Motion Charts provide an interactive infrastructure for discovery-based exploratory analysis of multivariate data.

Now, we want to explore the relation between two variables in the dataset: UR (Unemployment Rate) and HPI (Housing Price Index) in the state of Alabama over 2000 to 2006. First, how is the UR in Alabama from 2000 to 2006?

From this chart, we can see that the UR of Alabama increases from 2000 to 2003 then decreases sharply from 2004 to 2006. So you may wonder what is UR for states from other part of the country over the same period?

It seems like they all follow similar patterns over the years. So we want to study on any existing patterns between UR and HPI in a single state, say Alabama over the years.

The chart above informs: HPI increases over the years for Alabama while UR increases first before a sharp drop between 2004 and 2006. Clearly, there exists a quadratic association, if there is any between UR and HPI instead of a linear association. Similarly, if we extend the graph to all three states from different region, we have the following chart:

So, the question could be: is there any association between UR and HPI among all states based on the chart?

The motion chart, however, makes the study much more interesting by exhibiting a moving chart with UR vs. HPI of 51 states from different areas over 2000 to 2006. So we can get an idea of the changing of scores over the years among all states. You’re welcomed to play with the data to see how the chart changes using the link listed above.

6) Problems

6.1) Work on problems in Uniform Random Numbers and Random Walk from the Case Studies part in http://www.itl.nist.gov/div898/handbook/eda/section4/eda42.htm .

Two random samples were taken to determine backpack load difference between seniors and freshmen, in pounds. The following are the summaries

 Year Mean SD Median Min Max Range Count Freshmen 20.43 4.21 17.2 5.78 31.68 25.9 115 Senior 18.67 4.21 18.67 5.31 27.66 22.35 157

6.2) Which of the following plots would be the most useful in comparing the two sets of backpack weights? Choose One Answer:

(a) Histograms

(b) Dot Plots

(c) Scatter Plots

(d) Box Plots

6.3) School administrators are interested in examining the relationship between height and GPA. What type of plot should they use to display this relationship? Choose one answer.

(a) box plot

(b) scatter plot

(c) line plot

(d) dot plot

6.4) What would be the most appropriate plot for comparing the heights of the 8th graders from different ethnic backgrounds? Choose one answer.

(a) bar charts

(b) side by side boxplot

(c) histograms

(d) pie charts

6.5) There is a company in which a very small minority of males (3%) receive three times the median salary of males, and a very small minority of females (3%) receive one-third of the median salary of females. What do you expect the side-by-side boxplot of male and female salaries to look like? Choose one answer.

(a) Both boxplots will be skewed and the median line will not be in the middle of any of the boxes.

(b) Both boxplots will be skewed, in the case of the females the median line will be close to the top of the box and in the case of the males the median line will be closer to the bottom of the box.

(c) Need to have the actual data to compare the shape of the boxplots.

(d) Both boxplots will be skewed, in the case of males the median line will be close to the top of the box and for the females the median line will be closer to the bottom of the box.

6.6) A researcher has collected the following information on a random sample of 200 adults in the 40-50 age range: Weight in pounds Heart beats per minute Smoker or non-smoker Single or married He wants to examine the relationship between: 1) heart beat per minute and weight, and 2) smoking and marital status. Choose one answer.

(a) He should draw a scatter plot of heart beat and weight, and a segmented bar chart of smoking and marital status.

(b) He should draw a side by side boxplot of heart beat and weight and a scatterplot of smoking and marital status.

(c) He should draw a side by side boxplot of smoking and marital status and a segmented bar chart of hear beat and weight.

(d) He should draw a back to back stem and leaf plot of weight and heart beat and examine the cell frequencies in the contingency table for smoking by marital status.

6.7) As part of an experiment in perception, 160 UCLA psych students completed a task on identifying similar objects. On average, the students spent 8.25 minutes with standard deviation of 2.4 minutes. However, the minimum time was 2.3 minutes and one students worked for almost 60 minutes. What is the best description of the histogram of times that students spent on this task? Choose one answer.

(a) The histogram of times could be symmetrical and not normal with major outliers.

(b) The histogram of times could be left skewed, and in case there are any outliers, it is likely that they will be smaller than the mean.

(c) The histogram of times could be right skewed, and in the case of any outliers, it is likely that they will be larger than the mean.

(d) The histogram of times could be normal with no major outliers.

• 7) References