Difference between revisions of "SMHS LinearModeling"

From SOCR
Jump to: navigation, search
(Scientific Methods for Health Sciences - Linear Modeling)
 
(22 intermediate revisions by 2 users not shown)
Line 1: Line 1:
 
==[[SMHS| Scientific Methods for Health Sciences]] - Linear Modeling ==
 
==[[SMHS| Scientific Methods for Health Sciences]] - Linear Modeling ==
  
===Statistical Software- Pros/Cons Comparison===
+
The following sub-sections represent a blend of model-based and model-free scientific inference, forecasting and validity.
  
[[Image:SMHS_LinearModeling_Fig001.png|500px]]
+
===[[SMHS_LinearModeling_StatsSoftware|Statistical Software]]===
 +
This section briefly describes the pros and cons of different statistical software platforms.
  
<center>
+
===[[SMHS_LinearModeling_QC|Quality Control]]===
GoogleScholar Research Article Pubs
+
Discussion of data Quality Control (QC) and Quality Assurance (QA) which represent important components of data-driven modeling, analytics and visualization.
{| class="wikitable" style="text-align:center; " border="1"
 
|-
 
!Year||R||SAS||SPSS
 
|-
 
|1995||8||8620||6450
 
|-
 
|1996||2||8670||7600
 
|-
 
|1997||6||10100||9930
 
|-
 
|1998||13||10900||14300
 
|-
 
|1999||26||12500||24300
 
|-
 
|2000||51||16800||42300
 
|-
 
|2001||133||22700||68400
 
|-
 
|2002||286||28100||88400
 
|-
 
|2003||627||40300||78600
 
|-
 
|2004||1180||51400||137000
 
|-
 
|2005||2180||58500||147000
 
|-
 
|2006||3430||64400||142000
 
|-
 
|2007||5060||62700||131000
 
|-
 
|2008||6960||59800||116000
 
|-
 
|2009||9220||52800||61400
 
|-
 
|2010||11300||43000||44500
 
|-
 
|2011||14600||32100||32000
 
|}
 
require(ggplot2)
 
require(reshape)
 
Data_R_SAS_SPSS_Pubs <-read.csv('https://umich.instructure.com/files/522067/download?download_frd=1', header=T)
 
df <- data.frame(Data_R_SAS_SPSS_Pubs)  
 
# convert to long format
 
df <- melt(df ,  id.vars = 'Year', variable.name = 'Time')  
 
ggplot(data=df, aes(x=Year, y=value, colour=variable, group = variable)) +  geom_line() + geom_line(size=4) + labs(x='Year', y='Citations')
 
  
 +
===[[SMHS_LinearModeling_MLR |Multiple Linear Regression]]===
 +
Review and demonstration of computing and visualizing the regression-model coefficients (effect-sizes), (fixed-effect) linear model assumptions, examination of residual plots, and independence.
  
[[Image:SMHS_LinearModeling_Fig002.png|500px]]
+
===[[SMHS_LinearModeling_LMM |Linear mixed effects analyses]]===
 
+
Scientific inference based on fixed and random effect models, assumptions, and mixed effects logistic regression.
</center>
 
 
 
 
 
 
 
===Quality Control===
 
'''Questions:
 
*''' Is the data what it’s supposed to be (does it represent the study cohort/population)?
 
*''' How to inspect the quality of the data?
 
 
 
Data Quality Control (QC) and Quality Assurance (QA) represent important components of all modeling, analytics and visualization that precede all subsequent data processing steps. QC and QA may be performed manually or automatically. Statistical quality control involves quantitative methods for monitoring and controlling a process or data derived from observing a natural phenomenon. For example, is there evidence in the plots below of a change in the mean of these processes?
 
 
 
# simulate data with base value of 100 w/ normally distributed error
 
# install.packages("qcc")
 
library(qcc)
 
demo.data.1 <- rep(100, 1000) + rnorm(1000, mean=0, sd=2)
 
qcc(demo.data.1, type="xbar.one", center=100, add.stats=FALSE,
 
    title="Simulation 1", xlab="Index")
 
 
 
[[Image:SMHS_LinearModeling_Fig003.png|500px]]
 
 
 
Now let’s introduce a trend
 
 
 
# first 800 points have base value of 100 w/ normally distributed error,
 
# next 100 points have base value of 105 w/ normally distributed error
 
# last 100 points have base value of 110 w/ normally distributed error
 
M <- 110
 
SD=5
 
demo.data.2 <- c(rep(100, 800), rep(M, 100), rep(100+(M-100)/2, 100)) + rnorm(1000, mean=0, sd=SD)
 
qcc(demo.data.2, type="xbar.one", center=100, add.stats=FALSE,
 
    title="Simulation 2", xlab="Index")
 
 
 
[[Image:SMHS_LinearModeling_Fig004.png|500px]]
 
 
 
Our goal is to use statistical quality control to automatically identify issues with the data.  The qcc package in R provides methods for statistical quality control – given the data, it identifies candidate points as outliers based on the Shewhart Rules. Color-coding the data also helps point out irregular points.
 
 
 
The Shewhart control charts rules (cf. 1930’s) are based on monitoring events that unlikely when the controlled process is stable. Incidences of such atypical events are alarm signals suggesting that stability of the process may be compromised and the process is changed.
 
 
 
An instance of such an unlikely event is the situation when the upper/lower control limits (UCL or LCL) are exceeded. UCL and LCL are constructed as ±3 limits, indicating that the process is under control within them. Additional warning limits (LWL and UWL) are constructed at ±2 or ±. Other rules specifying events having low probability when the process is under control can be constructed:
 
 
 
1.  One point exceeds LCL/UCL.
 
 
 
2.  Nine points above/below the central line.
 
 
 
3.  Six consecutive points show increasing/decreasing trend.
 
 
 
4.  Difference of consecutive values alternates in sign for fourteen points.
 
 
 
5.  Two out of three points exceed LWL or UWL limits.
 
 
 
6.  Four out of five points are above/below the central line and exceed ± limit.
 
 
 
7.  Fifteen points are within ± limits.
 
 
 
8.  Eight consecutive values are beyond ± limits.
 
 
 
We can define training/testing dataset within qcc by adding the data we want to calibrate it with as the first parameter (demo.data.1), followed by the new data (demo.data.2) representing the test data.
 
 
 
#example using holdout/test sets
 
demo.data.1 <- rep(100, 1000) + rnorm(1000, mean=0, sd=2)
 
demo.data.2 <- c(rep(100, 800), rep(105, 100), rep(110, 100)) + rnorm(100, mean=0, sd=2)
 
MyQC <- qcc(demo.data.1, newdata=demo.data.2, type="xbar.one", center=100, add.stats=FALSE, title="Simulation 1 vs. 2", xlab="Index")
 
plot(MyQC) # , chart.all=FALSE)
 
 
 
[[Image:SMHS_LinearModeling_Fig005.png|500px]]
 
 
 
# add warning limits at 2 std. deviations
 
MyQC2 <- qcc(demo.data.1, newdata=demo.data.2, type="xbar.one", center=100, add.stats=FALSE,  title="Second Simulation 1 vs. 2", xlab="Index")
 
warn.limits <- limits.xbar(MyQC2$\$$center, MyQC2$\$$std.dev, MyQC2$\$$sizes, 0.95)
 
plot(MyQC2, restore.par = FALSE)
 
abline(h = warn.limits, lty = 2, lwd=2, col = "blue")
 
 
 
## limits.xbar(center, std.dev, sizes, conf)
 
Center = sample/group center statistic
 
Sizes= samples sizes.
 
std.dev= within group standard deviation.
 
Conf= a numeric value used to compute control limits, specifying the number of standard deviations (if conf > 1) or the confidence level (if 0 < conf < 1)
 
 
 
[[Image:SMHS_LinearModeling_Fig006.png|500px]]
 
 
 
Natural processes may have errors that are non-normally distributed. However, using (appropriate) transformations we can often normalize the errors.
 
 
 
We can use thresholds to define zones in the data where each zone represents, say, one standard deviation span of the range of the dataset.
 
 
 
find_zones <- function(x) {
 
  x.mean <- mean(x)
 
  x.sd <- sd(x)
 
  boundaries <- seq(-3, 3)
 
  # creates a set of zones for each point in x
 
  zones <- sapply(boundaries, function(i) {
 
    i * rep(x.sd, length(x))
 
  })
 
  zones + x.mean
 
}
 
head(find_zones(demo.data.2))
 
 
 
evaluate_zones <- function(x) {
 
  zones <- find_zones(x)
 
  colnames(zones) <- paste("zone", -3:3, sep="_")
 
  x.zones <- rowSums(x > zones) - 3
 
  x.zones
 
}
 
 
evaluate_zones(demo.data.2)
 
 
 
find_violations <- function(x.zones, i) {
 
  values <- x.zones[max(i-8, 1):i]
 
  # rule4 <- ifelse(any(values > 0), 1,
 
  rule4 <- ifelse(all(values > 0), 1,
 
                  ifelse(all(values < 0), -1,
 
                        0))
 
  values <- x.zones[max(i-5, 1):i]
 
  rule3 <- ifelse(sum(values >= 2) >= 2, 1,
 
                  ifelse(sum(values <= -2) >= 2, -1,
 
                        0))
 
  values <- x.zones[max(i-3, 1):i]
 
  rule2 <- ifelse(mean(values >= 3) >= 1, 1,
 
                  ifelse(mean(values <= -3) >= 1, -1,
 
                        0))
 
  #values <- x.zones[]
 
values <- x.zones[max(i-3, 1):i]
 
rule1 <- ifelse(any(values > 2), 1,
 
                  ifelse(any(values < -2), -1,
 
                        0))
 
  c("rule1"=rule1, "rule2"=rule2, "rule3"=rule3, "rule4"=rule4)
 
}
 
 
find_violations(evaluate_zones(demo.data.2), 20)
 
 
 
Now we can compute the rules for each point and assign a color to any violations.
 
 
 
library("plyr")
 
compute_violations <- function(x, start=1) {
 
  x.zones <- evaluate_zones(x)
 
  results <- ldply (start:length(x), function(i) {
 
    find_violations(x.zones, i)
 
  })
 
  results$\$$color <- ifelse(results$\$$rule1!=0, "pink",
 
                          ifelse(results$\$$rule2!=0, "red",
 
                                ifelse(results$\$$rule3!=0, "orange",
 
                                        ifelse(results$\$$rule4!=0, "yellow",
 
                                              "black"))))
 
  results
 
}
 
 
tail(compute_violations(demo.data.2))
 
 
 
Now let’s make a quality control chart.
 
 
 
plot.qcc <- function(x, holdout) {
 
  my.qcc <- compute_violations(x, length(x) - holdout)
 
  bands <- find_zones(x)
 
  plot.data <- x[(length(x) - holdout):length(x)]
 
  plot(plot.data, col= my.qcc$\$$color, type='b', pch=19,
 
      ylim=c(min(bands), max(bands)),
 
      main="QC Chart",
 
      xlab="", ylab="")
 
 
 
  for (i in 1:7) {
 
    lines(bands[,i], col= my.qcc$\$$color[i], lwd=0.75, lty=2)
 
  }
 
}
 
 
 
demo.data.4 <- c(rep(10, 90), rep(11, 10)) + rnorm(100, mean=0, sd=0.5)
 
plot.qcc (demo.data.4, 100)
 
 
 
[[Image:SMHS_LinearModeling_Fig007.png|500px]]
 
 
 
Let’s use the “Student's Sleep Data” (sleep) data.
 
 
 
library("qcc")
 
attach(sleep)
 
q <- qcc.groups(extra, group)
 
dim(q)
 
obj_avg_test_1_2  <-  qcc(q[1:2,],  type="xbar")
 
obj_avg_test_1_5  <-  qcc(q[,1:5],  type="xbar")
 
summary(obj_avg_test_1_5)
 
obj_avg_test_train <-  qcc(q[,1:5],  type="xbar", newdata=q[,6:10])
 
 
 
[[Image:SMHS_LinearModeling_Fig008.png|500px]]
 
 
 
# How is this different from this?
 
obj_avg_new <-  qcc(q[,1:10],  type="xbar")
 
 
 
This control chart has the solid horizontal line (center), the upper and lower control limits (dashed lines), and the sample group statistics (e.g., mean) are drawn as a piece-wise line connecting the points. The bottom of the plot includes summary statistics and the number of points beyond control limits and the number of violating runs. If the process is “in-control”, we can use estimated limits for monitoring prospective (new) data sampled from the same process/protocol. For instance,
 
 
 
obj_test_1_10_train_11_20  <-  qcc(sleep[1:10,],  type="xbar", newdata=sleep[11:20,])
 
plots the X chart for training and testing (11-20) sleep data where the statistics and the control limits are based on the first 10 (training) samples.
 
 
 
[[Image:SMHS_LinearModeling_Fig009.png|500px]]
 
 
 
Now try (range) QCC plot
 
obj_R  <-  qcc(q[,1:5],  type="R", newdata=q[,6:10])
 
 
 
A control chart aims to enhance our ability to monitor and track a process proxied by the data. When special causes of variation (random or not) are present the data may be considered “out of control”. Corresponding action may need to be taken to identify, control for, or eliminate such causes. A process is declared to be “controlled” if the plot of all data points are randomly spread out within the control limits. These Lower and Upper Control Limits (LCL, UCL) are usually computed as ±3σ from the center (e.g., mean). This QCC default limits can be changed using the argument nsigmas or by specifying the confidence level via the confidence level  argument.
 
 
 
 
 
[[Image:SMHS_LinearModeling_Fig10.png|500px]]
 
 
 
When the process is governed by a Gaussian distribution the ±3σ limits correspond to a two-tails probability of p=0.0027.
 
 
 
Finally, an operating characteristic (OC) curve shows the probability of not detecting a shift in the process (false-negative, type II error), i.e., the probability of erroneously accepting a process as being “in control”, when in fact, it’s out of control.
 
 
 
# par(mfrow=c(1,1))
 
oc.curves(obj_test_1_10_train_11_20)
 
# oc.curves(obj_test_1_10_train_11_20, identify=TRUE) # to manually identify specific OC points
 
 
 
[[Image:SMHS_LinearModeling_Fig11.png|500px]]
 
 
 
The function oc.curves returns a matrix or a vector of probabilities values representing the type II errors for different sample-sizes. See help(oc.curves), e.g., identify=TRUE, which allows to interactively identify values on the plot, for all options.
 
 
 
Notice that the OC curve is “S”-shaped. As expected, this is because as the percent of non-conforming values increases, the probability of acceptance decreases. A small sub-sample, instead of inspecting the entire data, may be used to determine the quality of a process. We can accept the data as in-control as long as the process percent nonconforming is below a predefined level.
 
 
 
 
 
===Quality Control===
 
'''Questions:
 
*''' Are there (linear) associations between predictors and a response variable(s)?
 
*''' When to look for linear relations and what assumptions play role?
 
 
 
Let’s use some of the data included in the Appendix. Snapshots of the first few rows in the data are shown below:
 
 
 
[[Image:SMHS_LinearModeling_Fig12.png|300px]] [[Image:SMHS_LinearModeling_Fig13.png|400px]]
 
 
 
Eyeballing the data may suggest that Race “2” subjects may have higher “Weights” (first dataset), or “Treatment A” yields higher “Observations” (second dataset). However this cursory observation may be incorrect as we may be looking at a small/exceptional sub-sample of the entire population. Data-driven linear modeling allows us to quantify such patterns and compute probability values expressing the strength of the evidence in the data to make such conclusions. In a nutshell, we express relationships of interest (e.g., weight as a function of race, or Observation as a function of treatment) using a linear function as an analytical representation of these inter-variable relations:
 
 
 
<BLOCKQUOTE>Weight ~ Race, Weight ~ Genotype, Obs ~ Treatment, Obs ~ Day, etc.<BR>
 
W = a +b*R,
 
</BLOCKQUOTE>
 
 
 
This “~” (tilde) notation implies “Weight predicted by Race” or “Observation as a linear function of Treatment”. The “dependent variable” (a measurable response) on the left is predicted by the factor on the right acting as an “independent variable”, co-variate, predictor, “explanatory variable”, or “fixed effect”.
 
 
 
Often times inter-dependencies may not be perfect, deterministic, or rigid (like in the case of predicting the Area of a disk knowing its radius, or predicting the 3D spatial location of a planet having a precise date and time). Weight may not completely and uniquely determined by Race alone, as many different factors play role (genetics, environment, aging, etc.) Even if we can measure all factors we can identify as potentially influencing Weight, there will still be intrinsic random variability into the observed Weight measures, which can’t be control for. This intrinsic random variation may be captured and accounted for using “random” term at the end.
 
 
 
<BLOCKQUOTE>Weight ~ Race + ε ~ D(m=0, s).
 
</BLOCKQUOTE>
 
 
 
Epsilon “ε” represent the error term of predicting Weight by Gender alone and summarize the aggregate impact of all factors aside from Race that impact individual’s Weight, which are experimentally uncontrollable or random. This formula a schematic analytic representation of a linear model that we’re going to estimate, quantify and use for prediction and inference. The right hand side splits the knowledge representation of Weight into 2 complementary components - a “fixed effect” for Race, which we understand and expect, and a “random effect” (“ε”) what we don’t know well. (W ~ R represents the “structural” or “systematic” part of the linear model and “ε” stands for the “random” or “probabilistic” part of the model.
 
 
 
 
 
.....
 
 
 
 
 
 
 
 
 
....
 
 
 
  
 +
===[[SMHS_LinearModeling_MachineLearning|Machine Learning Algorithms]]===
 +
Data modeling, training , testing, forecasting, prediction, and simulation.
  
 
<hr>
 
<hr>
 
* SOCR Home page: http://www.socr.umich.edu
 
* SOCR Home page: http://www.socr.umich.edu
 
 
{{translate|pageName=http://wiki.socr.umich.edu/index.php?title=SMHS_LinearModeling}}
 
{{translate|pageName=http://wiki.socr.umich.edu/index.php?title=SMHS_LinearModeling}}

Latest revision as of 07:47, 19 May 2016

Scientific Methods for Health Sciences - Linear Modeling

The following sub-sections represent a blend of model-based and model-free scientific inference, forecasting and validity.

Statistical Software

This section briefly describes the pros and cons of different statistical software platforms.

Quality Control

Discussion of data Quality Control (QC) and Quality Assurance (QA) which represent important components of data-driven modeling, analytics and visualization.

Multiple Linear Regression

Review and demonstration of computing and visualizing the regression-model coefficients (effect-sizes), (fixed-effect) linear model assumptions, examination of residual plots, and independence.

Linear mixed effects analyses

Scientific inference based on fixed and random effect models, assumptions, and mixed effects logistic regression.

Machine Learning Algorithms

Data modeling, training , testing, forecasting, prediction, and simulation.




Translate this page:

(default)
Uk flag.gif

Deutsch
De flag.gif

Español
Es flag.gif

Français
Fr flag.gif

Italiano
It flag.gif

Português
Pt flag.gif

日本語
Jp flag.gif

България
Bg flag.gif

الامارات العربية المتحدة
Ae flag.gif

Suomi
Fi flag.gif

इस भाषा में
In flag.gif

Norge
No flag.png

한국어
Kr flag.gif

中文
Cn flag.gif

繁体中文
Cn flag.gif

Русский
Ru flag.gif

Nederlands
Nl flag.gif

Ελληνικά
Gr flag.gif

Hrvatska
Hr flag.gif

Česká republika
Cz flag.gif

Danmark
Dk flag.gif

Polska
Pl flag.png

România
Ro flag.png

Sverige
Se flag.gif