Difference between revisions of "SMHS LinearModeling"

From SOCR
Jump to: navigation, search
Line 17: Line 17:
 
     title="Simulation 1", xlab="Index")
 
     title="Simulation 1", xlab="Index")
  
[[Image:SMHS_LinearModeling_Fig1.png|500px]]
+
[[Image:SMHS_LinearModeling_Fig1.png|800px]]
  
 +
Now let’s introduce a trend
  
  attach(data_1) 
+
  # first 800 points have base value of 100 w/ normally distributed error,
  # to ensure all variables are accessible within R, e.g., using age instead of data_1$\$$age
+
  # next 100 points have base value of 105 w/ normally distributed error
  # i2 maximum number of drinks (standard units) consumed per day (in the past 30 days range 0–184) see also i1
+
  # last 100 points have base value of 110 w/ normally distributed error
  # treat randomization group (0=usual care, 1=HELP clinic)
+
M <- 110
  # pcs SF-36 Physical Component Score (range 14-75)
+
  SD=5
# mcs SF-36 Mental Component Score(range 7-62)
+
  demo.data.2 <- c(rep(100, 800), rep(M, 100), rep(100+(M-100)/2, 100)) + rnorm(1000, mean=0, sd=SD)
# cesd Center for Epidemiologic Studies Depression scale (range 0–60)
+
  qcc(demo.data.2, type="xbar.one", center=100, add.stats=FALSE,
# indtot Inventory of Drug Use Con-sequences (InDUC) total score (range 4–45)
+
    title="Simulation 2", xlab="Index")
# pss_fr perceived social supports (friends, range 0–14) see also dayslink
 
# drugrisk Risk-Assessment Battery(RAB) drug risk score (range0–21)
 
  # satreat any BSAS substance abuse treatment at baseline (0=no,1=yes)
 
  
===Fragment of the data===
+
[[Image:SMHS_LinearModeling_Fig2.png|800px]]
<center>
 
{| class="wikitable" style="text-align:center; " border="1"
 
|-
 
! ID ||i2 ||age ||treat ||homeless ||pcs ||mcs ||cesd ||indtot ||pss_fr ||drugrisk ||sexrisk ||satreat ||female ||substance ||racegrp
 
|-
 
| 1 ||0 ||25 ||0 ||0 ||49 ||7 ||46 ||37 ||0 ||1 ||6 ||0 ||0 ||cocaine ||black
 
|-
 
| 2 ||18 ||31 ||0 ||0 ||48 ||34 ||17 ||48 ||0 ||0 ||11 ||0 ||0 ||alcohol ||white
 
|-
 
| 3 ||39 ||36 ||0 ||0 ||76 ||9 ||33 ||41 ||12 ||19 ||4 ||0 ||0 ||heroin ||black
 
|-
 
| … || || || || || || || || || || || || || || ||
 
|-
 
| 100 ||81 ||22 ||0 ||0 ||37 ||17 ||19 ||30 ||3 ||0 ||10 ||0 ||0 ||alcohol ||other
 
|}
 
</center>
 
  
===Testing section===
+
Our goal is to use statistical quality control to automatically identify issues with the data.  The qcc package in R provides methods for statistical quality control – given the data, it identifies candidate points as outliers based on the Shewhart Rules. Color-coding the data also helps point out irregular points.
  
summary(data_1)
+
The Shewhart control charts rules (cf. 1930’s) are based on monitoring events that unlikely when the controlled process is stable. Incidences of such atypical events are alarm signals suggesting that stability of the process may be compromised and the process is changed.
   
+
 
  x.norm <- rnorm(n=200, m=10, sd=20)
+
An instance of such an unlikely event is the situation when the upper/lower control limits (UCL or LCL) are exceeded. UCL and LCL are constructed as ±3 limits, indicating that the process is under control within them. Additional warning limits (LWL and UWL) are constructed at ±2 or ±. Other rules specifying events having low probability when the process is under control can be constructed:
  hist(x.norm, main="N(10,20) Histogram")
+
 
  hist(x.norm, main="N(10,20) Histogram")
+
1. One point exceeds LCL/UCL.
  mean(data_1$\$$age)
+
 
  sd(data_1$\$$age)
+
2. Nine points above/below the central line.
 +
 
 +
3. Six consecutive points show increasing/decreasing trend.
 +
 
 +
4.  Difference of consecutive values alternates in sign for fourteen points.
 +
 
 +
5. Two out of three points exceed LWL or UWL limits.
 +
 
 +
6. Four out of five points are above/below the central line and exceed ± limit.
 +
 
 +
7. Fifteen points are within ± limits.
 +
 
 +
8. Eight consecutive values are beyond ± limits.
  
 +
We can define training/testing dataset within qcc by adding the data we want to calibrate it with as the first parameter (demo.data.1), followed by the new data (demo.data.2) representing the test data.
  
Simulate new data to match the properties/characteristics of observed data
+
#example using holdout/test sets
 +
demo.data.1 <- rep(100, 1000) + rnorm(1000, mean=0, sd=2)
 +
demo.data.2 <- c(rep(100, 800), rep(105, 100), rep(110, 100)) + rnorm(100, mean=0, sd=2)
 +
MyQC <- qcc(demo.data.1, newdata=demo.data.2, type="xbar.one", center=100, add.stats=FALSE, title="Simulation 1 vs. 2", xlab="Index")
 +
plot(MyQC) # , chart.all=FALSE)
  
* i2 [0: 184]
+
[[Image:SMHS_LinearModeling_Fig3.png|800px]]
* age m=34,sd=12
 
* treat {0,1}
 
* homeless {0,1}
 
* pcs 14-75
 
* mcs 7-62
 
* cesd 0–60
 
* indtot 4-45
 
* pss_fr 0-14
 
* drugrisk 0-21
 
* sexrisk
 
* satreat (0=no,1=yes)
 
* female (0=no,1=yes)
 
* racegrp (black, white, other)
 
 
# Demographics variables
 
Sex <- ifelse(runif(NumSubj)<.5,0,1)
 
Weight <- as.integer(rnorm(NumSubj, 80,10))
 
Age <- as.integer(rnorm(NumSubj, 62,10))
 
  
  # Diagnosis:
+
  # add warning limits at 2 std. deviations
  Dx <- c(rep("PD", 100), rep("HC", 100), rep("SWEDD", 82))
+
  MyQC2 <- qcc(demo.data.1, newdata=demo.data.2, type="xbar.one", center=100, add.stats=FALSE,  title="Second Simulation 1 vs. 2", xlab="Index")
 +
warn.limits <- limits.xbar(MyQC2$center, MyQC2$std.dev, MyQC2$sizes, 0.95)
 +
plot(MyQC2, restore.par = FALSE)
 +
abline(h = warn.limits, lty = 2, lwd=2, col = "blue")  
  
  # Genetics
+
  ## limits.xbar(center, std.dev, sizes, conf)
chr12_rs34637584_GT <- c(ifelse(runif(100)<.3,0,1), ifelse(runif(100)<.6,0,1), ifelse(runif(82)<.4,0,1))                              # NumSubj Bernoulli trials
+
Center = sample/group center statistic
  chr17_rs11868035_GT <- c(ifelse(runif(100)<.7,0,1), ifelse(runif(100)<.4,0,1), ifelse(runif(82)<.5,0,1))                              # NumSubj Bernoulli trials
+
Sizes= samples sizes.  
 +
  std.dev= within group standard deviation.  
 +
Conf= a numeric value used to compute control limits, specifying the number of standard deviations (if conf > 1) or the confidence level (if 0 < conf < 1)
  
# Clinical          # rpois(NumSubj, 15) + rpois(NumSubj, 6)
+
[[Image:SMHS_LinearModeling_Fig3.png|800px]]
UPDRS_part_I <- c( ifelse(runif(100)<.7,0,1)+ifelse(runif(100)<.7,0,1),
 
ifelse(runif(100)<.6,0,1)+ ifelse(runif(100)<.6,0,1),
 
ifelse(runif(82)<.4,0,1)+ ifelse(runif(82)<.4,0,1) )
 
UPDRS_part_II <- c(sample.int(20, 100, replace=T), sample.int(14, 100, replace=T),
 
sample.int(18, 82, replace=T) )
 
UPDRS_part_III <- c(sample.int(30, 100, replace=T), sample.int(20, 100, replace=T),
 
    sample.int(25, 82, replace=T) )
 
  
# Time: VisitTime – done automatically below in aggregator
+
Natural processes may have errors that are non-normally distributed. However, using (appropriate) transformations we can often normalize the errors.
  
# Data (putting all components together)
+
We can use thresholds to define zones in the data where each zone represents, say, one standard deviation span of the range of the dataset.
sim_PD_Data <- cbind(
 
          rep(Cases, each= NumTime),                          # Cases
 
          rep(L_caudate_ComputeArea, each= NumTime), # Imaging
 
          rep(Sex, each= NumTime),                            # Demographics
 
          rep(Weight, each= NumTime),
 
          rep(Age, each= NumTime),
 
          rep(Dx, each= NumTime),                            # Dx
 
          rep(chr12_rs34637584_GT, each= NumTime),            # Genetics
 
          rep(chr17_rs11868035_GT, each= NumTime),
 
          rep(UPDRS_part_I, each= NumTime),                  # Clinical
 
          rep(UPDRS_part_II, each= NumTime),
 
          rep(UPDRS_part_III, each= NumTime),
 
          rep(c(0,6,12,18), NumSubj)                          # Time
 
)
 
  
  # Assign the column names
+
  find_zones <- function(x) {
colnames(sim_PD_Data) <- c(
+
  x.mean <- mean(x)
"Cases",
+
  x.sd <- sd(x)
"L_caudate_ComputeArea",
+
  boundaries <- seq(-3, 3)
"Sex", "Weight", "Age",
+
  # creates a set of zones for each point in x
"Dx", "chr12_rs34637584_GT", "chr17_rs11868035_GT",
+
  zones <- sapply(boundaries, function(i) {
"UPDRS_part_I", "UPDRS_part_II", "UPDRS_part_III",
+
    i * rep(x.sd, length(x))
  "Time"
+
  })
  )
+
  zones + x.mean
 +
  }
 +
  head(find_zones(demo.data.2))
  
  # some QC
+
  evaluate_zones <- function(x) {
  summary(sim_PD_Data)
+
  zones <- find_zones(x)
  dim(sim_PD_Data)
+
  colnames(zones) <- paste("zone", -3:3, sep="_")
  head(sim_PD_Data)
+
  x.zones <- rowSums(x > zones) - 3
 +
  x.zones
 +
}
 +
 +
evaluate_zones(demo.data.2)
 +
 
 +
find_violations <- function(x.zones, i) {
 +
  values <- x.zones[max(i-8, 1):i]
 +
  # rule4 <- ifelse(any(values > 0), 1,
 +
  rule4 <- ifelse(all(values > 0), 1,
 +
                  ifelse(all(values < 0), -1,
 +
                        0))
 +
  values <- x.zones[max(i-5, 1):i]
 +
  rule3 <- ifelse(sum(values >= 2) >= 2, 1,
 +
                  ifelse(sum(values <= -2) >= 2, -1,
 +
                        0))
 +
  values <- x.zones[max(i-3, 1):i]
 +
  rule2 <- ifelse(mean(values >= 3) >= 1, 1,
 +
                  ifelse(mean(values <= -3) >= 1, -1,
 +
                        0))
 +
  #values <- x.zones[]
 +
  values <- x.zones[max(i-3, 1):i]
 +
  rule1 <- ifelse(any(values > 2), 1,
 +
                  ifelse(any(values < -2), -1,
 +
                        0))
 +
  c("rule1"=rule1, "rule2"=rule2, "rule3"=rule3, "rule4"=rule4)
 +
  }
 +
 +
find_violations(evaluate_zones(demo.data.2), 20)
 +
 
 +
Now we can compute the rules for each point and assign a color to any violations.
 +
 
 +
  
  
 
.....
 
.....
  
<center>[[Image:SMHS_DataSimulation_Fig1.png|500px]] </center>
+
 
  
  

Revision as of 09:13, 21 January 2016

Scientific Methods for Health Sciences - Linear Modeling

Statistical Software- Pros/Cons Comparison

Quality Control

Questions:

  • Is the data what it’s supposed to (does it represent the study cohort/population)?
  • How to inspect the quality of the data?

Data Quality Control (QC) and Quality Assurance (QA) represent important components of all modeling, analytics and visualization that precede all subsequent data processing steps. QC and QA may be performed manually or automatically. Statistical quality control involves quantitative methods for monitoring and controlling a process or data derived from observing a natural phenomenon. For example, is there evidence in the plots below of a change in the mean of these processes?

# simulate data with base value of 100 w/ normally distributed error
# install.packages("qcc")
library(qcc)
demo.data.1 <- rep(100, 1000) + rnorm(1000, mean=0, sd=2)
qcc(demo.data.1, type="xbar.one", center=100, add.stats=FALSE,
   title="Simulation 1", xlab="Index")

SMHS LinearModeling Fig1.png

Now let’s introduce a trend

# first 800 points have base value of 100 w/ normally distributed error,
# next 100 points have base value of 105 w/ normally distributed error
# last 100 points have base value of 110 w/ normally distributed error
M <- 110
SD=5
demo.data.2 <- c(rep(100, 800), rep(M, 100), rep(100+(M-100)/2, 100)) + rnorm(1000, mean=0, sd=SD)
qcc(demo.data.2, type="xbar.one", center=100, add.stats=FALSE,
   title="Simulation 2", xlab="Index")

SMHS LinearModeling Fig2.png

Our goal is to use statistical quality control to automatically identify issues with the data. The qcc package in R provides methods for statistical quality control – given the data, it identifies candidate points as outliers based on the Shewhart Rules. Color-coding the data also helps point out irregular points.

The Shewhart control charts rules (cf. 1930’s) are based on monitoring events that unlikely when the controlled process is stable. Incidences of such atypical events are alarm signals suggesting that stability of the process may be compromised and the process is changed.

An instance of such an unlikely event is the situation when the upper/lower control limits (UCL or LCL) are exceeded. UCL and LCL are constructed as ±3 limits, indicating that the process is under control within them. Additional warning limits (LWL and UWL) are constructed at ±2 or ±. Other rules specifying events having low probability when the process is under control can be constructed:

1. One point exceeds LCL/UCL.

2. Nine points above/below the central line.

3. Six consecutive points show increasing/decreasing trend.

4. Difference of consecutive values alternates in sign for fourteen points.

5. Two out of three points exceed LWL or UWL limits.

6. Four out of five points are above/below the central line and exceed ± limit.

7. Fifteen points are within ± limits.

8. Eight consecutive values are beyond ± limits.

We can define training/testing dataset within qcc by adding the data we want to calibrate it with as the first parameter (demo.data.1), followed by the new data (demo.data.2) representing the test data.

#example using holdout/test sets
demo.data.1 <- rep(100, 1000) + rnorm(1000, mean=0, sd=2)
demo.data.2 <- c(rep(100, 800), rep(105, 100), rep(110, 100)) + rnorm(100, mean=0, sd=2)
MyQC <- qcc(demo.data.1, newdata=demo.data.2, type="xbar.one", center=100, add.stats=FALSE, title="Simulation 1 vs. 2", xlab="Index")
plot(MyQC) # , chart.all=FALSE)

SMHS LinearModeling Fig3.png

# add warning limits at 2 std. deviations
MyQC2 <- qcc(demo.data.1, newdata=demo.data.2, type="xbar.one", center=100, add.stats=FALSE,  title="Second Simulation 1 vs. 2", xlab="Index")
warn.limits <- limits.xbar(MyQC2$center, MyQC2$std.dev, MyQC2$sizes, 0.95)
plot(MyQC2, restore.par = FALSE)
abline(h = warn.limits, lty = 2, lwd=2, col = "blue") 
## limits.xbar(center, std.dev, sizes, conf)
Center = sample/group center statistic
Sizes= samples sizes. 
std.dev= within group standard deviation. 
Conf= a numeric value used to compute control limits, specifying the number of standard deviations (if conf > 1) or the confidence level (if 0 < conf < 1)

SMHS LinearModeling Fig3.png

Natural processes may have errors that are non-normally distributed. However, using (appropriate) transformations we can often normalize the errors.

We can use thresholds to define zones in the data where each zone represents, say, one standard deviation span of the range of the dataset.

find_zones <- function(x) {
 x.mean <- mean(x)
 x.sd <- sd(x)
 boundaries <- seq(-3, 3)
 # creates a set of zones for each point in x
 zones <- sapply(boundaries, function(i) {
   i * rep(x.sd, length(x))
 })
 zones + x.mean
}
head(find_zones(demo.data.2))
evaluate_zones <- function(x) {
 zones <- find_zones(x)
 colnames(zones) <- paste("zone", -3:3, sep="_")
 x.zones <- rowSums(x > zones) - 3
 x.zones
}

evaluate_zones(demo.data.2)
find_violations <- function(x.zones, i) {
 values <- x.zones[max(i-8, 1):i]
 # rule4 <- ifelse(any(values > 0), 1,
 rule4 <- ifelse(all(values > 0), 1,
                 ifelse(all(values < 0), -1,
                        0))
 values <- x.zones[max(i-5, 1):i]
 rule3 <- ifelse(sum(values >= 2) >= 2, 1,
                 ifelse(sum(values <= -2) >= 2, -1,
                        0))
 values <- x.zones[max(i-3, 1):i]
 rule2 <- ifelse(mean(values >= 3) >= 1, 1,
                 ifelse(mean(values <= -3) >= 1, -1,
                        0))
 #values <- x.zones[]
values <- x.zones[max(i-3, 1):i]
rule1 <- ifelse(any(values > 2), 1,
                 ifelse(any(values < -2), -1,
                        0))
 c("rule1"=rule1, "rule2"=rule2, "rule3"=rule3, "rule4"=rule4)
}

find_violations(evaluate_zones(demo.data.2), 20)

Now we can compute the rules for each point and assign a color to any violations.



.....



....





Translate this page:

(default)
Uk flag.gif

Deutsch
De flag.gif

Español
Es flag.gif

Français
Fr flag.gif

Italiano
It flag.gif

Português
Pt flag.gif

日本語
Jp flag.gif

България
Bg flag.gif

الامارات العربية المتحدة
Ae flag.gif

Suomi
Fi flag.gif

इस भाषा में
In flag.gif

Norge
No flag.png

한국어
Kr flag.gif

中文
Cn flag.gif

繁体中文
Cn flag.gif

Русский
Ru flag.gif

Nederlands
Nl flag.gif

Ελληνικά
Gr flag.gif

Hrvatska
Hr flag.gif

Česká republika
Cz flag.gif

Danmark
Dk flag.gif

Polska
Pl flag.png

România
Ro flag.png

Sverige
Se flag.gif