SMHS MissingData

From SOCR
Revision as of 12:03, 24 September 2014 by Dinov (talk | contribs) (Created page with "== Scientific Methods for Health Sciences - Missing Data == ===Motivation=== In complete-case analysis using multiple regression modeling, response results may be mi...")
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

Scientific Methods for Health Sciences - Missing Data

Motivation

In complete-case analysis using multiple regression modeling, response results may be missing may involve automatically excluding the cases with missing response value. This leads to restricting the amount of information available in the analysis, especially if the model has many parameters that need to be estimated and many responses are potentially missing. Alternatively, missing outcomes in a regression can be handled by modeling the outcome variable to impute missing values at each iteration.

A more challenging situation in regression analysis involves missing values in predictor variables. The options here are to remove the missing values, impute them, or analytically model them by supplying distributions for the input variables.

Theory

Types of data missingness

Knowing why and how is data missing is impoetant for determining hte appropriate protocol for handling the data. There are several categories of data-missingness.

  • Missingness completely at random (MCAR). A variable is missing completely at random if the probability of missingness is the same for all units. For instance, if each respondent decides whether to answer an income question by rolling a (fair) die and refusing to answer if a a number > 3 turn up. Inference based on imputing data missing completely at random, by throwing out cases with missing data, is unbiased.
  • Missingness at random (MAR). Most missingness is not completely random. For example, different non-response rates for whites and minorities on income question may be due to socioeconomic factors. Missing at random is a more general assumption where the probability a missing variable depends only on available information. Suppose demographic variables (e.g., age, gender, race, education) are recorded for all the people in the study. Then income will be missing at random if the probability of non-response to this question depends only on these other, fully recorded variables. This process many be modeled by logistic regression with the outcome variable (Y) representing indicator of missingness ($Y=1$ for observed cases and $Y=0$ for missing cases). When an outcome variable is missing at random, a regression model can exclude the missing cases (treat them as NA’s), if it controls for all the variables that affect the probability of missingness for the outcome. In our case, regression models of income would have to include predictors for ethnicity to avoid possible non-response bias.
  • Non-random missingness": When the data missingness depends on unobserved predictors, this indicates non-random gaps in the observed data, dues to information that may not be available, which may also be predictive of the missing values. For instance, highly-educated (high-income?) people may be less likely to respond to income" questions. So, college degree may have predictive value for income. Another example is a new clinical treatment that causes side-effects which may cause attrition of participants (patients drop out of study) dependent on their level of ability to deal with the side effects. Non-random missingness has to be explicitly modeled, otherwise, bias would creep into the scientific inference and impact the results of the study.
  • Missingness that depends on the missing value itself. If the probability of missingness depends on the (potentially missing) variable itself the situation is a bit more interesting. For example, higher earners may be less likely to reveal their income. In these situations, missing-data have to be modeled or accounted for by including more predictors in the missing-data

model to bringing it closer to missing at random situation. For example, whites and highly-educated participants may have higher-than-average incomes and we can control for such predictors to correct for higher rates of non-response (missingness) among higher-income people. Sometime, the missing data situation may require predictive models extrapolating beyond the range of the observed data.

Example

Suppose we are interested in identifying patterns, relations and associations between demographic, clinical and cognitive variables in a cohort of traumatic brain injury (TBI) patients. The table below shows the raw data. Notice the missing values in this table. Imputing the missing data would allow us to use all cases in our analysis of the multivariate relations using the completed dataset.

id age sex mechanism field.gcs er.gcs icu.gcs worst.gcs 6m.gose 2013.gose skull.fx temp.injury surgery spikes.hr min.hr max.hr acute.sz late.sz ever.sz
1 19 Male Fall 10 10 10 10 5 5 0 1 1 . . . 1 1 1
2 55 Male Blunt . 3 3 3 5 7 1 1 1 168.74 14 757 0 1 1
3 24 Male Fall 12 12 8 8 7 7 1 0 0 37.37 0 351 0 0 0
4 57 Female Fall 4 4 6 4 3 3 1 1 1 4.35 0 59 0 0 0
5 54 Female Peds_vs_Auto 14 11 8 8 5 7 0 1 1 54.59 0 284 0 0 0
6 16 Female MVA 13 7 7 7 7 8 1 1 1 75.92 7 180 0 1 1
7 21 Male Fall 3 3 6 3 3 3 1 0 1 . . . 0 0 0
8 25 Male Fall 3 4 3 3 3 3 0 1 0 5.26 0 88 0 1 1
9 30 Male GSW 3 9 3 3 3 5 1 1 1 43.88 0 367 0 1 1
10 38 Male Fall 3 6 6 3 3 3 1 1 1 45.6 4 107 0 1 1
11 43 Male Blunt 8 7 7 7 6 7 1 0 1 7.76 0 72 0 0 0
12 40 Male Fall 12 14 14 12 7 8 0 1 1 26.64 0 125 0 0 0
13 21 Male MVA 12 13 9 9 7 7 1 0 1 . . . 0 1 1
14 35 Female MVA 6 5 6 5 5 7 1 1 0 65.14 0 655 1 1 1
15 59 Male Peds_vs_Auto 14 14 0 0 8 8 1 1 0 . . . 0 0 0
16 32 Male MCA 5 6 3 3 4 5 1 0 0 . . . 0 0 0
17 31 Male MVA 7 3 9 3 5 7 1 0 0 3.82 0 28 0 0 0
18 57 Male MVA 4 3 7 3 3 3 0 1 1 . . . 0 1 1
19 18 Male Blunt 4 3 6 3 5 3 1 1 1 . . . 0 1 1
20 48 Male Bike_vs_Auto 3 8 7 3 5 7 0 0 0 . . . 0 0 0
21 19 Male GSW 15 15 3 3 . 6 1 0 1 . . . 1 1 1
22 22 Male Fall 3 3 3 3 2 2 1 1 1 9.7 0 80 0 1 1
23 20 Male Peds_vs_Auto 15 14 13 13 5 8 1 1 1 . . . 0 1 1
24 41 Male MVA 3 3 6 3 3 7 1 0 0 . . . 0 0 0
25 27 Male MCA 15 13 6 6 6 7 1 0 1 . . . 0 0 0
26 23 Male MVA 14 14 7 7 4 7 1 1 1 . . . 0 0 0
27 36 Male MCA 3 3 3 3 5 6 0 0 0 . . . 0 1 1
28 83 Female Fall 14 14 9 9 . 5 0 1 1 208.92 42 641 1 1 1
29 26 Male MCA 5 7 5 5 6 7 0 1 0 . . . 0 0 0
30 21 Male Fall 14 14 14 14 5 7 0 1 1 294 30 1199 1 1 1
31 23 Male MCA 12 13 13 12 . 7 1 0 1 . . . 0 0 0
32 45 Male MCA 6 6 6 6 3 6 0 0 1 . . . 0 0 0
33 18 Male Bike_vs_Auto 8 8 7 7 7 7 0 0 0 7.14 0 20 0 1 1
34 34 Male MVA 7 7 3 3 4 6 0 1 1 47.73 0 226 0 1 1
35 19 Male MVA 3 7 7 3 7 8 0 0 0 97.43 0 300 0 0 0
36 77 Female Peds_vs_Auto 3 6 3 3 3 3 1 1 0 7.09 0 31 0 1 1
37 75 Male Bike_vs_Auto . . . . . 8 1 0 0 5.9 0 42 0 1 1
38 25 Male Fall 14 . 6 6 8 8 0 0 1 29.61 0 175 1 0 1
39 62 Female Fall 12 8 8 8 3 3 0 1 1 6.16 0 33 0 1 1
40 41 Male MCA 7 3 7 3 5 5 1 1 1 1.66 0 23 0 1 1
41 60 Male Bike_vs_Auto 3 12 7 3 3 5 1 1 0 3.8 0 12 0 1 1
42 29 Female Peds_vs_Auto 9 14 3 3 8 7 1 0 1 . . . 0 1 1
43 48 Male Blunt 12 12 11 11 6 7 0 0 1 5.39 0 43 0 0 0
44 41 Male Peds_vs_Auto 3 3 3 3 2 2 1 1 0 1.28 0 15 1 1 1
45 34 Male Fall 6 8 3 3 3 3 1 1 1 213.84 3 824 1 1 1
46 25 Female MVA 6 8 3 3 . 7 0 1 0 1.7 0 36 0 0 0

The R code below illustrates the imputation of a raw data.

###########################################
# example of multiple imputation (R MI package)
# See Docs: http://www.stat.ucla.edu/~yajima/Publication/mipaper.rev04.pdf
# 
# If you don't have data, simulate fake data
# set.seed(123)
# n <- 1000
# u1 <- rbinom(n, 1, .5); v1 <- log(rnorm(n, 5, 1)); x1 <- u1*exp(v1)
# u2 <- rbinom(n, 1, .5); v2 <- log(rnorm(n, 5, 1)); x2 <- u2*exp(v2)
# x3 <- rbinom(n, 1, prob=0.45); x4 <- ordered(rep(seq(1, 5),100)[sample(1:n, n)]); x5 <- rep(letters[1:10],10)[sample(1:n, n)]; x6 <- trunc(runif(n, 1, 10)); x7 <- rnorm(n); x8 <- factor(rep(seq(1,10),10)[sample(1:n, n)]); x9 <- runif(n, 0.1, .99); x10 <- rpois(n, 4); y <- x1 + x2 + x7 + x9 + rnorm(n)
# fakedata <- cbind.data.frame(y, x1, x2, x3, x4, x5, x6, x7, x8, x9, x10)
# randomly create missing values
# dat <- mi:::.create.missing(fakedata, pct.mis=30)
##########################################################

library("mi")
# copy-paste the (raw) data from the table into a plain text file "EpiBioSData.csv"
EpiBiosData <- read.csv("~/EpiBioSData.csv", na.strings=c("",".","NA"))
# get information matrix of the data
inf <- mi.info(EpiBiosData)
# to update the variable type of a specific variable to mi.info
# inf <- update(inf, "type", list(x10="count"))
# run the imputation without data transformation
IMP <- mi(EpiBiosData, info=inf, check.coef.convergence=TRUE, add.noise=noise.control(post.run.iter=10))
#
# run the imputation with data transformation  
EpiBiosData.transformed <- mi.preprocess(EpiBiosData, inf)
#IMP <- mi(EpiBiosData.transformed, n.iter=6, check.coef.convergence=TRUE, add.noise=noise.control(post.run.iter=6))
#
# IMP <- mi(EpiBiosData.transformed, n.iter=6, add.noise=TRUE)
# no noise
# IMP <- mi(dat, info=inf, n.iter=6, add.noise=FALSE) ## NOT RUN
#
# convergence checking
converged(IMP, check = "data")  ## You should get FALSE here because only n.iter is small 
# converged(IMP, check = "coefs")
IMP.bugs1 <- bugs.mi(IMP, check = "data")    ## BUGS object to look at the R hat statistics
IMP.bugs2 <- bugs.mi(IMP, check = "coefs")   ## BUGS object to look at the R hat statistics
plot(IMP.bugs1)  ## visually check R.hat
#
# visually check the imputation
plot(IMP)
#
missing.pattern.plot(EpiBiosData, gray.scale = TRUE)
#
# to obtain the Completed/Imputed Dataset
IMP.EpiBiosData.all <- mi.completed(IMP)
#
# save results out
write.mi(IMP, format = "csv")
write.csv(IMP.EpiBiosData.all, "~/EpiBioS_MIData.csv")





Translate this page:

(default)
Uk flag.gif

Deutsch
De flag.gif

Español
Es flag.gif

Français
Fr flag.gif

Italiano
It flag.gif

Português
Pt flag.gif

日本語
Jp flag.gif

България
Bg flag.gif

الامارات العربية المتحدة
Ae flag.gif

Suomi
Fi flag.gif

इस भाषा में
In flag.gif

Norge
No flag.png

한국어
Kr flag.gif

中文
Cn flag.gif

繁体中文
Cn flag.gif

Русский
Ru flag.gif

Nederlands
Nl flag.gif

Ελληνικά
Gr flag.gif

Hrvatska
Hr flag.gif

Česká republika
Cz flag.gif

Danmark
Dk flag.gif

Polska
Pl flag.png

România
Ro flag.png

Sverige
Se flag.gif