SMHS MissingData

Scientific Methods for Health Sciences - Missing Data

Motivation

In complete-case analysis using multiple regression modeling, response results may be missing may involve automatically excluding the cases with missing response value. This leads to restricting the amount of information available in the analysis, especially if the model has many parameters that need to be estimated and many responses are potentially missing. Alternatively, missing outcomes in a regression can be handled by modeling the outcome variable to impute missing values at each iteration.

A more challenging situation in regression analysis involves missing values in predictor variables. The options here are to remove the missing values, impute them, or analytically model them by supplying distributions for the input variables.

Theory

Types of data missingness

Knowing why and how is data missing is impoetant for determining hte appropriate protocol for handling the data. There are several categories of data-missingness.

Missingness completely at random (MCAR). A variable is missing completely at random if the probability of missingness is the same for all units. For instance, if each respondent decides whether to answer an income question by rolling a (fair) die and refusing to answer if a a number > 3 turn up. Inference based on imputing data missing completely at random, by throwing out cases with missing data, is unbiased.

Missingness at random (MAR). Most missingness is not completely random. For example, different non-response rates for whites and minorities on income question may be due to socioeconomic factors. Missing at random is a more general assumption where the probability a missing variable depends only on available information. Suppose demographic variables (e.g., age, gender, race, education) are recorded for all the people in the study. Then income will be missing at random if the probability of non-response to this question depends only on these other, fully recorded variables. This process many be modeled by logistic regression with the outcome variable (Y) representing indicator of missingness ($Y=1$ for observed cases and $Y=0$ for missing cases). When an outcome variable is missing at random, a regression model can exclude the missing cases (treat them as NA’s), if it controls for all the variables that affect the probability of missingness for the outcome. In our case, regression models of income would have to include predictors for ethnicity to avoid possible non-response bias.

Non-random missingness": When the data missingness depends on unobserved predictors, this indicates non-random gaps in the observed data, dues to information that may not be available, which may also be predictive of the missing values. For instance, highly-educated (high-income?) people may be less likely to respond to income" questions. So, college degree may have predictive value for income. Another example is a new clinical treatment that causes side-effects which may cause attrition of participants (patients drop out of study) dependent on their level of ability to deal with the side effects. Non-random missingness has to be explicitly modeled, otherwise, bias would creep into the scientific inference and impact the results of the study.

Missingness that depends on the missing value itself. If the probability of missingness depends on the (potentially missing) variable itself the situation is a bit more interesting. For example, higher earners may be less likely to reveal their income. In these situations, missing-data have to be modeled or accounted for by including more predictors in the missing-data

model to bringing it closer to missing at random situation. For example, whites and highly-educated participants may have higher-than-average incomes and we can control for such predictors to correct for higher rates of non-response (missingness) among higher-income people. Sometime, the missing data situation may require predictive models extrapolating beyond the range of the observed data.

Example

Suppose we are interested in identifying patterns, relations and associations between demographic, clinical and cognitive variables in a cohort of traumatic brain injury (TBI) patients. The table below shows the raw data. Notice the missing values in this table. Imputing the missing data would allow us to use all cases in our analysis of the multivariate relations using the completed dataset.

id	age	sex	mechanism	field.gcs	er.gcs	icu.gcs	worst.gcs	6m.gose	2013.gose	skull.fx	temp.injury	surgery	spikes.hr	min.hr	max.hr	acute.sz	late.sz	ever.sz
1	19	Male	Fall	10	10	10	10	5	5	0	1	1	.	.	.	1	1	1
2	55	Male	Blunt	.	3	3	3	5	7	1	1	1	168.74	14	757	0	1	1
3	24	Male	Fall	12	12	8	8	7	7	1	0	0	37.37	0	351	0	0	0
4	57	Female	Fall	4	4	6	4	3	3	1	1	1	4.35	0	59	0	0	0
5	54	Female	Peds_vs_Auto	14	11	8	8	5	7	0	1	1	54.59	0	284	0	0	0
6	16	Female	MVA	13	7	7	7	7	8	1	1	1	75.92	7	180	0	1	1
7	21	Male	Fall	3	3	6	3	3	3	1	0	1	.	.	.	0	0	0
8	25	Male	Fall	3	4	3	3	3	3	0	1	0	5.26	0	88	0	1	1
9	30	Male	GSW	3	9	3	3	3	5	1	1	1	43.88	0	367	0	1	1
10	38	Male	Fall	3	6	6	3	3	3	1	1	1	45.6	4	107	0	1	1
11	43	Male	Blunt	8	7	7	7	6	7	1	0	1	7.76	0	72	0	0	0
12	40	Male	Fall	12	14	14	12	7	8	0	1	1	26.64	0	125	0	0	0
13	21	Male	MVA	12	13	9	9	7	7	1	0	1	.	.	.	0	1	1
14	35	Female	MVA	6	5	6	5	5	7	1	1	0	65.14	0	655	1	1	1
15	59	Male	Peds_vs_Auto	14	14	0	0	8	8	1	1	0	.	.	.	0	0	0
16	32	Male	MCA	5	6	3	3	4	5	1	0	0	.	.	.	0	0	0
17	31	Male	MVA	7	3	9	3	5	7	1	0	0	3.82	0	28	0	0	0
18	57	Male	MVA	4	3	7	3	3	3	0	1	1	.	.	.	0	1	1
19	18	Male	Blunt	4	3	6	3	5	3	1	1	1	.	.	.	0	1	1
20	48	Male	Bike_vs_Auto	3	8	7	3	5	7	0	0	0	.	.	.	0	0	0
21	19	Male	GSW	15	15	3	3	.	6	1	0	1	.	.	.	1	1	1
22	22	Male	Fall	3	3	3	3	2	2	1	1	1	9.7	0	80	0	1	1
23	20	Male	Peds_vs_Auto	15	14	13	13	5	8	1	1	1	.	.	.	0	1	1
24	41	Male	MVA	3	3	6	3	3	7	1	0	0	.	.	.	0	0	0
25	27	Male	MCA	15	13	6	6	6	7	1	0	1	.	.	.	0	0	0
26	23	Male	MVA	14	14	7	7	4	7	1	1	1	.	.	.	0	0	0
27	36	Male	MCA	3	3	3	3	5	6	0	0	0	.	.	.	0	1	1
28	83	Female	Fall	14	14	9	9	.	5	0	1	1	208.92	42	641	1	1	1
29	26	Male	MCA	5	7	5	5	6	7	0	1	0	.	.	.	0	0	0
30	21	Male	Fall	14	14	14	14	5	7	0	1	1	294	30	1199	1	1	1
31	23	Male	MCA	12	13	13	12	.	7	1	0	1	.	.	.	0	0	0
32	45	Male	MCA	6	6	6	6	3	6	0	0	1	.	.	.	0	0	0
33	18	Male	Bike_vs_Auto	8	8	7	7	7	7	0	0	0	7.14	0	20	0	1	1
34	34	Male	MVA	7	7	3	3	4	6	0	1	1	47.73	0	226	0	1	1
35	19	Male	MVA	3	7	7	3	7	8	0	0	0	97.43	0	300	0	0	0
36	77	Female	Peds_vs_Auto	3	6	3	3	3	3	1	1	0	7.09	0	31	0	1	1
37	75	Male	Bike_vs_Auto	.	.	.	.	.	8	1	0	0	5.9	0	42	0	1	1
38	25	Male	Fall	14	.	6	6	8	8	0	0	1	29.61	0	175	1	0	1
39	62	Female	Fall	12	8	8	8	3	3	0	1	1	6.16	0	33	0	1	1
40	41	Male	MCA	7	3	7	3	5	5	1	1	1	1.66	0	23	0	1	1
41	60	Male	Bike_vs_Auto	3	12	7	3	3	5	1	1	0	3.8	0	12	0	1	1
42	29	Female	Peds_vs_Auto	9	14	3	3	8	7	1	0	1	.	.	.	0	1	1
43	48	Male	Blunt	12	12	11	11	6	7	0	0	1	5.39	0	43	0	0	0
44	41	Male	Peds_vs_Auto	3	3	3	3	2	2	1	1	0	1.28	0	15	1	1	1
45	34	Male	Fall	6	8	3	3	3	3	1	1	1	213.84	3	824	1	1	1
46	25	Female	MVA	6	8	3	3	.	7	0	1	0	1.7	0	36	0	0	0

The R code below illustrates the imputation of a raw data.

###########################################
# example of multiple imputation (R MI package)
# See Docs: http://www.stat.ucla.edu/~yajima/Publication/mipaper.rev04.pdf
# 
# If you don't have data, simulate fake data
# set.seed(123)
# n <- 1000
# u1 <- rbinom(n, 1, .5); v1 <- log(rnorm(n, 5, 1)); x1 <- u1*exp(v1)
# u2 <- rbinom(n, 1, .5); v2 <- log(rnorm(n, 5, 1)); x2 <- u2*exp(v2)
# x3 <- rbinom(n, 1, prob=0.45); x4 <- ordered(rep(seq(1, 5),100)[sample(1:n, n)]); x5 <- rep(letters[1:10],10)[sample(1:n, n)]; x6 <- trunc(runif(n, 1, 10)); x7 <- rnorm(n); x8 <- factor(rep(seq(1,10),10)[sample(1:n, n)]); x9 <- runif(n, 0.1, .99); x10 <- rpois(n, 4); y <- x1 + x2 + x7 + x9 + rnorm(n)
# fakedata <- cbind.data.frame(y, x1, x2, x3, x4, x5, x6, x7, x8, x9, x10)
# randomly create missing values
# dat <- mi:::.create.missing(fakedata, pct.mis=30)
##########################################################

library("mi")
# copy-paste the (raw) data from the table into a plain text file "EpiBioSData.csv"
EpiBiosData <- read.csv("~/EpiBioSData.csv", na.strings=c("",".","NA"))
# get information matrix of the data
inf <- mi.info(EpiBiosData)
# to update the variable type of a specific variable to mi.info
# inf <- update(inf, "type", list(x10="count"))
# run the imputation without data transformation
IMP <- mi(EpiBiosData, info=inf, check.coef.convergence=TRUE, add.noise=noise.control(post.run.iter=10))
#
# run the imputation with data transformation  
EpiBiosData.transformed <- mi.preprocess(EpiBiosData, inf)
#IMP <- mi(EpiBiosData.transformed, n.iter=6, check.coef.convergence=TRUE, add.noise=noise.control(post.run.iter=6))
#
# IMP <- mi(EpiBiosData.transformed, n.iter=6, add.noise=TRUE)
# no noise
# IMP <- mi(dat, info=inf, n.iter=6, add.noise=FALSE) ## NOT RUN
#
# convergence checking
converged(IMP, check = "data")  ## You should get FALSE here because only n.iter is small 
# converged(IMP, check = "coefs")
IMP.bugs1 <- bugs.mi(IMP, check = "data")    ## BUGS object to look at the R hat statistics
IMP.bugs2 <- bugs.mi(IMP, check = "coefs")   ## BUGS object to look at the R hat statistics
plot(IMP.bugs1)  ## visually check R.hat
#
# visually check the imputation
plot(IMP)
#
missing.pattern.plot(EpiBiosData, gray.scale = TRUE)
#
# to obtain the Completed/Imputed Dataset
IMP.EpiBiosData.all <- mi.completed(IMP)
#
# save results out
write.mi(IMP, format = "csv")
write.csv(IMP.EpiBiosData.all, "~/EpiBioS_MIData.csv")

SOCR Home page: http://www.socr.umich.edu

Translate this page:

(default)	Deutsch	Español	Français	Italiano	Português	日本語	България	الامارات العربية المتحدة	Suomi	इस भाषा में	Norge
한국어	中文	繁体中文	Русский	Nederlands	Ελληνικά	Hrvatska	Česká republika	Danmark	Polska	România	Sverige

SMHS MissingData

Contents

Scientific Methods for Health Sciences - Missing Data

Motivation

Theory

Types of data missingness

Example

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools