Difference between revisions of "SMHS MissingData"

Latest revision as of 09:46, 23 May 2016

Scientific Methods for Health Sciences - Missing Data

Questions

Why is data usually incomplete?
What are the best strategies for dealing with missing data- ignore cases, replace them by some population derived values, or impute them?
What is the impact of data manipulations on the core scientific inference?

Overview

Many research studies encounter incomplete (missing) data that require special handling (e.g., processing, statistical analysis, visualization). There are a variety of methods (e.g., multiple imputation) to deal with missing data, detect missingness, impute the data, analyze the completed data-set and compare the characteristics of the raw and imputed data.

Multiple imputation involves 3 steps:

1) Impute: Create sets of plausible values for the missing observations that reflect uncertainty about the non-response model. Each of these sets of plausible values can be used to “fill-in” or complete the data-set.

2) Analyze: process each of these imputed data-sets using complete-data methods.

3) Combine: synthesize the results accounting for the uncertainty within each imputation round.

In a general regression setting, let’s denote the scalar, or vector valued, outcomes by Y, and the corresponding vector of predictors by X. For a given case (e.g., subject, unit), these quantities are either observed (obs) or missing (mis). Thus, Y_obs and X_obs represent the observed component of the outcome and the predictors; and Y_mis and X_mis denote the unobserved components of the outcome and predictors, respectively. Imputation involves the estimation of the regression parameters β governing the conditional distribution of Y given X: f(Y|X,β). The efficiency, bias and precision of the estimates are important in this process.

The type of missingness in the data is an important factor in the imputation process. The basic patterns of missingness include:

Missing completely at random (MCAR) assumes that the missing data is not related to any factor, known or unknown, in the study.

Missingness at random (MAR) assumes that the missingness depends only on observed quantities, which may include outcomes and/or predictors.

Non-ignorable missingness occurs when the missing data depends on unobserved quantities.

Multiple Imputation Protocol

(1) Imputation:

Generate a set of $m > 1$ plausible values for $Z_{mis}=(Y_{mis},X_{mis})$.

The imputation step relies upon assumptions regarding the cause of missingness in the dataset. The goal of the imputation is to account for the relationships between the unobserved and observed variables, while taking into account the uncertainty of the imputation. The commonly made MAR assumption for missing data is untestable without additional information. With MAR assumption we can generate imputations (Z_{1},Z_{2},...Z_{m}) from the distribution $f(Z_{mis}|Z_{obs})$, since after conditioning on $Z_{obs}$ the missingness is assumed to be random.

The missingness is monotone when the data matrix can be rearranged so that there is a hierarchy of missingness where observing a particular variable $Z_b$ for a subject implies that $Z_a$ is observed, for $a < b$. In monotone condition settings many imputation methods may be employed including (for continuous variables) propensity methods, predictive mean matching, and (for discrete variables) discriminant analysis or logistic regression. For non-monotonic missingness, Markov Chain Monte Carlo (MCMC) approaches may be used.

Each method has its own assumptions. For instance, predictive mean matching and MCMC approaches require multivariate normality. Predictive mean matching approach employs linear regression for the distribution of a partially observed variable, conditional on other factors. To impute the data using the predictive mean matching approach for a variable $Z_i$ with missing values, we fit a model using complete observations for $Z_1, . . . , Z_{i-1}$ (not monotonicity assumption!):

$E[Z_i|φ]=φ_0+φ_1Z_1+φ_2Z_2+...+φ_{i-1}Z_{i-1}$

Next, new parameters φ^* are randomly drawn from the distribution of the parameters (since these values are estimated instead of exactly known). For the 1≤l≤m imputation, the missing values are replaced by:

$Z_i^l= φ_0^* +φ_1^*Z_1 +φ_2^*Z_2+...+φ_{i-1}^*Z_{i-1}+σ^*ϵ$,

where σ^* is the estimate of variance from the model and ϵ is a Gaussian, N(0,1). This is the simple regression method. We can impute the observed value of $Z_i$ that is closest to predicted $\hat{Z}_i=Z_i^l$ in the dataset – this is the predictive mean matching method.

The propensity score method uses an alternative model for imputation where the values are imputed from observations that are equally likely to be missing, by fitting a logistic regression model for the missingness indicators. In some situations, propensity score imputation may lead to serious bias for missing covariates. For discrete incomplete variables, discriminant analysis or dichotomous logistic regression may be employed to impute values based on the estimated probability that a missing observation takes on a certain value.

(2) Analysis:

Analyze the m datasets using complete-case methods. This second step in the protocol involves carrying out the analysis of interest for each of the m imputed complete-observation datasets and storing the parameter vectors (e.g., effect-size coefficients, β’s) and their standard error estimates.

(3) Combination:

Combine the results from the m analyses and calculate the estimates of the within imputation and between imputation variability. If the imputation model is correct (assumptions are valid), these statistics account for the variability of the imputations and provide consistent estimates of the parameters and their standard errors.

Simulated Example

set.seed(123)
# create MCAR missing-data generator
create.missing <- function (data, pct.mis = 10)
{
n <- nrow(data)
J <- ncol(data)
if (length(pct.mis) == 1) {
n.mis <- rep((n * (pct.mis/100)), J)

}
else {
if (length(pct.mis) < J) 
stop("The length of missing does not equal to the column of the data")
n.mis <- n * (pct.mis/100)

}
   for (i in 1:ncol(data)) {
       if (n.mis[i] == 0) { # if column has nothing missing, do nothing.
           data[, i] <- data[, i]
       }
       else {
           data[sample(1:n, n.mis[i], replace = FALSE), i] <- NA
# For each given column (i), sample the row indices (1:n), 
# a number of indices to replace as “missing”, n.mis[i], “NA”,
 	  # without replacement
       }
   }
   return(as.data.frame(data))
}

# Simulate some real multivariate data sim_data={y,x1,x2,x3,x4,x5,x6,x7,x8,x9,x10}
# 
n <- 1000; u1 <- rbinom(n, 1, .5); v1 <- log(rnorm(n, 5, 1)); x1 <- u1*exp(v1)
u2 <- rbinom(n, 1, .5); v2 <- log(rnorm(n, 5, 1)); x2 <- u2*exp(v2)
x3 <- rbinom(n, 1, prob=0.45); x4 <- ordered(rep(seq(1, 5),n)[sample(1:n, n)]); x5 <- rep(letters[1:10],n)[sample(1:n, n)]; x6 <- trunc(runif(n, 1, 10)); x7 <- rnorm(n); x8 <- factor(rep(seq(1,10),n)[sample(1:n, n)]); x9 <-   runif(n, 0.1, .99); x10 <- rpois(n, 4); y <- x1 + x2 + x7 + x9 + rnorm(n)

# package the simulated data as a data frame object
sim_data <- cbind.data.frame(y, x1, x2, x3, x4, x5, x6, x7, x8, x9, x10)

# randomly create missing values
sim_data_30pct_missing <- create.missing(sim_data, pct.mis=30); 
head(sim_data_30pct_missing); summary(sim_data_30pct_missing)

# install.packages("mi")
# install.packages("betareg")
library("betareg"); library("mi")

# get show the missing information matrix			
mdf <- missing_data.frame(sim_data_30pct_missing) 
show(mdf); mdf@patterns; image(mdf)   # remember the visual pattern of this MCAR

# Next try to impute the missing values …
imputations <- mi(sim_data_30pct_missing, n.iter=10, n.chains=3, verbose=TRUE)
hist(imputations)

# Extracts several multiply imputed data.frames from “imputations” object
data.frames <- complete(imputations, 3)

# Compare the summary stats for the original data (prior to introducing missing
# values) with missing data and the re-completed data following imputation
summary(sim_data); summary(sim_data_30pct_missing); summary(data.frames1);
lapply(data.frames, summary)

Original
	y		x1	x2	x3	x4	x5	x6	x7		x8	x9	x10
Min	:-3.046	Min	:0.000	Min.	:0.000	Min	:0.000	0.180555556	a	:100	Min	:1.000	Min.	:-2.880570	1	:100	Min.	:0.1024	Min.	:0.000
1st Qu.:2.788	1st Qu.:0.000	1st Qu.:0.000	1st Qu.:0.000	0.222222222	b	:100	1st Qu.:3.000	1st Qu.:-0.687578	2	:100	1st Qu.:0.3356	1st Qu.:2.000
Median :5.613	Median :3.186	Median :2.024	Median :0.000	0.263888889	c	:100	Median :5.000	Median :0.000836	3	:100	Median :0.5546	Median :4.000
Mean	:5.658	Mean	:2.626	Mean	:2.480	Mean	:0.459	0.305555556	d	:100	Mean	:4.998	Mean	:-0.020541	4	:100	Mean	:0.5542	Mean	:3.898
3rd Qu.:8.525	3rd Qu.:5.101	3rd Qu.:4.962	3rd Qu.:1.000	0.347222222	e	:100	3rd Qu.:7.000	3rd Qu.:0.635832	5	:100	3rd Qu.:0.7829	3rd Qu.:5.000
Max.	:17.263	Max.	:8.774	Max.	:8.085	Max	:1.000	f	:100	Max.	:9.000	Max.	:3.281171	6	:100	Max.	:0.9887	Max.	:12.000
Missing Data
	y		x1	x2	x3	x4	x5	x6	x7		x8	x9	x10	6	:100
Min.	:-3.046	Min	:0.000	Min.	:0.000	Min.	:0.0000	1	:144	h	:76	Min.	:1.000	Min.	:-2.8806	8	:77	Min.	:0.1024	Min.	:0.000
1st Qu.:2.768	1st Qu.:0.000	1st Qu.:0.000	1st Qu.:0.0000	2	:128	a	:72	1st Qu.:3.000	1st Qu.:-0.6974	4	:75	1st Qu.:0.3504	1st Qu.:2.000
Median :5.650	Median :3.127	Median :0.000	Median :0.0000	3	:143	d	:72	Median :5.000	Median :-0.0490	9	:72	Median :0.5590	Median :4.000
Mean	:5.699	Mean	:2.633	Mean	:2.455	Mean	:0.4586	4	:138	f	:72	Mean	:4.994	Mean	:-0.0261	6	:71	Mean	:0.5619	Mean	:3.853
3rd Qu.:8.552	3rd Qu.:5.088	3rd Qu.:4.966	3rd Qu.:1.0000	5	:147	g	:72	3rd Qu.:7.000	3rd Qu.:0.6379	10	:71	3rd Qu.:0.7843	3rd Qu.:5.000
Max.	:17.263	Max.	:8.774	Max.	:8.085	Max.	:1.0000	NA's:300	(Other):336	Max.	:9.000	Max.	:2.7863	(Other):334	Max.	:0.9887	Max.	:12.000
NA's	:300	NA's	:300	NA's	:300	NA's	:300		NA's	:300	NA's	:300	NA's	:300	NA's	:300	NA's	300	NA's	:300
Imputed
	y		x1	x2	x3	x4	x5	x6	x7		x8	x9	x10	missing y
Min.	:-4.646	Min.	:-3.687	Min.	:-4.672	0.381944444	0.199305556	a	:108	Min.	:-1.005	Min.	:-2.88057	4	:114	Min.	:0.04046	Min.	:-2.009	Mode: logical
1st Qu.:3.015	1st Qu.:0.000	1st Qu.:0.000	0.354166667	0.211111111	h	:108	1st Qu.:3.000	1st Qu.:-0.70985	9	:105	1st Qu.:0.36273	1st Qu.:2.352	FALSE:700
Median :5.741	Median :2.836	Median :2.564	0.259722222	d	:107	Median :5.000	Median :-0.03695	5	:103	Median :0.57188	Median :4.000	TRUE :300
Mean	:5.846	Mean	:2.649	Mean	:2.533	0.297916667	c	:104	Mean	:5.012	Mean	:-0.01444	3	:102	Mean	:0.56607	Mean	:3.858	NA's :0
3rd Qu.:8.667	3rd Qu.:4.978	3rd Qu.:4.871	0.351388889	j	:100	3rd Qu.:7.000	3rd Qu.:0.70157	6	:102	3rd Qu.:0.78259	3rd Qu.:5.000
Max.	:17.263	Max.	:9.979	Max.	:12.833		f	:99	Max.	:12.178	Max.	:2.78629	8	:102	Max.	:0.99670	Max.	:12.000

# Check imputation convergence (details provided below)
round(mipply(imputations, mean, to.matrix = TRUE), 3)
Rhats(imputations, statistic = "moments") # assess the convergence of MI algorithm

plot(imputations); hist(imputations); image(imputations); summary(imputations)

# Finally, pool over the m = 5 imputed datasets when we fit the “model”
# Pool from across the 4 chains – in order to estimate a linear regression model
model_results <- pool(y ~ x1+x2+x3+x4+x5+x6+x7+x8+x9+x10, data=imputations,  m=5 )
display (model_results); summary (model_results)  
# Report the summaries of the imputations
data.frames <- complete(imputations, 3)  	# extract the first 3 chains
lapply(data.frames, summary)

TBI Data Example

See these traumatic brain injury (TBI) data: http://wiki.socr.umich.edu/index.php/SMHS_MissingData#Raw_TBI_data

Ind	ID	Age	Sex	TBI	field.gcs	er.gcs	icu.gcs	worst.gcs	X6m.gose	X2013.gose	skull.fx	temp.injury	surgery	spikes.hr	min.hr	max.hr	acute.sz	late.sz	ever.sz
1	1	19	M	Fall	10	10	10	10	5	5	0	1	1	NA	NA	NA	1	1	1
2	2	55	M	Blunt	NA	3	3	3	5	7	1	1	1	168.74	14	757	0	1	1
3	3	24	M	Fall	12	12	8	8	7	7	1	0	0	37.37	0	351	0	0	0
4	4	57	F	Fall	4	4	6	4	3	3	1	1	1	4.35	0	59	0	0	0
5	5	54	F	Peds_vs_Auto	14	11	8	8	5	7	0	1	1	54.59	0	284	0	0	0
6	6	16	F	MVA	13	7	7	7	7	8	1	1	1	75.92	7	180	0	1	1

# Load the (raw) data from the table into a plain text file "08_EpiBioSData_Incomplete.csv"
TBI_Data <- read.csv("https://umich.instructure.com/files/720782/download?download_frd=1", na.strings=c("",".","NA"))
summary(TBI_Data)

# Get information matrix of the data
# (1) Convert to a missing_data.frame
# library("betareg"); library("mi")			
mdf <- missing_data.frame(TBI_Data) # warnings about missingness patterns
show(mdf); mdf@patterns

# (2) change things: mi::change() method changes the family, imputation method,
# size, type, and so forth of a missing variable. It’s called 
# before calling mi to affect how the conditional expectation of each 
# missing variable is modeled.
mdf <- change(mdf, y = "spikes.hr", what = "transformation", to = "identity")

Arg	Description
y	A character vector naming one or more missing variables within the missing_data.frame specified by the data argument, or a vector of integers or a logical vector indicating which missing_variables to change.
what	A character string naming what is to be changed, such as "family", "imputation_method", "size","transformation", "type", "link", or "model."
to	A character string naming what y should be changed to, such as one of the admissible families, imputation methods, transformations, or types. If missing, then possible choices for the "to" argument will be helpfully printed on the screen.

# (3) examine missingness patterns
summary(mdf); hist(mdf); 
image(mdf)

# (4) Perform initial imputation
imputations <- mi(mdf, n.iter=30, n.chains=5, verbose=TRUE)
hist(imputations)

# (5) Extracts several multiply imputed data.frames from “imputations” object
data.frames <- complete(imputations, 5)

# (6) Report a list of “summaries” for each element (imputation instance)
lapply(data.frames, summary)

# (6.a) To cast the imputed numbers as integers (not necessary, but may be useful)
indx <- sapply(data.frames5, is.numeric)  # get the indices of numeric columns
data.frames5[indx] <- lapply(data.frames5[indx], function(x) as.numeric(as.integer(x))) 		    # cast each value as integer

# (7) Save results out
write.csv(data.frames5, "C:\\Users\\Dinov\\Desktop\\TBI_MIData.csv")

# (8) Complete Data analytics functions:
# library("mi")
#lm.mi(); glm.mi(); polr.mi(); bayesglm.mi(); bayespolr.mi(); lmer.mi(); glmer.mi()

# (8.1) Define Linear Regression for multiply imputed dataset – Also see Step (10)
##linear regression for each imputed data set - 5 regression models are fit
fit_lm1 <- glm(ever.sz ~ surgery + worst.gcs + factor(sex) + age, data.frames$\$$`chain:1`, family = "binomial"); summary(fit_lm1); display(fit_lm1)

 # Fit the appropriate model and pool the results (estimates over MI chains)
 model_results <- pool(ever.sz ~ surgery + worst.gcs + factor(sex) + age, family = "binomial", data=imputations,  m=5)
 display (model_results); summary (model_results)  

 # Report the summaries of the imputations
 data.frames <- complete(imputations, 3)  	# extract the first 3 chains
 lapply(data.frames, summary)

 # (9) Validation: we now verify whether enough iterations were conducted. 
 # Validation criteria demands that the mean of each completed variable should
 # be similar for each of the k chains (in this case k=5).
 # <b>mipply</b> is wrapper for <b>sapply</b> invoked on mi-class objects to compute the col means
 round(mipply(imputations, mean, to.matrix = TRUE), 3)

 # <b>Rhat</b> convergence statistics compares the variance between chains to the variance
 # across chains. 
 # <b>Rhat Values ~ 1.0</b> indicate likely convergence, 
 # <b>Rhat Values > 1.1</b> indicate that the chains should be run longer 
 # (use large number of iterations)
 Rhats(imputations, statistic = "moments") # assess the convergence of MI algorithm

 # When convergence is unstable, we can continue the iterations for all chains, e.g.
 imputations <- mi(imputations, n.iter=20) # add additional 20 iterations

 # To <b>plot</b> the produced mi results, for all missing_variables we can generate
 # a histogram of the observed, imputed, and completed data.
 # We can compare of the completed data to the fitted values implied by the model
 # for the completed data, by plotting binned residuals. 
 # <b>hist</b> function works similarly as plot. 
 # <b>image</b> function gives a sense of the missingness patterns in the data
 plot(imputations); hist(imputations); image(imputations); summary(imputations)

 # (10) Finally, pool over the m = 5 imputed datasets when we fit the “model”
 # Pool from across the 4 chains – in order to estimate a linear regression model
 # and impact ov various predictors

 <span style="background-color: #37FDFC">model_results <- <b><u>pool</u></b>(ever.sz ~ surgery + worst.gcs + factor(sex) + age, data =  imputations,  m =  5 ); display (model_results); summary (model_results)  </span>

 # Report the summaries of the imputations
 data.frames <- complete(imputations, 3)  	# extract the first 3 chains
 lapply(data.frames, summary)			# report summaries

==='"`UNIQ--h-9--QINU`"'Parkinson's Disease Case Study===

Background: See: http://wiki.socr.umich.edu/index.php/SOCR_Data_PD_BiomedBigMetadata

 <b>#imputation for logistic regression</b>
 # load the data: <b>08_PPMI_GSA_clinical.csv</b>
 ppmi_new<-read.csv("https://umich.instructure.com/files/330401/download?download_frd=1",header=TRUE)

 # install.packages("psych")
 library(psych)
 # report descriptive statistics

 ppmi_new$\$$ResearchGroup <- ifelse(ppmi_new$\$$ResearchGroup == "Control", "Control", "Patient")
 ppmi_new$\$$ResearchGroup <- as.factor(ppmi_new$\$$ResearchGroup): table(ppmi_new$\$$ResearchGroup)

# Reduce the Dataset: extract only the columns of interest (to ensure real time calculations)
ppmi_new_1 <- as.data.frame(ppmi_new)[,   c("ResearchGroup","R_fusiform_gyrus_Curvedness","Sex","Age","chr12_rs34637584_GT","UPDRS_Part_I_Summary_Score_Baseline","UPDRS_Part_II_Patient_Questionnaire_Summary_Score_Baseline","UPDRS_Part_III_Summary_Score_Baseline",  "UPDRS_Part_IV_Summary_Score_Baseline")]

descript <- describe(ppmi_new_1); descript	# reports means, SE’s, etc.

dim(ppmi_new); dim(ppmi_new_1)	# check dimensions before

# imputation 1
# install.packages("mi"); install.packages("betareg")
library(mi); library(betareg)

mdf<-missing_data.frame(ppmi_new_1[-1,])
apply(ppmi_new_1[-1,],1,mean)
imputations<-mi(mdf, seed=900);   summary(imputations)
ppmi_complete<-complete(imputations)

# Save the imputation
# again you may want to cast imputed numerical values as integers, perhaps
# (indx <- sapply(data.frames5, is.numeric)  # get the indices of numeric columns
# data.frames5[indx] <- lapply(data.frames5[indx], function(x) as.numeric(as.integer(x)))   # cast each value
# write.csv(ppmi_complete,"C:\\Users\\Dinov\\Desktop\\ppmi_new_1_complete.csv")
write.csv(ppmi_complete,"ppmi_new_1_complete.csv")

# Get the imputations of 4 iterations
ppmi_complete_1<- ppmi_complete1; ppmi_complete_2<- ppmi_complete2; 
ppmi_complete_3<- ppmi_complete3; ppmi_complete_4<- ppmi_complete4

# average the imputations 
ppmi_updrs <- (ppmi_complete_1+ppmi_complete_2+ppmi_complete_4+ppmi_complete_3)/4
# ppmi_top<-cbind(ppmi_new_1[-1,1:335], ppmi_updrs) #delete the first observation
# colnames(ppmi_top)<-colnames(ppmi_new[,1:393])

# Save results – complete dataset
#  write.csv(ppmi_top,"C:\\Users\\Dinov\\Desktop\\ppmi_top_complete_1.csv")
# write.csv(ppmi_top,"ppmi_top_complete_1.csv")

# Now, run some Data analytics on complete data:
# library("mi")
#lm.mi(); glm.mi(); polr.mi(); bayesglm.mi(); bayespolr.mi(); lmer.mi(); glmer.mi()
# Define Linear Regression for multiply imputed dataset
##linear regression for each imputed data set - 5 regression models are fit

model_results <- pool(ResearchGroup ~  R_fusiform_gyrus_Curvedness + factor(Sex) + Age + 
factor(chr12_rs34637584_GT) + UPDRS_Part_I_Summary_Score_Baseline + UPDRS_Part_II_Patient_Questionnaire_Summary_Score_Baseline + UPDRS_Part_III_Summary_Score_Baseline, family = "binomial", data=imputations,  m=3)
display (model_results); summary (model_results)

(default)	Deutsch	Español	Français	Italiano	Português	日本語	България	الامارات العربية المتحدة	Suomi	इस भाषा में	Norge
한국어	中文	繁体中文	Русский	Nederlands	Ελληνικά	Hrvatska	Česká republika	Danmark	Polska	România	Sverige

Difference between revisions of "SMHS MissingData"

Latest revision as of 09:46, 23 May 2016

Contents

Scientific Methods for Health Sciences - Missing Data

Questions

Overview

Multiple Imputation Protocol

(1) Imputation:

(2) Analysis:

(3) Combination:

Simulated Example

TBI Data Example

Appendix

See also

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools

@@ Line 1: / Line 1: @@
 ==[[SMHS| Scientific Methods for Health Sciences]] - Missing Data ==
-===Motivation===
+===Questions===
-In complete-case analysis using multiple regression modeling, response results may be missing may involve automatically excluding the cases with missing response value. This leads to restricting the amount of information available in the analysis, especially if the model has many parameters that need to be estimated and many responses are potentially missing. Alternatively, missing outcomes
+*Why is data usually incomplete?
-in a regression can be handled by modeling the outcome variable to impute missing values at each
+*What are the best strategies for dealing with missing data- ignore cases, replace them by some population derived values, or impute them?
-iteration.
+*What is the impact of data manipulations on the core scientific inference?
-A more challenging situation in regression analysis involves missing values in predictor variables.
+===Overview===
-The options here are to remove the missing values, impute them, or analytically model them by supplying distributions for the input variables.
-===Theory===
+Many research studies encounter incomplete (missing) data that require special handling (e.g., processing, statistical analysis, visualization). There are a variety of methods (e.g., multiple imputation) to deal with missing data, detect missingness, impute the data, analyze the completed data-set and compare the characteristics of the raw and imputed data.
-====Types of data missingness====
-Knowing why and how is data missing is impoetant for determining hte appropriate protocol for handling the data. There are several categories of data-missingness.
-* ''Missingness completely at random'' (MCAR). A variable is missing completely at random if the probability of missingness is the same for all units. For instance, if each respondent decides whether to answer an ''income'' question by [[SOCR_EduMaterials_Activities_DiceExperiment|rolling a (fair) die]] and refusing to answer if a a number > 3 turn up. Inference based on imputing data missing completely at random, by throwing out cases with missing data, is unbiased.
+Multiple imputation involves 3 steps:
-* ''Missingness at random'' (MAR). Most missingness is not completely random. For example, different non-response rates for whites and minorities on ''income'' question may be due to socioeconomic factors. Missing at random is a more general assumption where the probability a missing variable depends only on available information. Suppose demographic variables (e.g., age, gender, race, education) are recorded for all the people in the study. Then ''income'' will be missing at random if the probability of non-response to this question depends only on these other, fully recorded variables. This process many be modeled by logistic regression with the outcome variable (Y) representing indicator of missingness ($Y=1$ for observed cases and $Y=0$ for missing cases). When an outcome variable is missing at random, a regression model can exclude the missing cases (treat them as NA’s), if it controls for all the variables that affect the probability of missingness for the outcome. In our case, regression models of ''income'' would have to include predictors for ethnicity to avoid possible non-response bias.
+<b>1) Impute</b>: Create sets of plausible values for the missing observations that reflect uncertainty about the non-response model. Each of these sets of plausible values can be used to “fill-in” or complete the data-set.
-* ''Non-random missingness'': When the data missingness depends on unobserved predictors, this indicates non-random gaps in the observed data, dues to information that may not be available, which may also be predictive of the missing values. For instance, highly-educated (high-income?) people may be less likely to respond to ''income'' questions. So, college degree may have predictive value for ''income''. Another example is a new clinical treatment that causes side-effects which may cause attrition of participants (patients drop out of study) dependent on their level of ability to deal with the side effects. Non-random missingness has to be explicitly modeled, otherwise, bias would creep into the scientific inference and impact the results of the study.
+<b>2) Analyze</b>: process each of these imputed data-sets using complete-data methods.
-* ''Missingness that depends on the missing value itself''. If the probability of missingness depends on the (potentially missing) variable itself the situation is a bit more interesting. For example, higher earners may be less likely to reveal their ''income''.  In these situations, missing-data have to be modeled or accounted for by including more predictors in the missing-data model to bringing it closer to ''missing at random'' situation. For example, whites and highly-educated participants may have higher-than-average ''incomes'' and we can control for such predictors to correct for higher rates of non-response (missingness) among higher-income people. Sometime, the missing data situation may require predictive models extrapolating beyond the range of the observed data.
+<b>3) Combine</b>: synthesize the results accounting for the uncertainty within each imputation round.
-===Example===
+In a general regression setting, let’s denote the scalar, or vector valued, outcomes by <i>Y</i>, and the corresponding vector of predictors by <i>X</i>. For a given case (e.g., subject, unit), these quantities are either observed (<i>obs</i>) or missing (<i>mis</i>). Thus, Y<sub>obs</sub> and X<sub>obs</sub> represent the observed component of the outcome and the predictors; and Y<sub>mis</sub> and X<sub>mis</sub> denote the unobserved components of the outcome and predictors, respectively. Imputation involves the estimation of the regression parameters β governing the conditional distribution of Y given X: f(Y|X,β). The efficiency, bias and precision of the estimates are important in this process.
-Suppose we are interested in identifying patterns, relations and associations between demographic, clinical and cognitive variables in a cohort of traumatic brain injury (TBI) patients. The table below shows the raw data. Notice the missing values in this table. Imputing the missing data would allow us to use all cases in our analysis of the multivariate relations using the completed dataset.
-====Raw TBI data====
+The type of missingness in the data is an important factor in the imputation process. The basic patterns of missingness include:
-The variables in the table include: id=participant ID; age=age; sex=gender; mechanism=type of TBI injury; field.gcs=field [http://en.wikipedia.org/wiki/Glasgow_Coma_Scale Glasgow Coma Scale]; er.gcs=emergency room [http://en.wikipedia.org/wiki/Glasgow_Coma_Scale Glasgow Coma Scale]; icu.gcs=intensive care unit [http://en.wikipedia.org/wiki/Glasgow_Coma_Scale Glasgow Coma Scale]; worst.gcs=lowest [http://en.wikipedia.org/wiki/Glasgow_Coma_Scale Glasgow Coma Scale]; 6m.gose=[http://tbims.org/combi/gose/ Extended Glasgow Outcome Scale] score at 6-month follow up; 2013.gose=[http://tbims.org/combi/gose/ Extended Glasgow Outcome Scale] score in 2013; skull.fx=skull fracture criterion for impact-induced head injury; temp.injury=injury characteristic; surgery=index of surgery; spikes.hr=EEG spikes per hour; min.hr=min EEG per hour; max.hr=max EEG spikes per hour; acute.sz=indicator of seizure at acute state; late.sz=indicator of seizure at chronic state; ever.sz=indicator of seizures at any time. Period, '''"."''', indicates missing data.
+*Missing completely at random (MCAR) assumes that the missing data is not related to any factor, known or unknown, in the study.
+*Missingness at random (MAR) assumes that the missingness depends only on observed quantities, which may include outcomes and/or predictors.
+*Non-ignorable missingness occurs when the missing data depends on unobserved quantities.
+===Multiple Imputation Protocol===
+====(1) Imputation:====
+Generate a set of $m > 1$ plausible values for $Z_{mis}=(Y_{mis},X_{mis})$.
+The imputation step relies upon assumptions regarding the cause of missingness in the dataset. The goal of the imputation is to account for the relationships between the unobserved and observed variables, while taking into account the uncertainty of the imputation. The commonly made MAR assumption for missing data is untestable without additional information. With MAR assumption we can generate imputations (Z<sub>{1}</sub>,Z<sub>{2}</sub>,...Z<sub>{m}</sub>) from the distribution $f(Z_{mis}|Z_{obs})$, since after conditioning on $Z_{obs}$ the missingness is assumed to be random.
+The missingness is <b>monotone</b> when the data matrix can be rearranged so that there is a hierarchy of missingness where observing a particular variable $Z_b$ for a subject implies that $Z_a$  is observed, for $a < b$. <i>In monotone condition settings many imputation methods may be employed including (for continuous variables) propensity methods, predictive mean matching, and (for discrete variables) discriminant analysis or logistic regression</i>. For non-monotonic missingness, Markov Chain Monte Carlo (MCMC) approaches may be used.
+Each method has its own assumptions. For instance, predictive mean matching and MCMC approaches require multivariate normality. Predictive mean matching approach employs linear regression for the distribution of a partially observed variable, conditional on other factors. To impute the data using the <b>predictive mean matching approach</b> for a variable $Z_i$ with missing values, we fit a model using complete observations for $Z_1, . . . , Z_{i-1}$ (not monotonicity assumption!):
+$E[Z_i|φ]=φ_0+φ_1Z_1+φ_2Z_2+...+φ_{i-1}Z_{i-1}$
+Next, new parameters φ<sup>*</sup> are randomly drawn from the distribution of the parameters (since these values are estimated instead of exactly known). For the 1≤l≤m imputation, the missing values are replaced by:
+$Z_i^l= φ_0^* +φ_1^*Z_1 +φ_2^*Z_2+...+φ_{i-1}^*Z_{i-1}+σ^*ϵ$,
+where σ<sup>*</sup> is the estimate of variance from the model and ϵ is a Gaussian, <i>N</i>(0,1). This is the simple regression method. We can impute the <u>observed value of $Z_i$</u> that is closest to predicted $\hat{Z}_i=Z_i^l$ in the dataset – this is the <b>predictive mean matching method.</b>
+The <b>propensity score</b> method uses an alternative model for imputation where the values are imputed from observations that are equally likely to be missing, by fitting a <u>logistic regression model for the missingness indicators</u>. In some situations, propensity score imputation may lead to serious bias for missing covariates. For discrete incomplete variables, discriminant analysis or dichotomous logistic regression may be employed to impute values based on the estimated probability that a missing observation takes on a certain value.
+====(2) Analysis:====
+Analyze the <i>m</i> datasets using complete-case methods. This second step in the protocol involves carrying out the analysis of interest for each of the m imputed complete-observation datasets and storing the parameter vectors (e.g., effect-size coefficients, β’s) and their standard error estimates.
+====(3) Combination:====
+Combine the results from the <i>m</i> analyses and calculate the estimates of the <i>within imputation and between imputation</i> variability. If the imputation model is correct (assumptions are valid), these statistics account for the variability of the imputations and provide consistent estimates of the parameters and their standard errors.
+===Simulated Example===
+ set.seed(123)
+ # create MCAR missing-data generator
+ create.missing <- function (data, pct.mis = 10)
+ {
+ n <- nrow(data)
+ J <- ncol(data)
+ if (length(pct.mis) == 1) {
+ n.mis <- rep((n * (pct.mis/100)), J)
+ }
+ else {
+ if (length(pct.mis) < J)
+ stop("The length of missing does not equal to the column of the data")
+ n.mis <- n * (pct.mis/100)
+ }
+    for (i in 1:ncol(data)) {
+        if (n.mis[i] == 0) { # if column has nothing missing, do nothing.
+            data[, i] <- data[, i]
+        }
+        else {
+            data[sample(1:n, n.mis[i], replace = FALSE), i] <- NA
+ # For each given column (i), sample the row indices (1:n),
+ # a number of indices to replace as “missing”, n.mis[i], “NA”,
+  	  # without replacement
+        }
+    }
+    return(as.data.frame(data))
+ }
+ # Simulate some real multivariate data <b>sim_data</b>={y,x1,x2,x3,x4,x5,x6,x7,x8,x9,x10}
+ #
+ n <- 1000; u1 <- rbinom(n, 1, .5); v1 <- log(rnorm(n, 5, 1)); x1 <- u1*exp(v1)
+ u2 <- rbinom(n, 1, .5); v2 <- log(rnorm(n, 5, 1)); x2 <- u2*exp(v2)
+ x3 <- rbinom(n, 1, prob=0.45); x4 <- ordered(rep(seq(1, 5),n)[sample(1:n, n)]); x5 <- rep(letters[1:10],n)[sample(1:n, n)]; x6 <- trunc(runif(n, 1, 10)); x7 <- rnorm(n); x8 <- factor(rep(seq(1,10),n)[sample(1:n, n)]); x9 <-   runif(n, 0.1, .99); x10 <- rpois(n, 4); y <- x1 + x2 + x7 + x9 + rnorm(n)
+ # package the simulated data as a data frame object
+ sim_data <- cbind.data.frame(y, x1, x2, x3, x4, x5, x6, x7, x8, x9, x10)
+ # randomly create missing values
+ sim_data_30pct_missing <- create.missing(sim_data, pct.mis=30);
+ head(sim_data_30pct_missing); summary(sim_data_30pct_missing)
+ # install.packages("mi")
+ # install.packages("betareg")
+ library("betareg"); library("mi")
+ # get show the missing information matrix
+ mdf <- missing_data.frame(sim_data_30pct_missing)
+ show(mdf); mdf@patterns; image(mdf)   # remember the visual pattern of this MCAR
+ # Next try to impute the missing values …
+ imputations <- mi(sim_data_30pct_missing, n.iter=10, n.chains=3, verbose=TRUE)
+ hist(imputations)
+ # Extracts several multiply imputed data.frames from “imputations” object
+ data.frames <- complete(imputations, 3)
+ # Compare the summary stats for the original data (prior to introducing missing
+ # values) with missing data and the re-completed data following imputation
+ summary(sim_data); summary(sim_data_30pct_missing); summary(data.frames[[1]]);
+ lapply(data.frames, summary)
 <center>
-{| class="wikitable" style="text-align:center; " border="1"
+{| class="wikitable" style="text-align:center; width:99%" border="1"
+! colspan="22" |<b>Original</b>
 |-
-!id||age||sex||mechanism||field.gcs||er.gcs||icu.gcs||worst.gcs||6m.gose||2013.gose||skull.fx||temp.injury||surgery||spikes.hr||min.hr||max.hr||acute.sz||late.sz||ever.sz
+|||y||||x1||x2||x3||x4||x5||x6||x7||||x8||x9||x10|| || || || || || || ||
 |-
-|1||19||Male||Fall||10||10||10||10||5||5||0||1||1||.||.||.||1||1||1
+|Min||:-3.046||Min||:0.000||Min.||:0.000||Min||:0.000||0.180555556||a||:100||Min||:1.000||Min.||:-2.880570||1||:100||Min.||:0.1024||Min.||:0.000||
 |-
-|2||55||Male||Blunt||.||3||3||3||5||7||1||1||1||168.74||14||757||0||1||1
+|1st Qu.:2.788||1st Qu.:0.000||1st Qu.:0.000||1st Qu.:0.000||0.222222222||b||:100||1st Qu.:3.000||1st Qu.:-0.687578||2||:100||1st Qu.:0.3356||1st Qu.:2.000||||||||||||||||||
 |-
-|3||24||Male||Fall||12||12||8||8||7||7||1||0||0||37.37||0||351||0||0||0
+|Median :5.613||Median :3.186||Median :2.024||Median :0.000||0.263888889||c||:100||Median :5.000||Median :0.000836||3||:100||Median :0.5546||Median :4.000||||||||||||||||||
 |-
-|4||57||Female||Fall||4||4||6||4||3||3||1||1||1||4.35||0||59||0||0||0
+|Mean||:5.658||Mean||:2.626||Mean||:2.480||Mean||:0.459||0.305555556||d||:100||Mean||:4.998||Mean||:-0.020541||4||:100||Mean||:0.5542||Mean||:3.898||
 |-
-|5||54||Female||Peds_vs_Auto||14||11||8||8||5||7||0||1||1||54.59||0||284||0||0||0
+| 3rd Qu.:8.525||3rd Qu.:5.101||3rd Qu.:4.962||3rd Qu.:1.000||0.347222222||e||:100||3rd Qu.:7.000||3rd Qu.:0.635832||5||:100||3rd Qu.:0.7829||3rd Qu.:5.000||||||||||||||||||
 |-
-|6||16||Female||MVA||13||7||7||7||7||8||1||1||1||75.92||7||180||0||1||1
+| Max.||:17.263||Max.||:8.774||Max.||:8.085||Max||:1.000||f||:100||Max.||:9.000||Max.||:3.281171||6||:100||Max.||:0.9887||Max.||:12.000||||
 |-
-|7||21||Male||Fall||3||3||6||3||3||3||1||0||1||.||.||.||0||0||0
+! colspan="22" |<b>Missing Data</b>
 |-
-|8||25||Male||Fall||3||4||3||3||3||3||0||1||0||5.26||0||88||0||1||1
+|||y||||x1||x2||x3||x4||x5||x6||x7||||x8||x9||x10||6||:100||||||||||||
 |-
-|9||30||Male||GSW||3||9||3||3||3||5||1||1||1||43.88||0||367||0||1||1
+| Min.||:-3.046||Min||:0.000||Min.||:0.000||Min.||:0.0000||1||:144||h||:76||Min.||:1.000||Min.||:-2.8806||8||:77||Min.||:0.1024||Min.||:0.000
 |-
-|10||38||Male||Fall||3||6||6||3||3||3||1||1||1||45.6||4||107||0||1||1
+| 1st Qu.:2.768||1st Qu.:0.000||1st Qu.:0.000||1st Qu.:0.0000||2||:128||a||:72||1st Qu.:3.000||1st Qu.:-0.6974||4||:75||1st Qu.:0.3504||1st Qu.:2.000|| || || || || || || ||
 |-
-|11||43||Male||Blunt||8||7||7||7||6||7||1||0||1||7.76||0||72||0||0||0
+| Median :5.650||Median :3.127||Median :0.000||Median :0.0000||3||:143||d||:72||Median :5.000||Median :-0.0490||9||:72||Median :0.5590||Median :4.000|| || || || || || || ||
 |-
-|12||40||Male||Fall||12||14||14||12||7||8||0||1||1||26.64||0||125||0||0||0
+| Mean||:5.699||Mean||:2.633||Mean||:2.455||Mean||:0.4586||4||:138||f||:72||Mean||:4.994||Mean||:-0.0261||6||:71||Mean||:0.5619||Mean||:3.853
 |-
-|13||21||Male||MVA||12||13||9||9||7||7||1||0||1||.||.||.||0||1||1
+| 3rd Qu.:8.552||3rd Qu.:5.088||3rd Qu.:4.966||3rd Qu.:1.0000||5||:147||g||:72||3rd Qu.:7.000||3rd Qu.:0.6379||10||:71||3rd Qu.:0.7843||3rd Qu.:5.000|| || || || || || || ||
 |-
-|14||35||Female||MVA||6||5||6||5||5||7||1||1||0||65.14||0||655||1||1||1
+| Max.||:17.263||Max.||:8.774||Max.||:8.085||Max.||:1.0000||NA's:300||(Other):336||Max.||:9.000||Max.||:2.7863||(Other):334||Max.||:0.9887||Max.||:12.000||||||
 |-
-|15||59||Male||Peds_vs_Auto||14||14||0||0||8||8||1||1||0||.||.||.||0||0||0
+| NA's||:300||NA's||:300||NA's||:300||NA's||:300||||NA's||:300||NA's||:300||NA's||:300||NA's||:300||NA's||300||NA's||:300||
 |-
-|16||32||Male||MCA||5||6||3||3||4||5||1||0||0||.||.||.||0||0||0
+! colspan="22" |<b>Imputed</b>
 |-
-|17||31||Male||MVA||7||3||9||3||5||7||1||0||0||3.82||0||28||0||0||0
+|||y||||x1||x2||x3||x4||x5||x6||x7||||x8||x9||x10||missing y||||||||||||||
 |-
-|18||57||Male||MVA||4||3||7||3||3||3||0||1||1||.||.||.||0||1||1
+| Min.||:-4.646||Min.||:-3.687||Min.||:-4.672||0.381944444||0.199305556||a||:108||Min.||:-1.005||Min.||:-2.88057||4||:114||Min.||:0.04046||Min.||:-2.009||Mode: logical||
 |-
-|19||18||Male||Blunt||4||3||6||3||5||3||1||1||1||.||.||.||0||1||1
+| 1st Qu.:3.015||1st Qu.:0.000||1st Qu.:0.000||0.354166667||0.211111111||h||:108||1st Qu.:3.000||1st Qu.:-0.70985||9||:105||1st Qu.:0.36273||1st Qu.:2.352||FALSE:700||||||||||||||||
 |-
-|20||48||Male||Bike_vs_Auto||3||8||7||3||5||7||0||0||0||.||.||.||0||0||0
+|  Median :5.741||Median :2.836||Median :2.564||0.259722222||d||:107||Median :5.000||Median :-0.03695||5||:103||Median :0.57188||Median :4.000||TRUE :300||||||||||||||||||
 |-
-|21||19||Male||GSW||15||15||3||3||.||6||1||0||1||.||.||.||1||1||1
+| Mean||:5.846||Mean||:2.649||Mean||:2.533||0.297916667||c||:104||Mean||:5.012||Mean||:-0.01444||3||:102||Mean||:0.56607||Mean||:3.858||NA's :0||||
 |-
-|22||22||Male||Fall||3||3||3||3||2||2||1||1||1||9.7||0||80||0||1||1
+| 3rd Qu.:8.667||3rd Qu.:4.978||3rd Qu.:4.871||0.351388889||j||:100||3rd Qu.:7.000||3rd Qu.:0.70157||6||:102||3rd Qu.:0.78259||3rd Qu.:5.000||||||||||||||||||||
 |-
-|23||20||Male||Peds_vs_Auto||15||14||13||13||5||8||1||1||1||.||.||.||0||1||1
+| Max.||:17.263||Max.||:9.979||Max.||:12.833||||f||:99||Max.||:12.178||Max.||:2.78629||8||:102||Max.||:0.99670||Max.||:12.000||||||
-|-
-|24||41||Male||MVA||3||3||6||3||3||7||1||0||0||.||.||.||0||0||0
-|-
-|25||27||Male||MCA||15||13||6||6||6||7||1||0||1||.||.||.||0||0||0
-|-
-|26||23||Male||MVA||14||14||7||7||4||7||1||1||1||.||.||.||0||0||0
-|-
-|27||36||Male||MCA||3||3||3||3||5||6||0||0||0||.||.||.||0||1||1
-|-
-|28||83||Female||Fall||14||14||9||9||.||5||0||1||1||208.92||42||641||1||1||1
-|-
-|29||26||Male||MCA||5||7||5||5||6||7||0||1||0||.||.||.||0||0||0
-|-
-|30||21||Male||Fall||14||14||14||14||5||7||0||1||1||294||30||1199||1||1||1
-|-
-|31||23||Male||MCA||12||13||13||12||.||7||1||0||1||.||.||.||0||0||0
-|-
-|32||45||Male||MCA||6||6||6||6||3||6||0||0||1||.||.||.||0||0||0
-|-
-|33||18||Male||Bike_vs_Auto||8||8||7||7||7||7||0||0||0||7.14||0||20||0||1||1
-|-
-|34||34||Male||MVA||7||7||3||3||4||6||0||1||1||47.73||0||226||0||1||1
-|-
-|35||19||Male||MVA||3||7||7||3||7||8||0||0||0||97.43||0||300||0||0||0
-|-
-|36||77||Female||Peds_vs_Auto||3||6||3||3||3||3||1||1||0||7.09||0||31||0||1||1
-|-
-|37||75||Male||Bike_vs_Auto||.||.||.||.||.||8||1||0||0||5.9||0||42||0||1||1
-|-
-|38||25||Male||Fall||14||.||6||6||8||8||0||0||1||29.61||0||175||1||0||1
-|-
-|39||62||Female||Fall||12||8||8||8||3||3||0||1||1||6.16||0||33||0||1||1
-|-
-|40||41||Male||MCA||7||3||7||3||5||5||1||1||1||1.66||0||23||0||1||1
-|-
-|41||60||Male||Bike_vs_Auto||3||12||7||3||3||5||1||1||0||3.8||0||12||0||1||1
-|-
-|42||29||Female||Peds_vs_Auto||9||14||3||3||8||7||1||0||1||.||.||.||0||1||1
-|-
-|43||48||Male||Blunt||12||12||11||11||6||7||0||0||1||5.39||0||43||0||0||0
-|-
-|44||41||Male||Peds_vs_Auto||3||3||3||3||2||2||1||1||0||1.28||0||15||1||1||1
-|-
-|45||34||Male||Fall||6||8||3||3||3||3||1||1||1||213.84||3||824||1||1||1
-|-
-|46||25||Female||MVA||6||8||3||3||.||7||0||1||0||1.7||0||36||0||0||0
 |}
 </center>
-====R imputation script====
-The R code below illustrates the imputation of a raw data.
- ###########################################
- # example of multiple imputation (R MI package)
- # See Docs: http://www.stat.ucla.edu/~yajima/Publication/mipaper.rev04.pdf
- #
- # If we don't have real or observed or derived data, we can simulate fake data
- # See this SOCR Data/Activity: [[SOCR_Data_PD_BiomedBigMetadata| Predictive Big Data Analytics, Modeling and Visualization of Clinical, Genetic and Imaging Data for Parkinson’s Disease]]
- #
- # Alternatively use the example below.
- # set.seed(123)
- # n <- 1000
- # u1 <- rbinom(n, 1, .5); v1 <- log(rnorm(n, 5, 1)); x1 <- u1*exp(v1)
- # u2 <- rbinom(n, 1, .5); v2 <- log(rnorm(n, 5, 1)); x2 <- u2*exp(v2)
- # x3 <- rbinom(n, 1, prob=0.45); x4 <- ordered(rep(seq(1, 5),100)[sample(1:n, n)]); x5 <- rep(letters[1:10],10)[sample(1:n, n)]; x6 <- trunc(runif(n, 1, 10)); x7 <- rnorm(n); x8 <- factor(rep(seq(1,10),10)[sample(1:n, n)]); x9 <- runif(n, 0.1, .99); x10 <- rpois(n, 4); y <- x1 + x2 + x7 + x9 + rnorm(n)
- # fakedata <- cbind.data.frame(y, x1, x2, x3, x4, x5, x6, x7, x8, x9, x10)
- # randomly create missing values
- # dat <- mi:::.create.missing(fakedata, pct.mis=30)
- ##########################################################
- #
- library("mi")
- # copy-paste the (raw) data from the table into a plain text file "EpiBioSData.csv"
- EpiBiosData <- read.csv("~/EpiBioSData.csv", na.strings=c("",".","NA"))
- # get information matrix of the data
- inf <- mi.info(EpiBiosData)
- # to update the variable type of a specific variable to mi.info
- # inf <- update(inf, "type", list(x10="count"))
- # run the imputation without data transformation
- IMP <- mi(EpiBiosData, info=inf, check.coef.convergence=TRUE, add.noise=noise.control(post.run.iter=10))
- #
- # run the imputation with data transformation
- EpiBiosData.transformed <- mi.preprocess(EpiBiosData, inf)
- #IMP <- mi(EpiBiosData.transformed, n.iter=6, check.coef.convergence=TRUE, add.noise=noise.control(post.run.iter=6))
- #
- # IMP <- mi(EpiBiosData.transformed, n.iter=6, add.noise=TRUE)
- # no noise
- # IMP <- mi(dat, info=inf, n.iter=6, add.noise=FALSE) ## NOT RUN
- #
- # convergence checking
- converged(IMP, check = "data")  ## You should get FALSE here because only n.iter is small
- # converged(IMP, check = "coefs")
- IMP.bugs1 <- bugs.mi(IMP, check = "data")    ## BUGS object to look at the R hat statistics
- IMP.bugs2 <- bugs.mi(IMP, check = "coefs")   ## BUGS object to look at the R hat statistics
- plot(IMP.bugs1)  ## visually check R.hat
- #
- # visually check the imputation
- plot(IMP)
- #
- missing.pattern.plot(EpiBiosData, gray.scale = TRUE)
- #
- # to obtain the Completed/Imputed Dataset
- IMP.EpiBiosData.all <- mi.completed(IMP)
- #
- # save results out
- write.mi(IMP, format = "csv")
- write.csv(IMP.EpiBiosData.all, "~/EpiBioS_MIData.csv")
- #
- # Complete Data analytics: ression functions:
- # lm.mi(); glm.mi(); polr.mi(); bayesglm.mi(); bayespolr.mi(); lmer.mi(); glmer.mi()
- #
- fit <- lm.mi(ever.sz ~ surgery + worst.gcs + sex + age, IMP)
- # fit1 <- lm.mi(ever.sz.1 ~ surgery.1 + worst.gcs.1 + sex.1 + age.1, IMP)
- # fit2 <- lm.mi(ever.sz.2 ~ surgery.2 + worst.gcs.2 + sex.2 + age.2, IMP)
- display(fit)
-====Imputed Complete Data====
+ # Check imputation convergence (details provided below)
-The table below shows 3 alternative imputation results for the [[SMHS_MissingData#Raw_TBI_data|same TBI data]] (generated using the [[SMHS_MissingData#R_imputation_script|R script above]]). Variable indices (ID, ..., ever.sz; ID.1, ..., ever.sz.1; ID.2, ..., ever.sz.2) indicate the results of the 3 complementary multiple imputations.
+ round(mipply(imputations, mean, to.matrix = TRUE), 3)
+ Rhats(imputations, statistic = "moments") # assess the convergence of MI algorithm
+ plot(imputations); hist(imputations); image(imputations); summary(imputations)
+ # Finally, pool over the m = 5 imputed datasets when we fit the “model”
+ # Pool from across the 4 chains – in order to estimate a linear regression model
+ model_results <- pool(y ~ x1+x2+x3+x4+x5+x6+x7+x8+x9+x10, data=imputations,  m=5 )
+ display (model_results); summary (model_results)
+ # Report the summaries of the imputations
+ data.frames <- complete(imputations, 3)  	# extract the first 3 chains
+ lapply(data.frames, summary)
+===TBI Data Example===
+See these traumatic brain injury (TBI) data: http://wiki.socr.umich.edu/index.php/SMHS_MissingData#Raw_TBI_data
+<center>
 {| class="wikitable" style="text-align:center; " border="1"
 |-
-!ID||age||sex||mechanism||field.gcs||er.gcs||icu.gcs||worst.gcs||X6m.gose||X2013.gose||skull.fx||temp.injury||surgery||spikes.hr||min.hr||max.hr||acute.sz||late.sz||ever.sz||ID.1||age.1||sex.1||mechanism.1||field.gcs.1||er.gcs.1||icu.gcs.1||worst.gcs.1||X6m.gose.1||X2013.gose.1||skull.fx.1||temp.injury.1||surgery.1||spikes.hr.1||min.hr.1||max.hr.1||acute.sz.1||late.sz.1||ever.sz.1||ID.2||age.2||sex.2||mechanism.2||field.gcs.2||er.gcs.2||icu.gcs.2||worst.gcs.2||X6m.gose.2||X2013.gose.2||skull.fx.2||temp.injury.2||surgery.2||spikes.hr.2||min.hr.2||max.hr.2||acute.sz.2||late.sz.2||ever.sz.2
+|Ind||ID||Age ||Sex ||TBI||field.gcs||er.gcs||icu.gcs||worst.gcs||X6m.gose||X2013.gose||skull.fx||temp.injury||surgery||spikes.hr||min.hr||max.hr||acute.sz||late.sz||ever.sz
 |-
-|1||19||Male||Fall||10||10||10||10||5||5||0||1||1||31.33408337||17.64389893||329.4906597||1||1||1||1||19||Male||Fall||10||10||10||10||5||5||0||1||1||96.74023746||18.35575322||474.5394709||1||1||1||1||19||Male||Fall||10||10||10||10||5||5||0||1||1||134.3949936||13.93977114||357.0245178||1||1||1
+|1 ||1||19||M||Fall||10||10||10||10||5||5||0||1||1||NA||NA||NA||1||1||1
 |-
-|2||55||Male||Blunt||8.949465016||3||3||3||5||7||1||1||1||168.74||14||757||0||1||1||2||55||Male||Blunt||4.583430169||3||3||3||5||7||1||1||1||168.74||14||757||0||1||1||2||55||Male||Blunt||6.759506638||3||3||3||5||7||1||1||1||168.74||14||757||0||1||1
+|2 ||2||55||M||Blunt||NA||3||3||3||5||7||1||1||1||168.74||14||757||0||1||1
 |-
-|3||24||Male||Fall||12||12||8||8||7||7||1||0||0||37.37||0||351||0||0||0||3||24||Male||Fall||12||12||8||8||7||7||1||0||0||37.37||0||351||0||0||0||3||24||Male||Fall||12||12||8||8||7||7||1||0||0||37.37||0||351||0||0||0
+|3 ||3||24||M||Fall||12||12||8||8||7||7||1||0||0||37.37||0||351||0||0||0
 |-
-|4||57||Female||Fall||4||4||6||4||3||3||1||1||1||4.35||0||59||0||0||0||4||57||Female||Fall||4||4||6||4||3||3||1||1||1||4.35||0||59||0||0||0||4||57||Female||Fall||4||4||6||4||3||3||1||1||1||4.35||0||59||0||0||0
+|4 ||4||57||F||Fall||4||4||6||4||3||3||1||1||1||4.35||0||59||0||0||0
 |-
-|5||54||Female||Peds_vs_Auto||14||11||8||8||5||7||0||1||1||54.59||0||284||0||0||0||5||54||Female||Peds_vs_Auto||14||11||8||8||5||7||0||1||1||54.59||0||284||0||0||0||5||54||Female||Peds_vs_Auto||14||11||8||8||5||7||0||1||1||54.59||0||284||0||0||0
+|5||5||54||F||Peds_vs_Auto||14||11||8||8||5||7||0||1||1||54.59||0||284||0||0||0
 |-
-|6||16||Female||MVA||13||7||7||7||7||8||1||1||1||75.92||7||180||0||1||1||6||16||Female||MVA||13||7||7||7||7||8||1||1||1||75.92||7||180||0||1||1||6||16||Female||MVA||13||7||7||7||7||8||1||1||1||75.92||7||180||0||1||1
+|6||6||16||F||MVA||13||7||7||7||7||8||1||1||1||75.92||7||180||0||1||1
-|-
+|}
-|7||21||Male||Fall||3||3||6||3||3||3||1||0||1||97.81993169||-0.213863843||38.62625185||0||0||0||7||21||Male||Fall||3||3||6||3||3||3||1||0||1||57.41204562||22.47601237||-152.4300823||0||0||0||7||21||Male||Fall||3||3||6||3||3||3||1||0||1||-88.08486298||-17.52260473||-126.0982147||0||0||0
+</center>
+ # Load the (raw) data from the table into a plain text file "08_EpiBioSData_Incomplete.csv"
+ TBI_Data <- read.csv("https://umich.instructure.com/files/720782/download?download_frd=1", na.strings=c("",".","NA"))
+ summary(TBI_Data)
+ # Get information matrix of the data
+ # (1) Convert to a missing_data.frame
+ # library("betareg"); library("mi")
+ mdf <- missing_data.frame(TBI_Data) # warnings about missingness patterns
+ show(mdf); mdf@patterns
+ # (2) change things: '''mi::change()''' method changes the family, imputation method,
+ # size, type, and so forth of a missing variable. It’s called
+ # before calling '''mi''' to affect how the conditional expectation of each
+ # missing variable is modeled.
+ mdf <- '''change'''(mdf, y = "spikes.hr", what = "transformation", to = "identity")
+<center>
+{| class="wikitable" style="text-align:left; " border="1"
 |-
-|8||25||Male||Fall||3||4||3||3||3||3||0||1||0||5.26||0||88||0||1||1||8||25||Male||Fall||3||4||3||3||3||3||0||1||0||5.26||0||88||0||1||1||8||25||Male||Fall||3||4||3||3||3||3||0||1||0||5.26||0||88||0||1||1
+|Arg||Description
 |-
-|9||30||Male||GSW||3||9||3||3||3||5||1||1||1||43.88||0||367||0||1||1||9||30||Male||GSW||3||9||3||3||3||5||1||1||1||43.88||0||367||0||1||1||9||30||Male||GSW||3||9||3||3||3||5||1||1||1||43.88||0||367||0||1||1
+|y||A character vector naming one or more missing variables within the missing_data.frame specified by the <b>data</b> argument, or a vector of integers or a logical vector indicating which missing_variables to change.
 |-
-|10||38||Male||Fall||3||6||6||3||3||3||1||1||1||45.6||4||107||0||1||1||10||38||Male||Fall||3||6||6||3||3||3||1||1||1||45.6||4||107||0||1||1||10||38||Male||Fall||3||6||6||3||3||3||1||1||1||45.6||4||107||0||1||1
+|what||A character string naming what is to be changed, such as "family", "imputation_method", "size","transformation", "type", "link", or "model."
 |-
-|11||43||Male||Blunt||8||7||7||7||6||7||1||0||1||7.76||0||72||0||0||0||11||43||Male||Blunt||8||7||7||7||6||7||1||0||1||7.76||0||72||0||0||0||11||43||Male||Blunt||8||7||7||7||6||7||1||0||1||7.76||0||72||0||0||0
+|to||A character string naming what y should be changed to, such as one of the admissible families, imputation methods, transformations, or types. If missing, then possible choices for the "to" argument will be helpfully printed on the screen.
-|-
-|12||40||Male||Fall||12||14||14||12||7||8||0||1||1||26.64||0||125||0||0||0||12||40||Male||Fall||12||14||14||12||7||8||0||1||1||26.64||0||125||0||0||0||12||40||Male||Fall||12||14||14||12||7||8||0||1||1||26.64||0||125||0||0||0
-|-
-|13||21||Male||MVA||12||13||9||9||7||7||1||0||1||-139.5137892||-33.84480983||19.23090912||0||1||1||13||21||Male||MVA||12||13||9||9||7||7||1||0||1||125.7529181||19.14860131||291.9026199||0||1||1||13||21||Male||MVA||12||13||9||9||7||7||1||0||1||-12.93052956||-11.43761507||39.10220788||0||1||1
-|-
-|14||35||Female||MVA||6||5||6||5||5||7||1||1||0||65.14||0||655||1||1||1||14||35||Female||MVA||6||5||6||5||5||7||1||1||0||65.14||0||655||1||1||1||14||35||Female||MVA||6||5||6||5||5||7||1||1||0||65.14||0||655||1||1||1
-|-
-|15||59||Male||Peds_vs_Auto||14||14||0||0||8||8||1||1||0||104.0330205||28.99974405||718.6130268||0||0||0||15||59||Male||Peds_vs_Auto||14||14||0||0||8||8||1||1||0||158.5435588||11.79083406||788.0040332||0||0||0||15||59||Male||Peds_vs_Auto||14||14||0||0||8||8||1||1||0||70.32014857||-4.21119291||-20.90694906||0||0||0
-|-
-|16||32||Male||MCA||5||6||3||3||4||5||1||0||0||99.20158013||-17.25973621||351.0875441||0||0||0||16||32||Male||MCA||5||6||3||3||4||5||1||0||0||-217.7065141||3.020094958||-464.0898326||0||0||0||16||32||Male||MCA||5||6||3||3||4||5||1||0||0||-33.098168||-0.137931315||487.9378096||0||0||0
-|-
-|17||31||Male||MVA||7||3||9||3||5||7||1||0||0||3.82||0||28||0||0||0||17||31||Male||MVA||7||3||9||3||5||7||1||0||0||3.82||0||28||0||0||0||17||31||Male||MVA||7||3||9||3||5||7||1||0||0||3.82||0||28||0||0||0
-|-
-|18||57||Male||MVA||4||3||7||3||3||3||0||1||1||-149.9422408||-30.42385829||-302.9962493||0||1||1||18||57||Male||MVA||4||3||7||3||3||3||0||1||1||56.13694368||18.8199818||-62.49856966||0||1||1||18||57||Male||MVA||4||3||7||3||3||3||0||1||1||45.55341087||4.468479615||-213.454323||0||1||1
-|-
-|19||18||Male||Blunt||4||3||6||3||5||3||1||1||1||40.39410742||-36.87895336||97.37341326||0||1||1||19||18||Male||Blunt||4||3||6||3||5||3||1||1||1||-106.5749596||5.559437069||-1149.613796||0||1||1||19||18||Male||Blunt||4||3||6||3||5||3||1||1||1||51.18634141||-1.937828826||-342.6008879||0||1||1
-|-
-|20||48||Male||Bike_vs_Auto||3||8||7||3||5||7||0||0||0||118.8000347||32.36213678||-73.13271061||0||0||0||20||48||Male||Bike_vs_Auto||3||8||7||3||5||7||0||0||0||-87.77082486||3.717360906||-4.392630662||0||0||0||20||48||Male||Bike_vs_Auto||3||8||7||3||5||7||0||0||0||108.2372319||0.187220054||519.1251498||0||0||0
-|-
-|21||19||Male||GSW||15||15||3||3||9.25042468||6||1||0||1||40.13687855||-14.22895294||736.3143588||1||1||1||21||19||Male||GSW||15||15||3||3||7.223305755||6||1||0||1||18.46733942||25.54649386||-315.4151426||1||1||1||21||19||Male||GSW||15||15||3||3||3.828147759||6||1||0||1||111.5912998||17.80387347||200.2606779||1||1||1
-|-
-|22||22||Male||Fall||3||3||3||3||2||2||1||1||1||9.7||0||80||0||1||1||22||22||Male||Fall||3||3||3||3||2||2||1||1||1||9.7||0||80||0||1||1||22||22||Male||Fall||3||3||3||3||2||2||1||1||1||9.7||0||80||0||1||1
-|-
-|23||20||Male||Peds_vs_Auto||15||14||13||13||5||8||1||1||1||143.8993295||9.086586358||410.1307881||0||1||1||23||20||Male||Peds_vs_Auto||15||14||13||13||5||8||1||1||1||68.95037612||-3.634136214||745.7692261||0||1||1||23||20||Male||Peds_vs_Auto||15||14||13||13||5||8||1||1||1||17.52584031||4.326014737||307.7486916||0||1||1
-|-
-|24||41||Male||MVA||3||3||6||3||3||7||1||0||0||36.38713558||-7.70043125||397.1825116||0||0||0||24||41||Male||MVA||3||3||6||3||3||7||1||0||0||16.44376591||-3.227208192||318.7734905||0||0||0||24||41||Male||MVA||3||3||6||3||3||7||1||0||0||39.56564601||2.069381684||695.0459207||0||0||0
-|-
-|25||27||Male||MCA||15||13||6||6||6||7||1||0||1||89.67182293||-5.545616243||290.1814949||0||0||0||25||27||Male||MCA||15||13||6||6||6||7||1||0||1||17.66215864||4.730421601||191.3041522||0||0||0||25||27||Male||MCA||15||13||6||6||6||7||1||0||1||67.26462511||8.869208934||468.9527863||0||0||0
-|-
-|26||23||Male||MVA||14||14||7||7||4||7||1||1||1||46.02145191||-13.18053664||91.69057998||0||0||0||26||23||Male||MVA||14||14||7||7||4||7||1||1||1||39.36793917||12.35289402||405.9565656||0||0||0||26||23||Male||MVA||14||14||7||7||4||7||1||1||1||154.8598602||30.03873428||384.849982||0||0||0
-|-
-|27||36||Male||MCA||3||3||3||3||5||6||0||0||0||57.6777688||2.679615385||664.0843475||0||1||1||27||36||Male||MCA||3||3||3||3||5||6||0||0||0||-90.21498545||-17.3682012||-26.4769941||0||1||1||27||36||Male||MCA||3||3||3||3||5||6||0||0||0||-50.90586656||-6.319652137||352.0473406||0||1||1
-|-
-|28||83||Female||Fall||14||14||9||9||4.636755103||5||0||1||1||208.92||42||641||1||1||1||28||83||Female||Fall||14||14||9||9||6.242101503||5||0||1||1||208.92||42||641||1||1||1||28||83||Female||Fall||14||14||9||9||2.05329258||5||0||1||1||208.92||42||641||1||1||1
-|-
-|29||26||Male||MCA||5||7||5||5||6||7||0||1||0||-13.37057197||-14.89381792||152.0008772||0||0||0||29||26||Male||MCA||5||7||5||5||6||7||0||1||0||91.7509579||-0.870090941||529.0874936||0||0||0||29||26||Male||MCA||5||7||5||5||6||7||0||1||0||-0.534176227||-9.783851967||490.89377||0||0||0
-|-
-|30||21||Male||Fall||14||14||14||14||5||7||0||1||1||294||30||1199||1||1||1||30||21||Male||Fall||14||14||14||14||5||7||0||1||1||294||30||1199||1||1||1||30||21||Male||Fall||14||14||14||14||5||7||0||1||1||294||30||1199||1||1||1
-|-
-|31||23||Male||MCA||12||13||13||12||2.860415458||7||1||0||1||74.98397791||13.22365165||155.338988||0||0||0||31||23||Male||MCA||12||13||13||12||3.086836617||7||1||0||1||-11.96623577||-7.703251349||530.2622312||0||0||0||31||23||Male||MCA||12||13||13||12||5.904815799||7||1||0||1||16.65737111||2.374428875||436.7728378||0||0||0
-|-
-|32||45||Male||MCA||6||6||6||6||3||6||0||0||1||104.8995511||-16.52390582||483.006168||0||0||0||32||45||Male||MCA||6||6||6||6||3||6||0||0||1||-80.67017517||-15.81483739||116.5352423||0||0||0||32||45||Male||MCA||6||6||6||6||3||6||0||0||1||71.76729005||-1.066523637||893.1056629||0||0||0
-|-
-|33||18||Male||Bike_vs_Auto||8||8||7||7||7||7||0||0||0||7.14||0||20||0||1||1||33||18||Male||Bike_vs_Auto||8||8||7||7||7||7||0||0||0||7.14||0||20||0||1||1||33||18||Male||Bike_vs_Auto||8||8||7||7||7||7||0||0||0||7.14||0||20||0||1||1
-|-
-|34||34||Male||MVA||7||7||3||3||4||6||0||1||1||47.73||0||226||0||1||1||34||34||Male||MVA||7||7||3||3||4||6||0||1||1||47.73||0||226||0||1||1||34||34||Male||MVA||7||7||3||3||4||6||0||1||1||47.73||0||226||0||1||1
-|-
-|35||19||Male||MVA||3||7||7||3||7||8||0||0||0||97.43||0||300||0||0||0||35||19||Male||MVA||3||7||7||3||7||8||0||0||0||97.43||0||300||0||0||0||35||19||Male||MVA||3||7||7||3||7||8||0||0||0||97.43||0||300||0||0||0
-|-
-|36||77||Female||Peds_vs_Auto||3||6||3||3||3||3||1||1||0||7.09||0||31||0||1||1||36||77||Female||Peds_vs_Auto||3||6||3||3||3||3||1||1||0||7.09||0||31||0||1||1||36||77||Female||Peds_vs_Auto||3||6||3||3||3||3||1||1||0||7.09||0||31||0||1||1
-|-
-|37||75||Male||Bike_vs_Auto||13.73432996||14.02844584||12.08707554||6.608860636||5.076496724||8||1||0||0||5.9||0||42||0||1||1||37||75||Male||Bike_vs_Auto||1.439149068||9.833777842||1.828027302||-0.395951549||8.130388643||8||1||0||0||5.9||0||42||0||1||1||37||75||Male||Bike_vs_Auto||4.627381629||11.70604238||4.818391501||2.101228568||7.755182097||8||1||0||0||5.9||0||42||0||1||1
-|-
-|38||25||Male||Fall||14||18.45286032||6||6||8||8||0||0||1||29.61||0||175||1||0||1||38||25||Male||Fall||14||7.292553675||6||6||8||8||0||0||1||29.61||0||175||1||0||1||38||25||Male||Fall||14||9.021238862||6||6||8||8||0||0||1||29.61||0||175||1||0||1
-|-
-|39||62||Female||Fall||12||8||8||8||3||3||0||1||1||6.16||0||33||0||1||1||39||62||Female||Fall||12||8||8||8||3||3||0||1||1||6.16||0||33||0||1||1||39||62||Female||Fall||12||8||8||8||3||3||0||1||1||6.16||0||33||0||1||1
-|-
-|40||41||Male||MCA||7||3||7||3||5||5||1||1||1||1.66||0||23||0||1||1||40||41||Male||MCA||7||3||7||3||5||5||1||1||1||1.66||0||23||0||1||1||40||41||Male||MCA||7||3||7||3||5||5||1||1||1||1.66||0||23||0||1||1
-|-
-|41||60||Male||Bike_vs_Auto||3||12||7||3||3||5||1||1||0||3.8||0||12||0||1||1||41||60||Male||Bike_vs_Auto||3||12||7||3||3||5||1||1||0||3.8||0||12||0||1||1||41||60||Male||Bike_vs_Auto||3||12||7||3||3||5||1||1||0||3.8||0||12||0||1||1
-|-
-|42||29||Female||Peds_vs_Auto||9||14||3||3||8||7||1||0||1||43.10920277||-12.88149901||504.4628083||0||1||1||42||29||Female||Peds_vs_Auto||9||14||3||3||8||7||1||0||1||206.9761351||13.04583926||-237.7973505||0||1||1||42||29||Female||Peds_vs_Auto||9||14||3||3||8||7||1||0||1||117.2635425||-5.753348151||233.8244434||0||1||1
-|-
-|43||48||Male||Blunt||12||12||11||11||6||7||0||0||1||5.39||0||43||0||0||0||43||48||Male||Blunt||12||12||11||11||6||7||0||0||1||5.39||0||43||0||0||0||43||48||Male||Blunt||12||12||11||11||6||7||0||0||1||5.39||0||43||0||0||0
-|-
-|44||41||Male||Peds_vs_Auto||3||3||3||3||2||2||1||1||0||1.28||0||15||1||1||1||44||41||Male||Peds_vs_Auto||3||3||3||3||2||2||1||1||0||1.28||0||15||1||1||1||44||41||Male||Peds_vs_Auto||3||3||3||3||2||2||1||1||0||1.28||0||15||1||1||1
-|-
-|45||34||Male||Fall||6||8||3||3||3||3||1||1||1||213.84||3||824||1||1||1||45||34||Male||Fall||6||8||3||3||3||3||1||1||1||213.84||3||824||1||1||1||45||34||Male||Fall||6||8||3||3||3||3||1||1||1||213.84||3||824||1||1||1
-|-
-|46||25||Female||MVA||6||8||3||3||8.135915557||7||0||1||0||1.7||0||36||0||0||0||46||25||Female||MVA||6||8||3||3||7.477618429||7||0||1||0||1.7||0||36||0||0||0||46||25||Female||MVA||6||8||3||3||6.093298745||7||0||1||0||1.7||0||36||0||0||0
 |}
+</center>
+ # (3) examine missingness patterns
+ summary(mdf); hist(mdf);
+ image(mdf)
+ # (4) Perform initial imputation
+ imputations <- mi(mdf, n.iter=30, n.chains=5, verbose=TRUE)
+ hist(imputations)
+ # (5) Extracts several multiply imputed data.frames from “imputations” object
+ data.frames <- complete(imputations, 5)
+ # (6) Report a list of “summaries” for each element (imputation instance)
+ lapply(data.frames, summary)
+ # (6.a) To cast the imputed numbers as integers (not necessary, but may be useful)
+ indx <- sapply(data.frames[[5]], is.numeric)  # get the indices of numeric columns
+ data.frames[[5]][indx] <- lapply(data.frames[[5]][indx], function(x) as.numeric(as.integer(x))) 		    # cast each value as integer
+ # (7) Save results out
+ write.csv(data.frames[[5]], "C:\\Users\\Dinov\\Desktop\\TBI_MIData.csv")
+ # (8) Complete Data analytics functions:
+ # library("mi")
+ #lm.mi(); glm.mi(); polr.mi(); bayesglm.mi(); bayespolr.mi(); lmer.mi(); glmer.mi()
+ # (8.1) Define Linear Regression for multiply imputed dataset – Also see Step (10)
+ ##linear regression for each imputed data set - 5 regression models are fit
+ <b>fit_lm1</b> <- glm(ever.sz ~ surgery + worst.gcs + factor(sex) + age, data.frames$\$$`chain:1`, family = "binomial"); summary(fit_lm1); display(fit_lm1)
+ # Fit the appropriate model and pool the results (estimates over MI chains)
+ model_results <- pool(ever.sz ~ surgery + worst.gcs + factor(sex) + age, family = "binomial", data=imputations,  m=5)
+ display (model_results); summary (model_results)
+ # Report the summaries of the imputations
+ data.frames <- complete(imputations, 3)  	# extract the first 3 chains
+ lapply(data.frames, summary)
+ # (9) Validation: we now verify whether enough iterations were conducted.
+ # Validation criteria demands that the mean of each completed variable should
+ # be similar for each of the k chains (in this case k=5).
+ # <b>mipply</b> is wrapper for <b>sapply</b> invoked on mi-class objects to compute the col means
+ round(mipply(imputations, mean, to.matrix = TRUE), 3)
+ # <b>Rhat</b> convergence statistics compares the variance between chains to the variance
+ # across chains.
+ # <b>Rhat Values ~ 1.0</b> indicate likely convergence,
+ # <b>Rhat Values > 1.1</b> indicate that the chains should be run longer
+ # (use large number of iterations)
+ Rhats(imputations, statistic = "moments") # assess the convergence of MI algorithm
+ # When convergence is unstable, we can continue the iterations for all chains, e.g.
+ imputations <- mi(imputations, n.iter=20) # add additional 20 iterations
+ # To <b>plot</b> the produced mi results, for all missing_variables we can generate
+ # a histogram of the observed, imputed, and completed data.
+ # We can compare of the completed data to the fitted values implied by the model
+ # for the completed data, by plotting binned residuals.
+ # <b>hist</b> function works similarly as plot.
+ # <b>image</b> function gives a sense of the missingness patterns in the data
+ plot(imputations); hist(imputations); image(imputations); summary(imputations)
+ # (10) Finally, pool over the m = 5 imputed datasets when we fit the “model”
+ # Pool from across the 4 chains – in order to estimate a linear regression model
+ # and impact ov various predictors
+ <span style="background-color: #37FDFC">model_results <- <b><u>pool</u></b>(ever.sz ~ surgery + worst.gcs + factor(sex) + age, data =  imputations,  m =  5 ); display (model_results); summary (model_results)  </span>
+ # Report the summaries of the imputations
+ data.frames <- complete(imputations, 3)  	# extract the first 3 chains
+ lapply(data.frames, summary)			# report summaries
+===Parkinson's Disease Case Study===
+Background: See: http://wiki.socr.umich.edu/index.php/SOCR_Data_PD_BiomedBigMetadata
+ <b>#imputation for logistic regression</b>
+ # load the data: <b>08_PPMI_GSA_clinical.csv</b>
+ ppmi_new<-read.csv("https://umich.instructure.com/files/330401/download?download_frd=1",header=TRUE)
+ # install.packages("psych")
+ library(psych)
+ # report descriptive statistics
+ ppmi_new$\$$ResearchGroup <- ifelse(ppmi_new$\$$ResearchGroup == "Control", "Control", "Patient")
+ ppmi_new$\$$ResearchGroup <- as.factor(ppmi_new$\$$ResearchGroup): table(ppmi_new$\$$ResearchGroup)
+ # Reduce the Dataset: extract only the columns of interest (to ensure real time calculations)
+ ppmi_new_1 <- as.data.frame(ppmi_new)[,   c("ResearchGroup","R_fusiform_gyrus_Curvedness","Sex","Age","chr12_rs34637584_GT","UPDRS_Part_I_Summary_Score_Baseline","UPDRS_Part_II_Patient_Questionnaire_Summary_Score_Baseline","UPDRS_Part_III_Summary_Score_Baseline",  "UPDRS_Part_IV_Summary_Score_Baseline")]
+ descript <- <b><u>describe</u></b>(ppmi_new_1); descript	# reports means, SE’s, etc.
+ dim(ppmi_new); dim(ppmi_new_1)	# check dimensions before
+ # imputation 1
+ # install.packages("mi"); install.packages("betareg")
+ library(mi); library(betareg)
+ mdf<-missing_data.frame(ppmi_new_1[-1,])
+ apply(ppmi_new_1[-1,],1,mean)
+ imputations<-mi(mdf, seed=900);   summary(imputations)
+ ppmi_complete<-complete(imputations)
+ # Save the imputation
+ # again you may want to cast imputed numerical values as integers, perhaps
+ # (indx <- sapply(data.frames[[5]], is.numeric)  # get the indices of numeric columns
+ # data.frames[[5]][indx] <- lapply(data.frames[[5]][indx], function(x) as.numeric(as.integer(x)))   # cast each value
+ # write.csv(ppmi_complete,"C:\\Users\\Dinov\\Desktop\\ppmi_new_1_complete.csv")
+ write.csv(ppmi_complete,"ppmi_new_1_complete.csv")
+ # Get the imputations of 4 iterations
+ ppmi_complete_1<- ppmi_complete[[1]]; ppmi_complete_2<- ppmi_complete[[2]];
+ ppmi_complete_3<- ppmi_complete[[3]]; ppmi_complete_4<- ppmi_complete[[4]]
+ # average the imputations
+ ppmi_updrs <- (ppmi_complete_1+ppmi_complete_2+ppmi_complete_4+ppmi_complete_3)/4
+ # ppmi_top<-cbind(ppmi_new_1[-1,1:335], ppmi_updrs) #delete the first observation
+ # colnames(ppmi_top)<-colnames(ppmi_new[,1:393])
+ # Save results – complete dataset
+ #  write.csv(ppmi_top,"C:\\Users\\Dinov\\Desktop\\ppmi_top_complete_1.csv")
+ # write.csv(ppmi_top,"ppmi_top_complete_1.csv")
+ # Now, run some Data analytics on complete data:
+ # library("mi")
+ #lm.mi(); glm.mi(); polr.mi(); bayesglm.mi(); bayespolr.mi(); lmer.mi(); glmer.mi()
+ # Define Linear Regression for multiply imputed dataset
+ ##linear regression for each imputed data set - 5 regression models are fit
+ model_results <- pool(ResearchGroup ~  R_fusiform_gyrus_Curvedness + factor(Sex) + Age +
+ factor(chr12_rs34637584_GT) + UPDRS_Part_I_Summary_Score_Baseline + UPDRS_Part_II_Patient_Questionnaire_Summary_Score_Baseline + UPDRS_Part_III_Summary_Score_Baseline, family = "binomial", data=imputations,  m=3)
+ display (model_results); summary (model_results)
+===[[SMHS_MissingData_Appendix|Appendix]]===
 ===See also===