SMHS MissingData

From SOCR
Revision as of 13:33, 19 May 2016 by Pineaumi (talk | contribs) (Scientific Methods for Health Sciences - Missing Data)
Jump to: navigation, search

Scientific Methods for Health Sciences - Missing Data

Questions

  • Why is data usually incomplete?
  • What are the best strategies for dealing with missing data- ignore cases, replace them by some population derived values, or impute them?
  • What is the impact of data manipulations on the core scientific inference?
  • Overview

    Many research studies encounter incomplete (missing) data that require special handling (e.g., processing, statistical analysis, visualization). There are a variety of methods (e.g., multiple imputation) to deal with missing data, detect missingness, impute the data, analyze the completed data-set and compare the characteristics of the raw and imputed data.

    Multiple imputation involves 3 steps.

    1) Impute: Create sets of plausible values for the missing observations that reflect uncertainty about the non-response model. Each of these sets of plausible values can be used to “fill-in” or complete the data-set.

    2) Analyze: process each of these imputed data-sets using complete-data methods.

    3) Combine: synthesize the results accounting for the uncertainty within each imputation round.

    In a general regression setting, let’s denote the scalar, or vector valued, outcomes by Y, and the corresponding vector of predictors by X. For a given case (e.g., subject, unit), these quantities are either observed (obs) or missing (mis). Thus, Yobs and Xobs represent the observed component of the outcome and the predictors; and Ymis and Xmis denote the unobserved components of the outcome and predictors, respectively. Imputation involves the estimation of the regression parameters β governing the conditional distribution of Y given X: f(Y|X,β). The efficiency, bias and precision of the estimates are important in this process.

    The type of missingness in the data is an important factor in the imputation process. The basic patterns of missingness include:

    • Missing completely at random (MCAR) assumes that the missing data is not related to any factor, known or unknown, in the study.

    • Missingness at random (MAR) assumes that the missingness depends only on observed quantities, which may include outcomes and/or predictors.

    • Non-ignorable missingness occurs when the missing data depends on unobserved quantities.

    Multiple Imputation Protocol

    (1) Imputation: Generate a set of m > 1 plausible values for $Z_{mis}=(Y_{mis},X_{mis})$.

    The imputation step relies upon assumptions regarding the cause of missingness in the dataset. The goal of the imputation is to account for the relationships between the unobserved and observed variables, while taking into account the uncertainty of the imputation. The commonly made MAR assumption for missing data is untestable without additional information. With MAR assumption we can generate imputations (Z{1},Z{2},...Z{m}) from the distribution $f(Z_{mis}|Z_{obs})$, since after conditioning on $Z_{obs}$the missingness is assumed to be random.

    The missingness is monotone when the data matrix can be rearranged so that there is a hierarchy of missingness where observing a particular variable $Z_b$ for a subject implies that $Z_a$ is observed, for a < b. In monotone condition settings many imputation methods may be employed including (for continuous variables) propensity methods, predictive mean matching, and (for discrete variables) discriminant analysis or logistic regression. For non-monotonic missingness, Markov Chain Monte Carlo (MCMC) approaches may be used.

    Each method has its own assumptions. For instance, predictive mean matching and MCMC approaches require multivariate normality. Predictive mean matching approach employs linear regression for the distribution of a partially observed variable, conditional on other factors. To impute the data using the predictive mean matching approach for a variable $Z_i$ with missing values, we fit a model using complete observations for Z1, . . . , Z(i-1) (not monotonicity assumption!):

    $E[Z_i|φ]=φ_0+φ_1Z_1+φ_2Z_2+...+φ_{i-1}Z_{i-1}$

    Next, new parameters φ* are randomly drawn from the distribution of the parameters (since these values are estimated instead of exactly known). For the 1≤l≤m imputation, the missing values are replaced by:

    $Z_i^l= φ_0^* +φ_1^*Z_1 +φ_2^*Z_2+...+φ_{i-1}^*Z_{i-1}+σ^*ϵ$,

    where σ* is the estimate of variance from the model and ϵ is a Gaussian, N(0,1). This is the simple regression method. We can impute the observed value of $Z_i$ that is closest to predicted $\hat{Z}_i=Z_i^l$ in the dataset – this is the predictive mean matching method.

    The propensity score method uses an alternative model for imputation where the values are imputed from observations that are equally likely to be missing, by fitting a logistic regression model for the missingness indicators. In some situations, propensity score imputation may lead to serious bias for missing covariates. For discrete incomplete variables, discriminant analysis or dichotomous logistic regression may be employed to impute values based on the estimated probability that a missing observation takes on a certain value.

    (2) Analysis: Analyze the m datasets using complete-case methods. This second step in the protocol involves carrying out the analysis of interest for each of the m imputed complete-observation datasets and storing the parameter vectors (e.g., effect-size coefficients, β’s) and their standard error estimates.

    (3) Combination: Combine the results from the m analyses and calculate the estimates of the within imputation and between imputation variability. If the imputation model is correct (assumptions are valid), these statistics account for the variability of the imputations and provide consistent estimates of the parameters and their standard errors.

    Simulated Example

    set.seed(123)
    # create MCAR missing-data generator
    create.missing <- function (data, pct.mis = 10)
    {
    n <- nrow(data)
    J <- ncol(data)
    if (length(pct.mis) == 1) {
    n.mis <- rep((n * (pct.mis/100)), J)
    
    }
    else {
    if (length(pct.mis) < J) 
    stop("The length of missing does not equal to the column of the data")
    n.mis <- n * (pct.mis/100)
    
    }
       for (i in 1:ncol(data)) {
           if (n.mis[i] == 0) { # if column has nothing missing, do nothing.
               data[, i] <- data[, i]
           }
           else {
               data[sample(1:n, n.mis[i], replace = FALSE), i] <- NA
    # For each given column (i), sample the row indices (1:n), 
    # a number of indices to replace as “missing”, n.mis[i], “NA”,
     	  # without replacement
           }
       }
       return(as.data.frame(data))
    }
    
    # Simulate some real multivariate data sim_data={y,x1,x2,x3,x4,x5,x6,x7,x8,x9,x10}
    # 
    n <- 1000; u1 <- rbinom(n, 1, .5); v1 <- log(rnorm(n, 5, 1)); x1 <- u1*exp(v1)
    u2 <- rbinom(n, 1, .5); v2 <- log(rnorm(n, 5, 1)); x2 <- u2*exp(v2)
    x3 <- rbinom(n, 1, prob=0.45); x4 <- ordered(rep(seq(1, 5),n)[sample(1:n, n)]); x5 <- rep(letters[1:10],n)[sample(1:n, n)]; x6 <- trunc(runif(n, 1, 10)); x7 <- rnorm(n); x8 <- factor(rep(seq(1,10),n)[sample(1:n, n)]); x9 <-   runif(n, 0.1, .99); x10 <- rpois(n, 4); y <- x1 + x2 + x7 + x9 + rnorm(n)
    
    # package the simulated data as a data frame object
    sim_data <- cbind.data.frame(y, x1, x2, x3, x4, x5, x6, x7, x8, x9, x10)
    
    # randomly create missing values
    sim_data_30pct_missing <- create.missing(sim_data, pct.mis=30); 
    head(sim_data_30pct_missing); summary(sim_data_30pct_missing)
    
    # install.packages("mi")
    # install.packages("betareg")
    library("betareg"); library("mi")
    
    # get show the missing information matrix			
    mdf <- missing_data.frame(sim_data_30pct_missing) 
    show(mdf); mdf@patterns; image(mdf)   # remember the visual pattern of this MCAR
    
    # Next try to impute the missing values …
    imputations <- mi(sim_data_30pct_missing, n.iter=10, n.chains=3, verbose=TRUE)
    hist(imputations)
    
    # Extracts several multiply imputed data.frames from “imputations” object
    data.frames <- complete(imputations, 3)
    
    # Compare the summary stats for the original data (prior to introducing missing
    # values) with missing data and the re-completed data following imputation
    summary(sim_data); summary(sim_data_30pct_missing); summary(data.frames1);
    lapply(data.frames, summary)
    
    Original
    y x1 x2 x3 x4 x5 x6 x7 x8 x9 x10
    Min :-3.046 Min :0.000 Min. :0.000 Min :0.000 0.180555556 a :100 Min :1.000 Min. :-2.880570 1 :100 Min. :0.1024 Min. :0.000
    1st Qu.:2.788 1st Qu.:0.000 1st Qu.:0.000 1st Qu.:0.000 0.222222222 b :100 1st Qu.:3.000 1st Qu.:-0.687578 2 :100 1st Qu.:0.3356 1st Qu.:2.000
    Median :5.613 Median :3.186 Median :2.024 Median :0.000 0.263888889 c :100 Median :5.000 Median :0.000836 3 :100 Median :0.5546 Median :4.000
    Mean :5.658 Mean :2.626 Mean :2.480 Mean :0.459 0.305555556 d :100 Mean :4.998 Mean :-0.020541 4 :100 Mean :0.5542 Mean :3.898
    3rd Qu.:8.525 3rd Qu.:5.101 3rd Qu.:4.962 3rd Qu.:1.000 0.347222222 e :100 3rd Qu.:7.000 3rd Qu.:0.635832 5 :100 3rd Qu.:0.7829 3rd Qu.:5.000
    Max. :17.263 Max. :8.774 Max. :8.085 Max :1.000 f :100 Max. :9.000 Max. :3.281171 6 :100 Max. :0.9887 Max. :12.000
    Missing Data
    y x1 x2 x3 x4 x5 x6 x7 x8 x9 x10 6 :100
    Min. :-3.046 Min :0.000 Min. :0.000 Min. :0.0000 1 :144 h :76 Min. :1.000 Min. :-2.8806 8 :77 Min. :0.1024 Min. :0.000
    1st Qu.:2.768 1st Qu.:0.000 1st Qu.:0.000 1st Qu.:0.0000 2 :128 a :72 1st Qu.:3.000 1st Qu.:-0.6974 4 :75 1st Qu.:0.3504 1st Qu.:2.000
    Median :5.650 Median :3.127 Median :0.000 Median :0.0000 3 :143 d :72 Median :5.000 Median :-0.0490 9 :72 Median :0.5590 Median :4.000
    Mean :5.699 Mean :2.633 Mean :2.455 Mean :0.4586 4 :138 f :72 Mean :4.994 Mean :-0.0261 6 :71 Mean :0.5619 Mean :3.853
    3rd Qu.:8.552 3rd Qu.:5.088 3rd Qu.:4.966 3rd Qu.:1.0000 5 :147 g :72 3rd Qu.:7.000 3rd Qu.:0.6379 10 :71 3rd Qu.:0.7843 3rd Qu.:5.000
    Max. :17.263 Max. :8.774 Max. :8.085 Max. :1.0000 NA's:300 (Other):336 Max. :9.000 Max. :2.7863 (Other):334 Max. :0.9887 Max. :12.000
    NA's :300 NA's :300 NA's :300 NA's :300 NA's :300 NA's :300 NA's :300 NA's :300 NA's 300 NA's :300
    Imputed
    y x1 x2 x3 x4 x5 x6 x7 x8 x9 x10 missing y
    Min. :-4.646 Min. :-3.687 Min. :-4.672 0.381944444 0.199305556 a :108 Min. :-1.005 Min. :-2.88057 4 :114 Min. :0.04046 Min. :-2.009 Mode: logical
    1st Qu.:3.015 1st Qu.:0.000 1st Qu.:0.000 0.354166667 0.211111111 h :108 1st Qu.:3.000 1st Qu.:-0.70985 9 :105 1st Qu.:0.36273 1st Qu.:2.352 FALSE:700
    Median :5.741 Median :2.836 Median :2.564 0.259722222 d :107 Median :5.000 Median :-0.03695 5 :103 Median :0.57188 Median :4.000 TRUE :300
    Mean :5.846 Mean :2.649 Mean :2.533 0.297916667 c :104 Mean :5.012 Mean :-0.01444 3 :102 Mean :0.56607 Mean :3.858 NA's :0
    3rd Qu.:8.667 3rd Qu.:4.978 3rd Qu.:4.871 0.351388889 j :100 3rd Qu.:7.000 3rd Qu.:0.70157 6 :102 3rd Qu.:0.78259 3rd Qu.:5.000
    Max. :17.263 Max. :9.979 Max. :12.833 f :99 Max. :12.178 Max. :2.78629 8 :102 Max. :0.99670 Max. :12.000


    # Check imputation convergence (details provided below)
    round(mipply(imputations, mean, to.matrix = TRUE), 3)
    Rhats(imputations, statistic = "moments") # assess the convergence of MI algorithm
    
    plot(imputations); hist(imputations); image(imputations); summary(imputations)
    
    # Finally, pool over the m = 5 imputed datasets when we fit the “model”
    # Pool from across the 4 chains – in order to estimate a linear regression model
    model_results <- pool(y ~ x1+x2+x3+x4+x5+x6+x7+x8+x9+x10, data=imputations,  m=5 )
    display (model_results); summary (model_results)  
    # Report the summaries of the imputations
    data.frames <- complete(imputations, 3)  	# extract the first 3 chains
    lapply(data.frames, summary)
    

    TBI Data Example

    See these traumatic brain injury (TBI) data: http://wiki.socr.umich.edu/index.php/SMHS_MissingData#Raw_TBI_data

    Ind ID Age Sex TBI field.gcs er.gcs icu.gcs worst.gcs X6m.gose X2013.gose skull.fx temp.injury surgery spikes.hr min.hr max.hr acute.sz late.sz ever.sz
    1 1 19 M Fall 10 10 10 10 5 5 0 1 1 NA NA NA 1 1 1
    2 2 55 M Blunt NA 3 3 3 5 7 1 1 1 168.74 14 757 0 1 1
    3 3 24 M Fall 12 12 8 8 7 7 1 0 0 37.37 0 351 0 0 0
    4 4 57 F Fall 4 4 6 4 3 3 1 1 1 4.35 0 59 0 0 0
    5 5 54 F Peds_vs_Auto 14 11 8 8 5 7 0 1 1 54.59 0 284 0 0 0
    6 6 16 F MVA 13 7 7 7 7 8 1 1 1 75.92 7 180 0 1 1
    # Load the (raw) data from the table into a plain text file "08_EpiBioSData_Incomplete.csv"
    TBI_Data <- read.csv("https://umich.instructure.com/files/720782/download?download_frd=1", na.strings=c("",".","NA"))
    summary(TBI_Data)
    
    # Get information matrix of the data
    # (1) Convert to a missing_data.frame
    # library("betareg"); library("mi")			
    mdf <- missing_data.frame(TBI_Data) # warnings about missingness patterns
    show(mdf); mdf@patterns
    
    # (2) change things: mi::change() method changes the family, imputation method,
    # size, type, and so forth of a missing variable. It’s called 
    # before calling mi to affect how the conditional expectation of each 
    # missing variable is modeled.
    mdf <- change(mdf, y = "spikes.hr", what = "transformation", to = "identity")
    
    Arg Description
    y A character vector naming one or more missing variables within the missing_data.frame specified by the data argument, or a vector of integers or a logical vector indicating which missing_variables to change.
    what A character string naming what is to be changed, such as "family", "imputation_method", "size","transformation", "type", "link", or "model."
    to A character string naming what y should be changed to, such as one of the admissible families, imputation methods, transformations, or types. If missing, then possible choices for the "to" argument will be helpfully printed on the screen.
    # (3) examine missingness patterns
    summary(mdf); hist(mdf); 
    image(mdf)
    
    # (4) Perform initial imputation
    imputations <- mi(mdf, n.iter=30, n.chains=5, verbose=TRUE)
    hist(imputations)
    
    # (5) Extracts several multiply imputed data.frames from “imputations” object
    data.frames <- complete(imputations, 5)
    
    # (6) Report a list of “summaries” for each element (imputation instance)
    lapply(data.frames, summary)
    
    # (6.a) To cast the imputed numbers as integers (not necessary, but may be useful)
    indx <- sapply(data.frames5, is.numeric)  # get the indices of numeric columns
    data.frames5[indx] <- lapply(data.frames5[indx], function(x) as.numeric(as.integer(x))) 		    # cast each value as integer
    
    # (7) Save results out
    write.csv(data.frames5, "C:\\Users\\Dinov\\Desktop\\TBI_MIData.csv")
    
    # (8) Complete Data analytics functions:
    # library("mi")
    #lm.mi(); glm.mi(); polr.mi(); bayesglm.mi(); bayespolr.mi(); lmer.mi(); glmer.mi()
    
    # (8.1) Define Linear Regression for multiply imputed dataset – Also see Step (10)
    ##linear regression for each imputed data set - 5 regression models are fit
    fit_lm1 <- glm(ever.sz ~ surgery + worst.gcs + factor(sex) + age, data.frames$\$$`chain:1`, family = "binomial"); summary(fit_lm1); display(fit_lm1)
    
     # Fit the appropriate model and pool the results (estimates over MI chains)
     model_results <- pool(ever.sz ~ surgery + worst.gcs + factor(sex) + age, family = "binomial", data=imputations,  m=5)
     display (model_results); summary (model_results)  
    
     # Report the summaries of the imputations
     data.frames <- complete(imputations, 3)  	# extract the first 3 chains
     lapply(data.frames, summary)
    
     # (9) Validation: we now verify whether enough iterations were conducted. 
     # Validation criteria demands that the mean of each completed variable should
     # be similar for each of the k chains (in this case k=5).
     # <b>mipply</b> is wrapper for <b>sapply</b> invoked on mi-class objects to compute the col means
     round(mipply(imputations, mean, to.matrix = TRUE), 3)
    
     # <b>Rhat</b> convergence statistics compares the variance between chains to the variance
     # across chains. 
     # <b>Rhat Values ~ 1.0</b> indicate likely convergence, 
     # <b>Rhat Values > 1.1</b> indicate that the chains should be run longer 
     # (use large number of iterations)
     Rhats(imputations, statistic = "moments") # assess the convergence of MI algorithm
    
     # When convergence is unstable, we can continue the iterations for all chains, e.g.
     imputations <- mi(imputations, n.iter=20) # add additional 20 iterations
    
     # To <b>plot</b> the produced mi results, for all missing_variables we can generate
     # a histogram of the observed, imputed, and completed data.
     # We can compare of the completed data to the fitted values implied by the model
     # for the completed data, by plotting binned residuals. 
     # <b>hist</b> function works similarly as plot. 
     # <b>image</b> function gives a sense of the missingness patterns in the data
     plot(imputations); hist(imputations); image(imputations); summary(imputations)
    
     # (10) Finally, pool over the m = 5 imputed datasets when we fit the “model”
     # Pool from across the 4 chains – in order to estimate a linear regression model
     # and impact ov various predictors
    
     <span style="background-color: #37FDFC">model_results <- <b><u>pool</u></b>(ever.sz ~ surgery + worst.gcs + factor(sex) + age, data =  imputations,  m =  5 ); display (model_results); summary (model_results)  </span>
    
     # Report the summaries of the imputations
     data.frames <- complete(imputations, 3)  	# extract the first 3 chains
     lapply(data.frames, summary)			# report summaries
    
    ==='"`UNIQ--h-6--QINU`"'Parkinson's Disease Case Study===
    
    Background: See: http://wiki.socr.umich.edu/index.php/SOCR_Data_PD_BiomedBigMetadata
    
     <b>#imputation for logistic regression</b>
     # load the data: <b>08_PPMI_GSA_clinical.csv</b>
     ppmi_new<-read.csv("https://umich.instructure.com/files/330401/download?download_frd=1",header=TRUE)
    
     # install.packages("psych")
     library(psych)
     # report descriptive statistics
    
     ppmi_new$\$$ResearchGroup <- ifelse(ppmi_new$\$$ResearchGroup == "Control", "Control", "Patient")
     ppmi_new$\$$ResearchGroup <- as.factor(ppmi_new$\$$ResearchGroup): table(ppmi_new$\$$ResearchGroup)
    
    # Reduce the Dataset: extract only the columns of interest (to ensure real time calculations)
    ppmi_new_1 <- as.data.frame(ppmi_new)[,   c("ResearchGroup","R_fusiform_gyrus_Curvedness","Sex","Age","chr12_rs34637584_GT","UPDRS_Part_I_Summary_Score_Baseline","UPDRS_Part_II_Patient_Questionnaire_Summary_Score_Baseline","UPDRS_Part_III_Summary_Score_Baseline",  "UPDRS_Part_IV_Summary_Score_Baseline")]
    
    descript <- describe(ppmi_new_1); descript	# reports means, SE’s, etc.
    
    dim(ppmi_new); dim(ppmi_new_1)	# check dimensions before 
    
    # imputation 1
    # install.packages("mi"); install.packages("betareg")
    library(mi); library(betareg)
    
    mdf<-missing_data.frame(ppmi_new_1[-1,])
    apply(ppmi_new_1[-1,],1,mean)
    imputations<-mi(mdf, seed=900);   summary(imputations)
    ppmi_complete<-complete(imputations)
    
    # Save the imputation
    # again you may want to cast imputed numerical values as integers, perhaps
    # (indx <- sapply(data.frames5, is.numeric)  # get the indices of numeric columns
    # data.frames5[indx] <- lapply(data.frames5[indx], function(x) as.numeric(as.integer(x)))   # cast each value
    # write.csv(ppmi_complete,"C:\\Users\\Dinov\\Desktop\\ppmi_new_1_complete.csv")
    write.csv(ppmi_complete,"ppmi_new_1_complete.csv")
    
    # Get the imputations of 4 iterations
    ppmi_complete_1<- ppmi_complete1; ppmi_complete_2<- ppmi_complete2; 
    ppmi_complete_3<- ppmi_complete3; ppmi_complete_4<- ppmi_complete4
    
    # average the imputations 
    ppmi_updrs <- (ppmi_complete_1+ppmi_complete_2+ppmi_complete_4+ppmi_complete_3)/4
    # ppmi_top<-cbind(ppmi_new_1[-1,1:335], ppmi_updrs) #delete the first observation
    # colnames(ppmi_top)<-colnames(ppmi_new[,1:393])
    
    # Save results – complete dataset
    #  write.csv(ppmi_top,"C:\\Users\\Dinov\\Desktop\\ppmi_top_complete_1.csv")
    # write.csv(ppmi_top,"ppmi_top_complete_1.csv")
    
    # Now, run some Data analytics on complete data:
    # library("mi")
    #lm.mi(); glm.mi(); polr.mi(); bayesglm.mi(); bayespolr.mi(); lmer.mi(); glmer.mi()
    # Define Linear Regression for multiply imputed dataset
    ##linear regression for each imputed data set - 5 regression models are fit
    
    model_results <- pool(ResearchGroup ~  R_fusiform_gyrus_Curvedness + factor(Sex) + Age + 
    factor(chr12_rs34637584_GT) + UPDRS_Part_I_Summary_Score_Baseline + UPDRS_Part_II_Patient_Questionnaire_Summary_Score_Baseline + UPDRS_Part_III_Summary_Score_Baseline, family = "binomial", data=imputations,  m=3)
    display (model_results); summary (model_results)
    

    Appendix

    See also





    Translate this page:

    (default)
    Uk flag.gif

    Deutsch
    De flag.gif

    Español
    Es flag.gif

    Français
    Fr flag.gif

    Italiano
    It flag.gif

    Português
    Pt flag.gif

    日本語
    Jp flag.gif

    България
    Bg flag.gif

    الامارات العربية المتحدة
    Ae flag.gif

    Suomi
    Fi flag.gif

    इस भाषा में
    In flag.gif

    Norge
    No flag.png

    한국어
    Kr flag.gif

    中文
    Cn flag.gif

    繁体中文
    Cn flag.gif

    Русский
    Ru flag.gif

    Nederlands
    Nl flag.gif

    Ελληνικά
    Gr flag.gif

    Hrvatska
    Hr flag.gif

    Česká republika
    Cz flag.gif

    Danmark
    Dk flag.gif

    Polska
    Pl flag.png

    România
    Ro flag.png

    Sverige
    Se flag.gif