Imputation of Data with Non-linear Associations

Preface

R packages

In this practical, a number of R packages are used. The packages used (with versions that were used to generate the solutions) are:

R version 3.6.1 (2019-07-05)
mice (version: 3.6.0)
JointAI (version: 0.6.0)
ggplot2 (version: 3.2.1)
reshape2 (version: 1.4.3)
ggpubr (version: 0.2.2)

Dataset

For this practical, we will use the NHANES3 data, another subset of the data we have already seen in the lecture slides and the previous practicals. It contains only those cases that have observed wgt and some columns that are not needed were excluded.

To load this dataset, you can use the command file.choose() which opens the explorer and allows you to navigate to the location of the file NHANES3_for_practicals.RData on your computer. If you know the path to the file, you can also use load("<path>/NHANES3_for_practicals.RData").

Aim

The focus of this practical is the imputation of data that has features that require special attention.

In the interest of time, we will focus on these features and abbreviate steps that are the same as in any imputation setting (e.g., getting to know the data or checking that imputed values are realistic). Nevertheless, these steps are of course required when analysing data in practice.

Our aim is to fit the following linear regression model for weight:

lm(wgt ~ gender + bili + age * (chol + HDL) + hgt)

We expect that the effects of cholesterol and HDL may differ with age, and, hence, include interaction terms between age and chol and HDL, respectively.

Additionally, we want to include the other variables in the dataset as auxiliary variables.

Imputation using mice

Use of the Just Another Variable approach can in some settings reduce bias. Alternatively, we can use passive imputation, i.e., calculate the interaction terms in each iteration of the MICE algorithm. Furthermore, predictive mean matching tends to lead to less bias than normal imputation models.

Just Another Variable approach

Task 1

Calculate the interaction terms in the incomplete data.
Perform the setup-run of mice() without any iterations.

Solution 1

# calculate the interaction terms
NHANES3$agechol <- NHANES3$age * NHANES3$chol
NHANES3$ageHDL <- NHANES3$age * NHANES3$HDL

# setup run
imp0 <- mice(NHANES3, maxit = 0, 
             defaultMethod = c('norm', 'logreg', 'polyreg', 'polr'))
imp0

## Class: mids
## Number of multiple imputations:  5 
## Imputation methods:
##      wgt   gender     bili      age     chol      HDL      hgt     educ     race      SBP   hypten 
##       ""       ""   "norm"       ""   "norm"   "norm"   "norm"   "polr"       ""   "norm" "logreg" 
##       WC  agechol   ageHDL 
##   "norm"   "norm"   "norm" 
## PredictorMatrix:
##        wgt gender bili age chol HDL hgt educ race SBP hypten WC agechol ageHDL
## wgt      0      1    1   1    1   1   1    1    1   1      1  1       1      1
## gender   1      0    1   1    1   1   1    1    1   1      1  1       1      1
## bili     1      1    0   1    1   1   1    1    1   1      1  1       1      1
## age      1      1    1   0    1   1   1    1    1   1      1  1       1      1
## chol     1      1    1   1    0   1   1    1    1   1      1  1       1      1
## HDL      1      1    1   1    1   0   1    1    1   1      1  1       1      1

Task 2

Apply the necessary change to the imputation method and predictor matrix.

Since the interaction terms are calculated from the orignal variables, these interaction terms should not be used to impute the original variables.

Solution 2

meth <- imp0$method 
pred <- imp0$predictorMatrix

# change imputation for "bili" to pmm (to prevent negative values)
meth["bili"] <- 'pmm'
 
# changes in predictor matrix to prevent original variables being imputer based 
# on the interaction terms
pred["chol", "agechol"] <- 0
pred["HDL", "ageHDL"] <- 0

meth

##      wgt   gender     bili      age     chol      HDL      hgt     educ     race      SBP   hypten 
##       ""       ""    "pmm"       ""   "norm"   "norm"   "norm"   "polr"       ""   "norm" "logreg" 
##       WC  agechol   ageHDL 
##   "norm"   "norm"   "norm"

pred

##         wgt gender bili age chol HDL hgt educ race SBP hypten WC agechol ageHDL
## wgt       0      1    1   1    1   1   1    1    1   1      1  1       1      1
## gender    1      0    1   1    1   1   1    1    1   1      1  1       1      1
## bili      1      1    0   1    1   1   1    1    1   1      1  1       1      1
## age       1      1    1   0    1   1   1    1    1   1      1  1       1      1
## chol      1      1    1   1    0   1   1    1    1   1      1  1       0      1
## HDL       1      1    1   1    1   0   1    1    1   1      1  1       1      0
## hgt       1      1    1   1    1   1   0    1    1   1      1  1       1      1
## educ      1      1    1   1    1   1   1    0    1   1      1  1       1      1
## race      1      1    1   1    1   1   1    1    0   1      1  1       1      1
## SBP       1      1    1   1    1   1   1    1    1   0      1  1       1      1
## hypten    1      1    1   1    1   1   1    1    1   1      0  1       1      1
## WC        1      1    1   1    1   1   1    1    1   1      1  0       1      1
## agechol   1      1    1   1    1   1   1    1    1   1      1  1       0      1
## ageHDL    1      1    1   1    1   1   1    1    1   1      1  1       1      0

Task 3

Run the imputation using the JAV approach and check the traceplot.

Solution 3

# run imputation with the JAV approach
impJAV <- mice(NHANES3, method = meth, predictorMatrix = pred,
                maxit = 10, m = 5)

plot(impJAV, layout = c(4, 6))

Task 4

We skip the more detailed evaluation of the imputed values. With the settings given in the solution the chains have converged and distributions of the imputed values match the distributions of the observed data closely enough.

Analyse the imputed data and pool the results.

Solution 4

miraJAV <- with(impJAV, 
                lm(wgt  ~ gender + bili + age + chol + HDL + agechol + ageHDL + hgt))
summary(pool(miraJAV), conf.int = TRUE)

Passive Imputation

For the passive imputation, we can re-use the adjusted versions of meth and pred we created for the JAV approach, but additional changes to meth are necessary.

Task 1

Specify the new imputation method, i.e., adapt meth and save it as methPAS.

For passive imputation instead of an imputation method you need to specify the formula used to calculate the value that is imputed passively.

Solution 1

# changes in imputation method for passive imputation
methPAS <- meth
methPAS[c("agechol", "ageHDL")] <- c("~I(age*chol)", "~I(age*HDL)")

Task 2

Run the imputation using passive imputation and check the traceplot.

Solution 2

# run imputation with passive imputation
impPAS <- mice(NHANES3, method = methPAS, predictorMatrix = pred,
                maxit = 10, m = 5)
plot(impPAS, layout = c(4, 6))

Task 3

We will again skip the detailed evaluation of convergence and the imputed values.

Analyse the imputed data and pool the results.

Solution 3

miraPAS <- with(impPAS, 
                lm(wgt ~ gender + bili + age + chol + HDL + agechol + ageHDL + hgt))

summary(pool(miraPAS), conf.int = TRUE)

Imputation with JointAI

JointAI provides functions that allow us to fit Bayesian regression models with incomplete covariates. The main functions are designed to resemble the standard functions to fit regression models with complete data.

For univariate outcomes the following functions are available:

lm_imp(): for linear regression
glm_imp(): for generalized linear regression (e.g., logistic, gamma or Poisson)
clm_imp(): for ordinal (cumulative logit) regression

Specification of the analysis model

Similar to the complete data versions, the functions from JointAI take the following arguments:


`formula`	model formula
`data`	original, incomplete dataset
`family`	for glm’s: the distribution family of the outcome (e.g., `binomial()` for a logistic model)

Task 1

Specify the linear regression model using the function lm_imp() with the model formula given at the beginning of this practical (wgt ~ gender + bili + age * (chol + HDL) + hgt).

You need to specify the arguments formula, data and n.iter. Set n.iter = 100 (we will learn about this argument further down).

Solution 1

lm1 <- lm_imp(wgt ~ gender + bili + age * (chol + HDL) + hgt, data = NHANES3,
               n.iter = 100)

class(lm1)

## [1] "JointAI"

Task 2

The result is an object of class JointAI, which contains * the original data (data), * information on the type of model (call, analysis_type, models, fixed, random, hyperpars, scale_pars) and * information about the MCMC sampling (mcmc_settings), * the JAGS model (model) and * the MCMC sample (MCMC; if a sample was generated), * the computational time (time) of the MCMC sampling, and * some additional elements that are used by methods for objects of class JointAI but are typically not of interest for the user.

Check which imputation models were used in lm1.

The imputation model types are returned in the JointAI object under “models”.

Solution 2

lm1$models

##    hgt   chol    HDL   bili 
## "norm" "norm" "norm" "norm"

Specification of the imputation models

In JointAI, there are some arguments related to the imputation part of the model:


`models`	vector of imputation methods (for details see below and the vignette on Model Specification)
`auxvars`	one-sided formula of variables that are not part of the analysis model but should be used to predict missing values (optional; for details see the vignette on Model Specification)
`refcats`	allows specification of which category of categorical variables is used as reference (optional; for details see the vignette on Model Specification)
`trunc`	allow truncation of distributions of incomplete continuous covariates (for details see the vignette on Model Specification)

Like in mice default imputation models are chosen based on the class of each of the incomplete variables. The default choices for baseline (not time-varying) covariates are

name	model	variable type
`norm`	linear regression	continuous variables
`logit`	logistic regression	factors with two levels
`multilogit`	multinomial logit model	unordered factors with >2 levels
`cumlogit`	cumulative logit model	ordered factors with >2 levels

Alternative imputation methods are available for continuous baseline covariates:

name	model	variable type
`lognorm`	normal regression of the log-transformed variable	right-skewed variables >0
`gamma`	Gamma regression (with log-link)	right-skewed variables >0
`beta`	beta regression (with logit-link)	continuous variables with values in [0, 1]

Task

Re-fit the linear regression model, but now

specify a log-normal distribution for bili
set the reference category to the largest group
use the other variables that are in the data as auxiliary variables

To specify a non-default imputation method use the argument models = c(bili = ...).

To set the respective largest group as reference category for all categorical variables use the argument refcats = "largest".

Auxiliary variables need to be specified explicitely using the argument auxvars. It takes a one-sided formula

Solution

lm2 <- lm_imp(wgt ~ gender + bili + age * (chol + HDL) + hgt, data = NHANES3,
               auxvars = ~ race + SBP + hypten + WC,
               models = c(bili = 'lognorm'), refcats = 'largest',
               n.iter = 100)

Specification of the MCMC settings

Specification of the basic settings for the MCMC sampling can be achieved using the following arguments:


`n.chains`	number of MCMC chains
`n.adapt`	number of iterations used in the adaptive phase
`n.iter`	number of iterations per MCMC chain in the sampling phase

JointAI has more arguments than the ones given above, but in this practical we focus only on the most important. You may find out more about all the arguments in the vignette on MCMC Settings.

By default, n.chains = 3, n.adapt = 100 and n.iter = 0.

It is useful to use more than one chain to be able to evaluate convergence of the MCMC chains.

Samples in the adaptive phase are not used for the final MCMC sample. They are needed to optimize the MCMC sampler. When the number provided via the argument n.adapt is insufficient for this optimization a warning message will be printed. In simple models (e.g., linear regression) usually the default value of n.adapt = 100 can be used.

The default value for n.iter, the number of iterations in the sampling phase is 0 (no MCMC sample will be created). The number of iterations that is needed depends on how complex the model and the data is and can range from somewhere as low as 100 up to several million.

In the following we will look at some criteria to determine if the number of MCMC samples that was used is sufficient.

Evaluation of the MCMC sample

The first step after fitting a Bayesian model should be to confirm that the MCMC chains have converged. This can be done visually, using a traceplot (plotting the sampled values per parameter and chain across iterations) or using the Gelman-Rubin criterion.

Task 1

Investigate convergence of lm2 by creating a traceplot using the function traceplot(). The plot should show a horizontal band without trends or patterns and the different chains should be mixed.

Solution 1

traceplot(lm2)

Task 2

Investigate convergence of lm2 using the Gelman-Rubin criterion, implemented in the function GR_crit().

The upper limit of the confidence interval should not be much larger than 1.

Solution 2

GR_crit(lm2)

## Potential scale reduction factors:
## 
##              Point est. Upper C.I.
## (Intercept)       1.002      1.012
## genderfemale      0.996      0.997
## bili              0.997      0.999
## age               0.999      1.008
## chol              0.997      0.998
## HDL               1.000      1.010
## hgt               1.007      1.029
## age:chol          0.998      1.001
## age:HDL           1.008      1.036
## sigma_wgt         1.004      1.024
## 
## Multivariate psrf
## 
## 1.02

Continue

When we are satisfied with the convergence of the MCMC chains we can take a look at the MCMC sample is precise enough. We can do this by comparing the Monte Carlo error (which describes the error made since we have just a finite sample) to the estimated standard deviation. This is implemented in the function MC_error().

Task 3

Calculate the Monte Carlo error for lm2 with the help of MC_error().

Solution 3

MC_error(lm2)

##                  est   MCSE     SD MCSE/SD
## (Intercept)  -107.18 0.7399 20.211   0.037
## genderfemale    4.19 0.1154  2.010   0.057
## bili           -7.43 0.1810  2.816   0.064
## age             0.63 0.0110  0.242   0.045
## chol            9.49 0.1193  2.190   0.054
## HDL           -21.89 0.3558  6.242   0.057
## hgt            99.74 0.4445  9.266   0.048
## age:chol       -0.15 0.0025  0.044   0.057
## age:HDL         0.15 0.0079  0.126   0.063
## sigma_wgt      15.43 0.0285  0.483   0.059

par(mar = c(3.2, 5, 1, 1), mgp = c(2, 0.6, 0))
plot(MC_error(lm2))

The plot is just shows the ratio MCSE/SD. It is usually faster to check if all dots are left of the vertical 5% line than to read all numers in the table. For our final results, we should increase the number of iterations (either by inceasing n.iter or n.chains) to get a more precise estimate of the posterior distribution.

Results

Task

Get the summary of the model using the function summary().

Solution

summary(lm2)

## 
##  Linear model fitted with JointAI 
## 
## Call:
## lm_imp(formula = wgt ~ gender + bili + age * (chol + HDL) + hgt, 
##     data = NHANES3, n.iter = 100, auxvars = ~race + SBP + hypten + 
##         WC, refcats = "largest", models = c(bili = "lognorm"), 
##     seed = 2019)
## 
## Posterior summary:
##                  Mean      SD      2.5%    97.5% tail-prob. GR-crit
## (Intercept)  -107.175 20.2107 -147.0738 -68.6469     0.0000   1.012
## genderfemale    4.190  2.0102    0.6021   8.1219     0.0133   0.997
## bili           -7.432  2.8160  -12.0517  -0.9728     0.0133   0.999
## age             0.627  0.2422    0.1641   1.0533     0.0133   1.008
## chol            9.490  2.1903    5.0501  13.4212     0.0000   0.998
## HDL           -21.894  6.2417  -33.7844  -9.9311     0.0000   1.010
## hgt            99.740  9.2662   82.5247 116.3341     0.0000   1.029
## age:chol       -0.147  0.0442   -0.2306  -0.0635     0.0000   1.001
## age:HDL         0.148  0.1261   -0.0836   0.4101     0.2533   1.036
## 
## Posterior summary of residual std. deviation:
##           Mean    SD 2.5% 97.5% GR-crit
## sigma_wgt 15.4 0.483 14.4  16.4    1.02
## 
## 
## MCMC settings:
## Iterations = 101:200
## Sample size per chain = 100 
## Thinning interval = 1 
## Number of chains = 3 
## 
## Number of observations: 500

Additional exercise JointAI

Monitoring imputed values

JointAI also allows us to extract imputed values to generate multiple imputed datasets that can, for instance, be used for a secondary analysis.

To be able to extract the imputed values, it is necessary to specify in advance that these values should be monitored (“recorded”). This can be done using the argument monitor_params.

monitor_params uses a number of key words to specify which (groups of) parameters or values should be monitored. Each key word works like a switch and can be set to TRUE or FALSE. For more details see the vignette on Parameter Selection.

For monitoring imputed values, monitor_params = c(imps = TRUE) needs to be specified.

Task

Re-fit the linear regression model, but now additionally set the argument monitor_params to keep the imputed values.

Solution

lm3 <- lm_imp(wgt ~ gender + bili + age * (chol + HDL) + hgt, data = NHANES3,
               auxvars = ~ race + SBP + hypten + WC,
               models = c(bili = 'lognorm'), refcats = 'largest',
               n.iter = 100, monitor_params = c(imps = TRUE))

Extracting imputed data

The function get_MIdat() allows us to create multiple completed datasets from an object of class JointAI.

It takes the following arguments


`object`	an object of class `JointAI`
`m`	number of imputed datasets
`include`	should the original, incomplete data be included? (default is `TRUE`)
`start`	first iteration of interest; allows discarding burn-in iterations
`minspace`	minimum number of iterations between iterations chosen as imputed values (default is 50)
`seed`	optional seed value
`export_to_SPSS`	logical; should the completed data be exported to SPSS?
`resdir`	optional directory for results (for export to SPSS)
`filename`	optional file name (for export to SPSS)

Task 1

Extract 5 imputed datasets from lm3.
Inspect the resulting object.

Solution 1

MIdat3 <- get_MIdat(lm3, m = 5)

head(MIdat3)

summary(MIdat3)

##   Imputation_       wgt            gender          bili             age             chol       
##  Min.   :0.0   Min.   : 39.01   male  :1512   Min.   :0.2000   Min.   :20.00   Min.   : 1.871  
##  1st Qu.:1.0   1st Qu.: 65.20   female:1488   1st Qu.:0.6000   1st Qu.:31.00   1st Qu.: 4.270  
##  Median :2.5   Median : 76.20                 Median :0.7000   Median :43.00   Median : 4.884  
##  Mean   :2.5   Mean   : 78.25                 Mean   :0.7403   Mean   :45.02   Mean   : 5.002  
##  3rd Qu.:4.0   3rd Qu.: 86.41                 3rd Qu.:0.9000   3rd Qu.:58.00   3rd Qu.: 5.640  
##  Max.   :5.0   Max.   :167.38                 Max.   :2.9000   Max.   :79.00   Max.   :10.680  
##                                               NA's   :47                       NA's   :41      
##       HDL              hgt                          educ                     race     
##  Min.   :0.3419   Min.   :1.397   Less than 9th grade :186   Mexican American  : 312  
##  1st Qu.:1.1168   1st Qu.:1.626   9-11th grade        :414   Other Hispanic    : 348  
##  Median :1.3376   Median :1.676   High school graduate:690   Non-Hispanic White:1092  
##  Mean   :1.3995   Mean   :1.686   some college        :888   Non-Hispanic Black: 672  
##  3rd Qu.:1.6000   3rd Qu.:1.753   College or above    :816   other             : 576  
##  Max.   :3.1300   Max.   :1.930   NA's                :  6                            
##  NA's   :41       NA's   :11                                                          
##       SBP          hypten           WC            agechol           ageHDL           .rownr     
##  Min.   : 77.35   no  :2205   Min.   : 53.91   Min.   : 45.54   Min.   :  7.92   Min.   :  1.0  
##  1st Qu.:109.33   yes : 774   1st Qu.: 85.00   1st Qu.:140.16   1st Qu.: 39.37   1st Qu.:125.8  
##  Median :118.67   NA's:  21   Median : 95.44   Median :222.31   Median : 57.40   Median :250.5  
##  Mean   :120.37               Mean   : 96.35   Mean   :227.71   Mean   : 63.14   Mean   :250.5  
##  3rd Qu.:128.67               3rd Qu.:105.10   3rd Qu.:296.40   3rd Qu.: 80.04   3rd Qu.:375.2  
##  Max.   :202.00               Max.   :154.70   Max.   :725.90   Max.   :171.36   Max.   :500.0  
##  NA's   :29                   NA's   :23       NA's   :246      NA's   :246                     
##       .id       
##  Min.   :  1.0  
##  1st Qu.:125.8  
##  Median :250.5  
##  Mean   :250.5  
##  3rd Qu.:375.2  
##  Max.   :500.0  
##

We see that some columns were added to the data:

Imputation_ identifies the imputation number
.id is the subject ID
.rownr refers to the row number of the original data

Task 2

Similar to the functions densplot() from the mice package and propplot(), the function plot_imp_distr() from JointAI allows us to plot the distribution of the observed and imputed values for the incomplete variables.

It takes the following arguments


`data`	a data.frame in long format containing multiple imputations (and the original incomplete data)
`imp`	the name of the variable specifying the imputation indicator
`id`	the name of the variable specifying the subject indicator
`rownr`	the name of a variable identifying which rows correspond to the same observation in the original (unimputed) data
`ncol`, `nrow`	optional number of rows and columns in the plot layout; automatically chosen if unspecified

Check the imputed values in MIdat3 using plot_imp_distr().

Solution 2

plot_imp_distr(MIdat3)

Transforming imputed data to a mids object

To perform standard analyses on the imputed data it is usefull to convert them to a mids object, so that they can be treated as if they had been imputed with mice().

The mice package proves the function as.mids() to convert a long-format dataset (with original and multiple imputed datasets stacked onto each other) to a mids object.

Task

Transform MIdat3 to a mids object and confirm that it has worked by checking the class of the result.

Solution

mids3 <- as.mids(MIdat3, .imp = "Imputation_", .id = '.id')

class(mids3)