In this practical, a number of R packages are used. The packages used (with versions that were used to generate the solutions) are:
mice
(version: 3.6.0)JointAI
(version: 0.6.0)ggplot2
(version: 3.2.1)reshape2
(version: 1.4.3)ggpubr
(version: 0.2.2)For this practical, we will use the NHANES3 data, another subset of the data we have already seen in the lecture slides and the previous practicals. It contains only those cases that have observed wgt
and some columns that are not needed were excluded.
To load this dataset, you can use the command file.choose()
which opens the explorer and allows you to navigate to the location of the file NHANES3_for_practicals.RData
on your computer. If you know the path to the file, you can also use load("<path>/NHANES3_for_practicals.RData")
.
The focus of this practical is the imputation of data that has features that require special attention.
In the interest of time, we will focus on these features and abbreviate steps that are the same as in any imputation setting (e.g., getting to know the data or checking that imputed values are realistic). Nevertheless, these steps are of course required when analysing data in practice.
Our aim is to fit the following linear regression model for weight:
We expect that the effects of cholesterol and HDL may differ with age, and, hence, include interaction terms between age
and chol
and HDL
, respectively.
Additionally, we want to include the other variables in the dataset as auxiliary variables.
Use of the Just Another Variable approach can in some settings reduce bias. Alternatively, we can use passive imputation, i.e., calculate the interaction terms in each iteration of the MICE algorithm. Furthermore, predictive mean matching tends to lead to less bias than normal imputation models.
mice()
without any iterations.# calculate the interaction terms
NHANES3$agechol <- NHANES3$age * NHANES3$chol
NHANES3$ageHDL <- NHANES3$age * NHANES3$HDL
# setup run
imp0 <- mice(NHANES3, maxit = 0,
defaultMethod = c('norm', 'logreg', 'polyreg', 'polr'))
imp0
## Class: mids
## Number of multiple imputations: 5
## Imputation methods:
## wgt gender bili age chol HDL hgt educ race SBP hypten
## "" "" "norm" "" "norm" "norm" "norm" "polr" "" "norm" "logreg"
## WC agechol ageHDL
## "norm" "norm" "norm"
## PredictorMatrix:
## wgt gender bili age chol HDL hgt educ race SBP hypten WC agechol ageHDL
## wgt 0 1 1 1 1 1 1 1 1 1 1 1 1 1
## gender 1 0 1 1 1 1 1 1 1 1 1 1 1 1
## bili 1 1 0 1 1 1 1 1 1 1 1 1 1 1
## age 1 1 1 0 1 1 1 1 1 1 1 1 1 1
## chol 1 1 1 1 0 1 1 1 1 1 1 1 1 1
## HDL 1 1 1 1 1 0 1 1 1 1 1 1 1 1
Since the interaction terms are calculated from the orignal variables, these interaction terms should not be used to impute the original variables.
meth <- imp0$method
pred <- imp0$predictorMatrix
# change imputation for "bili" to pmm (to prevent negative values)
meth["bili"] <- 'pmm'
# changes in predictor matrix to prevent original variables being imputer based
# on the interaction terms
pred["chol", "agechol"] <- 0
pred["HDL", "ageHDL"] <- 0
meth
## wgt gender bili age chol HDL hgt educ race SBP hypten
## "" "" "pmm" "" "norm" "norm" "norm" "polr" "" "norm" "logreg"
## WC agechol ageHDL
## "norm" "norm" "norm"
## wgt gender bili age chol HDL hgt educ race SBP hypten WC agechol ageHDL
## wgt 0 1 1 1 1 1 1 1 1 1 1 1 1 1
## gender 1 0 1 1 1 1 1 1 1 1 1 1 1 1
## bili 1 1 0 1 1 1 1 1 1 1 1 1 1 1
## age 1 1 1 0 1 1 1 1 1 1 1 1 1 1
## chol 1 1 1 1 0 1 1 1 1 1 1 1 0 1
## HDL 1 1 1 1 1 0 1 1 1 1 1 1 1 0
## hgt 1 1 1 1 1 1 0 1 1 1 1 1 1 1
## educ 1 1 1 1 1 1 1 0 1 1 1 1 1 1
## race 1 1 1 1 1 1 1 1 0 1 1 1 1 1
## SBP 1 1 1 1 1 1 1 1 1 0 1 1 1 1
## hypten 1 1 1 1 1 1 1 1 1 1 0 1 1 1
## WC 1 1 1 1 1 1 1 1 1 1 1 0 1 1
## agechol 1 1 1 1 1 1 1 1 1 1 1 1 0 1
## ageHDL 1 1 1 1 1 1 1 1 1 1 1 1 1 0
Run the imputation using the JAV approach and check the traceplot.
We skip the more detailed evaluation of the imputed values. With the settings given in the solution the chains have converged and distributions of the imputed values match the distributions of the observed data closely enough.
For the passive imputation, we can re-use the adjusted versions of meth
and pred
we created for the JAV approach, but additional changes to meth
are necessary.
Specify the new imputation method, i.e., adapt meth
and save it as methPAS
.
For passive imputation instead of an imputation method you need to specify the formula used to calculate the value that is imputed passively.
Run the imputation using passive imputation and check the traceplot.
We will again skip the detailed evaluation of convergence and the imputed values.
JointAI provides functions that allow us to fit Bayesian regression models with incomplete covariates. The main functions are designed to resemble the standard functions to fit regression models with complete data.
For univariate outcomes the following functions are available:
lm_imp()
: for linear regressionglm_imp()
: for generalized linear regression (e.g., logistic, gamma or Poisson)clm_imp()
: for ordinal (cumulative logit) regression
formula
|
model formula |
data
|
original, incomplete dataset |
family
|
for glm’s: the distribution family of the outcome (e.g., binomial() for a logistic model)
|
Specify the linear regression model using the function lm_imp()
with the model formula given at the beginning of this practical (wgt ~ gender + bili + age * (chol + HDL) + hgt
).
You need to specify the arguments formula
, data
and n.iter
. Set n.iter = 100
(we will learn about this argument further down).
lm1 <- lm_imp(wgt ~ gender + bili + age * (chol + HDL) + hgt, data = NHANES3,
n.iter = 100)
class(lm1)
## [1] "JointAI"
The result is an object of class JointAI
, which contains * the original data (data
), * information on the type of model (call
, analysis_type
, models
, fixed
, random
, hyperpars
, scale_pars
) and * information about the MCMC sampling (mcmc_settings
), * the JAGS model (model
) and * the MCMC sample (MCMC
; if a sample was generated), * the computational time (time
) of the MCMC sampling, and * some additional elements that are used by methods for objects of class JointAI
but are typically not of interest for the user.
Check which imputation models were used in lm1
.
The imputation model types are returned in the JointAI object under “models”.
models
|
vector of imputation methods (for details see below and the vignette on Model Specification) |
auxvars
|
one-sided formula of variables that are not part of the analysis model but should be used to predict missing values (optional; for details see the vignette on Model Specification) |
refcats
|
allows specification of which category of categorical variables is used as reference (optional; for details see the vignette on Model Specification) |
trunc
|
allow truncation of distributions of incomplete continuous covariates (for details see the vignette on Model Specification) |
class
of each of the incomplete variables. The default choices for baseline (not time-varying) covariates are
name | model | variable type |
---|---|---|
norm
|
linear regression | continuous variables |
logit
|
logistic regression | factors with two levels |
multilogit
|
multinomial logit model | unordered factors with >2 levels |
cumlogit
|
cumulative logit model | ordered factors with >2 levels |
name | model | variable type |
---|---|---|
lognorm
|
normal regression of the log-transformed variable | right-skewed variables >0 |
gamma
|
Gamma regression (with log-link) | right-skewed variables >0 |
beta
|
beta regression (with logit-link) | continuous variables with values in [0, 1] |
Re-fit the linear regression model, but now
bili
To specify a non-default imputation method use the argument models = c(bili = ...)
.
To set the respective largest group as reference category for all categorical variables use the argument refcats = "largest"
.
Auxiliary variables need to be specified explicitely using the argument auxvars
. It takes a one-sided formula
n.chains
|
number of MCMC chains |
n.adapt
|
number of iterations used in the adaptive phase |
n.iter
|
number of iterations per MCMC chain in the sampling phase |
JointAI has more arguments than the ones given above, but in this practical we focus only on the most important. You may find out more about all the arguments in the vignette on MCMC Settings.
By default, n.chains = 3
, n.adapt = 100
and n.iter = 0
.
It is useful to use more than one chain to be able to evaluate convergence of the MCMC chains.
Samples in the adaptive phase are not used for the final MCMC sample. They are needed to optimize the MCMC sampler. When the number provided via the argument n.adapt
is insufficient for this optimization a warning message will be printed. In simple models (e.g., linear regression) usually the default value of n.adapt = 100
can be used.
The default value for n.iter
, the number of iterations in the sampling phase is 0
(no MCMC sample will be created). The number of iterations that is needed depends on how complex the model and the data is and can range from somewhere as low as 100 up to several million.
In the following we will look at some criteria to determine if the number of MCMC samples that was used is sufficient.
The first step after fitting a Bayesian model should be to confirm that the MCMC chains have converged. This can be done visually, using a traceplot (plotting the sampled values per parameter and chain across iterations) or using the Gelman-Rubin criterion.
Investigate convergence of lm2
by creating a traceplot using the function traceplot()
. The plot should show a horizontal band without trends or patterns and the different chains should be mixed.
Investigate convergence of lm2
using the Gelman-Rubin criterion, implemented in the function GR_crit()
.
The upper limit of the confidence interval should not be much larger than 1.
## Potential scale reduction factors:
##
## Point est. Upper C.I.
## (Intercept) 1.002 1.012
## genderfemale 0.996 0.997
## bili 0.997 0.999
## age 0.999 1.008
## chol 0.997 0.998
## HDL 1.000 1.010
## hgt 1.007 1.029
## age:chol 0.998 1.001
## age:HDL 1.008 1.036
## sigma_wgt 1.004 1.024
##
## Multivariate psrf
##
## 1.02
When we are satisfied with the convergence of the MCMC chains we can take a look at the MCMC sample is precise enough. We can do this by comparing the Monte Carlo error (which describes the error made since we have just a finite sample) to the estimated standard deviation. This is implemented in the function MC_error()
.
Calculate the Monte Carlo error for lm2
with the help of MC_error()
.
## est MCSE SD MCSE/SD
## (Intercept) -107.18 0.7399 20.211 0.037
## genderfemale 4.19 0.1154 2.010 0.057
## bili -7.43 0.1810 2.816 0.064
## age 0.63 0.0110 0.242 0.045
## chol 9.49 0.1193 2.190 0.054
## HDL -21.89 0.3558 6.242 0.057
## hgt 99.74 0.4445 9.266 0.048
## age:chol -0.15 0.0025 0.044 0.057
## age:HDL 0.15 0.0079 0.126 0.063
## sigma_wgt 15.43 0.0285 0.483 0.059
The plot is just shows the ratio MCSE/SD
. It is usually faster to check if all dots are left of the vertical 5% line than to read all numers in the table. For our final results, we should increase the number of iterations (either by inceasing n.iter
or n.chains
) to get a more precise estimate of the posterior distribution.
Get the summary of the model using the function summary()
.
##
## Linear model fitted with JointAI
##
## Call:
## lm_imp(formula = wgt ~ gender + bili + age * (chol + HDL) + hgt,
## data = NHANES3, n.iter = 100, auxvars = ~race + SBP + hypten +
## WC, refcats = "largest", models = c(bili = "lognorm"),
## seed = 2019)
##
## Posterior summary:
## Mean SD 2.5% 97.5% tail-prob. GR-crit
## (Intercept) -107.175 20.2107 -147.0738 -68.6469 0.0000 1.012
## genderfemale 4.190 2.0102 0.6021 8.1219 0.0133 0.997
## bili -7.432 2.8160 -12.0517 -0.9728 0.0133 0.999
## age 0.627 0.2422 0.1641 1.0533 0.0133 1.008
## chol 9.490 2.1903 5.0501 13.4212 0.0000 0.998
## HDL -21.894 6.2417 -33.7844 -9.9311 0.0000 1.010
## hgt 99.740 9.2662 82.5247 116.3341 0.0000 1.029
## age:chol -0.147 0.0442 -0.2306 -0.0635 0.0000 1.001
## age:HDL 0.148 0.1261 -0.0836 0.4101 0.2533 1.036
##
## Posterior summary of residual std. deviation:
## Mean SD 2.5% 97.5% GR-crit
## sigma_wgt 15.4 0.483 14.4 16.4 1.02
##
##
## MCMC settings:
## Iterations = 101:200
## Sample size per chain = 100
## Thinning interval = 1
## Number of chains = 3
##
## Number of observations: 500
JointAI also allows us to extract imputed values to generate multiple imputed datasets that can, for instance, be used for a secondary analysis.
To be able to extract the imputed values, it is necessary to specify in advance that these values should be monitored (“recorded”). This can be done using the argument monitor_params
.
monitor_params
uses a number of key words to specify which (groups of) parameters or values should be monitored. Each key word works like a switch and can be set to TRUE
or FALSE
. For more details see the vignette on Parameter Selection.
For monitoring imputed values, monitor_params = c(imps = TRUE)
needs to be specified.
Re-fit the linear regression model, but now additionally set the argument monitor_params
to keep the imputed values.
The function get_MIdat()
allows us to create multiple completed datasets from an object of class JointAI
.
object
|
an object of class JointAI
|
m
|
number of imputed datasets |
include
|
should the original, incomplete data be included? (default is TRUE )
|
start
|
first iteration of interest; allows discarding burn-in iterations |
minspace
|
minimum number of iterations between iterations chosen as imputed values (default is 50) |
seed
|
optional seed value |
export_to_SPSS
|
logical; should the completed data be exported to SPSS? |
resdir
|
optional directory for results (for export to SPSS) |
filename
|
optional file name (for export to SPSS) |
lm3
.## Imputation_ wgt gender bili age chol
## Min. :0.0 Min. : 39.01 male :1512 Min. :0.2000 Min. :20.00 Min. : 1.871
## 1st Qu.:1.0 1st Qu.: 65.20 female:1488 1st Qu.:0.6000 1st Qu.:31.00 1st Qu.: 4.270
## Median :2.5 Median : 76.20 Median :0.7000 Median :43.00 Median : 4.884
## Mean :2.5 Mean : 78.25 Mean :0.7403 Mean :45.02 Mean : 5.002
## 3rd Qu.:4.0 3rd Qu.: 86.41 3rd Qu.:0.9000 3rd Qu.:58.00 3rd Qu.: 5.640
## Max. :5.0 Max. :167.38 Max. :2.9000 Max. :79.00 Max. :10.680
## NA's :47 NA's :41
## HDL hgt educ race
## Min. :0.3419 Min. :1.397 Less than 9th grade :186 Mexican American : 312
## 1st Qu.:1.1168 1st Qu.:1.626 9-11th grade :414 Other Hispanic : 348
## Median :1.3376 Median :1.676 High school graduate:690 Non-Hispanic White:1092
## Mean :1.3995 Mean :1.686 some college :888 Non-Hispanic Black: 672
## 3rd Qu.:1.6000 3rd Qu.:1.753 College or above :816 other : 576
## Max. :3.1300 Max. :1.930 NA's : 6
## NA's :41 NA's :11
## SBP hypten WC agechol ageHDL .rownr
## Min. : 77.35 no :2205 Min. : 53.91 Min. : 45.54 Min. : 7.92 Min. : 1.0
## 1st Qu.:109.33 yes : 774 1st Qu.: 85.00 1st Qu.:140.16 1st Qu.: 39.37 1st Qu.:125.8
## Median :118.67 NA's: 21 Median : 95.44 Median :222.31 Median : 57.40 Median :250.5
## Mean :120.37 Mean : 96.35 Mean :227.71 Mean : 63.14 Mean :250.5
## 3rd Qu.:128.67 3rd Qu.:105.10 3rd Qu.:296.40 3rd Qu.: 80.04 3rd Qu.:375.2
## Max. :202.00 Max. :154.70 Max. :725.90 Max. :171.36 Max. :500.0
## NA's :29 NA's :23 NA's :246 NA's :246
## .id
## Min. : 1.0
## 1st Qu.:125.8
## Median :250.5
## Mean :250.5
## 3rd Qu.:375.2
## Max. :500.0
##
We see that some columns were added to the data:
Imputation_
identifies the imputation number.id
is the subject ID.rownr
refers to the row number of the original dataSimilar to the functions densplot()
from the mice package and propplot()
, the function plot_imp_distr()
from JointAI allows us to plot the distribution of the observed and imputed values for the incomplete variables.
It takes the following arguments
data
|
a data.frame in long format containing multiple imputations (and the original incomplete data) |
imp
|
the name of the variable specifying the imputation indicator |
id
|
the name of the variable specifying the subject indicator |
rownr
|
the name of a variable identifying which rows correspond to the same observation in the original (unimputed) data |
ncol , nrow
|
optional number of rows and columns in the plot layout; automatically chosen if unspecified |
Check the imputed values in MIdat3
using plot_imp_distr()
.
To perform standard analyses on the imputed data it is usefull to convert them to a mids
object, so that they can be treated as if they had been imputed with mice()
.
The mice package proves the function as.mids()
to convert a long-format dataset (with original and multiple imputed datasets stacked onto each other) to a mids
object.
Transform MIdat3
to a mids
object and confirm that it has worked by checking the class
of the result.
© Nicole Erler