In this practical, a number of R packages are used. The packages used (with versions that were used to generate the solutions) are:
mice
(version: 3.6.0)JointAI
(version: 0.6.0)For this practical, we will use the NHANES2 dataset, a subset of the data we have seen in the lecture slides.
To load this dataset, you can use the command file.choose()
which opens the explorer and allows you to navigate to the location of the file NHANES2_for_practicals.RData
on your computer. If you know the path to the file, you can also use load("<path>/NHANES2_for_practicals.RData")
. RStudio users can also just click on the file in the “Files” pane/tab to load it.
Imputation needs to be tailored to the dataset at hand and, hence, using the function mice()
well requires several arguments to be specified. To make the specification easier it is useful to do a dry-run which will create the default versions of everything that needs to be specified.
These default settings can then be adapted to our data.
Do the set-up run of mice()
with the NHANES2 data without any iterations (maxit = 0
).
There are many imputation methods available in mice. You can find the list in the help page of the mice()
function. We will focus here on the following ones:
name | variable type | description |
---|---|---|
pmm | any | Predictive mean matching |
norm | numeric | Bayesian linear regression |
logreg | binary | Logistic regression |
polr | ordered | Proportional odds model |
polyreg | unordered | Polytomous logistic regression |
The default imputation methods that mice()
selects can be specified in the argument defaultMethod
.
If unspecified, mice
will use
pmm
for numerical columns,logreg
for factor columns with two categories,polyreg
for columns with unordered andpolr
for columns with ordered factors with more than two categories.In the NHANES2
data we have the following variables:
When a normal imputation model seems to be appropriate for most of the continuous covariates, you may want to specify norm
as the default method in the setup run. Let’s do that:
The order of the types of variable is: continuous, binary, factor, ordered factor.
## Warning: Number of logged events: 1
In the plot of the NHANES2
data shown above we can see that the variable creat
has a skewed distribution, hence, using a normal imputation model may not work well.
meth
from imp0
.creat
so that this variable will be imputed using predictive mean matching.""
) the variable will not be imputed.## HDL race bili smoke DM gender WC chol HyperMed alc
## "norm" "" "norm" "polr" "" "" "norm" "norm" "polr" "polr"
## SBP wgt hypten cohort occup age educ albu creat uricacid
## "norm" "norm" "logreg" "" "polyreg" "" "polr" "norm" "pmm" "norm"
## BMI hypchol hgt
## "norm" "logreg" "norm"
The predictor matrix specifies which variables are used in the linear predictors of each of the imputation models.
A value of 1
specifies that the variable given in the column name is used in the model to impute the variable given in the row name (and 0
specifies that this variable is not used in that model).
Get the predictorMatrix
from imp0
. Notice that mice has already set some of the values to 0
. Do you understand why?
## HDL race bili smoke DM gender WC chol HyperMed alc SBP wgt hypten cohort occup age educ
## HDL 0 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1
## race 1 0 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1
## bili 1 1 0 1 1 1 1 1 1 1 1 1 1 0 1 1 1
## smoke 1 1 1 0 1 1 1 1 1 1 1 1 1 0 1 1 1
## DM 1 1 1 1 0 1 1 1 1 1 1 1 1 0 1 1 1
## gender 1 1 1 1 1 0 1 1 1 1 1 1 1 0 1 1 1
## WC 1 1 1 1 1 1 0 1 1 1 1 1 1 0 1 1 1
## chol 1 1 1 1 1 1 1 0 1 1 1 1 1 0 1 1 1
## HyperMed 1 1 1 1 1 1 1 1 0 1 1 1 1 0 1 1 1
## alc 1 1 1 1 1 1 1 1 1 0 1 1 1 0 1 1 1
## SBP 1 1 1 1 1 1 1 1 1 1 0 1 1 0 1 1 1
## wgt 1 1 1 1 1 1 1 1 1 1 1 0 1 0 1 1 1
## hypten 1 1 1 1 1 1 1 1 1 1 1 1 0 0 1 1 1
## cohort 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1
## occup 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 1 1
## age 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 0 1
## educ 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 0
## albu 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1
## creat 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1
## uricacid 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1
## BMI 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1
## hypchol 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1
## hgt 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1
## albu creat uricacid BMI hypchol hgt
## HDL 1 1 1 1 1 1
## race 1 1 1 1 1 1
## bili 1 1 1 1 1 1
## smoke 1 1 1 1 1 1
## DM 1 1 1 1 1 1
## gender 1 1 1 1 1 1
## WC 1 1 1 1 1 1
## chol 1 1 1 1 1 1
## HyperMed 1 1 1 1 1 1
## alc 1 1 1 1 1 1
## SBP 1 1 1 1 1 1
## wgt 1 1 1 1 1 1
## hypten 1 1 1 1 1 1
## cohort 1 1 1 1 1 1
## occup 1 1 1 1 1 1
## age 1 1 1 1 1 1
## educ 1 1 1 1 1 1
## albu 0 1 1 1 1 1
## creat 1 0 1 1 1 1
## uricacid 1 1 0 1 1 1
## BMI 1 1 1 0 1 1
## hypchol 1 1 1 1 0 1
## hgt 1 1 1 1 1 0
The column corresponding to the variable cohort
is set to 0
which means that this variable is not used in any of the imputation models. cohort
has the same value for all observations, so it would not be useful as a covariate.
Because BMI is calculated from height (hgt
) and weight (wgt
), and there are cases where only one of these two variables is missing, we want to impute hgt
and wgt
separately. BMI
should be imputed using passive imputation.
To avoid multicollinearity (which may lead to problems during imputation), imputation models should not include all three variables as predictor variables. In this example, we will use BMI
to impute the other variables.
Moreover, we need to exclude waist circumference (WC
) from the imputation model for wgt
because the high correlation between WC
, BMI
and wgt
would otherwise lead to problems during imputation. (In practie you would first try to leave it in and only exclude it if it causes problems.)
And since HyperMed
does not give us a lot more information than hypten
, but has a lot more missing values, we do not want to use it as a predictor variable.
Apply the necessary changes to pred
and meth
.
For passive imputation, you need to specify the formula used to calculate BMI in meth using "~I(...)"
.
# BMI will not be used as predictor of height and weight
pred[c("hgt", "wgt"), "BMI"] <- 0
# height and weight will not be used as predictor in any model
pred[, c("hgt", "wgt")] <- 0
# height and weight will be used as predictors for each other
pred["hgt", "wgt"] <- 1
pred["wgt", "hgt"] <- 1
# WC is not used as predictor for weight
pred["wgt", "WC"] <- 0
# HyperMed will not be used as predictor in any model
pred[, "HyperMed"] <- 0
# hypchol will not be used as predictor in the imputation model for chol
pred["chol", "hypchol"] <- 0
# BMI will be imputed passively
meth["BMI"] <- "~I(wgt/hgt^2)"
# HyperMed will not be imputed
meth["HyperMed"] <- ""
The visit sequence specifies the order in which the variables are imputed.
To be sure that the imputed values of BMI
match the imputed values of hgt
and wgt
, BMI
needs to be imputed after hgt
and wgt
.
visitSequence
from imp0
, and## [1] "HDL" "race" "bili" "smoke" "DM" "gender" "WC" "chol"
## [9] "HyperMed" "alc" "SBP" "wgt" "hypten" "cohort" "occup" "age"
## [17] "educ" "albu" "creat" "uricacid" "BMI" "hypchol" "hgt"
With the changes that we have made to the predictorMatrix
, method
and visitSequence
, we can now perform the imputation. For now, use no more then m = 5
and maxit = 10
(to save some time).
imp <- mice(NHANES2, method = meth, predictorMatrix = pred, visitSequence = visSeq,
maxit = 10, m = 5, seed = 2019)
mice()
prints the name of the variable being imputed for each iteration and imputation. If you run mice()
on your own computer the output will show up continuously. There, you may notice that imputation is slowest for categorical variables, especially when they have many categories.
You can hide the lengthy output by specifying printFlag = FALSE
.
mice()
does not return a data.frame
. Find out the class of the object returned by mice()
function using the function class()
and names()
, and take a look at the help file for this class.
## [1] "mids"
## [1] "data" "imp" "m" "where" "blocks"
## [6] "call" "nmis" "method" "predictorMatrix" "visitSequence"
## [11] "formulas" "post" "blots" "seed" "iteration"
## [16] "lastSeedValue" "chainMean" "chainVar" "loggedEvents" "version"
## [21] "date"
We see that imp
is an object of class mids
.
The help file tells us that a mids
object is a list with several elements:
data :
|
Original (incomplete) data set. |
imp :
|
The imputed values: A list of ncol(data) components, each list component is a matrix with nmis[j] rows and m columns.
|
m :
|
The number of imputations. |
where :
|
The missingness indicator matrix. |
blocks
|
The blocks argument of the mice() function.
|
call :
|
The call that created the mids object.
|
nmis :
|
The number of missing observations per variable. |
method :
|
The vector imputation methods. |
predictorMatrix :
|
The predictor matrix. |
visitSequence :
|
The sequence in which columns are visited during imputation. |
formulas
|
A named list of formulas corresponding the the imputed variables (blocks). |
post :
|
A vector of strings of length length(blocks) with commands for post-processing.
|
seed :
|
The seed value of the solution. |
iteration :
|
The number of iterations. |
lastSeedValue :
|
The most recent seed value. |
chainMean:
|
The mean of imputed values per variable and iteration: a list of m components. Each component is a matrix with maxit columns and length(visitSequence) rows.
|
chainVar :
|
The variances of imputed values per variable and iteration(same structure as chainMean ).
|
loggedEvents :
|
A data.frame with the record of automatic corrective actions and warnings; (NULL if no action was made).
|
version
|
Version number of the mice package that created the object. |
date
|
Date at which the object was created. |
Details of the loggedEvents
:
mice()
does some pre-processing of the data:
Furthermore, during each iteration
polr
imputation that does not converge is replaced by polyreg
.data.frame
in loggedEvents
has the following columns:
it
|
iteration number |
im
|
imputation number |
dep
|
name of the name of the variable being imputed |
meth
|
imputation method used |
out
|
character vector with names of altered/removed predictors |
© Nicole Erler