In this practical, a number of R packages are used. The packages used (with versions that were used to generate the solutions) are:
mice
(version: 3.6.0)RColorBrewer
(version: 1.1.2)reshape2
(version: 1.4.3)ggplot2
(version: 3.2.1)For this practical, we will again use the NHANES2 dataset that we have seen in the previous practical.
To load this dataset, you can use the command file.choose()
which opens the explorer and allows you to navigate to the location of the file NHANES2_for_practicals.RData
on your computer. If you know the path to the file, you can also use load("<path>/NHANES2_for_practicals.RData")
. RStudio users can also just click on the file in the “Files” pane/tab to load it.
The imputed data are stored in a mids
object called imp
that we created in the previous practical.
You can load it into your workspace by clicking the object imps.RData
if you are using RStudio. Alternatively, you can load this workspace using load("<path>/imps.RData")
. You then need to run:
It is good practice to make sure that mice()
has not done any processing of the data that was not planned or that you are not aware of. This means checking that the correct method
, predictorMatrix
and visitSequence
were used.
Do these checks for imp
.
## [1] TRUE
## [1] TRUE
## [1] TRUE
## HDL race bili smoke DM gender
## "norm" "" "norm" "polr" "" ""
## WC chol HyperMed alc SBP wgt
## "norm" "norm" "" "polr" "norm" "norm"
## hypten cohort occup age educ albu
## "logreg" "" "polyreg" "" "polr" "norm"
## creat uricacid BMI hypchol hgt
## "pmm" "norm" "~I(wgt/hgt^2)" "logreg" "norm"
## HDL race bili smoke DM gender WC chol HyperMed alc SBP wgt hypten cohort occup age educ
## HDL 0 1 1 1 1 1 1 1 0 1 1 0 1 0 1 1 1
## race 1 0 1 1 1 1 1 1 0 1 1 0 1 0 1 1 1
## bili 1 1 0 1 1 1 1 1 0 1 1 0 1 0 1 1 1
## smoke 1 1 1 0 1 1 1 1 0 1 1 0 1 0 1 1 1
## DM 1 1 1 1 0 1 1 1 0 1 1 0 1 0 1 1 1
## gender 1 1 1 1 1 0 1 1 0 1 1 0 1 0 1 1 1
## WC 1 1 1 1 1 1 0 1 0 1 1 0 1 0 1 1 1
## chol 1 1 1 1 1 1 1 0 0 1 1 0 1 0 1 1 1
## HyperMed 1 1 1 1 1 1 1 1 0 1 1 0 1 0 1 1 1
## alc 1 1 1 1 1 1 1 1 0 0 1 0 1 0 1 1 1
## SBP 1 1 1 1 1 1 1 1 0 1 0 0 1 0 1 1 1
## wgt 1 1 1 1 1 1 0 1 0 1 1 0 1 0 1 1 1
## hypten 1 1 1 1 1 1 1 1 0 1 1 0 0 0 1 1 1
## cohort 1 1 1 1 1 1 1 1 0 1 1 0 1 0 1 1 1
## occup 1 1 1 1 1 1 1 1 0 1 1 0 1 0 0 1 1
## age 1 1 1 1 1 1 1 1 0 1 1 0 1 0 1 0 1
## educ 1 1 1 1 1 1 1 1 0 1 1 0 1 0 1 1 0
## albu 1 1 1 1 1 1 1 1 0 1 1 0 1 0 1 1 1
## creat 1 1 1 1 1 1 1 1 0 1 1 0 1 0 1 1 1
## uricacid 1 1 1 1 1 1 1 1 0 1 1 0 1 0 1 1 1
## BMI 1 1 1 1 1 1 1 1 0 1 1 0 1 0 1 1 1
## hypchol 1 1 1 1 1 1 1 1 0 1 1 0 1 0 1 1 1
## hgt 1 1 1 1 1 1 1 1 0 1 1 1 1 0 1 1 1
## albu creat uricacid BMI hypchol hgt
## HDL 1 1 1 1 1 0
## race 1 1 1 1 1 0
## bili 1 1 1 1 1 0
## smoke 1 1 1 1 1 0
## DM 1 1 1 1 1 0
## gender 1 1 1 1 1 0
## WC 1 1 1 1 1 0
## chol 1 1 1 1 0 0
## HyperMed 1 1 1 1 1 0
## alc 1 1 1 1 1 0
## SBP 1 1 1 1 1 0
## wgt 1 1 1 0 1 1
## hypten 1 1 1 1 1 0
## cohort 1 1 1 1 1 0
## occup 1 1 1 1 1 0
## age 1 1 1 1 1 0
## educ 1 1 1 1 1 0
## albu 0 1 1 1 1 0
## creat 1 0 1 1 1 0
## uricacid 1 1 0 1 1 0
## BMI 1 1 1 0 1 0
## hypchol 1 1 1 1 0 0
## hgt 1 1 1 0 1 0
## [1] "HDL" "race" "bili" "smoke" "DM" "gender" "WC" "chol"
## [9] "HyperMed" "alc" "SBP" "wgt" "hypten" "cohort" "occup" "age"
## [17] "educ" "albu" "creat" "uricacid" "hypchol" "hgt" "BMI"
Checking the loggedEvent
shows us if mice()
detected any problems during the imputation.
Check the loggedEvents
for imp
.
Let’s see what would have happened if we had not prepared the predictorMatrix
, method
and visitSequence
before imputation.
Run the imputation without setting any additional arguments:
impnaive <- mice(NHANES2, m = 5, maxit = 10)
Take a look at the loggedEvents
of impnaive
.
The loggedEvents
of the “naive” imputation show that the constant variable cohort
was excluded before the imputation (as it should be). Furthermore, in the imputation model for HyperMed
, the variable hyptenyes
was excluded (hyptenyes
is the dummy variable belonging to hypten
).
We did not change the visitSequence
in impnaive
. Find out how that affected the imputed values of BMI
.
naiveDF1 <- complete(impnaive, 1)
naivecalcBMI <- with(naiveDF1, wgt/hgt^2)
impDF1 <- complete(imp, 1)
impcalcBMI <- with(impDF1, wgt/hgt^2)
cbind(naiveBMI = naiveDF1$BMI, naivecalcBMI,
impBMI = impDF1$BMI, impcalcBMI)[which(is.na(NHANES2$BMI)), ]
## naiveBMI naivecalcBMI impBMI impcalcBMI
## [1,] 27.72183 27.39723 24.83157 24.83157
## [2,] 42.57342 41.14373 29.34160 29.34160
## [3,] 34.96218 37.19945 39.39542 39.39542
## [4,] 25.09025 23.43565 30.54933 30.54933
## [5,] 31.11811 32.50800 26.77603 26.77603
## [6,] 53.45653 44.30180 44.55843 44.55843
## [7,] 33.45054 33.85769 33.95596 33.95596
## [8,] 25.82705 24.18514 24.84901 24.84901
## [9,] 20.25445 20.67391 35.92126 35.92126
## [10,] 40.35064 39.02754 38.00768 38.00768
## [11,] 27.56521 26.88638 35.19163 35.19163
## [12,] 26.29231 24.03070 28.91268 28.91268
## [13,] 24.96105 25.29304 36.65114 36.65114
## [14,] 31.41156 30.99954 38.40088 38.40088
## [15,] 22.70998 22.46495 21.42614 21.42614
## [16,] 30.91019 30.66649 32.16342 32.16342
## [17,] 24.79963 23.41127 22.42358 22.42358
## [18,] 21.78935 22.04695 19.05868 19.05868
## [19,] 24.69459 24.58942 26.68989 26.68989
## [20,] 31.50786 32.05961 21.82161 21.82161
## [21,] 24.40730 23.02497 21.44369 21.44369
## [22,] 31.79397 31.24754 23.63931 23.63931
## [23,] 27.60408 26.54276 35.61949 35.61949
## [24,] 23.91393 23.41127 23.33941 23.33941
## [25,] 30.91019 29.26408 27.79779 27.79779
## [26,] 45.61438 49.80076 40.85093 40.85093
## [27,] 38.62077 38.37203 32.16596 32.16596
## [28,] 25.69610 24.41214 24.00366 24.00366
## [29,] 23.01259 22.46495 33.98395 33.98395
## [30,] 26.52057 26.45250 28.13602 28.13602
## [31,] 23.61830 23.41127 31.71061 31.71061
## [32,] 21.69968 20.98261 24.12301 24.12301
## [33,] 34.96218 34.20240 34.34352 34.34352
## [34,] 29.44466 29.47564 44.54863 44.54863
## [35,] 22.03277 22.04695 24.68605 24.68605
## [36,] 43.04508 46.76502 44.28761 44.28761
## [37,] 33.20051 32.61309 28.55831 28.55831
When we compare the imputed and calculated values of BMI
from impnaive
we can see that the imputed hgt
and wgt
give a different BMI
than is imputed. This is because BMI
is imputed before wgt
, which means that the most recent imputed value of wgt
is from the previous iteration.
Changing the visitSequence
in imp
prevented this inconsistency.
In order to obtain correct results, the MICE algorithm needs to have converged. This can be checked visually by plotting summaries of the imputed values accross the iterations.
The mean and variance of the imputed values per iteration and variable are stored in the elements chainMean
and chainVar
of the mids
object.
Plot them to see if our imputation imp
has converged.
# implemented plotting function (use layout to change the number of rows and columns)
plot(imp, layout = c(6, 6))
The chains in imp
seem to have converged, however it is difficult to judge this based on only 10 iterations. In practice, more iterations should be done.
To save you some time, I ran the imputation again with 30 iterations and the traceplots confirm convergence:
In comparison, impnaive
(in my version with maxit = 30
) had some convergence problems:
hgt
, and wgt
show a clear trend and the chains do not mix well, i.e., there is more variation between the chains than within each chain. (the same is the case for BMI
).
These are clear signs that there is correlation or identification problems between these variables and some other variables (which is why we made adjustments to the predictorMatrix
for imp
).
Now that we know that imp
has converged, we can compare the distribution of the imputed values against the distribution of the observed values. When our imputation models fit the data well, they should have similar distributions (conditional on the covariates used in the imputation model).
You can use densityplot()
and propplot()
to get plots for all continuous and categorical variables.
propplot()
is not part of any package. Copy the following syntax that defines this function:
To check all imputed values you can either get a summary of the imp
element of the mids
object or create a complete dataset containing all imputations using the function complete()
and get the summary of that.
## Warning: attributes are not identical across measure variables; they will be dropped
# get the summary of the imputed values
sapply(Filter(function(x) nrow(x) > 0, imp$imp),
function(x) summary(unlist(x))
)
## $HDL
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.06009 1.19649 1.42542 1.42400 1.68001 2.59759
##
## $bili
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.1174 0.5719 0.7760 0.7856 0.9980 1.7547
##
## $smoke
## never former current
## 5 3 2
##
## $WC
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 64.46 87.59 98.09 99.49 109.60 152.22
##
## $chol
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.844 4.163 5.057 4.961 5.763 8.431
##
## $HyperMed
## Mode NA's
## logical 4065
##
## $alc
## 0 <=1 1-3 3-7 >7
## 409 712 192 161 121
##
## $SBP
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 78.95 111.94 124.57 124.38 137.66 170.34
##
## $wgt
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 25.18 68.43 80.13 81.41 94.91 136.87
##
## $hypten
## no yes
## 153 57
##
## $occup
## working looking for work not working
## 49 6 30
##
## $educ
## Less than 9th grade 9-11th grade High school graduate some college
## 0 0 2 1
## College or above
## 2
##
## $albu
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.234 4.079 4.292 4.286 4.503 5.106
##
## $creat
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.450 0.710 0.850 0.909 0.980 7.460
##
## $uricacid
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.094 4.403 5.337 5.375 6.304 10.834
##
## $BMI
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 12.53 24.69 28.23 29.55 33.96 48.07
##
## $hypchol
## no yes
## 368 62
##
## $hgt
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.400 1.563 1.608 1.631 1.704 1.942
## .imp .id HDL race bili
## Min. :1 Min. : 1.0 Min. :0.06009 Mexican American : 560 Min. :-0.1174
## 1st Qu.:2 1st Qu.: 250.8 1st Qu.:1.11000 Other Hispanic : 550 1st Qu.: 0.6000
## Median :3 Median : 500.5 Median :1.32000 Non-Hispanic White:1750 Median : 0.7000
## Mean :3 Mean : 500.5 Mean :1.39411 Non-Hispanic Black:1145 Mean : 0.7558
## 3rd Qu.:4 3rd Qu.: 750.2 3rd Qu.:1.60000 other : 995 3rd Qu.: 0.9000
## Max. :5 Max. :1000.0 Max. :4.03000 Max. : 2.9000
## smoke DM gender WC chol HyperMed
## never :3045 no :4265 male :2465 Min. : 61.90 Min. : 1.844 no : 125
## former : 958 yes: 735 female:2535 1st Qu.: 85.10 1st Qu.: 4.270 previous: 100
## current: 997 Median : 95.40 Median : 4.910 yes : 710
## Mean : 96.51 Mean : 4.995 NA's :4065
## 3rd Qu.:105.52 3rd Qu.: 5.640
## Max. :154.70 Max. :10.680
## alc SBP wgt hypten cohort occup
## 0 : 974 Min. : 78.95 Min. : 25.18 no :3618 Length:5000 working :2769
## <=1:2117 1st Qu.:109.33 1st Qu.: 63.50 yes:1382 Class :character looking for work: 236
## 1-3: 717 Median :118.22 Median : 77.11 Mode :character not working :1995
## 3-7: 566 Mean :120.40 Mean : 78.42
## >7 : 626 3rd Qu.:129.33 3rd Qu.: 89.36
## Max. :202.00 Max. :167.83
## age educ albu creat uricacid
## Min. :20.00 Less than 9th grade : 415 Min. :3.000 Min. :0.4400 Min. : 1.094
## 1st Qu.:31.00 9-11th grade : 665 1st Qu.:4.100 1st Qu.:0.6900 1st Qu.: 4.400
## Median :43.00 High school graduate:1142 Median :4.300 Median :0.8200 Median : 5.300
## Mean :45.23 some college :1416 Mean :4.289 Mean :0.8576 Mean : 5.358
## 3rd Qu.:59.00 College or above :1362 3rd Qu.:4.500 3rd Qu.:0.9600 3rd Qu.: 6.200
## Max. :79.00 Max. :5.400 Max. :7.4600 Max. :10.834
## BMI hypchol hgt
## Min. :12.53 no :4433 Min. :1.397
## 1st Qu.:23.30 yes: 567 1st Qu.:1.616
## Median :26.61 Median :1.676
## Mean :27.56 Mean :1.684
## 3rd Qu.:30.85 3rd Qu.:1.753
## Max. :60.54 Max. :1.981
Unfortunately, we have some negative imputed values for bili
. Often, this would not result in bias in the analysis, but may be difficult to explain when providing a summary of the imputed data in a publication. In the present example we can see that the observed values have a slightly right-skewed distribution compared to the imputed values. Re-doing the imputation with pmm
instead of norm
for bili
should fix this. (However, since the imputations seem fine overall, and there is little knowledge gain in re-doing the previous steps, we will skip this repetition in this practical.)
The distributions of the imputed values for hgt
and SBP
differ a bit from the distributions of the observed data.
We also imputed a larger proportion than might have been expected in the highest category of alc
, and the distribution of values for smoke
looks a bit weird (but smoke
only has one missing value, which makes it difficult to judge the distribution of the imputed values).
Investigate if differences in the distributions of observed and imputed values, can be explained by other variables. Check this for
SBP
conditional on gender
and hypten
hgt
conditional on gender
alc
conditional on gender
or smoke
© Nicole Erler