Imputation of Longitudinal or Multi-level Data

Preface

R packages

In this practical, a number of R packages are used. If any of them are not installed you may be able to follow the practical but will not be able to run all of the code. The packages used (with versions that were used to generate the solutions) are:

R version 3.6.1 (2019-07-05)
mice (version: 3.6.0)
nlme (version: 3.1.140)
JointAI (version: 0.6.0)
splines (version: 3.6.1)
ggplot2 (version: 3.2.1)

Dataset

In this practical we will work with data from a trial on primary biliary cirrhosis (PBC) of the liver.

To get the pbclong data, load the file pbclong.RData. You can download it here. To load this dataset, you can use the command file.choose() which opens the explorer and allows you to navigate to the location of the downloaded file on your computer. If you know the path to the file, you can also use load("<path>/<pbclong.RData").

The variables contained in the dataset pbclong are:


`id`	patient identifier
`day`	continuously measured day of follow-up time (the time variable)
`sex`	patients’ sex (f: female, m: male)
`trt`	treatment group (0: D-penicillmain, 1: placebo)
`age`	patients’ age at intake
`ascites`	presence of ascites at baseline (0:no, 1:yes)
`hepato`	presence of hepatomegaly or enlarged liver
`bili`	serum bilirubin level at baseline
`copper`	urine copper (ug/day)
`albumin`	serum albumin level at follow-up (time-varying)

The variables have the following distributions and proportions of missing values:

The missing data pattern is:

The longitudinal outcome albumin shows relatively linear trajectories over time:

To analyse the trajectories of albumin we want to use the following linear mixed effects model with random intercept and slope (either using lme from the package nlme or using lmer from the package lme4):

# using package nlme
lme(albumin ~ day + sex + trt + age + ascites + bili + copper, random = ~day|id)


# using package lme4
lmer(albumin ~ day + sex + trt + age + ascites + bili + copper + (day|id))

Imputation using mice

For imputation of longitudinal data, the mice package provides special imputation methods that take into account the two-level structure of the data.

In the pbclong data, missing values occur in baseline covariates ascites and copper, and the time-varying variable hepato.

For imputation of baseline covariates, the imputation methods

2lonly.pmm and
2lonly.norm

are available.

Time-varying variables can be imputed with imputation methods such as 2l.norm or 2l.bin. Since the imputation of hepato with 2l.bin takes quite some time, we will omit it in this part of the practical.

The predictorMatrix requires some extra specification to identify the “id” variable (set to -2) and the random effects structure (set to 2).

Task 1

Run the setup imputation and perform the necessary changes in the imputation method and predictor matrix.

mice does not recognize automatically that the data are multil-level, hence the settings chosen by default are not correct. You need to specify imputation methods for all incomplete variables.

Solution 1

micelong0 <- mice(pbclong, maxit = 0)
meth_micelong <- micelong0$method
pred_micelong <- micelong0$predictorMatrix

# don't impute hepato
meth_micelong[c("hepato")] <- ""
# exclude hepato from predictor of other models (because incomplete)
pred_micelong[, c("hepato")] <- 0

meth_micelong[c("copper", "ascites")] <- "2lonly.pmm"

pred_micelong[, "id"] <- -2
pred_micelong[, "day"] <- 2


# check the imputation method
meth_micelong

##           id          trt          age          sex          day      ascites       hepato 
##           ""           ""           ""           ""           "" "2lonly.pmm"           "" 
##         bili      albumin       copper 
##           ""           "" "2lonly.pmm"

# check the predictor matrix
pred_micelong

##         id trt age sex day ascites hepato bili albumin copper
## id      -2   1   1   1   2       1      0    1       1      1
## trt     -2   0   1   1   2       1      0    1       1      1
## age     -2   1   0   1   2       1      0    1       1      1
## sex     -2   1   1   0   2       1      0    1       1      1
## day     -2   1   1   1   2       1      0    1       1      1
## ascites -2   1   1   1   2       0      0    1       1      1
## hepato  -2   1   1   1   2       1      0    1       1      1
## bili    -2   1   1   1   2       1      0    0       1      1
## albumin -2   1   1   1   2       1      0    1       0      1
## copper  -2   1   1   1   2       1      0    1       1      0

Task 2

run the imputation with the adapted imputation method meth_micelong and predictor matrix pred_micelong
analyse the imputed data
pool the results

Solution 2

micelong <- mice(pbclong, meth = meth_micelong, pred = pred_micelong,
                 maxit = 20, seed = 2019, printFlag = FALSE)

library(nlme)
micelong_mira <- with(micelong, lme(albumin ~ day + sex + trt + age + ascites + 
                                      bili + copper, random = ~day|id)
)

# alternative:
# library(lme4)
# micelong_mira <- with(micelong, lmer(albumin ~ day + sex + trt + age + ascites +
#                                        bili + copper + (day|id))
# )

summary(pool(micelong_mira), conf.int = TRUE)

Imputation using JointAI

To analyse incomplete longitudinal data using a linear mixed model the R package JointAI provides the function lme_imp(). The specification of the main model components is analogous to the function lme() from the nlme package.

Specification of longitudinal models:
When imputing variables in a longitudinal (or other multi-level) model and there are missing values in baseline (level-2) covariates, models need to be specified for all longitudinal covariates, even if they do not have missing values. Specifying no model would imply that the incomplete baseline covariates are independent of the complete longitudinal variable (see also here). Therefore, JointAI automatically specifies models for all longitudinal covariates in such a setting.

An exception may be the time variable: it is often reasonable to assume that the baseline covariates are independent of the measurement times of the outcome and longitudinal covariates. To tell JointAI not to specify a model for the time variable, the argument no_model can be used.

Model types for longitudinal covariates:
For longitudinal covariates the following model types are implemented:

name	model	variable type
`lmm`	linear mixed model	continuous variables
`glmm_logit`	logistic mixed model	factors with two levels
`glmm_gamma`	gamma mixed model (with log-link)	skewed continuous variables
`glmm_poisson`	poisson mixed model	count variables
`clmm`	cumulative logit mixed model	ordered factors with >2 levels

More info:
For the specification of the other arguments of lme_imp(), refer to

the help page,
the vignette on Model Specification, or
the details given in the practical on Imputation with Non-linear Associations

Task

Run the imputation (start with n.iter = 500; this will take a few seconds).

Remember to specify appropriate models for the incomplete covariates and longitudinal variables.
Prevent specification of a model for age.
Check convergence using a traceplot().
If you are satisfied by convergence and mixing of the chains, get the model summary().

Solution

library(JointAI)
JointAIlong <- lme_imp(albumin ~ day + sex + trt + age + ascites + 
                         bili + copper, random = ~day|id,
                       models = c(copper = 'lognorm'),
                       no_model = 'day', data = pbclong, n.iter = 500, seed = 2019)

traceplot(JointAIlong)
summary(JointAIlong)

## 
##  Linear mixed model fitted with JointAI 
## 
## Call:
## lme_imp(fixed = albumin ~ day + sex + trt + age + ascites + bili + 
##     copper, data = pbclong, random = ~day | id, n.adapt = 100, 
##     n.iter = 500, models = c(copper = "lognorm"), no_model = "day", 
##     seed = 2019)
## 
## Posterior summary:
##                  Mean       SD      2.5%     97.5% tail-prob. GR-crit
## (Intercept)  4.180254 1.25e-01  3.932808  4.420392     0.0000    1.23
## sexf        -0.115628 6.03e-02 -0.231414 -0.002686     0.0453    1.53
## trt1         0.004183 3.82e-02 -0.072827  0.076834     0.9227    1.06
## age         -0.007347 1.95e-03 -0.011078 -0.003536     0.0000    1.01
## ascites1    -0.180464 8.83e-02 -0.357131 -0.006822     0.0387    1.01
## copper      -0.000905 2.63e-04 -0.001443 -0.000372     0.0000    1.21
## day         -0.000197 1.41e-05 -0.000224 -0.000170     0.0000    1.01
## bili        -0.023937 2.46e-03 -0.028894 -0.019229     0.0000    1.01
## 
## Posterior summary of random effects covariance matrix:
##          Mean      SD   2.5%  97.5% tail-prob. GR-crit
## D[1,1] 0.0992 0.01078 0.0803 0.1214               1.01
## D[1,2] 0.0166 0.00525 0.0072 0.0278          0    1.01
## D[2,2] 0.0196 0.00455 0.0122 0.0297               1.05
## 
## Posterior summary of residual std. deviation:
##               Mean      SD  2.5% 97.5% GR-crit
## sigma_albumin 0.32 0.00594 0.309 0.332    1.01
## 
## 
## MCMC settings:
## Iterations = 101:600
## Sample size per chain = 500 
## Thinning interval = 1 
## Number of chains = 3 
## 
## Number of observations: 1945 
## Number of groups: 312

Additional exercise JointAI

We want to fit a logistic mixed model for the variable hepato and explore if the association is non-linear over time.

Task 1

Fit a logistic mixed model using the function glme_imp using the same covariates as before.
Specify a natural cubic spline with 3 degrees of freedom for day.
Check convergence using a traceplot().

When specifying a generalized (mixed) model remember to specify the model family and link function.

To use natural cubic splines use the function ns() from the package splines, i.e., ns(day, df = 3).

Solution 1

library(splines)
JointAIlong2 <- glme_imp(hepato ~ ns(day, df = 3) + sex + trt + age + ascites +
                           bili + copper, random = ~1|id, family = binomial(),
                         models = c(copper = 'lognorm'),
                         no_model = 'day', data = pbclong, n.iter = 1000,
                         seed = 2019)

traceplot(JointAIlong2)

Task 2

When the model has converged, we want to visualize the potentially non-linear association of day. To do that, we can create a new dataset containing information on an “average” subject, with different values for day.

The function predDF() creates such a dataset from an object of class JointAI. It sets reference values (i.e., the median for continuous variables and the reference category for categorical variables) for all variables other than the one specified in the argument var. The variable given in var will range across the range of values of that variable encountered in the data.

Use predDF() to create a dataset that allows visualization of the effect of day.

Solution 2

newdf <- predDF(JointAIlong2, var = 'day')

head(newdf)

Task 3

We can now predict the outcome of our model for our “average” subject using the function predict(). It takes a JointAI object and a data.frame containing the data to predict from as arguments. The argument quantiles can be used to specify which quantiles of the distribution of each fitted value are returned (default is 2.5% and 97.5%).

predict() returns a list with the following elements

dat: the data.frame provided by the user extended with the fitted values and 2.5% and 97.5% quantiles that form the credible interval for the fitted values
fit: a vector containing the fitted values (the mean of the distribution of the fitted value)
quantiles: a matrix containing the credible interval for each fitted value

Use predict() to obtain the fitted values and corresponding intervals
Visualize the result by plotting fitted values and quantiles (y-axis) over time (day; x-axis)

Solution 3

pred <- predict(JointAIlong2, newdata = newdf)

ggplot(pred$dat, aes(x = day, y = fit)) +
  geom_ribbon(aes(ymin = `2.5%`, ymax = `97.5%`), alpha = 0.3) +
  geom_line()

Note:
The fitted values and quantiles are on the scale of the linear predictor, i.e., obtained by multiplying the data in newdf (\(\mathbf x\)) with the samples of the posterior distribution of the parameters (\(\boldsymbol \beta\)).

For a logistic model it is more intuitive to present the fitted values on the probability scale.

\[ \log\left(\frac{\pi}{1-\pi}\right) = \mathbf x^\top\boldsymbol\beta \qquad \Rightarrow \pi = \frac{\exp(\mathbf x^\top\boldsymbol\beta)}{1 + \exp(\mathbf x^\top\boldsymbol\beta)} \]

The function plogis() does this transformation.

ggplot(pred$dat, aes(x = day, y = plogis(fit))) +
  geom_ribbon(aes(ymin = plogis(`2.5%`), ymax = plogis(`97.5%`)), alpha = 0.3) +
  geom_line() +
  ylab('probability of hepatomegaly')