In this practical, a number of R packages are used. If any of them are not installed you may be able to follow the practical but will not be able to run all of the code. The packages used (with versions that were used to generate the solutions) are:
mice
(version: 3.6.0)nlme
(version: 3.1.140)JointAI
(version: 0.6.0)splines
(version: 3.6.1)ggplot2
(version: 3.2.1)In this practical we will work with data from a trial on primary biliary cirrhosis (PBC) of the liver.
To get the pbclong data, load the file pbclong.RData
. You can download it here. To load this dataset, you can use the command file.choose()
which opens the explorer and allows you to navigate to the location of the downloaded file on your computer. If you know the path to the file, you can also use load("<path>/<pbclong.RData")
.
The variables contained in the dataset pbclong
are:
id
|
patient identifier |
day
|
continuously measured day of follow-up time (the time variable) |
sex
|
patients’ sex (f: female, m: male) |
trt
|
treatment group (0: D-penicillmain, 1: placebo) |
age
|
patients’ age at intake |
ascites
|
presence of ascites at baseline (0:no, 1:yes) |
hepato
|
presence of hepatomegaly or enlarged liver |
bili
|
serum bilirubin level at baseline |
copper
|
urine copper (ug/day) |
albumin
|
serum albumin level at follow-up (time-varying) |
The variables have the following distributions and proportions of missing values:
The missing data pattern is:
The longitudinal outcome albumin
shows relatively linear trajectories over time:
To analyse the trajectories of albumin
we want to use the following linear mixed effects model with random intercept and slope (either using lme
from the package nlme or using lmer
from the package lme4):
For imputation of longitudinal data, the mice package provides special imputation methods that take into account the two-level structure of the data.
In the pbclong
data, missing values occur in baseline covariates ascites
and copper
, and the time-varying variable hepato
.
For imputation of baseline covariates, the imputation methods
2lonly.pmm
and2lonly.norm
are available.
Time-varying variables can be imputed with imputation methods such as 2l.norm
or 2l.bin
. Since the imputation of hepato
with 2l.bin
takes quite some time, we will omit it in this part of the practical.
The predictorMatrix
requires some extra specification to identify the “id” variable (set to -2) and the random effects structure (set to 2).
Run the setup imputation and perform the necessary changes in the imputation method and predictor matrix.
mice does not recognize automatically that the data are multil-level, hence the settings chosen by default are not correct. You need to specify imputation methods for all incomplete variables.
micelong0 <- mice(pbclong, maxit = 0)
meth_micelong <- micelong0$method
pred_micelong <- micelong0$predictorMatrix
# don't impute hepato
meth_micelong[c("hepato")] <- ""
# exclude hepato from predictor of other models (because incomplete)
pred_micelong[, c("hepato")] <- 0
meth_micelong[c("copper", "ascites")] <- "2lonly.pmm"
pred_micelong[, "id"] <- -2
pred_micelong[, "day"] <- 2
# check the imputation method
meth_micelong
## id trt age sex day ascites hepato
## "" "" "" "" "" "2lonly.pmm" ""
## bili albumin copper
## "" "" "2lonly.pmm"
## id trt age sex day ascites hepato bili albumin copper
## id -2 1 1 1 2 1 0 1 1 1
## trt -2 0 1 1 2 1 0 1 1 1
## age -2 1 0 1 2 1 0 1 1 1
## sex -2 1 1 0 2 1 0 1 1 1
## day -2 1 1 1 2 1 0 1 1 1
## ascites -2 1 1 1 2 0 0 1 1 1
## hepato -2 1 1 1 2 1 0 1 1 1
## bili -2 1 1 1 2 1 0 0 1 1
## albumin -2 1 1 1 2 1 0 1 0 1
## copper -2 1 1 1 2 1 0 1 1 0
meth_micelong
and predictor matrix pred_micelong
micelong <- mice(pbclong, meth = meth_micelong, pred = pred_micelong,
maxit = 20, seed = 2019, printFlag = FALSE)
library(nlme)
micelong_mira <- with(micelong, lme(albumin ~ day + sex + trt + age + ascites +
bili + copper, random = ~day|id)
)
# alternative:
# library(lme4)
# micelong_mira <- with(micelong, lmer(albumin ~ day + sex + trt + age + ascites +
# bili + copper + (day|id))
# )
summary(pool(micelong_mira), conf.int = TRUE)
To analyse incomplete longitudinal data using a linear mixed model the R package JointAI provides the function lme_imp()
. The specification of the main model components is analogous to the function lme()
from the nlme package.
Specification of longitudinal models:
When imputing variables in a longitudinal (or other multi-level) model and there are missing values in baseline (level-2) covariates, models need to be specified for all longitudinal covariates, even if they do not have missing values. Specifying no model would imply that the incomplete baseline covariates are independent of the complete longitudinal variable (see also here). Therefore, JointAI automatically specifies models for all longitudinal covariates in such a setting.
An exception may be the time variable: it is often reasonable to assume that the baseline covariates are independent of the measurement times of the outcome and longitudinal covariates. To tell JointAI not to specify a model for the time variable, the argument no_model
can be used.
name | model | variable type |
---|---|---|
lmm
|
linear mixed model | continuous variables |
glmm_logit
|
logistic mixed model | factors with two levels |
glmm_gamma
|
gamma mixed model (with log-link) | skewed continuous variables |
glmm_poisson
|
poisson mixed model | count variables |
clmm
|
cumulative logit mixed model | ordered factors with >2 levels |
More info:
For the specification of the other arguments of lme_imp()
, refer to
Run the imputation (start with n.iter = 500
; this will take a few seconds).
age
.traceplot()
.summary()
.library(JointAI)
JointAIlong <- lme_imp(albumin ~ day + sex + trt + age + ascites +
bili + copper, random = ~day|id,
models = c(copper = 'lognorm'),
no_model = 'day', data = pbclong, n.iter = 500, seed = 2019)
traceplot(JointAIlong)
summary(JointAIlong)
##
## Linear mixed model fitted with JointAI
##
## Call:
## lme_imp(fixed = albumin ~ day + sex + trt + age + ascites + bili +
## copper, data = pbclong, random = ~day | id, n.adapt = 100,
## n.iter = 500, models = c(copper = "lognorm"), no_model = "day",
## seed = 2019)
##
## Posterior summary:
## Mean SD 2.5% 97.5% tail-prob. GR-crit
## (Intercept) 4.180254 1.25e-01 3.932808 4.420392 0.0000 1.23
## sexf -0.115628 6.03e-02 -0.231414 -0.002686 0.0453 1.53
## trt1 0.004183 3.82e-02 -0.072827 0.076834 0.9227 1.06
## age -0.007347 1.95e-03 -0.011078 -0.003536 0.0000 1.01
## ascites1 -0.180464 8.83e-02 -0.357131 -0.006822 0.0387 1.01
## copper -0.000905 2.63e-04 -0.001443 -0.000372 0.0000 1.21
## day -0.000197 1.41e-05 -0.000224 -0.000170 0.0000 1.01
## bili -0.023937 2.46e-03 -0.028894 -0.019229 0.0000 1.01
##
## Posterior summary of random effects covariance matrix:
## Mean SD 2.5% 97.5% tail-prob. GR-crit
## D[1,1] 0.0992 0.01078 0.0803 0.1214 1.01
## D[1,2] 0.0166 0.00525 0.0072 0.0278 0 1.01
## D[2,2] 0.0196 0.00455 0.0122 0.0297 1.05
##
## Posterior summary of residual std. deviation:
## Mean SD 2.5% 97.5% GR-crit
## sigma_albumin 0.32 0.00594 0.309 0.332 1.01
##
##
## MCMC settings:
## Iterations = 101:600
## Sample size per chain = 500
## Thinning interval = 1
## Number of chains = 3
##
## Number of observations: 1945
## Number of groups: 312
We want to fit a logistic mixed model for the variable hepato
and explore if the association is non-linear over time.
glme_imp
using the same covariates as before.day
.traceplot()
.When specifying a generalized (mixed) model remember to specify the model family and link function.
To use natural cubic splines use the function ns()
from the package splines, i.e., ns(day, df = 3)
.
When the model has converged, we want to visualize the potentially non-linear association of day
. To do that, we can create a new dataset containing information on an “average” subject, with different values for day
.
The function predDF()
creates such a dataset from an object of class JointAI
. It sets reference values (i.e., the median for continuous variables and the reference category for categorical variables) for all variables other than the one specified in the argument var
. The variable given in var
will range across the range of values of that variable encountered in the data.
Use predDF()
to create a dataset that allows visualization of the effect of day
.
We can now predict the outcome of our model for our “average” subject using the function predict()
. It takes a JointAI
object and a data.frame
containing the data to predict from as arguments. The argument quantiles
can be used to specify which quantiles of the distribution of each fitted value are returned (default is 2.5%
and 97.5%
).
predict()
returns a list with the following elements
dat
: the data.frame
provided by the user extended with the fitted values and 2.5% and 97.5% quantiles that form the credible interval for the fitted valuesfit
: a vector containing the fitted values (the mean of the distribution of the fitted value)quantiles
: a matrix containing the credible interval for each fitted valuepredict()
to obtain the fitted values and corresponding intervalsday
; x-axis)pred <- predict(JointAIlong2, newdata = newdf)
ggplot(pred$dat, aes(x = day, y = fit)) +
geom_ribbon(aes(ymin = `2.5%`, ymax = `97.5%`), alpha = 0.3) +
geom_line()
Note:
The fitted values and quantiles are on the scale of the linear predictor, i.e., obtained by multiplying the data in newdf
(\(\mathbf x\)) with the samples of the posterior distribution of the parameters (\(\boldsymbol \beta\)).
For a logistic model it is more intuitive to present the fitted values on the probability scale.
\[ \log\left(\frac{\pi}{1-\pi}\right) = \mathbf x^\top\boldsymbol\beta \qquad \Rightarrow \pi = \frac{\exp(\mathbf x^\top\boldsymbol\beta)}{1 + \exp(\mathbf x^\top\boldsymbol\beta)} \]
The function plogis()
does this transformation.
© Nicole Erler