In this practical, a number of R packages are used. If any of them are not installed you may be able to follow the practical but will not be able to run all of the code. The packages used (with versions that were used to generate the solutions) are:
mice
(version: 3.6.0)survival
(version: 2.44.1.1; note: in versions 2.44-1 and 2.44-1.1 confint()
returns NA
)JointAI
(version: 0.6.0)ggplot2
(version: 3.2.1)In this practical we will work with a different subset of the PBC data, omitting the longitudinal structure and focusing on the survival component.
To get the pbcdat data, load the file pbcdat.RData
. You can download it here. To load this dataset, you can use the command file.choose()
which opens the explorer and allows you to navigate to the location of the downloaded file on your computer. If you know the path to the file, you can also use load("<path>/<pbcdat.RData")
.
pbcdat
) are:
time | number of years between inclusion and death, transplantion, or end of follow-up |
status |
status at time (censored, transplant, dead)
|
age | patient’s age at intake |
sex | patient’s sex |
platelet | platelet count |
chol | serum cholesterol |
stage | histologic stage of disease |
The variables in pbcdat
dataset have the following distributions:
The missing data pattern is
We are interested to determine predictor variables for patient survival, using the following Cox proportional hazards model:
As we have seen in the lecture, the mice package provides the function nelsonaalen()
that calculates the Nelson-Aalen estimator with which the cumulative Hazard can be approximated.
mice()
.Note:
nelsonaalen()
does not accept the status == 'dead'
specification of the event, hence, we need to create a new event indicator event
.
pbcdat$event <- pbcdat$status == 'dead'
pbcdat$na <- nelsonaalen(data = pbcdat, timevar = "time", statusvar = "event")
micesurv0 <- mice(pbcdat, maxit = 0)
micesurvmeth <- micesurv0$meth
micesurvpred <- micesurv0$pred
micesurvmeth[c("chol")] <- "midastouch"
micesurvmeth[c("platelet")] <- "norm"
micesurvpred[, "time"] <- 0
# check the method and predictorMatrix
micesurvmeth
## time status platelet age sex chol stage
## "" "" "norm" "" "" "midastouch" "polr"
## event na
## "" ""
## time status platelet age sex chol stage event na
## time 0 1 1 1 1 1 1 1 1
## status 0 0 1 1 1 1 1 1 1
## platelet 0 1 0 1 1 1 1 1 1
## age 0 1 1 0 1 1 1 1 1
## sex 0 1 1 1 0 1 1 1 1
## chol 0 1 1 1 1 0 1 1 1
## stage 0 1 1 1 1 1 0 1 1
## event 0 1 1 1 1 1 1 0 1
## na 0 1 1 1 1 1 1 1 0
Run the imputation and analyse the imputed data.
## Warning: Unknown or uninitialised column: 'df.residual'.
## Warning in pool.fitlist(getfit(object), dfcom = dfcom): Large sample assumed.
Analysis of a Cox proportional hazards model using JointAI works analogous to the function coxph()
from packages survival.
coxph_imp()
(and 500 iterations).library(survival)
JointAIcox <- coxph_imp(Surv(time = time, event) ~ platelet +
age + sex + chol + stage, data = pbcdat,
n.iter = 500, seed = 2019)
traceplot(JointAIcox)
The mixing of the parameters for stage
is not great. We would have to increase the number of iterations and/or the number of chains to get better results.
In the following section we will see a trick that can sometimes help to improve problems with convergence or mixing.
For larger datasets with many different event times the Cox model implemented in JointAI can become quite slow. This is because it has to use the counting process notation which requires a loop over all event times in each iteration of the MCMC sampler.
A parametric survival model, e.g. assuming a Weibull distribution (see Section 4.2 of the slides of EP03: Biostatistical Methods II: Survival Analysis), is often faster. This is implemented in the function survreg_imp()
.
traceplot()
and the Gelman-Rubin criterion (GR_crit()
).JointAIsurv <- survreg_imp(Surv(time = time, event) ~ platelet +
age + sex + chol + stage, data = pbcdat,
n.iter = 500, seed = 2019)
traceplot(JointAIsurv)
GR_crit(JointAIsurv)
## Potential scale reduction factors:
##
## Point est. Upper C.I.
## (Intercept) 1.34 1.91
## platelet 1.01 1.04
## age 1.02 1.07
## sexf 1.03 1.05
## chol 1.02 1.07
## stage2 2.07 3.91
## stage3 2.00 4.03
## stage4 2.12 4.26
## shape_time 1.01 1.04
##
## Multivariate psrf
##
## 1.78
Both the traceplot and the Gelman-Rubin criterion show that the parameters for stage
don’t converge well. There are clear patterns visible in the plots and the upper limit of the confidence intverval of the Gelman-Rubin criterion is much larger than one.
In some cases, convergence of coefficients for categorical variables can be improved by changing the reference category. Especially when the categories are very unbalanced, convergence it better when the largest (or a large) category is chosen as the reference.
The plot of the distribution of the variables in thepbcdat
data at the beginning of this practical shows that there are few patients with stage 1.
stage
as reference (using the argument refcats
).Check out the help file and the vignette on Model Specification for more details on how to use the argument refcats
.
traceplot()
and GR_crit()
to see if the change in reference category improved convergence.JointAIsurv2 <- survreg_imp(Surv(time = time, event) ~ platelet +
age + sex + chol + stage, data = pbcdat,
n.iter = 500, seed = 2019, refcats = 'largest')
traceplot(JointAIsurv2)
GR_crit(JointAIsurv2)
## Potential scale reduction factors:
##
## Point est. Upper C.I.
## (Intercept) 1.02 1.08
## platelet 1.04 1.12
## age 1.02 1.06
## sexm 1.01 1.04
## chol 1.02 1.06
## stage1 1.14 1.41
## stage2 1.15 1.47
## stage3 1.05 1.16
## shape_time 1.00 1.01
##
## Multivariate psrf
##
## 1.12
The traceplots look a lot better and the Gelman-Rubin criterion has improved. Nevertheless, more iterations or chaines would be necessary to obtain more reliable results.
© Nicole Erler